Let's try: dbt part 3 - seed and source

Before we transform data, we must have a data source first.

data, data engineering

Posted Oct 18, 2025

8 min read

dbt-labs/dbt-core

Let's try: dbt part 3 - seed and source

In this series

Let's try: dbt part 1 - prologue
Let's try: dbt part 2 - first start
Let's try: dbt part 3 - seed and source
Let's try: dbt part 4 - model materialization

Before we transform data, we must have a data source first.

dbt allows to configure sources in YAML files. It also features data loading that called seeds

Seeds

Seeds are the CSV files that we want to load to the tables to the data warehouses. Even though dbt is not a good tool for data integration comparing to those specific tools like Apache Airflow, but we can sometimes load it with this capability.

The best use case of it would be that we want to load some small static data sets for lookup, comparison, testing, or likewise before the transformation in next steps.

Prepare seeds

Let’s say I have person.csv and want to load it up there. We have to prepare a configuration file¹ like this.

We should place the target files in seeds/ directory. Then create a YAML file for seed configuration. The filename of the configuration can be anything but we may consider naming convention for better organization. Here I put seed.yml.
1 2 3 dbt_proj01/seeds ├── person.csv └── seed.yml

I prepared small data set as below.

id,first_name,last_name,gender,occupation
1,Kenneth,Farrell,M,Product manager
2,Kimberly,Hood,F,Social worker
4,Robert,Haley,M,"Designer, jewellery"
5,Christine,Valentine,F,Barrister
7,Elizabeth,Espinoza,F,Solicitor

Now we can configure seed to be loaded as below.

I enabled this seed.
I want to load it to raw schema (that is dataset in BigQuery).
I want to rename the table to person_occupation.
This file contains quotes.
This file has the following columns.

  
version: 2

seeds:
  - name: person
    config:
      enabled: true
      schema: raw
      alias: person_occupation
      quote_columns: true
      column_types:
        id: int64
        first_name: string
        last_name: string
        gender: string
        occupation: string

Run load seeds

Use this command:

dbt seed

We should see the log if seeds are loaded successfully.

47:41  Running with dbt=1.10.13
47:42  Registered adapter: bigquery=1.10.2
47:43  Unable to do partial parsing because change detected to override macro. Starting full parse.
47:43  Found 2 models, 1 seed, 4 data tests, 509 macros
47:43  
47:43  Concurrency: 1 threads (target='dev')
47:43  
47:48  1 of 1 START seed file dbt_test_dataset_raw.person_occupation .................. [RUN]
47:53  1 of 1 OK loaded seed file dbt_test_dataset_raw.person_occupation .............. [INSERT 5 in 5.19s]
47:53  
47:53  Finished running 1 seed in 0 hours 0 minutes and 9.94 seconds (9.94s).
47:53  
47:53  Completed successfully
47:53  
47:53  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1

And check for seed tables.

Aww! we want the schema (dataset) to be raw but it is dbt_test_dataset_raw instead.
We need to “override” this.

Override schema name

According to the dbt docs², if we supply value for schema, the value will not replace but append to the default schema name in profile.yml.

What we have to do is to create a macro to override this behaviour.

We create a new file named generate_schema_name.sql to self-describe this action in macros/ directory.
1 2 dbt_proj01/macros └── generate_schema_name.sql

Write the content as below. This Jinja means that:

if we don’t input schema, it will use the default schema.
if we input schema, use the input with no changes.

  
{% macro generate_schema_name(custom_schema_name, node) -%}

    {%- set default_schema = target.schema -%}
    {%- if custom_schema_name is none -%}

        {{ default_schema }}

    {%- else -%}

        {{ custom_schema_name | trim }}

    {%- endif -%}

{%- endmacro %}

Run dbt seed again.

  
$ dbt seed

52:19  Running with dbt=1.10.13
52:21  Registered adapter: bigquery=1.10.2
52:21  Unable to do partial parsing because change detected to override macro. Starting full parse.
52:21  Found 2 models, 1 seed, 4 data tests, 509 macros
52:21  
52:21  Concurrency: 1 threads (target='dev')
52:21  
52:26  1 of 1 START seed file raw.person_occupation ................................... [RUN]
52:33  1 of 1 OK loaded seed file raw.person_occupation ............................... [INSERT 5 in 6.71s]
52:33  
52:33  Finished running 1 seed in 0 hours 0 minutes and 11.17 seconds (11.17s).
52:33  
52:33  Completed successfully
52:33  
52:33  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1

Come to check and now we have the dataset raw as expected.

Use seeds in models

We have loaded our seeds so it’s time to use them in our models.

Let’s say I want a model to find all female persons, we can write this.

  
SELECT *
FROM {{ ref('person') }}
WHERE gender = 'F'

We use {{ ref('seed_name') }} to refer to the seed tables.

Compile models

Let’s see the compiled result:

  
$ dbt compile

15:21:22  Running with dbt=1.10.13
15:21:23  Registered adapter: bigquery=1.10.2
15:21:24  Found 1 seed, 3 models, 4 data tests, 509 macros
15:21:24  
15:21:24  Concurrency: 1 threads (target='dev')
15:21:24  
Compiled node 'woman_jobs' is:
SELECT *
FROM `bluebirz-playground`.`raw`.`person_occupation`
WHERE gender = 'F'

As above, dbt substitutes the correct table of the seed into our model (line #11).

Sources

Sources are the real tables/views in the data warehouses that we want to run queries on. As they are already existed, dbt just needs configurations to tell where they are right now.

Let’s say I already have this table books in the dataset raw in BigQuery.

And the schema of table books is as below.

  
[
  {"name": "id", "type": "STRING", "mode": "NULLABLE"}, 
  {"name": "isbn", "type": "STRING", "mode": "NULLABLE"}, 
  {"name": "title", "type": "STRING", "mode": "NULLABLE"}, 
  {"name": "author", "type": "STRING", "mode": "NULLABLE"}, 
  {"name": "published_year", "type": "INTEGER", "mode": "NULLABLE"}
]

After that, we want a new table from books to be a series of number of books by decades where those books were published before year 2000.

Prepare sources and models

We have to create configuration files³ and query files as below.

We would create a configuration file and a query file in books/ directory.

models/books
├── old-books-by-decade.sql
└── source-books.yml

Here is the configuration file for source. I name this source raw_data linking to the schema (dataset) raw and specify the table books.

  
version: 2

sources:
  - name: raw_data
    schema: raw
    tables:
      - name: books
        description: all books in my library

This query will compute the decade of each book which was published before 2000 and count them from source raw_data.books.

  
{{-
  config(
    materialized='table',
    schema='transform'
  )
-}}

SELECT
    SAFE_CAST(FLOOR(published_year / 10) * 10 AS INT64) AS decade,
    COUNT(*) AS book_count
FROM {{ source('raw_data', 'books') }}
WHERE published_year < 2000
GROUP BY decade

This model will be materialized as a table where configured in {{ config() }} block.
I put schema='transform' here so this table will be created in the schema transform.
Add - in Jinja block to remove leading/trailing spaces as {{- ... -}}.

Compile models

So let’s see the final query statement by dbt compile.

  
$ dbt compile 

18:18:43  Running with dbt=1.10.13
18:18:45  Registered adapter: bigquery=1.10.2
18:18:45  Found 3 models, 2 seeds, 4 data tests, 1 source, 509 macros
18:18:45  
18:18:45  Concurrency: 1 threads (target='dev')
18:18:45  
Compiled node 'old-books-by-decade' is:
SELECT
    SAFE_CAST(FLOOR(published_year / 10) * 10 AS INT64) AS decade,
    COUNT(*) AS book_count
FROM `bluebirz-playground`.`raw`.`books`
WHERE published_year < 2000
GROUP BY decade

Yes! The query looks correct.

Run models

Now we can run it.

  
$ dbt run

18:50  Running with dbt=1.10.13
18:51  Registered adapter: bigquery=1.10.2
18:52  Found 3 models, 2 seeds, 4 data tests, 1 source, 509 macros
18:52  
18:52  Concurrency: 1 threads (target='dev')
18:52  
18:53  1 of 1 START sql table model transform.old-books-by-decade ..................... [RUN]
18:57  1 of 1 OK created sql table model transform.old-books-by-decade ................ [CREATE TABLE (10.0 rows, 3.9 KiB processed) in 3.93s]
18:57  
18:57  Finished running 1 table model in 0 hours 0 minutes and 4.48 seconds (4.48s).
18:57  
18:57  Completed successfully
18:57  
18:57  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1

See the result table in BigQuery.

This model runs successfully.

Wrap up

seeds:

put CSV files in seeds/ directory.

define YAML configurations in syntax like this:

  
  version: 2

  seeds:
    - name: <seed_name>
      config:
        enabled: <true|false>
        schema: <schema>
        alias: <alias>
        quote_columns: <true|false>
        column_types:
          <column_name>: <data_type>

execute dbt seed to load seeds.
refer seeds in models by {{ ref('<seed_name>') }}.

override schema name by creating macro/generate_schema_name.sql with Jinja as above.
sources:
- define YAML configurations in syntax like this:
  1 2 3 4 5 6 7 8 version: 2 sources: - name: <source_name> schema: <schema> tables: - name: <table_name> description: <description>
- use {{ source('<source_name>', '<table_name>') }} to refer to the source table in the model.
- create a table instead of a view by default with {{ config(materialized='table') }} in the model.
- create a table in a specific schema (dataset) with {{ config(schema='<schema_name>') }} in the model.
compile all models with dbt compile.
run all models with dbt run.

Reference

data, data engineering

This post is licensed under CC BY 4.0 by the author.

Seeds

Prepare seeds

Run load seeds

Override schema name

Use seeds in models

Compile models

Sources

Prepare sources and models

Compile models

Run models

Wrap up

Reference

Trending Tags