Let's try: dbt part 3 - seed and source
Before we transform data, we must have a data source first.
Before we transform data, we must have a data source first.
dbt allows to configure sources in YAML files. It also features data loading that called seeds
Seeds
Seeds are the CSV files that we want to load to the tables to the data warehouses. Even though dbt is not a good tool for data integration comparing to those specific tools like Apache Airflow, but we can sometimes load it with this capability.
The best use case of it would be that we want to load some small static data sets for lookup, comparison, testing, or likewise before the transformation in next steps.
Prepare seeds
Let’s say I have person.csv
and want to load it up there. We have to prepare a configuration file1 like this.
We should place the target files in
seeds/
directory. Then create a YAML file for seed configuration. The filename of the configuration can be anything but we may consider naming convention for better organization. Here I putseed.yml
.1 2 3
dbt_proj01/seeds ├── person.csv └── seed.yml
I prepared small data set as below.
1 2 3 4 5 6
id,first_name,last_name,gender,occupation 1,Kenneth,Farrell,M,Product manager 2,Kimberly,Hood,F,Social worker 4,Robert,Haley,M,"Designer, jewellery" 5,Christine,Valentine,F,Barrister 7,Elizabeth,Espinoza,F,Solicitor
Now we can configure seed to be loaded as below.
- I
enabled
this seed. - I want to load it to
raw
schema (that is dataset in BigQuery). - I want to rename the table to
person_occupation
. - This file contains quotes.
- This file has the following columns.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
version: 2 seeds: - name: person config: enabled: true schema: raw alias: person_occupation quote_columns: true column_types: id: int64 first_name: string last_name: string gender: string occupation: string
- I
Run load seeds
Use this command:
1
dbt seed
We should see the log if seeds are loaded successfully.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
20:47:41 Running with dbt=1.10.13
20:47:42 Registered adapter: bigquery=1.10.2
20:47:43 Unable to do partial parsing because change detected to override macro. Starting full parse.
20:47:43 Found 2 models, 1 seed, 4 data tests, 509 macros
20:47:43
20:47:43 Concurrency: 1 threads (target='dev')
20:47:43
20:47:48 1 of 1 START seed file dbt_test_dataset_raw.person_occupation .................. [RUN]
20:47:53 1 of 1 OK loaded seed file dbt_test_dataset_raw.person_occupation .............. [INSERT 5 in 5.19s]
20:47:53
20:47:53 Finished running 1 seed in 0 hours 0 minutes and 9.94 seconds (9.94s).
20:47:53
20:47:53 Completed successfully
20:47:53
20:47:53 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1
And check for seed tables.
Aww! we want the schema (dataset) to be
raw
but it isdbt_test_dataset_raw
instead.
We need to “override” this.
Override schema name
According to the dbt docs2, if we supply value for schema
, the value will not replace but append to the default schema name in profile.yml
.
What we have to do is to create a macro to override this behaviour.
We create a new file named
generate_schema_name.sql
to self-describe this action inmacros/
directory.1 2
dbt_proj01/macros └── generate_schema_name.sql
Write the content as below. This Jinja means that:
- if we don’t input
schema
, it will use the default schema. - if we input
schema
, use the input with no changes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
{% macro generate_schema_name(custom_schema_name, node) -%} {%- set default_schema = target.schema -%} {%- if custom_schema_name is none -%} {{ default_schema }} {%- else -%} {{ custom_schema_name | trim }} {%- endif -%} {%- endmacro %}
- if we don’t input
Run dbt seed
again.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ dbt seed
20:52:19 Running with dbt=1.10.13
20:52:21 Registered adapter: bigquery=1.10.2
20:52:21 Unable to do partial parsing because change detected to override macro. Starting full parse.
20:52:21 Found 2 models, 1 seed, 4 data tests, 509 macros
20:52:21
20:52:21 Concurrency: 1 threads (target='dev')
20:52:21
20:52:26 1 of 1 START seed file raw.person_occupation ................................... [RUN]
20:52:33 1 of 1 OK loaded seed file raw.person_occupation ............................... [INSERT 5 in 6.71s]
20:52:33
20:52:33 Finished running 1 seed in 0 hours 0 minutes and 11.17 seconds (11.17s).
20:52:33
20:52:33 Completed successfully
20:52:33
20:52:33 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1
Come to check and now we have the dataset raw
as expected.
Use seeds in models
We have loaded our seeds so it’s time to use them in our models.
Let’s say I want a model to find all female persons, we can write this.
1
2
3
SELECT *
FROM {{ ref('person') }}
WHERE gender = 'F'
We use {{ ref('seed_name') }}
to refer to the seed tables.
Compile models
Let’s see the compiled result:
1
2
3
4
5
6
7
8
9
10
11
12
$ dbt compile
15:21:22 Running with dbt=1.10.13
15:21:23 Registered adapter: bigquery=1.10.2
15:21:24 Found 1 seed, 3 models, 4 data tests, 509 macros
15:21:24
15:21:24 Concurrency: 1 threads (target='dev')
15:21:24
Compiled node 'woman_jobs' is:
SELECT *
FROM `bluebirz-playground`.`raw`.`person_occupation`
WHERE gender = 'F'
As above, dbt substitutes the correct table of the seed into our model (line #11).
Sources
Sources are the real tables/views in the data warehouses that we want to run queries on. As they are already existed, dbt just needs configurations to tell where they are right now.
Let’s say I already have this table books
in the dataset raw
in BigQuery.
And the schema of table books
is as below.
1
2
3
4
5
6
7
[
{"name": "id", "type": "STRING", "mode": "NULLABLE"},
{"name": "isbn", "type": "STRING", "mode": "NULLABLE"},
{"name": "title", "type": "STRING", "mode": "NULLABLE"},
{"name": "author", "type": "STRING", "mode": "NULLABLE"},
{"name": "published_year", "type": "INTEGER", "mode": "NULLABLE"}
]
After that, we want a new table from books
to be a series of number of books by decades where those books were published before year 2000.
Prepare sources and models
We have to create configuration files3 and query files as below.
We would create a configuration file and a query file in
books/
directory.1 2 3
models/books ├── old-books-by-decade.sql └── source-books.yml
Here is the configuration file for source. I name this source
raw_data
linking to the schema (dataset)raw
and specify the tablebooks
.1 2 3 4 5 6 7 8
version: 2 sources: - name: raw_data schema: raw tables: - name: books description: all books in my library
This query will compute the decade of each book which was published before 2000 and count them from source
raw_data.books
.1 2 3 4 5 6 7 8 9 10 11 12 13
{{- config( materialized='table', schema='transform' ) -}} SELECT SAFE_CAST(FLOOR(published_year / 10) * 10 AS INT64) AS decade, COUNT(*) AS book_count FROM {{ source('raw_data', 'books') }} WHERE published_year < 2000 GROUP BY decade
- This model will be materialized as a table where configured in
{{ config() }}
block. - I put
schema='transform'
here so this table will be created in the schematransform
. - Add
-
in Jinja block to remove leading/trailing spaces as{{- ... -}}
.
- This model will be materialized as a table where configured in
Compile models
So let’s see the final query statement by dbt compile
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ dbt compile
18:18:43 Running with dbt=1.10.13
18:18:45 Registered adapter: bigquery=1.10.2
18:18:45 Found 3 models, 2 seeds, 4 data tests, 1 source, 509 macros
18:18:45
18:18:45 Concurrency: 1 threads (target='dev')
18:18:45
Compiled node 'old-books-by-decade' is:
SELECT
SAFE_CAST(FLOOR(published_year / 10) * 10 AS INT64) AS decade,
COUNT(*) AS book_count
FROM `bluebirz-playground`.`raw`.`books`
WHERE published_year < 2000
GROUP BY decade
Yes! The query looks correct.
Run models
Now we can run it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ dbt run
18:18:50 Running with dbt=1.10.13
18:18:51 Registered adapter: bigquery=1.10.2
18:18:52 Found 3 models, 2 seeds, 4 data tests, 1 source, 509 macros
18:18:52
18:18:52 Concurrency: 1 threads (target='dev')
18:18:52
18:18:53 1 of 1 START sql table model transform.old-books-by-decade ..................... [RUN]
18:18:57 1 of 1 OK created sql table model transform.old-books-by-decade ................ [CREATE TABLE (10.0 rows, 3.9 KiB processed) in 3.93s]
18:18:57
18:18:57 Finished running 1 table model in 0 hours 0 minutes and 4.48 seconds (4.48s).
18:18:57
18:18:57 Completed successfully
18:18:57
18:18:57 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1
See the result table in BigQuery.
This model runs successfully.
Wrap up
- seeds:
- put CSV files in
seeds/
directory. define YAML configurations in syntax like this:
1 2 3 4 5 6 7 8 9 10 11
version: 2 seeds: - name: <seed_name> config: enabled: <true|false> schema: <schema> alias: <alias> quote_columns: <true|false> column_types: <column_name>: <data_type>
- execute
dbt seed
to load seeds. - refer seeds in models by
{{ ref('<seed_name>') }}
.
- put CSV files in
- override schema name by creating
macro/generate_schema_name.sql
with Jinja as above. - sources:
define YAML configurations in syntax like this:
1 2 3 4 5 6 7 8
version: 2 sources: - name: <source_name> schema: <schema> tables: - name: <table_name> description: <description>
- use
{{ source('<source_name>', '<table_name>') }}
to refer to the source table in the model. - create a table instead of a view by default with
{{ config(materialized='table') }}
in the model. - create a table in a specific schema (dataset) with
{{ config(schema='<schema_name>') }}
in the model.
- compile all models with
dbt compile
. - run all models with
dbt run
.