Let's try: dbt part 4 - model materialization
build data products with various transformation models to tackle business problems
From the previous part, now we have data sources so we can build our data products with various transformation models to tackle business problems.
dbt models
Models are data products to answer some real-world questions or give an insight to decision makers.
dbt provides model building tools in order to manage our data sets and data products. We need to understand when and how we establish which model to gain best performance and bring great experience to our data customers.
Right now we can build 5 materialization types1 in dbt:
- View
- Table
- Incremental
- Ephemeral
- Materialized view
Configure materialization
We can configure materialization in 3 solutions:
- configure in
dbt_project.yml - configure in a macro in model SQL files
- configure in model YAML files, such as
property.yml
1 2 3 4 5 6 7
dbt_proj01 ├── dbt_project.yml └── models └─── climate ├── property.yml ├── sources.yml └── weekly-delhi-climate.sql
1 2 3 4 5 6 7
models: dbt_proj01: # Config indicated by + and applies to all files under models/example/ example: +materialized: <materialization type> climate: +materialized: <materialization type>
1 2 3 4 5 6 7
{{- config( materialized='<materialization type>', ) -}} SELECT ...
1 2 3 4 5 6
version: 2 models: - name: <model name> config: materialized: <materialization type>
Now we’re gonna see how can we utilize and build a model in each type of materialization.
Try building models
Let’s say we have climate data of Delhi city from Kaggle like this.
And we have source.yml as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
version: 2
sources:
- name: delhi_climate
schema: raw
tables:
- name: daily_delhi_climate
columns:
- name: date
description: Date of format YYYY-MM-DD
- name: meantemp
description: Mean temperature averaged out from multiple 3 hour intervals in a day.
- name: humidity
description: Humidity value for the day (units are grams of water vapor per cubic meter volume of air).
- name: wind_speed
description: Wind speed measured in kmph.
- name: meanpressure
description: Pressure reading of weather (measure in atm)
So we would like to compute average value of those 4 metrics in weekly basis starting from Sunday.
The calendar of that period is here.
1
2
3
4
5
6
7
8
9
2016
October November December
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 1 2 3 4 5 1 2 3
2 3 4 5 6 7 8 6 7 8 9 10 11 12 4 5 6 7 8 9 10
9 10 11 12 13 14 15 13 14 15 16 17 18 19 11 12 13 14 15 16 17
16 17 18 19 20 21 22 20 21 22 23 24 25 26 18 19 20 21 22 23 24
23 24 25 26 27 28 29 27 28 29 30 25 26 27 28 29 30 31
30 31
For example, a week can be Sunday 30 October - Saturday 5 November, the week after is from Sunday 6 November - Saturday 12 November and so on.
Come to make all 5 models together.
Materialization: View
View is the default materialization of dbt models.
We consider a view when we want to control data with small transformation such as select some fields and make it lowercase and union with other tables. Managing authorization to allow or not allow users to query is a big one of the advantages of views.
We create a view by providing materialized='view' in config() or not providing it to let dbt create a view by default.
1 2 3 4 5 6 7 8 9
{{- config(schema="transform", alias="weekly_climate_stats", materialized="view") -}} select date_trunc(date, week(sunday)) as week_start_sunday, avg(meantemp) as avg_temp, avg(humidity) as avg_humidity, avg(wind_speed) as avg_wind_speed, avg(meanpressure) as avg_pressure from {{ source("delhi_climate", "daily_delhi_climate") }} group by all
{{ config(materialized='view') }}to make this model a view.date_trunc(date, week(sunday))returns an earliest Sunday of thatdate.{{ source() }}to refer to the defined source insource.yml.
Materialization: Table
Another basic materialization, table.
We consider a table when we want to make a persist data model for the users or dashboards, especially when the model needs to refresh its data but take so big time that a view wouldn’t the good solution for users to wait.
Just put
materialized='table'and this model becomes a table.1 2 3 4 5 6 7 8 9
{{- config(schema="transform", alias="weekly_climate_stats", materialized="table") -}} select date_trunc(date, week(sunday)) as week_start_sunday, avg(meantemp) as avg_temp, avg(humidity) as avg_humidity, avg(wind_speed) as avg_wind_speed, avg(meanpressure) as avg_pressure from {{ source("delhi_climate", "daily_delhi_climate") }} group by all
Materialization: Incremental
When table is too big yet no point to refresh data every time, incremental model would be suited.
We consider incremental model if the table is big and we don’t want to refresh the result every time. Waste of time, waste of money to do so. Incremental models allow us pick the new data (delta) from the source and transform then put to the target table. Yes it’s SCD type 2 (old blog: Slowly Changing Dimensions).
In order to utilize this incremental model, we need to understand incremental strategies2 like append or merge where they depend on adapters.
Model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
{{-
config(
schema="transform",
alias="weekly_climate_stats",
materialized="incremental",
incremental_strategy="merge",
unique_key="week_start_sunday",
)
-}}
select
date_trunc(date, week(sunday)) as week_start_sunday,
avg(meantemp) as avg_temp,
avg(humidity) as avg_humidity,
avg(wind_speed) as avg_wind_speed,
avg(meanpressure) as avg_pressure,
current_timestamp() as inserted_at
from {{ source("delhi_climate", "daily_delhi_climate") }}
{% if is_incremental() -%}
where
date_trunc(date, week(sunday)) > (select max(week_start_sunday) from {{ this }})
{%- endif %}
group by all
materialized="incremental"to define this model to be incremental model.incremental_strategy="merge"to use merge operation.unique_key=["<field1>", "<field2>", ...]to assign unique fields for comparing value for the merge.is_incremental()returnstrueonly if:- There is the model table existed.
- This model is incremental.
- Execute without the flag
--full-refresh
- When
is_incremental()isfalse, the model will be materialized as a table instead. {% if is_incremental() %}...{% endif %}adds the query inside this block whenis_incremental()istrue.
The model above will havewhereclause active to pick new records.where date_trunc(...) > (select max(week_start_sunday) from ...)means new records would be picked when its earliest Sunday is newer than the latest (max)week_start_sundayof the current table.{{ this }}to refer to this model as a self-reference.
First run
There is no table of this incremental model yet,
transform.weekly_climate_stats. And the source has 30 records as of November 2016.We got the table
transform.weekly_climate_statshaving 5 records.The model runs as new from the command
create or replace table ....1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
$ cat target/run/dbt_proj01/models/climate/weekly-delhi-climate.sql create or replace table `bluebirz-playground`.`transform`.`weekly_climate_stats` OPTIONS() as ( select date_trunc(date, week(sunday)) as week_start_sunday, avg(meantemp) as avg_temp, avg(humidity) as avg_humidity, avg(wind_speed) as avg_wind_speed, avg(meanpressure) as avg_pressure, current_timestamp() as inserted_at from `bluebirz-playground`.`raw`.`daily_delhi_climate` group by all );
Next run
Now we have the incremental model and more than that, we also have additional 31 records in source table. It’s the December 2016 data.
We got more rows to be 9 in total. Those new rows whose
week_start_sundayare as of December 2016 have differentinserted_at.This run, the model uses
mergeoperation and haswhereclause to pick new records (line #12-13). When match it updates, and when not matched it inserts.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
$ cat target/run/dbt_proj01/models/climate/weekly-delhi-climate.sql merge into `bluebirz-playground`.`transform`.`weekly_climate_stats` as DBT_INTERNAL_DEST using (select date_trunc(date, week(sunday)) as week_start_sunday, avg(meantemp) as avg_temp, avg(humidity) as avg_humidity, avg(wind_speed) as avg_wind_speed, avg(meanpressure) as avg_pressure, current_timestamp() as inserted_at from `bluebirz-playground`.`raw`.`daily_delhi_climate` where date_trunc(date, week(sunday)) > (select max(week_start_sunday) from `bluebirz-playground`.`transform`.`weekly_climate_stats`) group by all ) as DBT_INTERNAL_SOURCE on ((DBT_INTERNAL_SOURCE.week_start_sunday = DBT_INTERNAL_DEST.week_start_sunday)) when matched then update set `week_start_sunday` = DBT_INTERNAL_SOURCE.`week_start_sunday`,`avg_temp` = DBT_INTERNAL_SOURCE.`avg_temp`,`avg_humidity` = DBT_INTERNAL_SOURCE.`avg_ humidity`,`avg_wind_speed` = DBT_INTERNAL_SOURCE.`avg_wind_speed`,`avg_pressure` = DBT_INTERNAL_SOURCE.`avg_pressure`,`inserted_at` = DBT_INTERNAL_SOURCE.`i nserted_at` when not matched then insert (`week_start_sunday`, `avg_temp`, `avg_humidity`, `avg_wind_speed`, `avg_pressure`, `inserted_at`) values (`week_start_sunday`, `avg_temp`, `avg_humidity`, `avg_wind_speed`, `avg_pressure`, `inserted_at`)
Full refresh
In case we want to run the model entirely new, we can execute dbt run --full-refresh.
We are using the updated source which have data as of both November and December 2016.
After
dbt run --full-refresh, we have a new table looks like the last run but this one has sameinserted_atbecause we refreshed it.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
$ cat target/run/dbt_proj01/models/climate/weekly-delhi-climate.sql create or replace table `bluebirz-playground`.`transform`.`weekly_climate_stats` OPTIONS() as ( select date_trunc(date, week(sunday)) as week_start_sunday, avg(meantemp) as avg_temp, avg(humidity) as avg_humidity, avg(wind_speed) as avg_wind_speed, avg(meanpressure) as avg_pressure, current_timestamp() as inserted_at from `bluebirz-playground`.`raw`.`daily_delhi_climate` group by all );
Materialization: Ephemeral
When we want a virtual model to do some tasks and also is reuseable.
We consider ephemeral models to act as reuseable functions. dbt will build this type of model to be CTE (Common Table Expression) so we can’t see the final product of this but also call this type with ref().
Adding
materialized='ephemeral'and the model is ready.1 2 3 4 5 6 7 8 9 10 11 12 13
{{- config( materialized="ephemeral", ) -}} select date_trunc(date, week(sunday)) as week_start_sunday, avg(meantemp) as avg_temp, avg(humidity) as avg_humidity, avg(wind_speed) as avg_wind_speed, avg(meanpressure) as avg_pressure, from {{ source("delhi_climate", "daily_delhi_climate") }} group by all
Use the ephemeral model with
ref('<ephemeral_filename>')1 2 3 4 5 6 7 8 9
{{- config( schema="transform", alias="weekly_climate_stats", materialized="table", ) -}} select * from {{ ref("weekly_climate_compute_cte") }}
The ephemeral model becomes a CTE with prefix
__dbt__cte__.1 2 3 4 5 6 7 8 9 10 11 12 13
$ cat target/compiled/dbt_proj01/models/climate/weekly-delhi-stats.sql with __dbt__cte__weekly_climate_compute_cte as ( select date_trunc(date, week(sunday)) as week_start_sunday, avg(meantemp) as avg_temp, avg(humidity) as avg_humidity, avg(wind_speed) as avg_wind_speed, avg(meanpressure) as avg_pressure, from `bluebirz-playground`.`raw`.`daily_delhi_climate` group by all ) select * from __dbt__cte__weekly_climate_compute_cte
Materialization: Materialized View
Materialized view is an advance view that it computes and stores the query results beforehand. However the materialized views do support limited SQL features such as AVG, SUM, COUNT, etc. but not with window functions, user-defined functions and others.
We consider materialized views when we want precomputed data but also in views, having some aggregations, joins and filters so there are flexibility to update the query without storing in physical tables.
According to BigQuery, materialized views can be configurable3 by:
enable_refreshto allow BigQuery refresh this materialized view automatically when the source tables change.refresh_interval_minutesis the time period in minutes this materialized view will be refreshed from source tables.max_staleness67 is the time period this materialized view will query source tables if the last refresh was outside this period, otherwise this view will use cached data.
We define materialized view with
materialized='materialized_view'and other configurations as above.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
{{- config( schema="transform", alias="weekly_climate_stats", materialized="materialized_view", on_configuration_change="apply", enable_refresh=True, refresh_interval_minutes=180, max_staleness="INTERVAL '4:0:0' HOUR TO SECOND", ) -}} select date_trunc(date, week(sunday)) as week_start_sunday, avg(meantemp) as avg_temp, avg(humidity) as avg_humidity, avg(wind_speed) as avg_wind_speed, avg(meanpressure) as avg_pressure, from {{ source("delhi_climate", "daily_delhi_climate") }} group by all
After
dbt run, we should see the new materialized view.
Tips
- consider partitioned tables as source tables to reduce refresh cost and time.
- set
refresh_interval_minutesto be aligned with the updates on source tables.
For example, if the source tables have new data every 2 hours, we can setrefresh_interval_minutesas 180 minutes from 2 hours + extra 1 hour as buffer = 3 × 60 = 180 minutes.- set
max_stalenessbased on our usages.
For example, if we have dashboards running between 9 AM - 10 AM in the morning and we have set the materialized views to be refreshed at 7 AM, we can setmax_stalenessas 4 hours to ensure the dashboards will use the cache.
Wrap up
- Configure materialization types in:
dbt_project.yml- model files using
{{ config(materialized='<materialized type>') }} - model YAML files.
- view:
{{ config(materialized='view') }}to make a view.
- table:
{{ config(materialized='table') }}to make a table in full load.
- incremental:
{{ config(materialized='incremental) }}to make an incremental model.{{ config(incremental_strategy='<strategy>') }}to define how to handle incremental data.{{ config(unique_key=["<field1>", "<field2>", ...]) }}to assign unique fields for incremental data.is_incremental()to manage extra logic for selecting incremental data and it returnstruewhen running this incremental model on an existing table without the flag--full-refresh.dbt run --full-refreshto run all incremental models entirely new.{{ this }}refer to the model itself.
- ephemeral:
{{ config(materialized='ephemeral') }}to create an ephemeral model.- refer to an ephemeral model by
{{ ref('<ephemeral_filename>') }}.
- materialized view:
{{ config(materialized='materialized_view') }}to make a materialized view.{{ config(enable_refresh=True) }}to enable auto refresh.{{ config(refresh_interval_minutes=<number>) }}to set time interval for auto refresh.{{ config(max_staleness="<interval time>"to set max staleness time.
References
Optimizing Materialized Views with dbt | dbt Developer Blog ↩︎
Automatic refresh | Manage materialized views | BigQuery | Google Cloud ↩︎
Set the frequency cap | Manage materialized views | BigQuery | Google Cloud ↩︎
Use materialized views with max_staleness option | Create materialized views | BigQuery | Google Cloud ↩︎
Max_Staleness parameters - Google Cloud / Database - Google Developer forums ↩︎


















