Slowly Changing Dimensions
Let's talk about the concept of capturing data changes.
We capture data into our databases and data warehouses, and data can come and change over time.
Let’s talk about the concept of capturing those changes.
Slowly Changing Dimensions (SCD)
Slowly Changing Dimensions or SCD is one of core concepts in data warehousing. When we receive new data every day, how could we keep them into our data warehouses in an effective way in terms of cost, performance, and quality?
Types of SCD
SCD can be divided into many types, and each type has its own way to maintain data change. In this blog I will share only 5 popular types:
Type | Action |
---|---|
Type 0 | No change |
Type 1 | Overwrite |
Type 2 | Insert rows |
Type 3 | Separate historical columns |
Type 4 | Separate historical tables |
There are type 5 which is type 1 + 4, type 6 which is type 1 + 2 + 3, and more but they are not much popular because they’re too specific for general use cases and quite challenging to implement.
SCD type 0
Type 0 is a type of unchanging or constants e.g. postal codes.
SCD type 1
Type 1 is an overwriting, no old data is kept. It’s good in case we don’t want to maintain the history or the data is not much important and we want just a current data.
SCD type 2
Type 2 is a popular one when we need to track all history then we keep adding the updated rows into the tables but gain more table size in exchange.
To indicate which record is older or newer, we usually add date time columns like created_at
, updated_at
, effective_date
or some flags such as is_current
or similar for that purpose.
This type should have a new surrogate key as a primary key (sid
in this figure) because original keys (natural keys) would be duplicated when we insert new records of the same key.
SCD type 3
Type 3 is to separate the historical data into dedicated columns. We can choose that we are going to:
- track the first values (to see the first and the last), or
- track the previous values (to see the recent changes)
Surrogate keys are unnecessary for this type.
SCD type 4
Type 4 means we maintain:
- a separated table to track the history (like type 3), and
- update the main table to keep latest data (like type 1).
It’s useful when we need to trace back we can find in historical table, and find the main table in most cases to see the recent data. And surrogate keys are necessary for this type with the same reason as type 2.
OLTP vs OLAP
Basically we can distinguish two types of database systems as:
- OLTP (Online Transaction Processing)
- designed for transactional data
- optimized for writing data, fast query processing, and maintaining data integrity
- Example databases are MySQL and Oracle
- Example use cases are customer relationship management (CRM), e-commerce platforms, and banking systems
- OLAP (Online Analytical Processing)
- designed for analytics and complex queries
- optimized for read-heavy operations and support decision-making processes
- Example databases are Google BigQuery and Amazon RedShift
- Example use cases are data warehouses and business intelligence systems
SCD in my works
In my experience that mostly involves with OLAP such as Google BigQuery, Slowly Changing Dimensions (SCD) type 2 is the type I have used the most with INSERT statements for adding new records. On par with that, MERGE statements or upsert (insert + update) are also frequent ones to update the existing records and insert the new ones. So I can keep track every changes.
References
- Implementing Slowly Changing Dimensions (SCDs) in Data Warehouses
- Slowly Changing Dimensions(SCD): Types with Examples
- Slow Changing Dimension Type 2 and Type 4 Concept and Implementation | by Amitava Nandi | Medium
- Slowly changing dimension - Wikipedia
- SCD: Slowly Changing Dimension, an Ultimate Guide - RADACAD
- Slowly Changing Dimension (SCD) Type 3 - Kontext