Implementing SCD Type 2 and Type 3 Using PySpark

Overview

In the world of data warehousing, one of the fundamental challenges is handling changes in dimension data over time. The concept of Slowly Changing Dimensions (SCD) addresses this challenge, allowing businesses to track and store historical data as it evolves. SCD refers to how data in dimensional tables changes slowly and gradually rather than frequently or unpredictably. There are different strategies (or types) for managing these changes, and in this blog, we will focus on SCD Type 2 and SCD Type 3, explaining their concepts and showing how they can be implemented using PySpark.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Slowly Changing Dimension (SCD)

A dimension in a data warehouse refers to descriptive information (like a customer’s name, address, product category, etc.) that categorizes facts and measures to enable meaningful analysis. Slowly Changing Dimensions are dimensions that change over time but much slower than transactional data.

For example:

A customer’s address or marital status might change, but these changes are infrequent.
A product’s category could change if the business redefines its taxonomy.

Types of Slowly Changing Dimensions

Multiple types of SCDs define how to store historical changes in dimension data:

table1

SCD Type 2: Full Historical Tracking

SCD Type 2 is used when we want to preserve the entire history of changes in dimension attributes. This means that every time a change occurs in a dimension record, a new record is inserted into the dimension table, and the old record is maintained with an indicator of whether it is the current record. This allows businesses to see exactly how a dimension has evolved.

For example, consider a customer dimension table:

table2

The first record represents the customer’s address before the change.
The second record captures the updated address and has a Current_Flag of ‘Y’, indicating that this is the most recent address.

Implementation of SCD Type 2 Using PySpark

In PySpark, implementing SCD Type 2 involves merging incoming data with the existing dimension data and creating new records for updates. Below is a simplified implementation:

Pyspark code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_date, lit, when

# Initialize Spark session
spark = SparkSession.builder.appName("SCDType2").getOrCreate()

# Sample DataFrames
existing_df = spark.createDataFrame([
    (1, 'Aiswarya', 'Newtown', '2020-01-01', '2023-01-01', 'N'),
], ['Customer_ID', 'Customer_Name', 'Address', 'Start_Date', 'End_Date', 'Current_Flag'])

new_df = spark.createDataFrame([
    (1, 'Aiswarya', 'Saltlake', '2023-01-02'),
], ['Customer_ID', 'Customer_Name', 'Address', 'Start_Date'])

# Step 1: Identify changes
change_df = existing_df.alias('existing').join(
    new_df.alias('new'),
    on=['Customer_ID'],
    how='inner'
). filter(
    (col('existing.Address') != col('new.Address'))
)

# Step 2: Close out old records (set end date and flag them as 'N')
updated_existing_df = existing_df.alias('existing').join(
    change_df.select('existing.Customer_ID'),
    on=['Customer_ID'],
    how='inner'
). withColumn(
    'End_Date', current_date()
).withColumn(
    'Current_Flag', lit('N')
)

# Step 3: Insert new records for changed data
new_records_df = change_df.withColumn(
    'End_Date', lit (None)
).withColumn(
    'Current_Flag', lit('Y')
)

# Combine the new and old records
final_df = updated_existing_df.union(new_records_df)

final_df.show()

from pyspark.sql import SparkSession

from pyspark.sql.functions import col, current_date, lit, when

# Initialize Spark session

spark = SparkSession.builder.appName("SCDType2").getOrCreate()

# Sample DataFrames

existing_df = spark.createDataFrame([

(1, 'Aiswarya', 'Newtown', '2020-01-01', '2023-01-01', 'N'),

], ['Customer_ID', 'Customer_Name', 'Address', 'Start_Date', 'End_Date', 'Current_Flag'])

new_df = spark.createDataFrame([

(1, 'Aiswarya', 'Saltlake', '2023-01-02'),

], ['Customer_ID', 'Customer_Name', 'Address', 'Start_Date'])

# Step 1: Identify changes

change_df = existing_df.alias('existing').join(

new_df.alias('new'),

on=['Customer_ID'],

how='inner'

). filter(

(col('existing.Address') != col('new.Address'))

)

# Step 2: Close out old records (set end date and flag them as 'N')

updated_existing_df = existing_df.alias('existing').join(

change_df.select('existing.Customer_ID'),

on=['Customer_ID'],

how='inner'

). withColumn(

'End_Date', current_date()

).withColumn(

'Current_Flag', lit('N')

)

# Step 3: Insert new records for changed data

new_records_df = change_df.withColumn(

'End_Date', lit (None)

).withColumn(

'Current_Flag', lit('Y')

)

# Combine the new and old records

final_df = updated_existing_df.union(new_records_df)

final_df.show()

In this code:

We compare the existing data (existing_df) with the incoming new data (new_df).
For any changed records, we “close” the old records by updating the End_Date and setting the Current_Flag to ‘N’.
New records are added with the updated information and a Current_Flag of ‘Y’.

Benefits:

Complete historical tracking.
Allows analysis of data as it existed at any point in time.
It is useful when business processes need to be tracked over time, such as customer activity, product changes, etc.

SCD Type 3: Storing Previous and Current Values

SCD Type 3 tracks limited history by storing only the current and previous values. It’s suitable for infrequent changes where full historical tracking isn’t needed.

Example:
If a customer’s address changes, the table keeps both current and previous addresses:

table3

Implementation using PySpark:

# Existing and new data
existing_df = spark.createDataFrame([(1, 'Aiswarya', 'Newtown', 'Saltlake')],
                                    ['Customer_ID', 'Customer_Name', 'Current_Address', 'Previous_Address'])

new_df = spark.createDataFrame([(1, 'Aiswarya', 'Saltlake')],
                                ['Customer_ID', 'Customer_Name', 'Current_Address'])

# Step 1: Identify changes
change_df = existing_df.join(new_df, on='Customer_ID').filter(
    col('Current_Address') != col('new.Current_Address')
)

# Step 2: Update current and previous addresses
updated_df = existing_df.withColumn('Previous_Address', col('Current_Address')).withColumn(
    'Current_Address', col('new.Current_Address'))
updated_df.show()

# Existing and new data

existing_df = spark.createDataFrame([(1, 'Aiswarya', 'Newtown', 'Saltlake')],

['Customer_ID', 'Customer_Name', 'Current_Address', 'Previous_Address'])

new_df = spark.createDataFrame([(1, 'Aiswarya', 'Saltlake')],

['Customer_ID', 'Customer_Name', 'Current_Address'])

# Step 1: Identify changes

change_df = existing_df.join(new_df, on='Customer_ID').filter(

col('Current_Address') != col('new.Current_Address')

)

# Step 2: Update current and previous addresses

updated_df = existing_df.withColumn('Previous_Address', col('Current_Address')).withColumn(

'Current_Address', col('new.Current_Address'))

updated_df.show()

Benefits:

Tracks current and previous states.
Saves storage compared to Type 2.
Suitable for scenarios where limited history is sufficient.

Conclusion

Slowly Changing Dimensions, particularly SCD Type 2 and Type 3, are pivotal in ensuring the relevance and accuracy of the data used for analysis in a business intelligence environment. Understanding these strategies allows organizations to choose the right approach based on their needs and requirements. Incorporating PySpark into your data processing framework enables efficient handling of large datasets and simplifies the implementation of SCDs, ultimately leading to more insightful analysis.

So, whether you are developing a system for customer analytics or managing any business process with changing attributes, the strategic implementation of SCDs can provide deeper insights and answer questions about your operations over time.

Drop a query if you have any questions regarding SCD and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront and many more.

To get started, go through our Consultancy page and Managed Services Package, CloudThat’s offerings.

FAQs

1. What is the Difference between SCD Type 2 and Type 3?

ANS: –

Type 2: Creates a new record for every change, preserving full history with Start_Date, End_Date, and Current_Flag.
Type 3: Stores only current and previous values, tracking limited history for recent changes.

2. When to use Type 2 vs. Type 3?

ANS: –