Voiced by Amazon Polly |
Overview
In the world of data warehousing, one of the fundamental challenges is handling changes in dimension data over time. The concept of Slowly Changing Dimensions (SCD) addresses this challenge, allowing businesses to track and store historical data as it evolves. SCD refers to how data in dimensional tables changes slowly and gradually rather than frequently or unpredictably. There are different strategies (or types) for managing these changes, and in this blog, we will focus on SCD Type 2 and SCD Type 3, explaining their concepts and showing how they can be implemented using PySpark.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Slowly Changing Dimension (SCD)
For example:
- A customer’s address or marital status might change, but these changes are infrequent.
- A product’s category could change if the business redefines its taxonomy.
Types of Slowly Changing Dimensions
Multiple types of SCDs define how to store historical changes in dimension data:
SCD Type 2: Full Historical Tracking
SCD Type 2 is used when we want to preserve the entire history of changes in dimension attributes. This means that every time a change occurs in a dimension record, a new record is inserted into the dimension table, and the old record is maintained with an indicator of whether it is the current record. This allows businesses to see exactly how a dimension has evolved.
For example, consider a customer dimension table:
- The first record represents the customer’s address before the change.
- The second record captures the updated address and has a Current_Flag of ‘Y’, indicating that this is the most recent address.
Implementation of SCD Type 2 Using PySpark
In PySpark, implementing SCD Type 2 involves merging incoming data with the existing dimension data and creating new records for updates. Below is a simplified implementation:
Pyspark code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
from pyspark.sql import SparkSession from pyspark.sql.functions import col, current_date, lit, when # Initialize Spark session spark = SparkSession.builder.appName("SCDType2").getOrCreate() # Sample DataFrames existing_df = spark.createDataFrame([ (1, 'Aiswarya', 'Newtown', '2020-01-01', '2023-01-01', 'N'), ], ['Customer_ID', 'Customer_Name', 'Address', 'Start_Date', 'End_Date', 'Current_Flag']) new_df = spark.createDataFrame([ (1, 'Aiswarya', 'Saltlake', '2023-01-02'), ], ['Customer_ID', 'Customer_Name', 'Address', 'Start_Date']) # Step 1: Identify changes change_df = existing_df.alias('existing').join( new_df.alias('new'), on=['Customer_ID'], how='inner' ). filter( (col('existing.Address') != col('new.Address')) ) # Step 2: Close out old records (set end date and flag them as 'N') updated_existing_df = existing_df.alias('existing').join( change_df.select('existing.Customer_ID'), on=['Customer_ID'], how='inner' ). withColumn( 'End_Date', current_date() ).withColumn( 'Current_Flag', lit('N') ) # Step 3: Insert new records for changed data new_records_df = change_df.withColumn( 'End_Date', lit (None) ).withColumn( 'Current_Flag', lit('Y') ) # Combine the new and old records final_df = updated_existing_df.union(new_records_df) final_df.show() |
In this code:
- We compare the existing data (existing_df) with the incoming new data (new_df).
- For any changed records, we “close” the old records by updating the End_Date and setting the Current_Flag to ‘N’.
- New records are added with the updated information and a Current_Flag of ‘Y’.
Benefits:
- Complete historical tracking.
- Allows analysis of data as it existed at any point in time.
- It is useful when business processes need to be tracked over time, such as customer activity, product changes, etc.
SCD Type 3: Storing Previous and Current Values
SCD Type 3 tracks limited history by storing only the current and previous values. It’s suitable for infrequent changes where full historical tracking isn’t needed.
Example:
If a customer’s address changes, the table keeps both current and previous addresses:
Implementation using PySpark:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Existing and new data existing_df = spark.createDataFrame([(1, 'Aiswarya', 'Newtown', 'Saltlake')], ['Customer_ID', 'Customer_Name', 'Current_Address', 'Previous_Address']) new_df = spark.createDataFrame([(1, 'Aiswarya', 'Saltlake')], ['Customer_ID', 'Customer_Name', 'Current_Address']) # Step 1: Identify changes change_df = existing_df.join(new_df, on='Customer_ID').filter( col('Current_Address') != col('new.Current_Address') ) # Step 2: Update current and previous addresses updated_df = existing_df.withColumn('Previous_Address', col('Current_Address')).withColumn( 'Current_Address', col('new.Current_Address')) updated_df.show() |
Benefits:
- Tracks current and previous states.
- Saves storage compared to Type 2.
- Suitable for scenarios where limited history is sufficient.
Conclusion
Slowly Changing Dimensions, particularly SCD Type 2 and Type 3, are pivotal in ensuring the relevance and accuracy of the data used for analysis in a business intelligence environment. Understanding these strategies allows organizations to choose the right approach based on their needs and requirements. Incorporating PySpark into your data processing framework enables efficient handling of large datasets and simplifies the implementation of SCDs, ultimately leading to more insightful analysis.
So, whether you are developing a system for customer analytics or managing any business process with changing attributes, the strategic implementation of SCDs can provide deeper insights and answer questions about your operations over time.
Drop a query if you have any questions regarding SCD and we will get back to you quickly.
Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.
- Reduced infrastructure costs
- Timely data-driven decisions
About CloudThat
CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.
CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront and many more.
To get started, go through our Consultancy page and Managed Services Package, CloudThat’s offerings.
FAQs
1. What is the Difference between SCD Type 2 and Type 3?
ANS: –
- Type 2: Creates a new record for every change, preserving full history with Start_Date, End_Date, and Current_Flag.
- Type 3: Stores only current and previous values, tracking limited history for recent changes.
2. When to use Type 2 vs. Type 3?
ANS: –
- Use Type 2 to track full history (e.g., address changes).
- Use Type 3 for limited changes (e.g., last two addresses).
WRITTEN BY Aiswarya Sahoo
Click to Comment