Cloud Computing, Data Analytics

4 Mins Read

Implementing SCD Type 2 and Type 3 Using PySpark

Voiced by Amazon Polly

Overview

In the world of data warehousing, one of the fundamental challenges is handling changes in dimension data over time. The concept of Slowly Changing Dimensions (SCD) addresses this challenge, allowing businesses to track and store historical data as it evolves. SCD refers to how data in dimensional tables changes slowly and gradually rather than frequently or unpredictably. There are different strategies (or types) for managing these changes, and in this blog, we will focus on SCD Type 2 and SCD Type 3, explaining their concepts and showing how they can be implemented using PySpark.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Slowly Changing Dimension (SCD)

A dimension in a data warehouse refers to descriptive information (like a customer’s name, address, product category, etc.) that categorizes facts and measures to enable meaningful analysis. Slowly Changing Dimensions are dimensions that change over time but much slower than transactional data.

For example:

  • A customer’s address or marital status might change, but these changes are infrequent.
  • A product’s category could change if the business redefines its taxonomy.

Types of Slowly Changing Dimensions

Multiple types of SCDs define how to store historical changes in dimension data:

table1

SCD Type 2: Full Historical Tracking

SCD Type 2 is used when we want to preserve the entire history of changes in dimension attributes. This means that every time a change occurs in a dimension record, a new record is inserted into the dimension table, and the old record is maintained with an indicator of whether it is the current record. This allows businesses to see exactly how a dimension has evolved.

For example, consider a customer dimension table:

table2

  • The first record represents the customer’s address before the change.
  • The second record captures the updated address and has a Current_Flag of ‘Y’, indicating that this is the most recent address.

Implementation of SCD Type 2 Using PySpark

In PySpark, implementing SCD Type 2 involves merging incoming data with the existing dimension data and creating new records for updates. Below is a simplified implementation:

Pyspark code:

In this code:

  1. We compare the existing data (existing_df) with the incoming new data (new_df).
  2. For any changed records, we “close” the old records by updating the End_Date and setting the Current_Flag to ‘N’.
  3. New records are added with the updated information and a Current_Flag of ‘Y’.

Benefits:

  • Complete historical tracking.
  • Allows analysis of data as it existed at any point in time.
  • It is useful when business processes need to be tracked over time, such as customer activity, product changes, etc.

SCD Type 3: Storing Previous and Current Values

SCD Type 3 tracks limited history by storing only the current and previous values. It’s suitable for infrequent changes where full historical tracking isn’t needed.

Example:
If a customer’s address changes, the table keeps both current and previous addresses:

table3

Implementation using PySpark:

Benefits:

  • Tracks current and previous states.
  • Saves storage compared to Type 2.
  • Suitable for scenarios where limited history is sufficient.

Conclusion

Slowly Changing Dimensions, particularly SCD Type 2 and Type 3, are pivotal in ensuring the relevance and accuracy of the data used for analysis in a business intelligence environment. Understanding these strategies allows organizations to choose the right approach based on their needs and requirements. Incorporating PySpark into your data processing framework enables efficient handling of large datasets and simplifies the implementation of SCDs, ultimately leading to more insightful analysis.

So, whether you are developing a system for customer analytics or managing any business process with changing attributes, the strategic implementation of SCDs can provide deeper insights and answer questions about your operations over time.

Drop a query if you have any questions regarding SCD and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

  • Reduced infrastructure costs
  • Timely data-driven decisions
Get Started

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training PartnerAWS Migration PartnerAWS Data and Analytics PartnerAWS DevOps Competency PartnerAWS GenAI Competency PartnerAmazon QuickSight Service Delivery PartnerAmazon EKS Service Delivery Partner AWS Microsoft Workload PartnersAmazon EC2 Service Delivery PartnerAmazon ECS Service Delivery PartnerAWS Glue Service Delivery PartnerAmazon Redshift Service Delivery PartnerAWS Control Tower Service Delivery PartnerAWS WAF Service Delivery PartnerAmazon CloudFront and many more.

To get started, go through our Consultancy page and Managed Services PackageCloudThat’s offerings.

FAQs

1. What is the Difference between SCD Type 2 and Type 3?

ANS: –

  • Type 2: Creates a new record for every change, preserving full history with Start_Date, End_Date, and Current_Flag.
  • Type 3: Stores only current and previous values, tracking limited history for recent changes.

2. When to use Type 2 vs. Type 3?

ANS: –

  • Use Type 2 to track full history (e.g., address changes).
  • Use Type 3 for limited changes (e.g., last two addresses).

WRITTEN BY Aiswarya Sahoo

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!