AWS, Cloud Computing, Data Analytics

4 Mins Read

Automating Data Analytics and Quality Monitoring with AWS Glue Anomaly Detection

Voiced by Amazon Polly

Overview

AWS Glue Data Quality Anomaly Detection is an advanced feature designed to enhance the accuracy and reliability of data quality assessments in AWS Glue, a fully managed ETL service. This feature leverages machine learning to detect unusual patterns and anomalies in your data, automatically adjusting data quality thresholds in response to evolving data trends. Traditional data quality checks often rely on fixed rules that quickly become outdated as data changes over time. AWS Glue’s anomaly detection addresses this challenge by dynamically updating these rules, ensuring that your data remains consistently high-quality, even in complex, rapidly changing environments. This makes it an invaluable tool for organizations looking to maintain the integrity of their data and drive more informed decision-making processes.

Introduction

The recent addition of anomaly detection capabilities in AWS Glue’s Data Quality suite addresses a major challenge in modern data processing: managing evolving data quality rules. By incorporating machine learning, this feature automates the adjustment of thresholds based on changing data patterns, reducing the reliance on static rules that quickly become outdated. Below is a detailed guide on implementing and leveraging this feature in AWS Glue to monitor data quality in dynamic environments.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Implementation Steps

This implementation uses a scenario involving NYC taxi ride data to demonstrate how anomaly detection can be applied. The goal is to monitor ride counts and fare amounts while accounting for daily fluctuations and seasonal trends.

  1. Set Up Resources Using AWS CloudFormation:

Start by deploying the AWS CloudFormation template that provisions the necessary resources. This setup includes:

  • Amazon S3 bucket: For storing input and output data.
  • AWS IAM roles and policies: Ensuring the necessary permissions for the Glue job.
  • AWS Glue Database: To store metadata and configurations.
  • AWS Glue ETL Job: Preconfigured to process NYC taxi ride data and analyze data quality.

The template also includes a data generator script that simulates taxi ride data for 7 days, incorporating synthetic anomalies for testing purposes.

  1. Generate Sample Data:

The data generator job, which runs automatically after the AWS CloudFormation stack is created, populates the dataset with hourly taxi ride records for the first week of May 2024. Here’s how it works:

  • Normal Data: The data remains consistent for the first five days (May 1-5), reflecting typical ride patterns.
  • Anomalous Data: On the sixth day (May 6), the data includes intentional anomalies such as an unexpectedly high fare amount and a spike in ride count. This day is critical for testing the anomaly detection capabilities.

The generated dataset is stored in the Amazon S3 bucket, and the AWS Glue table is automatically created for easy querying and analysis.

  1. Create an AWS Glue Visual ETL Job:

Create a visual ETL job in AWS Glue Studio with the sample data ready. This job performs data transformations and includes quality checks using Glue Data Quality. The key steps include:

  • Define Metrics and Rules: Identify the metrics you want to monitor, such as the range of fare amounts and ride counts. Initially, broad thresholds should be set to accommodate normal variability in the data.
  • Set Data Quality Rules: These rules help evaluate whether data meets quality standards. For instance, the fare amount rule could initially allow for wide variations (e.g., between $5 and $300) but will be refined based on insights gathered during training.

The AWS Glue job lets you create data quality rules and analyzers visually, making setting up the required configurations easier without writing extensive code.

  1. Run the Job Over Multiple Days to Train the Model:

The job must be run multiple times to enable effective anomaly detection, each processing a different day’s data. Start by running the job for the first five days:

  • Day 1-5: As the model processes each day’s data, it establishes baseline trends. The model begins learning the typical ranges for ride counts and fare amounts, adjusting thresholds accordingly.

Running the job multiple times is crucial because it allows the model to recognize seasonal patterns, such as higher ride counts during rush hours or on weekends. The more data the model processes, the better it detects deviations from the norm.

  1. Analyze Results and Detect Anomalies:

After the model has been trained with five days of data, run the job for the sixth day, which includes anomalies. Here’s what you’ll observe:

  • Significant Deviations Detected: The model identifies the spikes in ride count and fare amount as outliers, marking them as potential anomalies.
  • Updated Rule Recommendations: AWS Glue Data Quality suggests updated rules based on the detected anomalies, providing more accurate ranges for the monitored metrics.

For example, if the original rule allowed fare amounts between $5 and $300, the model might recommend tightening this range after observing consistent patterns.

  1. Update Data Quality Rules:

The next step involves applying the model’s recommendations. In Glue Studio, update your data quality rules based on the following insights:

  • Adjust Row Count Rule: The rule for ride count could be refined to allow only counts between 275 and 1966 based on the typical daily variations observed in the data.
  • Refine Fare Amount Thresholds: The fare amount range might be adjusted from $5-$300 to a narrower range based on actual data trends.

These updates help prevent false positives while maintaining effective anomaly detection for real issues.

  1. Exclude Known Anomalies from Training Data:

To avoid the model learning incorrect patterns from the sixth day’s anomalous data, exclude this day’s data from future training runs. AWS Glue offers straightforward options to exclude specific datasets or runs. This ensures that only accurate, representative data is used to refine the model.

  1. Enable Anomaly Detection as a Rule:

Once the model has matured and proven reliable, you can configure anomaly detection as a data quality rule. This rule automatically flags out-of-bounds values based on learned patterns, simplifying data monitoring. For instance, if a sudden increase in ride count occurs outside normal variations, the model alerts the data engineer, enabling proactive issue resolution.

Conclusion

AWS Glue’s anomaly detection feature offers a dynamic approach to managing data quality, especially in environments where data fluctuates due to growth, seasonality, or unexpected events.

Automating the adjustment of thresholds and rules through machine learning reduces the need for constant manual oversight. This feature allows data engineers to detect and address data quality issues in real time, ensuring the integrity of business-critical data.

The above mentioned implementation demonstrates how to effectively use AWS Glue’s anomaly detection capabilities—from setting up resources to refining rules based on model recommendations. As data ecosystems grow increasingly complex, this feature helps organizations maintain high-quality data pipelines, ensuring accurate, reliable analytics for decision-making.

Drop a query if you have any questions regarding AWS Glue Data Quality Anomaly Detection and we will get back to you quickly.

Making IT Networks Enterprise-ready – Cloud Management Services

  • Accelerated cloud migration
  • End-to-end view of the cloud environment
Get Started

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training PartnerAWS Migration PartnerAWS Data and Analytics PartnerAWS DevOps Competency PartnerAWS GenAI Competency PartnerAmazon QuickSight Service Delivery PartnerAmazon EKS Service Delivery Partner, AWS Microsoft Workload PartnersAmazon EC2 Service Delivery PartnerAmazon ECS Service Delivery PartnerAWS Glue Service Delivery PartnerAmazon Redshift Service Delivery PartnerAWS Control Tower Service Delivery PartnerAWS WAF Service Delivery Partner and many more.

To get started, go through our Consultancy page and Managed Services PackageCloudThat’s offerings.

FAQs

1. Why are traditional fixed rules for data quality not sufficient?

ANS: – Traditional fixed rules are static and may quickly become outdated as business conditions evolve. For instance, changes in business scale, seasonal fluctuations, or unexpected events can lead to data patterns that fixed rules cannot accommodate, resulting in missed anomalies or false positives.

2. Can I exclude known anomalies from training the anomaly detection model?

ANS: – Yes, AWS Glue allows you to exclude specific runs or data points from being used in the model’s training. This is particularly useful to prevent the model from considering known anomalies as part of the normal data trend, ensuring more accurate future predictions.

WRITTEN BY Rachana Kampli

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!