Case Study

Building a Data Pipeline for Migration and Processing of Data

Download the Case Study
Industry 

Software Development

Expertise 

Amazon S3, AWS Glue, AWS Lambda, Amazon MSK

Offerings/solutions 

Streamlined data processing with AWS Glue, AWS Lambda, and Amazon Athena, deployed in Oregon and replicated in Singapore.

About the Client

CustomFit.ai is an AI-powered, Intelligent, Precise Personalization platform for B2B websites. Established in 2019, it offers a no-code website personalization solution. The platform utilizes artificial intelligence to identify and understand individual visitors, enabling it to dynamically modify website content based on their preferences.

Highlights

50%

Manual Intervention Reduced

40%

Reduction in Amazon Athena query cost

Efficient Data Retrieval

Streamlined access to large datasets, saving time and cost

The Challenge

The client is facing an issue with the data management. They are encountering difficulties in applying complex filtering, which involves joining multiple tables. To resolve this, we created a robust data pipeline that automates the extraction, transformation, and loading (ETL) process. This pipeline will efficiently convert JSON data into a structured format suitable for querying and analysis, handle increasing volumes of data, and accommodate future growth.

Solutions

• The solution is deployed in the Oregon region, with Amazon S3 buckets replicated in Singapore.
• Data is sourced from Amazon MSK via an Amazon MSK sink connector for Amazon S3, and it is stored as a single file in a day-wise partition in a staging bucket.
• An AWS Glue Crawler runs on this bucket, creating a raw table in the database.
• There are five glue jobs, which run on top of the raw table daily, extracting and storing files in 5 separate partitions in a processed data bucket. The extraction is done based on the keys extracted from the payload.
• The sub-partitions are created in the 5 main partitions, which are further partitioned into a day-wise basis.
• Everyday crawlers run on the main partitions populating the AWS Glue Catalog. The crawler is dynamically created using AWS Lambda.
• The processed data bucket is used as an event notification trigger for the AWS Lambda to create the crawlers if required.
• Amazon Athena creates the dataset based on the tables created in the glue catalog for the main data partition key. i.e. n tables for n partitions.

The Results

Automated Glue jobs streamline data processing, partitioning reduces scans by 40%, and the migration pipeline is fully automated.

Download the Case Study

AWS Partner – Data Analytics Services Competency

Pioneering Data Analytics by being an AWS Partner – Data Analytics Services Competency.

Learn more

An authorized partner for all major cloud providers

A cloud agnostic organization with the rare distinction of being an authorized partner for AWS, Microsoft, Google and VMware.

Learn more

A house of strong pool of certified consulting experts

150+ cloud certified experts in AWS, Azure, GCP, VMware, etc.; delivered 200+ projects for top 100 fortune 500 companies.

Learn more

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!