AWS, Cloud Computing

3 Mins Read

Automating AWS Glue Table Version Cleanup with AWS Lambda

Voiced by Amazon Polly

Introduction

AWS Glue is a powerful data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

One of its key components is the AWS Glue Data Catalog, which stores metadata about the data in your ecosystem. However, as your AWS Glue catalog grows, maintaining the versions of tables can become challenging and lead to increased costs and clutter. Automating the cleanup of AWS Glue table versions using AWS Lambda can save time, reduce costs, and streamline operations.

In this blog, we will explore how to automate the cleanup of AWS Glue table versions using AWS Lambda. We will cover the prerequisites, a step-by-step implementation guide, and best practices to ensure an efficient and reliable solution.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Why Clean Up AWS Glue Table Versions?

AWS Glue automatically tracks table versions whenever a table schema is updated. While this feature ensures versioning and schema evolution, it can lead to excessive table versions over time. Here are some reasons to clean up old table versions:

  • Cost Management: Each table version contributes to the overall cost of AWS Glue usage. Cleaning up unnecessary versions helps optimize costs.
  • Enhanced Performance: Reducing the number of table versions improves query performance and reduces clutter in the Data Catalog.
  • Compliance: Maintaining a limited number of versions ensures compliance with data governance policies.

Prerequisites

Before proceeding, ensure that you have:

  1. AWS Account: Access to an AWS account with appropriate permissions.
  2. AWS IAM Role: An AWS IAM role with the necessary permissions to access AWS Glue and Amazon CloudWatch logs.
  3. AWS Glue Table: Existing AWS Glue tables have multiple versions.
  4. AWS CLI or Console Access: Familiarity with the AWS Management Console or CLI to set up the AWS Lambda function.

Step-by-Step Implementation

Step 1: Define Cleanup Requirements

Determine the criteria for cleanup. For example, you might want to retain the last 5 versions of each table and delete older versions.

Step 2: Create an AWS IAM Role for AWS Lambda

  1. Navigate to the AWS IAM Console.
  2. Create a new role and select AWS Lambda as the trusted entity.
  3. Attach the following policies:
    1. AWSGlueServiceRole: Provides permissions to access AWS Glue.
    2. CloudWatchLogsFullAccess: Enables logging for the Lambda function.
  4. Save the role for use in the AWS Lambda function.

Step 3: Develop the AWS Lambda Function

Below is a Python script for the AWS Lambda function. This script uses the boto3 library to interact with AWS Glue and perform cleanup:

Step 4: Deploy the AWS Lambda Function

  1. Open the AWS Lambda Console.
  2. Create a new function with the following details:
    1. Runtime: Python 3.x
    2. Role: Select the AWS IAM role created earlier.
  3. Upload the script or paste the code into the editor.
  4. Configure environment variables (e.g., database name, table name) if needed.
  5. Save and deploy the function.

Step 5: Schedule the Cleanup Using Amazon EventBridge

To automate the cleanup process, use Amazon EventBridge to trigger the AWS Lambda function on a schedule:

  1. Open the Amazon EventBridge Console.
  2. Create a new rule with the following details:
    1. Rule type: Schedule
    2. Schedule expression: Define how often the function should run (e.g., daily, weekly).
  3. Set the target to the AWS Lambda function created earlier.
  4. Save the rule.

Best Practices

  1. Test Before Deployment: Use test tables to validate the AWS Lambda function’s behavior before running it in production.
  2. Use Amazon CloudWatch Logs: Monitor the function’s logs in Amazon CloudWatch to troubleshoot issues and ensure successful execution.
  3. Backup Important Versions: Retain backups of critical table versions to avoid accidental data loss.
  4. Fine-Tune Retention Count: Choose an optimal number of versions to retain based on your use case.
  5. Security: Ensure the AWS IAM role has only the necessary permissions to minimize security risks.

Conclusion

Automating the cleanup of AWS Glue table versions with AWS Lambda helps maintain an optimized and cost-effective AWS Data Catalog. By implementing the steps outlined in this blog, you can streamline table version management, reduce costs, and enhance overall performance.

Drop a query if you have any questions regarding AWS Data Catalog and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

  • Reduced infrastructure costs
  • Timely data-driven decisions
Get Started

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training PartnerAWS Migration PartnerAWS Data and Analytics PartnerAWS DevOps Competency PartnerAWS GenAI Competency PartnerAmazon QuickSight Service Delivery PartnerAmazon EKS Service Delivery Partner AWS Microsoft Workload PartnersAmazon EC2 Service Delivery PartnerAmazon ECS Service Delivery PartnerAWS Glue Service Delivery PartnerAmazon Redshift Service Delivery PartnerAWS Control Tower Service Delivery PartnerAWS WAF Service Delivery PartnerAmazon CloudFrontAmazon OpenSearchAWS DMS and many more.

FAQs

1. What should I do if the AWS Glue table has too many versions, causing API throttling

ANS: – Provide recommendations on handling API rate limits, such as introducing pauses or breaking the cleanup into batches.

2. How do I configure different retention counts for different tables?

ANS: – You can define a configuration file (e.g., JSON or YAML) or use environment variables that specify the retention count for each table. Modify the script to read these configurations dynamically.

WRITTEN BY Aehteshaam Shaikh

Aehteshaam Shaikh is working as a Research Associate - Data & AI/ML at CloudThat. He is passionate about Analytics, Machine Learning, Deep Learning, and Cloud Computing and is eager to learn new technologies.

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!