Automating Data Archival in Amazon RDS Using AWS Lambda and Amazon S3

Overview

Efficiently managing database growth is crucial for optimizing costs and maintaining performance. As data accumulates in Amazon RDS, archiving old or less frequently accessed data to Amazon S3 offers a cost-effective solution. Organizations can build a scalable and seamless archival pipeline by leveraging AWS Lambda for automation.

This blog provides a detailed guide to automate the data archival process, covering architecture, implementation, and best practices.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Why Automate Data Archival?

Cost Efficiency – Amazon S3 offers lower storage costs compared to Amazon RDS. Archiving data reduces operational expenses.
Enhanced Database Performance – Removing unused data minimizes query times, backup durations, and resource consumption.
Regulatory Compliance – Facilitates adherence to data retention policies by securely storing archival data.
Scalability – Automation handles large datasets without manual intervention, reducing operational overhead.

Architecture Overview

The proposed solution involves the following AWS services:

Amazon RDS: The source database containing data to be archived.
Amazon S3: Destination for archived data.
AWS Lambda: Executes the data extraction and transfer logic.
Amazon CloudWatch Events: Schedules the archival process.
AWS Identity and Access Management (IAM): Secures access for AWS Lambda to interact with Amazon RDS and Amazon S3.

Workflow

AWS Lambda connects to the Amazon RDS instance and queries records flagged for archival.
The data is extracted, transformed into a CSV file, and uploaded to an Amazon S3 bucket.
Once the archival data is uploaded, the corresponding rows are deleted from Amazon RDS.
Amazon CloudWatch Events triggers the AWS Lambda function periodically.

Step-by-Step Implementation

Step 1: Set Up Amazon RDS

Create or Identify the Amazon RDS Instance

Ensure the Amazon RDS database has sufficient resources to handle data archival queries.

2. Add an Archive Flag

Modify your database schema to include a column (e.g., archive_flag) for marking rows eligible for archival.
Example SQL:

ALTER TABLE transactions ADD archive_flag BOOLEAN DEFAULT FALSE;

1	ALTER TABLE transactions ADD archive_flag BOOLEAN DEFAULT FALSE;

Step 2: Create an Amazon S3 Bucket

Set Up the Bucket
- Use the Amazon S3 console to create a bucket (e.g., my-data-archive-bucket).
- Enable server-side encryption to secure archived data.
Configure Lifecycle Policies
- Define policies to transition old data to cheaper storage tiers like Amazon S3 Glacier.

Step 3: Write the AWS Lambda Function

Install Dependencies

Use the AWS Lambda runtime for Python and package required libraries like boto3 and pymysql.

pip install pymysql -t ./lambda_package
pip install boto3 -t ./lambda_package

1 2	pip install pymysql -t ./lambda_package pip install boto3 -t ./lambda_package

AWS Lambda Code Example

Here is the Python code for the AWS Lambda function:

import boto3
import pymysql
import csv
import os
from datetime import datetime
 
# Environment Variables
RDS_HOST = os.environ['RDS_HOST']
RDS_USER = os.environ['RDS_USER']
RDS_PASSWORD = os.environ['RDS_PASSWORD']
RDS_DATABASE = os.environ['RDS_DATABASE']
S3_BUCKET = os.environ['S3_BUCKET']
ARCHIVE_TABLE = os.environ['ARCHIVE_TABLE']
 
def lambda_handler(event, context):
    try:
        # Connect to the RDS instance
        connection = pymysql.connect(
            host=RDS_HOST,
            user=RDS_USER,
            password=RDS_PASSWORD,
            database=RDS_DATABASE
        )
        cursor = connection.cursor()
        
        # Query records for archival
        query = f"SELECT * FROM {ARCHIVE_TABLE} WHERE archive_flag = TRUE"
        cursor.execute(query)
        rows = cursor.fetchall()
        
        if not rows:
            print("No records found for archival.")
            return
        
        # Save data to a CSV file
        timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        file_name = f"{ARCHIVE_TABLE}_archive_{timestamp}.csv"
        file_path = f"/tmp/{file_name}"
        
        with open(file_path, 'w', newline='') as csvfile:
            writer = csv.writer(csvfile)
            # Write header
            writer.writerow([desc[0] for desc in cursor.description])
            # Write data rows
            writer.writerows(rows)
        
        # Upload the file to S3
        s3_client = boto3.client('s3')
        s3_client.upload_file(file_path, S3_BUCKET, file_name)
        print(f"Archived data uploaded to S3: {file_name}")
        
        # Delete archived records
        delete_query = f"DELETE FROM {ARCHIVE_TABLE} WHERE archive_flag = TRUE"
        cursor.execute(delete_query)
        connection.commit()
        print("Archived records deleted from RDS.")
        
    except Exception as e:
        print(f"Error occurred: {str(e)}")
    finally:
        cursor.close()
        connection.close()

import boto3

import pymysql

import csv

import os

from datetime import datetime

# Environment Variables

RDS_HOST = os.environ['RDS_HOST']

RDS_USER = os.environ['RDS_USER']

RDS_PASSWORD = os.environ['RDS_PASSWORD']

RDS_DATABASE = os.environ['RDS_DATABASE']

S3_BUCKET = os.environ['S3_BUCKET']

ARCHIVE_TABLE = os.environ['ARCHIVE_TABLE']

def lambda_handler(event, context):

try:

# Connect to the RDS instance

connection = pymysql.connect(

host=RDS_HOST,

user=RDS_USER,

password=RDS_PASSWORD,

database=RDS_DATABASE

)

cursor = connection.cursor()

# Query records for archival

query = f"SELECT * FROM {ARCHIVE_TABLE} WHERE archive_flag = TRUE"

cursor.execute(query)

rows = cursor.fetchall()

if not rows:

print("No records found for archival.")

return

# Save data to a CSV file

timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

file_name = f"{ARCHIVE_TABLE}_archive_{timestamp}.csv"

file_path = f"/tmp/{file_name}"

with open(file_path, 'w', newline='') as csvfile:

writer = csv.writer(csvfile)

# Write header

writer.writerow([desc[0] for desc in cursor.description])

# Write data rows

writer.writerows(rows)

# Upload the file to S3

s3_client = boto3.client('s3')

s3_client.upload_file(file_path, S3_BUCKET, file_name)

print(f"Archived data uploaded to S3: {file_name}")

# Delete archived records

delete_query = f"DELETE FROM {ARCHIVE_TABLE} WHERE archive_flag = TRUE"

cursor.execute(delete_query)

connection.commit()

print("Archived records deleted from RDS.")

except Exception as e:

print(f"Error occurred: {str(e)}")

finally:

cursor.close()

connection.close()

Environment Variables

Set these environment variables in your AWS Lambda function configuration:

RDS_HOST
RDS_USER
RDS_PASSWORD
RDS_DATABASE
S3_BUCKET
ARCHIVE_TABLE

Step 4: Grant Permissions to Lambda

Attach an AWS IAM role to AWS Lambda with:
- AmazonS3FullAccess or bucket-specific permissions.
- Access to Amazon RDS using AWSSecretsManagerReadWrite or Amazon RDS policy.

Step 5: Schedule AWS Lambda with Amazon CloudWatch Events

Use Amazon CloudWatch to trigger AWS Lambda on a regular schedule:

Example: Run daily at midnight using a cron expression:

{
  "cron": "0 0 * * ? *"
}

{

"cron": "0 0 * * ? *"

}

Best Practices

Data Filtering: Ensure your archival query selects only necessary rows to minimize overhead.
Encryption: Use Amazon S3 bucket encryption for data at rest and SSL/TLS for data in transit.
Monitoring: Configure Amazon CloudWatch to log AWS Lambda execution details and errors.
Testing: Run the AWS Lambda function in a test environment to validate results before deploying.
Scalability: Use Amazon RDS to read replicas and query archival data if your workload increases.

Conclusion

Automating data archival from Amazon RDS to Amazon S3 using AWS Lambda is a practical and efficient approach to managing database growth. This process ensures cost savings, enhanced performance, and simplified compliance.

Following the steps and best practices outlined in this blog, you can build a robust, scalable, and secure data archival pipeline.

Drop a query if you have any questions regarding Data Archival and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront and many more.

To get started, go through our Consultancy page and Managed Services Package, CloudThat’s offerings.

FAQs

1. How do I determine which data to archive from my Amazon RDS database?

ANS: – You can define a specific criterion for identifying archival data, such as a date column (e.g., records older than a certain date) or a custom flag column (archive_flag). SQL queries with these filters ensure only relevant data is selected for archival.

2. How can I ensure the security of archived data in Amazon S3?

ANS: – To secure data in Amazon S3:

Enable server-side encryption (e.g., AES-256 or AWS KMS keys).
Use bucket policies and AWS IAM roles to restrict access.
Enable Amazon S3 versioning for data protection against accidental overwrites or deletions.
Use Amazon S3 Block Public Access settings to prevent unauthorized access.