Voiced by Amazon Polly |
Introduction
Amazon Managed Workflows for Apache Airflow (MWAA) is a fully managed service that makes building and running workflow automation on AWS easy.
This blog post provides a detailed step-by-step guide to help you install the Pandas library on MWAA hosted in a private network.
Managed Workflows for Apache Airflow (MWAA) simplifies the deployment and management of Airflow, a popular tool for orchestrating complex data workflows. However, extending MWAA’s functionality by adding custom Python libraries like Pandas can be challenging, especially when the environment is hosted in a private network with no direct internet access. This guide walks you through installing Pandas on MWAA while ensuring compliance with the network restrictions.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Prerequisites
Before you start, ensure you have the following:
- An AWS account with appropriate permissions to create and manage MWAA environments and Amazon S3 buckets.
- An existing MWAA environment hosted in a private network.
- Basic knowledge of Amazon S3, AWS IAM roles, and Amazon VPC configurations.
Step-by-Step Guide
- Prepare a Custom Requirements File
Create a requirements.txt file that lists Pandas and any other dependencies you might need. This file will install the necessary packages in your MWAA environment.
1 |
pandas==1.3.3 |
- Set Up an Amazon S3 Bucket for Dependencies
Create an Amazon S3 Bucket: If you don’t already have an Amazon S3 bucket for your MWAA environment, create one. This bucket will be used to store your requirements file.
1 |
aws s3 mb s3://your-bucket-name |
Upload the Requirements File: Upload the requirements.txt file to the Amazon S3 bucket.
1 |
aws s3 cp requirements.txt s3://your-bucket-name/requirements.txt |
- Configure the MWAA Environment
- Navigate to the MWAA Console: Open the Amazon MWAA console and select your MWAA environment.
- Update the Environment Configuration: Go to the Environment details section.
In the Python requirements file field, specify the Amazon S3 path to your requirements.txt file, for example: s3://your-bucket-name/requirements.txt.
- Save Changes: Save the changes and wait for the environment to update. This may take a few minutes.
4. Verify Pandas Installation
To verify that Pandas has been successfully installed, you can create a simple DAG that imports and uses Pandas. Here’s an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
from airflow import DAG from airflow.operators.python_operator import PythonOperator from datetime import datetime import pandas as pd def test_pandas(): df = pd.DataFrame({'column1': [1, 2], 'column2': [3, 4]}) print(df) default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2023, 1, 1), 'retries': 1, } dag = DAG( 'test_pandas', default_args=default_args, schedule_interval='@once', ) run_this = PythonOperator( task_id='run_test', python_callable=test_pandas, dag=dag, ) |
- How to Install Pandas Library on Amazon MWAA Deploy the DAG: Save the DAG file in your DAGs folder (usually in the Amazon S3 bucket associated with your MWAA environment).
- Trigger the DAG: Trigger the DAG from the Airflow UI to ensure it runs successfully.
Handling Private Network Restrictions
Suppose your MWAA environment is hosted in a private network. In that case, you need to ensure that the necessary endpoints and permissions are in place to allow the environment to access the Amazon S3 bucket and other required services:
Amazon VPC Endpoints: Ensure that your Amazon VPC has the following endpoints configured:
Amazon S3 Endpoint: To allow access to the Amazon S3 bucket.
Other Service Endpoints: Depending on your specific requirements.
AWS IAM Roles and Policies: The AWS IAM role associated with your MWAA environment should have the necessary permissions to access the Amazon S3 bucket and any other AWS resources required.
Conclusion
Installing the Pandas library on MWAA hosted in a private network requires a few additional steps compared to a public network setup. By preparing a custom requirements file, configuring your Amazon S3 bucket, and ensuring the necessary network endpoints and permissions are in place, you can successfully extend your MWAA environment’s functionality with Pandas. This guide provides a comprehensive overview to help you navigate the installation process smoothly.
By following these steps, you can take full advantage of the powerful data manipulation capabilities of Pandas within your MWAA workflows, enabling more efficient and effective data processing.
Drop a query if you have any questions regarding Pandas and we will get back to you quickly.
Making IT Networks Enterprise-ready – Cloud Management Services
- Accelerated cloud migration
- End-to-end view of the cloud environment
About CloudThat
CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.
CloudThat is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, and many more.
To get started, go through our Consultancy page and Managed Services Package, CloudThat’s offerings.
FAQs
1. What if my MWAA environment doesn't update with the new requirements file?
ANS: – Ensure the Amazon S3 URI is correct and your requirements.txt file is accessible. Also, make sure there are no syntax errors in the file. If the problem persists, check the MWAA logs for any errors.
2. How do I specify a specific version of Pandas in the requirements file?
ANS: – To specify a specific Pandas version, use the == operator, like pandas==1.3.3.
3. How long does it take for MWAA to update after changing the requirements file?
ANS: – The update process can take several minutes, depending on the size of the requirements file and the current load on your MWAA environment.
WRITTEN BY Sunil H G
Sunil H G is a highly skilled and motivated Research Associate at CloudThat. He is an expert in working with popular data analysis and visualization libraries such as Pandas, Numpy, Matplotlib, and Seaborn. He has a strong background in data science and can effectively communicate complex data insights to both technical and non-technical audiences. Sunil's dedication to continuous learning, problem-solving skills, and passion for data-driven solutions make him a valuable asset to any team.
Click to Comment