Voiced by Amazon Polly |
Overview:
In this blog, we will explore the pivotal role of Apache Airflow in the data engineering domain and delve into the process of deploying Airflow on Kubernetes, along with integrating Git for enhanced version control and collaboration. Join us as we uncover how this integration reshapes the landscape of data engineering practices.
Customized Cloud Solutions to Drive your Business Success
- Cloud Migration
- Devops
- AIML & IoT
Introduction
Apache Airflow has emerged as a leading open-source platform for orchestrating complex data workflows. Its intuitive user interface and scalable architecture make it a preferred choice among data engineers for scheduling, monitoring, and managing workflows.
Prerequisites
Before getting started, ensure you have the following:
- An AWS account with appropriate permissions to work on the Amazon EKS cluster
- Kubernetes Knowledge.
- AWS CLI configured with credentials.
Airflow Deployment on Kubernetes with Helm
Kubernetes, renowned for its container orchestration prowess, complements Airflow’s capabilities by providing a scalable and resilient infrastructure for deploying and managing workflow execution.
Helm: Kubernetes Package Manager
Helm, the package manager for Kubernetes, simplifies the deployment and management of Kubernetes applications through the concept of charts—pre-packaged Kubernetes resources.
Steps to Deploy Airflow on Kubernetes using Helm:
- Install Helm: Begin by installing Helm on your Kubernetes cluster. Helm facilitates the management of Kubernetes applications through charts.
The commands below will install the Helm in the terminal.
1 2 3 |
$ sudo apt-get update $ sudo apt-get install -y helm |
- Prepare Airflow Helm Chart: Utilize an existing Airflow Helm chart or customize one according to your requirements. Helm charts encapsulate Kubernetes manifests, configurations, and dependencies necessary for deploying Airflow.
1 |
$ helm init |
- Install Airflow Chart: Deploy the customized Airflow Helm chart onto your Kubernetes cluster using Helm’s simple command-line interface.
1 2 3 4 5 |
$ helm repo add apache-airflow https://airflow.apache.org $ helm repo update $ helm repo list |
- Deploy Airflow Through Helm:
Use the code below to deploy the Helm chart.
1 |
$ helm get values apache-airflow > newvalues.yaml |
It will download the yaml file where the variables with values are stored. We can change the required values and upgrade the helm using the below command.
1 |
$ helm upgrade --install airflow apache-airflow/airflow -n airflow-dep -f newvalues.yaml --debug |
It will deploy the changes in the namespace.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: airflow-ingress namespace: airflow-dep annotations: alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=4000 alb.ingress.kubernetes.io/scheme: internet-facing alb.ingress.kubernetes.io/subnets: <subntest id> alb.ingress.kubernetes.io/target-type: ip spec: rules: - http: paths: - path: / pathType: Prefix backend: service: name: airflow-webserver port: number: 8080 |
1 |
$ Kubectl get ingress –n airflow-dep |
Copy the IP address and put it in the browser to get the Airflow page.
Use the Username and password as admin.
Integrating Git with Airflow:
To integrate Git with Airflow, you typically configure Airflow to pull DAGs (Directed Acyclic Graphs) from a Git repository. This enables version-controlled management of your workflow definitions.
- Create a Git Repository: Create a repository containing your Airflow DAG definitions.
- Configure Airflow to Pull DAGs from Git: Update your Airflow configuration (either through the Helm chart values or directly in Airflow’s configuration) to specify the Git repository URL and credentials if necessary.
In the Values.yaml
Provide the values according to the pic below, and you will understand how to sync them to the Airflow deployment with dags in Git Hub.
- Then Create Public and private key using sshkeygen
1 |
$ ssh-keygen -t ed25519 -C "<mailaddress>" |
- Copy the public key by printing the key using the below command
1 |
$ cat ~/.ssh/id_ed25519.pub |
It will output starting
1 |
ssh-ed25519 <code> <mailaddress> |
- Copy and paste the generated public key in the GitHub.
- Create a secret in the same namespaces where airflow is deployed.
1 |
$ kubectl create secret generic airflow-ssh-git-secret --from-file=gitSshKey=/root/.ssh/id_ed25519 -n rl-airflow |
Edit the newvalues.yam and add the secret name below the gitsync module.
Then Upgrade the Deployment.
1 |
$ helm upgrade --install airflow apache-airflow/airflow -n airflow-dep -f newvalues.yaml --debug |
- Sync DAGs: After configuring Airflow, it automatically syncs and loads DAGs from the specified Git repository.
Conclusion:
In conclusion, Airflow is a robust and flexible platform for orchestrating complex workflows and data pipelines. With its rich features, scalability, and extensibility, Airflow empowers organizations to automate and monitor their data processes efficiently. By leveraging Airflow’s dynamic DAGs, task dependencies, and the broad ecosystem of integrations, teams can achieve greater agility, reliability, and visibility in their data workflows. As organizations continue to embrace data-driven decision-making, Airflow remains a powerful tool for orchestrating data pipelines and unlocking valuable insights from diverse data sources.
Get your new hires billable within 1-60 days. Experience our Capability Development Framework today.
- Cloud Training
- Customized Training
- Experiential Learning
About CloudThat
CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.
CloudThat is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, and many more.
To get started, go through our Consultancy page and Managed Services Package, CloudThat’s offerings.
FAQs
1. What is Airflow's main purpose?
ANS: – Airflow is primarily designed to orchestrate and schedule complex data workflows and pipelines.
2. How does Airflow handle task dependencies?
ANS: – Airflow uses Directed Acyclic Graphs (DAGs) to represent workflows, allowing users to define task dependencies and execution orders.
3. Can Airflow integrate with other systems and tools?
ANS: – Yes, Airflow provides a rich ecosystem of integrations and operators to interact with various systems, such as databases, cloud services, and messaging queues.
4. Is Airflow scalable for large-scale data processing?
ANS: – Airflow is highly scalable and can handle large-scale data processing tasks by leveraging distributed execution and parallelism.
5. How does Airflow ensure reliability and monitoring of workflows?
ANS: – Airflow offers robust monitoring and logging capabilities, including task-level retries, alerts, and visualization of workflow status through its web UI.
WRITTEN BY Karthik Kumar P V
Karthik Kumar Patro Voona is a Research Associate (Kubernetes) at CloudThat Technologies. He Holds Bachelor's degree in Information and Technology and has good programming knowledge of Python. He has experience in both AWS and Azure. He has a passion for Cloud-computing and DevOps. He has good working experience in Kubernetes and DevOps Tools like Terraform, Ansible, and Jenkins. He is a very good Team player, Adaptive and interested in exploring new technologies.
Click to Comment