Resilience Engineering in Amazon EKS with LitmusChaos

Introduction

Modern applications are increasingly deployed on distributed systems like Kubernetes, and Amazon Elastic Kubernetes Service (EKS) is a preferred choice for hosting containerized workloads. While Amazon EKS offers scalability and reliability, ensuring the resilience of applications in production requires rigorous testing under failure scenarios. Resilience Engineering is a proactive approach to building robust systems, and LitmusChaos is an open-source chaos engineering platform designed for Kubernetes-native environments.

In this blog, we will explore how resilience engineering principles can be applied to Amazon EKS using LitmusChaos to create reliable, failure-resistant applications.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Resilience Engineering

Resilience engineering is designing systems to withstand and recover from unexpected failures. The goal is to minimize downtime, preserve data integrity, and ensure a seamless user experience even under adverse conditions. In Kubernetes environments, resilience engineering involves:

Simulating real-world failures (e.g., node crashes, network delays).
Observing how applications behave under stress.
Identifying and fixing vulnerabilities before they impact production.

Why Resilience Engineering on Amazon EKS?

Amazon EKS is a platform for deploying Kubernetes workloads. However, resilience challenges can arise due to the dynamic and distributed nature of Kubernetes-based systems. Resilience engineering ensures your applications hosted on EKS remain functional and recover gracefully during failures. Here’s why resilience engineering is crucial for Amazon EKS environments.

Handling Node Failures: Amazon EKS automatically manages the control plane and integrates with managed node groups, but worker nodes can still fail due to hardware issues, resource exhaustion, or misconfigurations. Resilience engineering ensures pods are rescheduled quickly and services remain unaffected.
Managing Network Issues: In distributed applications, inter-service communication is vital. Network delays, packet drops, or DNS failures can disrupt service reliability. Simulating these issues helps validate retry logic, timeouts, and fallback mechanisms.
Resource Exhaustion: Kubernetes clusters often deal with varying workloads, which can lead to resource contention. Resilience engineering tests the cluster’s ability to handle CPU, memory, and disk exhaustion without impacting critical services.

To implement resilience engineering in Amazon EKS, we are leveraging LitmusChaos as the primary tool for conducting chaos engineering experiments.

LitmusChaos

LitmusChaos is an open-source chaos engineering tool designed to help developers, and SREs ensure the resilience of their Kubernetes workloads. When used with Amazon Elastic Kubernetes Service (EKS), LitmusChaos enables teams to simulate real-world failure scenarios, identify vulnerabilities, and build robust applications capable of withstanding disruptions.

Key Features of LitmusChaos

Kubernetes-Native Design Seamlessly integrates with Kubernetes ecosystems, including Amazon EKS, GKE, AKS, and more.
Extensive Chaos Experiments Provides a library of predefined experiments, such as pod delete, node drain, network latency, and resource stress, with the flexibility to define custom scenarios.
Chaos Workflows Automates and sequences chaos experiments using tools like Argo Workflows or Litmus’s built-in workflow engine.
Role-Based Access Control (RBAC) Ensures secure execution of chaos experiments in multi-tenant environments.
ChaosCenter UI is a centralized dashboard for managing, monitoring, and analyzing chaos experiments.

Deploying LitmusChaos in the Amazon EKS cluster using the Helm package

Helm is a package manager for Kubernetes that simplifies the deployment and management of applications and services within a Kubernetes cluster. It allows you to define, install, and upgrade even the most complex Kubernetes applications using “charts” pre-configured Kubernetes resources packaged together. Helm provides several benefits:

Package Management: Helm charts allow you to define, install, and manage Kubernetes applications in a repeatable way.
Simplified Deployment: With Helm, you can install applications with a single command (helm install), which reduces the complexity of Kubernetes manifests.
Version Control: Helm supports versioning, so you can track and manage changes to applications over time, making it easier to roll back or upgrade applications.
Reusability: Charts can be shared and reused across different teams and projects, reducing duplication of effort.
Customization: Helm charts support customization through configuration values, allowing you to tailor deployments based on your environment or needs.
Release Management: Helm manages releases, enabling the deployment of specific versions of an application and providing easy rollback to previous versions.

Preparing Amazon EKS Cluster for LitmusChaos setup

Step 1:

Create an Amazon EKS Cluster: Set up an Amazon EKS cluster on your AWS account with the necessary AWS IAM roles and permissions. To install LitmusChaos, ensure the cluster includes three nodes with the t3a.medium instance type. Additionally, configure a storage class to facilitate the setup of the Chaos Hub and store chaos experiments. Litmus will claim this storage class using PersistentVolumeClaim (PVC).

Step 2:

Configure Storage Class: Create a YAML file defining the storage class and apply it to the cluster using the kubectl apply command.

Yaml file for the storage class and name it as litmus-sc.yaml

step2

step2b

Ensure that your storage class is created using the below command.

step2c

Ensure you added the Amazon EBS CSI driver in the Amazon EKS Cluster in the add-on section.

Deploying LitmusChaos using Helm in Amazon EKS Cluster.

Step 3:

First, pull the litmus helm chart with the helm pull command. Unzip the tar.file. Open the chart in vs code, locate the values.yaml for the PVC, and edit the storage class as litmus-sc.yaml. Now, edit the fronted service as NodePort to access the LitmusChaos UI page.

step3

Step 4:

To install the LitmusChaos, create a namespace as litmus

step4

Step 5:

For installation, use the helm install command along with the chart path

helm install <chart-name> <chart-path> -n litmus

i.e., helm install litmus-chaos litmus/ -n litmus

Check whether all pods are running or not by using this command

step5

Step 6:

We completed the installation of LitmusChaos in our Amazon EKS Cluster. Now, we port-forward the frontend service to get the ChaosCenter UI page.

step6

Access ChaosCenter by providing the IP address of your Amazon EC2 instance and port number. Here, the port number is 9094.

<public_ip>: 9094

The UI page of The LitmusChoas is below. Here default username as “admin” and password as “litmus”. After logging in with this credential, we can change the password.

step6b

Conclusion

Resilience engineering is essential for ensuring the robustness of applications in distributed systems. With Amazon EKS and LitmusChaos, teams can proactively identify weaknesses, test recovery mechanisms, and build fault-tolerant systems.

By embracing chaos engineering, organizations can improve their application reliability and deliver seamless user experiences, even during unexpected failures.

Drop a query if you have any questions regarding Resilience engineering and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront and many more.

To get started, go through our Consultancy page and Managed Services Package, CloudThat’s offerings.

FAQs

1. What is the role of ChaosCenter in LitmusChaos?

ANS: – ChaosCenter is the control plane for managing chaos experiments, visualizing workflows, and analyzing results, making it easier to manage resilience testing in Amazon EKS clusters.

2. How to implement the Chaos Experiments in ChaosCenter?

ANS: – LitmusChaos’s ChaosCenter provides a unified platform for orchestrating, executing, and monitoring chaos experiments. It offers a user-friendly interface to design, automate, and analyze experiments in Kubernetes environments like Amazon EKS. Refer to LitmusChaos Official page documentation to perform the experiments LitmusChaos.io.

WRITTEN BY Sidda Sonali

Sidda Sonali is a Research Intern at CloudThat. She is keenly interested in learning advanced technologies and gaining insights into emerging and upcoming cloud services. Sonali actively seeks opportunities to learn about new cloud innovations and best practices.