Voiced by Amazon Polly |
Overview
In today’s data-driven landscape, organizations often face the challenge of efficiently processing massive amounts of real-time data. This blog delves into integrating Amazon Kinesis and Amazon EMR to streamline real-time data ingestion and processing. It highlights the benefits of leveraging Kinesis for scalable, durable, and near real-time data streaming while utilizing Amazon EMR for powerful big data analytics. With a focus on use cases such as real-time analytics, data transformation, and log processing, the blog provides a detailed step-by-step guide to set up and optimize this robust data pipeline.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Why Use Amazon Kinesis with Amazon EMR?
Amazon Kinesis is a scalable and durable real-time data streaming service. Integrating it with Amazon EMR allows organizations to process streaming data in near real-time, making it ideal for use cases like:
- Real-time analytics: Monitoring website activity, processing IoT sensor data, or financial transactions.
- Data transformation: Converting raw data into structured formats for downstream analysis.
- Log processing: Streaming logs from various applications and processing them fly.
By leveraging Amazon Kinesis, you can decouple data producers from consumers and ensure that Amazon EMR only processes data when ready.
Components of the Pipeline
- Amazon Kinesis Data Streams: This is the primary channel for ingesting real-time data from producers.
- Amazon Kinesis Data Firehose: Delivers data from streams to destinations like Amazon S3, enabling Amazon EMR to process data in batches.
- Amazon EMR: Processes the ingested data using frameworks like Apache Spark, Hive, or Presto.
Below is a detailed step-by-step guide to setting up the pipeline.
Step-by-Step Guide
Step 1: Create an Amazon Kinesis Data Stream
- Log in to the AWS Management Console and navigate to Amazon Kinesis Data Streams.
- Create a new stream:
- Name your stream (e.g., real-time-stream).
- Define the number of shards based on your expected data throughput. Each shard supports up to 1 MB/sec write and 2 MB/sec read.
- Click Create Stream.
Configuring Data Producers
You can use AWS SDKs or libraries like boto3 for Python to send data to the stream. For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import boto3 import json def send_to_kinesis(stream_name, data): kinesis_client = boto3.client('kinesis') response = kinesis_client.put_record( StreamName=stream_name, Data=json.dumps(data), PartitionKey="partition-key" ) return response # Example Usage send_to_kinesis("real-time-stream", {"event": "page_view", "user": "1234"}) |
Step 2: Configure Amazon Kinesis Data Firehose
Amazon Kinesis Data Firehose acts as a bridge between Kinesis Data Streams and Amazon S3.
- Navigate to Amazon Kinesis Data Firehose in the AWS Management Console.
- Create a Delivery Stream:
- Choose Source as Amazon Kinesis Data Stream and select your stream (real-time-stream).
- Choose Destination as Amazon S3.
- Configure the destination:
- Create or specify an Amazon S3 bucket (e.g., emr-data-bucket).
- Optionally enable data transformation using an AWS Lambda function.
- Define the buffer size and interval:
- g., 5 MB or 60 seconds (whichever is met first).
- Click Create Delivery Stream.
Step 3: Set Up an Amazon EMR Cluster
- Navigate to the Amazon EMR Console and create a new cluster:
- Choose the desired Amazon EMR version (e.g., emr-6.x) with Apache Spark.
- Select an instance type like m5.xlarge based on your workload.
- Configure Amazon S3 Input and Output Paths:
- Input Path: s3://emr-data-bucket/
- Output Path: s3://emr-output-bucket/
- Enable permissions:
- Attach AWS IAM roles allowing Amazon EMR to access Amazon S3 and Amazon Kinesis.
Step 4: Process Data on Amazon EMR Using Apache Spark
Apache Spark on EMR can process data from the Amazon S3 bucket, which Amazon Kinesis Firehose delivers.
Example Spark Job
Save the following Spark job as process_stream.py:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from pyspark.sql import SparkSession # Initialize Spark Session spark = SparkSession.builder.appName("KinesisEMRProcessing").getOrCreate() # Read data from S3 input_path = "s3://emr-data-bucket/" data = spark.read.json(input_path) # Perform transformations data_transformed = data.withColumnRenamed("event", "event_type") # Write output to S3 output_path = "s3://emr-output-bucket/processed/" data_transformed.write.mode("overwrite").json(output_path) print("Data processing complete!") |
Submit the Job
Use the following command to submit the Spark job to your Amazon EMR cluster:
1 |
yarn submit --deploy-mode cluster --master yarn process_stream.py |
Step 5: Monitor the Pipeline
- Monitor Amazon Kinesis Data Streams:
- Use the Amazon Kinesis monitoring dashboard to track incoming data rate and shard utilization.
- Monitor Firehose:
- Check the delivery stream metrics for data delivery success/failure rates.
- Monitor Amazon EMR:
- Use Amazon CloudWatch to monitor Amazon EMR cluster health and Spark job performance.
Best Practices
- Optimize Shard Count: Adjust the shard count dynamically using the Amazon Kinesis Scaling Utility or AWS Application Auto Scaling.
- Enable Compression: Configure Firehose to compress data (e.g., using GZIP or Snappy) to reduce storage costs and speed up processing.
- Use Partitioning: Partition data in Amazon S3 by timestamp or other meaningful keys to improve query performance in Amazon
- Secure Your Pipeline:
- Encrypt data at rest using Amazon S3 server-side encryption (SSE-S3 or SSE-KMS).
- Encrypt data in transit using SSL/TLS.
Conclusion
With proper monitoring and optimizations, this architecture can deliver high performance while remaining cost-effective.
Drop a query if you have any questions regarding Amazon Kinesis or Amazon EMR and we will get back to you quickly.
Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.
- Reduced infrastructure costs
- Timely data-driven decisions
About CloudThat
CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.
CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront and many more.
To get started, go through our Consultancy page and Managed Services Package, CloudThat’s offerings
FAQs
1. Can I use Amazon Kinesis Data Analytics instead of Amazon Kinesis Data Firehose in this setup?
ANS: – Yes, you can use Amazon Kinesis Data Analytics to process streaming data in real-time before sending it to an Amazon S3 bucket or other destinations. Amazon Kinesis Data Analytics allows for SQL-based transformations on streaming data, which can be beneficial if you need pre-processing before ingesting data into Amazon EMR.
2. How do I determine the optimal number of shards for my Kinesis Data Stream?
ANS: – The number of shards depends on your data’s throughput requirements:
- Each shard supports up to 1 MB/second write and 2 MB/second
- Calculate your expected data ingestion rate and divide it by these limits to determine the required number of shards. Using the Kinesis Scaling Utility or AWS Application Auto Scaling, you can also scale dynamically.
WRITTEN BY Deepak Kumar Manjhi
Click to Comment