Real-time Data Streaming: Amazon RDS MySQL CDC with Apache Kafka on Amazon EC2, Debezium, and AWS Lambda

Overview

Debezium is an innovative open-source platform for Change Data Capture (CDC). It captures real-time data changes from various databases and transforms them into a stream of change events. Debezium supports popular databases like MySQL, PostgreSQL, and MongoDB and seamlessly integrates with Apache Kafka for efficient data streaming. With Debezium, organizations can unlock the power of real-time analytics, data integration, and event-driven architectures. Its resilience, low latency, and scalability make it an indispensable tool for capturing and leveraging database changes.

This entire solution is explained in 3 parts, in the 1^st part, we have seen the creation of VPC and the installation of an Amazon EC2 machine (private) which can be SSH without using Bastion host that is with using Amazon EC2 Instance Connect (EIC) Endpoint, this is the 2^nd part where we will be launching Amazon RDS, install Apache Kafka and configure Debezium on Private Amazon EC2.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Apache Kafka

Apache Kafka is an innovative open source distributed streaming platform.

It excels at handling high-volume, real-time data streams by enabling efficient and reliable data transportation between systems and applications. With its unique architecture and design principles, Apache Kafka offers scalability, fault tolerance, and the ability to retain large volumes of data.

It has become the standard choice for building scalable, real-time data pipelines and event-driven architectures. Debezium will be integrated with Apache Kafka for efficient data streaming.

Amazon RDS MySQL

Amazon RDS MySQL is a popular managed relational database service that Amazon Web Services (AWS) provides. It offers a simplified and scalable solution for deploying and managing MySQL databases in the cloud. Amazon RDS MySQL automates time-consuming administrative tasks such as hardware provisioning, software patching, and database backups. It provides high availability through automated backups, software patching, and replication options. Amazon RDS MySQL is known for its reliability, performance, and ease of use, making it an excellent choice for applications requiring a robust and scalable MySQL database solution. We will use Amazon RDS MySQL to connect with Debezium Connector.

Steps to Launch Amazon RDS MySQL

Step 1: Launch an Amazon RDS MySQL

Goto Amazon RDS service -> Parameter Groups -> Create – as following

rds1

Step 2: Once created, go into it, click on edit parameters

Step 3: Binlog_format – > Select ROW and click on save changes

rds3

Step 4: Now let’s create Amazon RDS MySQL Database

rds4

rds4b

rds4c

rds4d

Step 5: Select the VPC which is created earlier

rds5

rds5b

Step 6: Enter the Initial database name, select the DB Parameter logbin which was created, and enable backup for 1 day for binlog purposes, or else it wouldn’t work.

rds6

Keep other as default (if required, edit) and click on create (it would take a few mins to create)

Steps to Install Apache Kafka and configure Debezium on Private Amazon EC2

Step 1: Let’s install Apache Kafka and configure Debezium on Private Amazon EC2

In part 1, we have SSH into Amazon EC2 using Amazon EC2 Instance Connect Endpoint, now let’s run these commands (NAT Gateway helps us to install these with the web)

sudo su
apt-get update
apt-get install -y wget net-tools netcat tar openjdk-8-jdk
wget https://archive.apache.org/dist/kafka/2.7.0/kafka_2.12-2.7.0.tgz
tar -xzf kafka_2.12-2.7.0.tgz
mv kafka_2.12-2.7.0 kafka
cd ./kafka/config/
vi zookeeper.properties (edit the file, press i for insert mode)
dataDir=/root/zookeeper (this needs to be added in place of dataDir, after replacing, press escape, and :wq to save the file and quit)

rm -rf server.properties
vi server.properties (i to insert mode and paste the below code, edit IP address – add private address of instance and press escape, :wq to save and quit)
#starting of the code
broker.id=0
#     listeners = PLAINTEXT://your.host.name:9092
advertised.listeners=PLAINTEXT://PrivateIP:9092
zookeeper.connect=PrivateIP:2181
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
auto.create.topics.enable=true
log.dirs=/home/ubuntu/kafka-logs
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connection.timeout.ms=6000
#end of the code


cd /home/ubuntu
wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-mysql/1.3.1.Final/debezium-connector-mysql-1.3.1.Final-plugin.tar.gz
tar -xvzf debezium-connector-mysql-1.3.1.Final-plugin.tar.gz
cd kafka
mkdir connect
cd ..
sudo mv debezium-connector-mysql ./kafka/connect
vi ./kafka/config/connect-standalone.properties (edit these two lines)
 
bootstrap.servers=EC2PrivateIPAddress:9092
plugin.path=/home/ubuntu/kafka/connect/ (add this)

sudo su

apt-get update

apt-get install -y wget net-tools netcat tar openjdk-8-jdk

wget https://archive.apache.org/dist/kafka/2.7.0/kafka_2.12-2.7.0.tgz

tar -xzf kafka_2.12-2.7.0.tgz

mv kafka_2.12-2.7.0 kafka

cd ./kafka/config/

vi zookeeper.properties (edit the file, press i for insert mode)

dataDir=/root/zookeeper (this needs to be added in place of dataDir, after replacing, press escape, and :wq to save the file and quit)

rm -rf server.properties

vi server.properties (i to insert mode and paste the below code, edit IP address – add private address of instance and press escape, :wq to save and quit)

#starting of the code

broker.id=0

# listeners = PLAINTEXT://your.host.name:9092

advertised.listeners=PLAINTEXT://PrivateIP:9092

zookeeper.connect=PrivateIP:2181

num.network.threads=3

num.io.threads=8

socket.send.buffer.bytes=102400

socket.receive.buffer.bytes=102400

socket.request.max.bytes=104857600

auto.create.topics.enable=true

log.dirs=/home/ubuntu/kafka-logs

num.partitions=1

num.recovery.threads.per.data.dir=1

offsets.topic.replication.factor=1

transaction.state.log.replication.factor=1

transaction.state.log.min.isr=1

log.retention.hours=168

log.segment.bytes=1073741824

log.retention.check.interval.ms=300000

zookeeper.connection.timeout.ms=6000

#end of the code

cd /home/ubuntu

wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-mysql/1.3.1.Final/debezium-connector-mysql-1.3.1.Final-plugin.tar.gz

tar -xvzf debezium-connector-mysql-1.3.1.Final-plugin.tar.gz

cd kafka

mkdir connect

cd ..

sudo mv debezium-connector-mysql ./kafka/connect

vi ./kafka/config/connect-standalone.properties (edit these two lines)

bootstrap.servers=EC2PrivateIPAddress:9092

plugin.path=/home/ubuntu/kafka/connect/ (add this)

Step 2: Connect to MySQL

sudo apt install mysql-server
mysql -h RDS-MySQL-Endpoint -u admin -p

1 2	sudo apt install mysql-server mysql -h RDS-MySQL-Endpoint -u admin -p

Password: enter your password which gave while creating
Show global variables like ‘log_bin’; show global variables like ‘binlog_format’;

kafka

use mysqldb;
create table sample (id int, name varchar(20));
insert into sample values(2,'cloudthat'); 
select * from sample;

use mysqldb;

create table sample (id int, name varchar(20));

insert into sample values(2,'cloudthat');

select * from sample;

kafka2

GRANT ALL PRIVILEGES ON mysqldb.* TO 'admin'@'%';

1	GRANT ALL PRIVILEGES ON mysqldb.* TO 'admin'@'%';

Keep this tab like that and duplicate this tab for next steps

Step 3: Debezium configuration

vi ./kafka/config/connect-debezium-mysql.properties
name=test-connector
connector.class=io.debezium.connector.mysql.MySqlConnector
database.hostname=RDS-ENDPOINT
database.port=3306
database.user=admin
database.password=adminpass
database.server.id=1
database.server.name=mysql
database.include.list=mysqldb
table.include.list=mysqldb.sample 
database.history.kafka.bootstrap.servers= EC2PrivateIPAdress:9092
database.history.kafka.topic=dbhistory.test
include.schema.changes=true
tombstones.on.delete=false

vi ./kafka/config/connect-debezium-mysql.properties

name=test-connector

connector.class=io.debezium.connector.mysql.MySqlConnector

database.hostname=RDS-ENDPOINT

database.port=3306

database.user=admin

database.password=adminpass

database.server.id=1

database.server.name=mysql

database.include.list=mysqldb

table.include.list=mysqldb.sample

database.history.kafka.bootstrap.servers= EC2PrivateIPAdress:9092

database.history.kafka.topic=dbhistory.test

include.schema.changes=true

tombstones.on.delete=false

Step 4: Topic creation with Debenzium and listing

sudo systemctl restart mysql
./kafka/bin/zookeeper-server-start.sh -daemon ./kafka/config/zookeeper.properties
./kafka/bin/kafka-server-start.sh -daemon ./kafka/config/server.properties
./kafka/bin/kafka-topics.sh --list --bootstrap-server EC2PrivateIPAdress:9092

sudo systemctl restart mysql

./kafka/bin/zookeeper-server-start.sh -daemon ./kafka/config/zookeeper.properties

./kafka/bin/kafka-server-start.sh -daemon ./kafka/config/server.properties

./kafka/bin/kafka-topics.sh --list --bootstrap-server EC2PrivateIPAdress:9092

The above command doesn’t display any topics since we didn’t start the Debenzium connector

./kafka/bin/connect-standalone.sh ./kafka/config/connect-standalone.properties 
./kafka/config/connect-debezium-mysql.properties - single command

1 2	./kafka/bin/connect-standalone.sh ./kafka/config/connect-standalone.properties ./kafka/config/connect-debezium-mysql.properties - single command

Launch duplicate tab of it and check if topics created

./kafka/bin/kafka-topics.sh --list --bootstrap-server EC2PrivateIPAdress:9092

1	./kafka/bin/kafka-topics.sh --list --bootstrap-server EC2PrivateIPAdress:9092

kafka3

We will do further process in part 3 where we will see CRUD operation on MySQL Database to check if Debezium is working.

Conclusion

In the above process, we have installed Apache Kafka on Amazon EC2, which has the Kafka topics to store the CDC data with the help of the Debezium connector. Multiple tables can be configured in the Debezium configuration file, separated by the databases as well. The topics will be created based on the tables specified in the configuration file. The CDC data will send to the respective topics.

Drop a query if you have any questions regarding Apache Kafka and we will get back to you quickly.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is an official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft Gold Partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.

To get started, go through our Consultancy page and Managed Services Package that is CloudThat’s offerings.

FAQs

1. What if I get a port 8083 error?

ANS: – Use lsof -i :8083 and kill the process and run it again.

2. Can we have more than one table?

ANS: – Yes, we can use multiple tables.

WRITTEN BY Suresh Kumar Reddy

Yerraballi Suresh Kumar Reddy is working as a Research Associate - Data and AI/ML at CloudThat. He is a self-motivated and hard-working Cloud Data Science aspirant who is adept at using analytical tools for analyzing and extracting meaningful insights from data.