Skip to content

Implementing POC for CDC system by using Debezium and Structured Spark Streaming

Notifications You must be signed in to change notification settings

ntnhaatj/cdc-ingestion-debezium

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CDC Ingestion Architecture

  • This repo aims for implementing Capturing Data Change system by using Debezium and Structured Spark Streaming.
  • Written in Python and used Docker Compose for demonstration.

Demo directories

.
├── docs/               # attached documentation
├── env/                # service environments
├── main.py             # main spark streaming service
├── pyspark.Dockerfile  # base image for running spark streaming service
├── schemas/            # avro schemas registry interfaces
└── svc/                # service utility package

Instructions

  • to launch POC demo
$ docker-compose build && docker-compose up -d
  • to observe the logging of particular service
$ docker-compose logs -f <service>
  • tear down all services
$ docker-compose down

Use cases

  1. Table CDC Streaming
  • To capture data change on specific table.
  • Usage
$ python main.py \
      --cls "svc.streams.CDCMySQLTable" \
      --opts "{\"app_name\": \"Customers Table CDC Stream\", \"table_name\": \"customers\"}"
  • Tables with suffix _cdc is for capturing data changes on each source respectively

Screen Shot 2022-05-18 at 09 27 15

  • For instance customers table schema

Screen Shot 2022-05-18 at 09 27 40

  • CDC schema of customers table

Screen Shot 2022-05-18 at 09 27 34

  • Notes:
    • op field captured the action on source table: u -> update, c -> create, d -> update
  1. Click Through Rate Streaming (WIP)
  • Join Clicks and Impressions stream to capture, calculate and stream the click through rate.
  • Usage
$ python main.py \
      --cls "svc.streams.ClickThroughRateStreaming" \
      --opts "{\"app_name\": \"Click Through Rate Stream\", \"click_table\": \"clicks\", \"impression_table\": \"impressions\"}"
  • [WIP] Populate dummy data

Notes

  • to launch a concrete stream which attached to existing systems
$ docker-compose build  # to build cdc-ingestion image 
$ docker run -it --rm --name cdc-addresses-stream \
    --network your-compose-network \
    --volume ${PWD}:/app \
    --env-file env/kafka.env \
    --env-file env/mysql.env \
    --env-file env/connectors.env \
    --env-file env/streamsvc.env \
    cdc-ingestion \
    main.py \
      --cls "svc.streams.CDCMySQLTable" \
      --opts "{\"app_name\": \"arbitrary stream\", \"table_name\": \"addresses\"}"
  • to run kafka consumer on particular topic for debugging
$ docker run -it --rm --name avro-consumer \
    --network your-compose-network \
    debezium/connect:1.9 \
    /kafka/bin/kafka-console-consumer.sh \
      --bootstrap-server kafka:9092 \
      --property print.key=true \
      --formatter io.confluent.kafka.formatter.AvroMessageFormatter \
      --property schema.registry.url=http://schema-registry:8081 \
      --topic dbserver1.inventory.customers

About

Implementing POC for CDC system by using Debezium and Structured Spark Streaming

Resources

Stars

Watchers

Forks

Packages

No packages published