Insights Infra Setup Guide

Streaming setup Overviev

This guide provides a step-by-step walkthrough to set up a streaming pipeline from an OLTP database to Kafka and finally to ClickHouse using Debezium, Kafka Connect, and the ClickHouse Sink Connector in a Kubernetes environment.

Step 1: Deploy Automated Kafka Connect with Custom SMT Plugins

To use the io.debezium.connector.postgresql.PostgresConnector and com.clickhouse.kafka.connect.ClickHouseSinkConnector in your Kafka Connect setup, you need to create a custom Docker image with these plugins.

The Kafka Connect image building is now automated using Cloud Build CI/CD pipeline. This includes:

Debezium PostgreSQL Connector: io.debezium.connector.postgresql.PostgresConnector
ClickHouse Sink Connector: com.clickhouse.kafka.connect.ClickHouseSinkConnector
Custom SMT Transformation Plugins: For data transformation during streaming

1.1 Automated Image Building

The custom Kafka Connect image is automatically built via Cloud Build CI/CD when changes are pushed to the repository. The build process:

Triggers: Automatically triggered on code changes
Build Process: Uses devops-kafka-connect/cloudbuild.yaml
Plugin Installation: Automated via install-plugins.sh scripts
Registry Push: Automatically pushes to asia-docker.pkg.dev/sbx-ci-cd/public/devops-kafka-connectt:latest

1.2 Deploy Kafka Connect with Debezium

Kafka Connect is required to run both Debezium (for Change Data Capture) and the ClickHouse Sink Connector (for storing data in ClickHouse).

Create a kafka-connect.yaml file with the necessary configurations for Kafka Connect.
Apply the manifest to your Kubernetes cluster:
```
kubectl apply -f kafka/kafka-connect.yaml
```

Verify the Plugins: Once the Kafka Connect pod is running, verify that the plugins are available:

You should see both io.debezium.connector.postgresql.PostgresConnector and com.clickhouse.kafka.connect.ClickHouseSinkConnector in the list of available plugins.

Step 2: Deploy Debezium Source Connector for PostgreSQL

Debezium monitors PostgreSQL and publishes changes to Kafka.

2.1 Enable Logical Replication in PostgreSQL

Run the following SQL commands in your PostgreSQL database to enable logical replication:

ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 10;
ALTER SYSTEM SET max_wal_senders = 10;
SELECT pg_reload_conf();

Create a replication user:

CREATE USER debezium WITH REPLICATION ENCRYPTED PASSWORD '<your_password>';
GRANT USAGE ON SCHEMA public TO debezium;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO debezium;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO debezium;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON SEQUENCES TO debezium;

2.2 Create a Logical Replication Slot

Create a publication for the tables you want to monitor:

-- this is a sample sql to create publication on selected tables

CREATE PUBLICATION dbz_publication
FOR TABLE 
public.table_name1,
public.table_name2,
public.table_name3;

Adding tables to an Existing PostgreSQL Publication

To add more tables to your existing dbz_publication, you can use the ALTER PUBLICATION command:

ALTER PUBLICATION dbz_publication 
ADD TABLE public.new_table_name1, public.new_table_name2;

Replace new_table_name1 and new_table_name2 with the actual names of the tables you want to add.

How to Remove Tables from Publication

ALTER PUBLICATION dbz_publication DROP TABLE public.table_name;

Replace table_name with the actual names of the tables you want to drop from publication.

Verify the replication slots and publications:

SELECT * FROM pg_replication_slots;
SELECT * FROM pg_publication;

To clean up, use:

SELECT pg_drop_replication_slot('dbz');
DROP PUBLICATION dbz_publication;

2.3 Deploy Debezium Connector

Create a debezium-postgres.json file with the connector configuration.
To capture all tables in the database, use:
```
"table.include.list": "public.*"
```
Or remove this line entirely to capture all tables by default.

Deploy the connector using sample config:

./insights-infra/Aiven/connectors/backbone_debezium_postgres.sh

To delete the connector:

# this is just a example command to delete connector
curl -X DELETE http://localhost:8083/connectors/<connector>

Step 3: Deploy ClickHouse

ClickHouse will store the data streamed from Kafka.

3.1 Deploy ClickHouse

Create a clickhouse.yaml file with the necessary configurations.

Use Helm to deploy ClickHouse:

helm show values oci://registry-1.docker.io/bitnamicharts/clickhouse --version 8.0.10 > clickhouse/helm/values.yaml

helm install clickhouse oci://registry-1.docker.io/bitnamicharts/clickhouse -f clickhouse/helm/values.yaml --version 8.0.10 -n kafka

helm upgrade clickhouse oci://registry-1.docker.io/bitnamicharts/clickhouse -f clickhouse/helm/values.yaml --version 8.0.10 -n kafka

Step 4: Deploy ClickHouse Sink Connector

The ClickHouse Sink Connector pulls data from Kafka and inserts it into ClickHouse.

4.1 Create a Table in ClickHouse

Run the following SQL in ClickHouse to create a table:

# This is a sample query to create tables in clickhouse

CREATE DATABASE sbx;

CREATE TABLE sbx.wms (
    id Int64,
    name String,
    created_at DateTime
    updatedAt DateTime
) ENGINE = ReplacingMergeTree(updatedAt)
ORDER BY (name, id);

Create all required tables using SQL queries given in sql-queries folder.

4.2 Deploy ClickHouse Sink Connector

Create a clickhouse-sink.json file with the connector configuration.
Ensure all topics and topic2TableMap are present.

Deploy the connector:

curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d @clickhouse/samples/clickhouse-sink.json

To update the connector:

curl -X PUT http://localhost:8083/connectors/samples/clickhouse-connect/config \
     -H "Content-Type: application/json" \
     -d @clickhouse/clickhouse-sink-update.json

To delete the connector:

curl -X DELETE http://localhost:8083/connectors/clickhouse-connect

Step 5: Verify the Setup

5.1 Check Kafka Topics

List all Kafka topics:

kubectl exec -it <kafka-pod-name> -n dev -- kafka-topics.sh --list --bootstrap-server <kafka-url>:9092

You should see a topic like postgres.public.your_table.

5.2 Consume Messages from Kafka

Consume messages from a Kafka topic:

kubectl exec -it <kafka-pod-name> -n dev -- kafka-console-consumer.sh --topic postgres.public.ccs_cedge --from-beginning --bootstrap-server <kafka-url>:9092

5.3 Verify Data in ClickHouse

Query the data in ClickHouse:

clickhouse-client --host clickhouse --query "SELECT * FROM your_db.your_table;"

5.4 Check Number of Threads

Run the following query to check the number of threads:

SELECT
    name,
    value
FROM system.settings
WHERE name = 'max_threads';

Step 6: ClickHouse Integration in Metabase (Kubernetes)

Metabase is an easy-to-use, open-source UI tool for asking questions about your data. It connects to ClickHouse using a JDBC driver. Follow these steps to deploy Metabase in a Kubernetes environment and integrate it with ClickHouse.

6.1 Deploy Metabase in Kubernetes

Download the latest ClickHouse plugin JAR using initContainer and mount the plugin via a shared volume:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: metabase
  labels:
    app: metabase
spec:
  replicas: 1
  selector:
    matchLabels:
      app: metabase
  template:
    metadata:
      labels:
        app: metabase
    spec:
      containers:
        - name: metabase
          image: metabase/metabase:latest
          ports:
            - containerPort: 3000
          volumeMounts:
            - name: plugins
              mountPath: /plugins
          env:
            - name: MB_PLUGINS_DIR
              value: "/plugins"
      initContainers:
        - name: download-clickhouse-plugin
          image: curlimages/curl:latest
          command:
            [
              "sh",
              "-c",
              "curl -L https://github.com/clickhouse/metabase-clickhouse-driver/releases/latest/download/clickhouse.metabase-driver.jar    -o /plugins/clickhouse.metabase-driver.jar"
            ]
          volumeMounts:
            - name: plugins
              mountPath: /plugins
      volumes:
        - name: plugins
          emptyDir: {}

Apply the deployment and wait for metabase pod to load new driver.
Access Metabase and use driver.

6.2 Add ClickHouse as a Data Source in Metabase

Open the Metabase UI in your browser (https://<EXTERNAL-URL>).
Go to Settings > Admin > Databases.
Click Add Database.
Fill in the following details:
- Database Type: Select ClickHouse.
- Host: Enter the hostname or IP address of your ClickHouse service.
- Port: Enter the ClickHouse port (default is 8123).
- Database Name: Enter the name of your ClickHouse database (e.g., sbx).
- Username: Enter your ClickHouse username.
- Password: Enter your ClickHouse password.
Click Save to add the database.

6.4 Start Querying Data

Once the database is added, you can start creating dashboards, visualizations, and asking questions about your data in ClickHouse using Metabase.

Notes

Ensure all Kubernetes resources are properly configured and deployed.
Use secure credentials and secrets for sensitive configurations.
Monitor the pipeline for errors and performance bottlenecks.

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
.github/instructions		.github/instructions
clickhouse-supertables		clickhouse-supertables
dev-pod		dev-pod
docs		docs
flink-studio		flink-studio
kafka-setup		kafka-setup
.gitignore		.gitignore
ARCH.md		ARCH.md
CLAUDE.md		CLAUDE.md
README.md		README.md
SCHEMA_UPDATES.md		SCHEMA_UPDATES.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Insights Infra Setup Guide

Streaming setup Overviev

Table of Contents

Step 1: Deploy Automated Kafka Connect with Custom SMT Plugins

1.1 Automated Image Building

1.2 Deploy Kafka Connect with Debezium

Step 2: Deploy Debezium Source Connector for PostgreSQL

2.1 Enable Logical Replication in PostgreSQL

2.2 Create a Logical Replication Slot

2.3 Deploy Debezium Connector

Step 3: Deploy ClickHouse

3.1 Deploy ClickHouse

Step 4: Deploy ClickHouse Sink Connector

4.1 Create a Table in ClickHouse

4.2 Deploy ClickHouse Sink Connector

Step 5: Verify the Setup

5.1 Check Kafka Topics

5.2 Consume Messages from Kafka

5.3 Verify Data in ClickHouse

5.4 Check Number of Threads

Step 6: ClickHouse Integration in Metabase (Kubernetes)

6.1 Deploy Metabase in Kubernetes

6.2 Add ClickHouse as a Data Source in Metabase

6.4 Start Querying Data

Notes

Documentation Reference

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

stackbox-dev/insights-infra

Folders and files

Latest commit

History

Repository files navigation

Insights Infra Setup Guide

Streaming setup Overviev

Table of Contents

Step 1: Deploy Automated Kafka Connect with Custom SMT Plugins

1.1 Automated Image Building

1.2 Deploy Kafka Connect with Debezium

Step 2: Deploy Debezium Source Connector for PostgreSQL

2.1 Enable Logical Replication in PostgreSQL

2.2 Create a Logical Replication Slot

2.3 Deploy Debezium Connector

Step 3: Deploy ClickHouse

3.1 Deploy ClickHouse

Step 4: Deploy ClickHouse Sink Connector

4.1 Create a Table in ClickHouse

4.2 Deploy ClickHouse Sink Connector

Step 5: Verify the Setup

5.1 Check Kafka Topics

5.2 Consume Messages from Kafka

5.3 Verify Data in ClickHouse

5.4 Check Number of Threads

Step 6: ClickHouse Integration in Metabase (Kubernetes)

6.1 Deploy Metabase in Kubernetes

6.2 Add ClickHouse as a Data Source in Metabase

6.4 Start Querying Data

Notes

Documentation Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages