Airbyte tuning capabilities #48436

rafaelmariotti-bellesa · 2024-11-08T23:42:38Z

rafaelmariotti-bellesa
Nov 8, 2024

Hi,
I've been trying Airbyte for a few weeks now, in a kubernetes cluster (EKS from AWS, deploying using Airbyte helm chart version 1.2.0), with a dedicated postgres database using AWS RDS. After some struggling, I was able to finally make it work.
I started to add a few sources and destinations, and some of them are good enough in terms of speed. However, for some, and in this case, Shopify, this is taking too many hours for the first sync. I'm trying to sync 1 year of data, and after 12 hours executing, I gave up waiting for it.
I've read lots of posts here, and also in the old discussion blog, where I found some configurations that are deprecated, and some others that helped me to improve my environment a little bit. Here's what I have so far:

EKS configuration:
Node group 1:
m6i.xlarge machine (4 vCPU with 16GB), with minimum/desired of 1 ec2, up to 5 max
Node group 2:
m6i.2xlarge (8 vCPU with 32GB), with minimum/desired of 1 ec2, up to 5 max

Following the Airbyte documentation, I used the nodeSelector to dedicate the node group 2 only for the worker, worker-launcher, orchestrator, workload-api-server, and the jobs. All the other deployments are going to node 1.

This is the values.yaml files that I have configured so far:

# source: https://github.com/airbytehq/airbyte-platform/blob/main/charts/airbyte/values.yaml
global:
  jobs:
    resources:
      limits:
        cpu: "2000m"
        memory: "4Gi"
      requests:
        cpu: "1000m"
        memory: "2Gi"
  kube:
    nodeSelector:
      type: jobs

postgresql:
  enabled: false

metrics:
  enabled: false

serviceAccount:
  create: true

# https://docs.airbyte.com/enterprise-setup/scaling-airbyte
worker:
  replicaCount: 2
  extraEnv:
    - name: MAX_SYNC_WORKERS
      value: "100"
    - name: MAX_SPEC_WORKERS
      value: "100"
    - name: MAX_CHECK_WORKERS
      value: "100"
    - name: MAX_DISCOVER_WORKERS
      value: "100"
    - name: MAX_NOTIFY_WORKERS
      value: "100"
  nodeSelector:
    type: jobs

# https://docs.airbyte.com/operator-guides/configuring-airbyte
workload-launcher:
  replicaCount: 2
  extraEnv:
  nodeSelector:
    type: static
  extraEnv:
    - name: WORKLOAD_LAUNCHER_PARALLELISM
      value: "200"  # MAX_SYNC_WORKERS * replicaCount
    - name: JOB_KUBE_NODE_SELECTORS
      value: type=jobs
    - name: SPEC_JOB_KUBE_NODE_SELECTORS
      value: type=jobs
    - name: CHECK_JOB_KUBE_NODE_SELECTORS
      value: type=jobs
    - name: DISCOVER_JOB_KUBE_NODE_SELECTORS
      value: type=jobs

temporal:
  extraEnv:
    - name: DBNAME
      value: "temporal"
    - name: VISIBILITY_DBNAME
      value: "temporal_visibility"
    # temporal migration variables
    - name: POSTGRES_TLS_ENABLED
      value: "true"
    - name : POSTGRES_TLS_DISABLE_HOST_VERIFICATION
      value: "true"
    # temporal app connection variables
    - name: SQL_TLS
      value: "true"
    - name: SQL_TLS_DISABLE_HOST_VERIFICATION
      value: "true"
    - name: SQL_TLS_ENABLED
      value: "true"
    - name: SQL_HOST_VERIFICATION
      value: "false"
  nodeSelector:
    type: static

webapp:
  nodeSelector:
    type: static

connectorBuilderServer:
  nodeSelector:
    type: static

airbyte-bootloader:
  nodeSelector:
    type: static

server:
  replicaCount: 2
  nodeSelector:
    type: static
  extraEnv:
    - name: HTTP_IDLE_TIMEOUT
      value: 20m
    - name: READ_TIMEOUT
      value: 30m

keycloak:
  nodeSelector:
    type: static

keycloak-setup:
  nodeSelector:
    type: static

orchestrator:
  nodeSelector:
    type: jobs
  
workload-api-server:
  nodeSelector:
    type: jobs

podSweeper:
  nodeSelector:
    type: static

With this configuration, I was able to make Airbyte work.
However, what I want now is to make it faster.

I've tried lots of different things:

increase the CPU/memory requests and limits, but that seems to have no effect (the pod created by the sync job uses around 1.7GB memory, and sometimes with spikes of 600m of CPU - which is fine for my current configuration)
Tried to increase the MAX_*_WORKERS and WORKLOAD_LAUNCHER_PARALLELISM variables, but that also seems to have no effect
Increase the number of replicaCount of workers and worker-launcher, also no effect (I read in another post here that this is irrelevant since with a k8s architecture, the worker just delegates the workload for the job pods)
I also changed a few configuration in the source and destination settings. For the Shopify source, I changed the BULK size from 30 to 7 days, and deactivated the user_id and presentment prices options.
Tried to disable the containerOrchestrator inside of the worker and worker-launcher variables in the values.yaml file, but that also seems to have no effect (I checked the worker variables, and the CONTAINER_ORCHESTRATOR_ENABLED is always set to true)
Tried to enable a single stream, instead of having multiple streams running in parallel on the same connection. I had the feeling that it was a little bit faster, but I was expecting much more.

I also noticed in the logs for this replication that the flush worker is taking a lot of time to copy the file to the stage tables in Snowflake, like for example:

2024-11-08 23:35:30 source > Stream: `customer_journey_summary` requesting BULK Job for period: 2024-04-07T00:00:00+00:00 -- 2024-04-14T00:00:00+00:00. Slice size: `P7D`. The BULK checkpoint after `100000` lines.
2024-11-08 23:35:30 source > Stream: `customer_journey_summary`, the BULK Job: `gid://shopify/BulkOperation/5108125827317` is CREATED
2024-11-08 23:35:30 source > API Load: `REGULAR`
2024-11-08 23:35:42 destination > INFO pool-6-thread-1 i.a.c.i.d.a.b.BufferManager(printQueueInfo):94 [ASYNC QUEUE INFO] Global: max: 1.5 GB, allocated: 19.66 MB (19.6589412689209 MB), %% used: 0.012798789888620377 | Queue `bboutique_customer_journey_summary`, num records: 9788, num bytes: 5.19 MB, allocated bytes: 10 MB | State Manager memory usage: Allocated: 9 MB, Used: -357626 bytes, percentage Used -0.03531015683639257
2024-11-08 23:35:42 destination > INFO pool-9-thread-1 i.a.c.i.d.a.FlushWorkers(printWorkerInfo):127 [ASYNC WORKER INFO] Pool queue size: 0, Active threads: 0
2024-11-08 23:36:01 source > Stream: `customer_journey_summary`, the BULK Job: `gid://shopify/BulkOperation/5108125827317` is COMPLETED
2024-11-08 23:36:01 source > Stream: `customer_journey_summary`, the BULK Job: `gid://shopify/BulkOperation/5108125827317` time elapsed: 29.886 sec. Rows collected: 7924 --> records: `7924`.
2024-11-08 23:36:01 replication-orchestrator > Records read: 305000 (106 MB)
2024-11-08 23:36:02 source > Stream: `customer_journey_summary` requesting BULK Job for period: 2024-04-14T00:00:00+00:00 -- 2024-04-21T00:00:00+00:00. Slice size: `P7D`. The BULK checkpoint after `100000` lines.
2024-11-08 23:36:02 source > Stream: `customer_journey_summary`, the BULK Job: `gid://shopify/BulkOperation/5108127138037` is CREATED
2024-11-08 23:36:22 source > Stream: `customer_journey_summary`, the BULK Job: `gid://shopify/BulkOperation/5108127138037` is COMPLETED
2024-11-08 23:36:23 source > Stream: `customer_journey_summary`, the BULK Job: `gid://shopify/BulkOperation/5108127138037` time elapsed: 19.881 sec. Rows collected: 6966 --> records: `6966`.
2024-11-08 23:36:23 replication-orchestrator > Records read: 310000 (108 MB)
2024-11-08 23:36:23 replication-orchestrator > Records read: 315000 (109 MB)

This log is just for a single stream running, no parallel streams.

Now, a few questions:

is there a way to improve this process even more using environment variables somewhere?
is there a way to increase the number of flushed rows? (instead of having 5k per flush)
is the environment variables WORKLOAD_LAUNCHER_PARALLELISM and MAX_*_WORKERS still valid in the current versions of Airbyte? if not, what should I use instead?

Any consideration for tuning is welcome here, ANYTHING! :)

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airbyte tuning capabilities #48436

{{title}}

EKS configuration:

Replies: 0 comments

Select a reply

Airbyte tuning capabilities #48436

rafaelmariotti-bellesa Nov 8, 2024

EKS configuration:

Replies: 0 comments

rafaelmariotti-bellesa
Nov 8, 2024