Airbyte tuning capabilities #48436
Unanswered
rafaelmariotti-bellesa
asked this question in
Deployment
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I've been trying Airbyte for a few weeks now, in a kubernetes cluster (EKS from AWS, deploying using Airbyte helm chart version 1.2.0), with a dedicated postgres database using AWS RDS. After some struggling, I was able to finally make it work.
I started to add a few sources and destinations, and some of them are good enough in terms of speed. However, for some, and in this case, Shopify, this is taking too many hours for the first sync. I'm trying to sync 1 year of data, and after 12 hours executing, I gave up waiting for it.
I've read lots of posts here, and also in the old discussion blog, where I found some configurations that are deprecated, and some others that helped me to improve my environment a little bit. Here's what I have so far:
EKS configuration:
Node group 1:
m6i.xlarge machine (4 vCPU with 16GB), with minimum/desired of 1 ec2, up to 5 max
Node group 2:
m6i.2xlarge (8 vCPU with 32GB), with minimum/desired of 1 ec2, up to 5 max
Following the Airbyte documentation, I used the
nodeSelector
to dedicate the node group 2 only for the worker, worker-launcher, orchestrator, workload-api-server, and the jobs. All the other deployments are going to node 1.This is the
values.yaml
files that I have configured so far:With this configuration, I was able to make Airbyte work.
However, what I want now is to make it faster.
I've tried lots of different things:
MAX_*_WORKERS
andWORKLOAD_LAUNCHER_PARALLELISM
variables, but that also seems to have no effectuser_id
andpresentment prices
options.containerOrchestrator
inside of the worker and worker-launcher variables in thevalues.yaml
file, but that also seems to have no effect (I checked the worker variables, and theCONTAINER_ORCHESTRATOR_ENABLED
is always set to true)I also noticed in the logs for this replication that the flush worker is taking a lot of time to copy the file to the stage tables in Snowflake, like for example:
This log is just for a single stream running, no parallel streams.
Now, a few questions:
WORKLOAD_LAUNCHER_PARALLELISM
andMAX_*_WORKERS
still valid in the current versions of Airbyte? if not, what should I use instead?Any consideration for tuning is welcome here, ANYTHING! :)
Thank you
Beta Was this translation helpful? Give feedback.
All reactions