Clarification Needed on Airbyte Job Parallelism and Worker Roles in GKE Deployment #42439

karolur · 2024-07-23T14:27:50Z

karolur
Jul 23, 2024

Hello,

I have Airbyte deployed in GKE using Helm chart 0.248.5 and Airbyte version 0.63.4.

I'm trying to understand how scaling Airbyte works internally. I've read several threads, blogs, and documentation about job parallelism in Airbyte, but it's still unclear to me how some of the components mentioned work together.

I understand that when a job sync starts, several pods will be initialized, such as check, orchestrator, discovery, source, and destination (not all at the same time). I also understand the role of the parameters MAX_*_WORKERS in the values. I also understand the role of the temporal ports in communicating with these created pods and that there is a limiting number of 40 TEMPORAL_WORKER_PORTS to communicate with pods, so I assume there is a limit of 40 pods that can communicate with the Temporal DB.

What I don't understand is the role of the deployment airbyte-worker. There seems to be only one persistent pod that houses this worker, and I'm not sure what it does in relation to the actual workers that do the jobs. There is a document in daspire (https://docs.daspire.com/deploying-airbyte/on-kubernetes/#increasing-job-parallelism) that mentions, "The number of worker pods can be changed by increasing the number of replicas for the airbyte-worker deployment," but I have no idea how old this is or if it is still relevant.

Additionally, in my company, there is a horizontal pod autoscaler configured in the airbyte-worker deployment from 1 to 50 replicas, but I've monitored this service while we are having a high number of jobs, and the replicas always stay at 1, so I'm not even sure this is necessary.

I've noticed Airbyte uses words like jobs and workers interchangeably sometimes, and sometimes they mean particular things, so I'm confused about the pod airbyte-worker in Kubernetes. Is it really a worker, or are only the pods created to execute sync jobs considered workers? what does airbyte-worker do in kubernetes? and finally, is it still relevant to create more airbyte-worker replicas to increase job concurrency?

Thank you for your help!

Answered by NAjustin

Jul 23, 2024

In Kubernetes deployments, the airbyte-worker deployment has fewer responsibilities because it delegates to a newer job called container orchestrator (whereas in a docker compose deployment this is all on the worker). More on that here:
https://docs.airbyte.com/understanding-airbyte/jobs#decoupling-worker-and-job-processes

(Some of the "whys" of the orchestrator pod listed there will also probably give you an idea to which of the sources you're reading from are current or out of date.)

This leaves the worker primarily responsible for initiating the orchestrator (which initiates the appropriate source/read and destination/write pods), monitoring the state, and handling any additional inter…

View full answer

NAjustin · 2024-07-23T16:56:20Z

NAjustin
Jul 23, 2024

In Kubernetes deployments, the airbyte-worker deployment has fewer responsibilities because it delegates to a newer job called container orchestrator (whereas in a docker compose deployment this is all on the worker). More on that here:
https://docs.airbyte.com/understanding-airbyte/jobs#decoupling-worker-and-job-processes

(Some of the "whys" of the orchestrator pod listed there will also probably give you an idea to which of the sources you're reading from are current or out of date.)

This leaves the worker primarily responsible for initiating the orchestrator (which initiates the appropriate source/read and destination/write pods), monitoring the state, and handling any additional interventions needed. When this change was made, the need to scale the worker deployment became much less common.

In our case we still occasionally see contention or timeouts when initiating/running many concurrent jobs (>100) . . . so we run 2 replicas for server and worker to ensure high availability. FWIW, we're using a GKE Autopilot cluster with the DBs on Cloud SQL, storage in GCS, and secrets managed in Google Secrets Manager (which all actually works quite well).

The most current docs on scaling are probably the Self-Managed Enterprise docs, which you can see here:
https://docs.airbyte.com/enterprise-setup/scaling-airbyte

Those apply equally well in general to OSS deployments (it's the same codebase) but their team seems to be working hard on trying to bring all the other docs current as well (most recently updating and greatly improving the Deployment docs).

P.S. Regarding the interchangeable usage of terms, I think some of this comes from the generic concepts vs. deployment names. But also, naming things is the hardest part of development and docs writing 😂

1 reply

karolur Jul 23, 2024
Author

Thank you so much for taking the time to reply so fast. I understand better now.

Incidentally, do you use any particular formula to calculate the amount of pods that will be created/needed in order to run those >100 jobs you mentioned? how do you estimate the number of temporal worker ports you will need to add for the same amount of jobs?

I would appreciate any insight.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification Needed on Airbyte Job Parallelism and Worker Roles in GKE Deployment #42439

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Clarification Needed on Airbyte Job Parallelism and Worker Roles in GKE Deployment #42439

karolur Jul 23, 2024

Replies: 1 comment · 1 reply

NAjustin Jul 23, 2024

karolur Jul 23, 2024 Author

karolur
Jul 23, 2024

Replies: 1 comment 1 reply

NAjustin
Jul 23, 2024

karolur Jul 23, 2024
Author