Implement traffic distribution for reability, latency and cost in kubernetes cloud provider #675

mlophez · 2024-09-04T09:20:42Z

In cloud-based Kubernetes environments, it is common to have multiple availability zones within the same cluster.

Starting from Kubernetes 1.31, kube-proxy implements a type of load balancing that prioritizes keeping traffic within the same availability zone. This ensures better latencies and significant cost savings, as traffic crossing availability zones can be expensive.

https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution

Since the HAProxy Ingress Controller doesn't use Kubernetes services but directly updates the backends with the pod IPs, I have the following question:

Would it be possible to have an option for HAProxy to route traffic only to the pods within the same availability zone?

This could be achieved by detecting the availability zone where the controller is running at startup, and then registering the SRVs from the same AZ as normal and the others as backups.

What do you think?

This would allow us to save considerable costs and improve latencies.

In our specific case, we use HAProxy both for traffic coming from the internet and for traffic between services.

I am willing to collaborate on the solution if assistance is provided.

Best regards.

mlophez · 2024-09-20T11:12:48Z

Hello again,

Thanks to @juampa, we've implemented a POC in version 2.10.16 of the controller, which is the one we're currently using.

Since in the EKS Kubernetes clusters we are using the VPC-CNI add-on, where the pod IPs come from the subnet ranges, we can associate IP - subnet - zone and put the pods that don't belong to the controller's zone in backup.

For this, we have implemented two new environment variables.

This works; however, in this version, when the service is created from scratch, the backup status doesn't appear correctly in the stats, although it shows up correctly in the haproxy.cfg.

We think it could be a bug or that we are not making the changes in the correct places.

You can see the code here:
mlophez@66dc4c5

We have tested other approaches by extracting real-time data from the node and the service.

We look forward to your feedback.

Thank you.

hdurand0710 · 2024-09-20T12:20:42Z

Hi @MLR96 ,

Thanks for your contribution.
As for the stats, here is what I get

with

  server SRV_1 10.244.0.7:8888 backup enabled
  server SRV_2 127.0.0.1:8888 disabled
  server SRV_3 127.0.0.1:8888 disabled
  server SRV_4 127.0.0.1:8888 disabled

In the stats, the line for the backup SRV is blue with Y in the 'Bck` column.

Can you give me more information on what's incorrect for you in the stats ?
Thanks

mlophez · 2024-09-20T12:41:02Z

Hi @hdurand0710,

When the replicas of a deployment are scaled up and more srv are generated, the "backup" label is correctly set in the haproxy.cfg file, but both appear as "UP" in the runtime.

Once a reload of HAProxy is performed, the configuration loads correctly in the runtime. We believe this might be an error when registering the srv in the runtime, or that our code is incorrect.

Thank you.

hdurand0710 · 2024-09-20T12:54:08Z

@MLR96
Thanks for your quick feedback.
Issue is clear.
Indeed, with the current state of the code (and in the POC), we are not sending the backup state through the runtime command, hence the stats would remain wrong until the next reload.
The current runtime commands do not allow to set the server as backup.
We will discuss with the team on the best way to solve this.
Thanks

mlophez · 2024-09-20T13:05:41Z

@hdurand0710
Thank you for your attention.
We are looking forward to eliminating the costs for network traffic that crosses availability zones.
When you reach a consensus to implement this for all scenarios, we are ready to open a PR.
Regards

mlophez · 2024-09-24T08:51:32Z

Hi @hdurand0710

I was reviewing the runtime update and I was able to make everything work correctly. It seems that the runtime update and the haproxy.cfg file update happen in different places.

I’m still not entirely sure, but after adding the following commit, everything works as it should in the poc.

mlophez@20bceda

Best regards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement traffic distribution for reability, latency and cost in kubernetes cloud provider #675

Implement traffic distribution for reability, latency and cost in kubernetes cloud provider #675

mlophez commented Sep 4, 2024

mlophez commented Sep 20, 2024

hdurand0710 commented Sep 20, 2024

mlophez commented Sep 20, 2024 •

edited

Loading

hdurand0710 commented Sep 20, 2024

mlophez commented Sep 20, 2024

mlophez commented Sep 24, 2024

Implement traffic distribution for reability, latency and cost in kubernetes cloud provider #675

Implement traffic distribution for reability, latency and cost in kubernetes cloud provider #675

Comments

mlophez commented Sep 4, 2024

mlophez commented Sep 20, 2024

hdurand0710 commented Sep 20, 2024

mlophez commented Sep 20, 2024 • edited Loading

hdurand0710 commented Sep 20, 2024

mlophez commented Sep 20, 2024

mlophez commented Sep 24, 2024

mlophez commented Sep 20, 2024 •

edited

Loading