Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP chore(aws-provider): aws tags batching functionality #5058

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ivankatliarchuk
Copy link
Contributor

@ivankatliarchuk ivankatliarchuk commented Feb 3, 2025

Description

No issue to day

Multi tenant platform, with dozens of customers, clients of random size and different requirements.

Some of them have route53 setups with 100+, 200-300 zones in accounts. We have no control over things like split zones or similar stuff.

So what is going on. Filter by tag is a great feature, as it helps to attach metadata to workloads of different types and etc. At the moment the logic

response, err := client.ListTagsForResource(ctx, &route53.ListTagsForResourceInput{
does one request per zone, it only works for external-dns when the size is ~ 20-40 zones. More important, as we have some controls and platform customers build other tooling that rely on Route53 API, they all throttled as well. So we have outages, as a result the external-dns need to be stopped time-to-time.

At the moment with current implementation, the external-dns not even able to fetch this zones data, cachind does not help either, as zones are not in the cache. external-dns do not crash, but zones list is empty, hence no records to update

Rate limit error on UI when external-dns refreshing a zone list

Screenshot 2025-02-02 at 16 13 32

Current implementation

Screenshot 2025-02-03 at 08 31 06

With batching API in use

Screenshot 2025-02-03 at 09 27 38

Why there is a difference from ~30 seconds FAIL and occasionally SUCCESS vs ~ 6 seconds SUCCESS

  • Current implementation makes one-to-one call per zone
  • aws-sdk-go-v2 has default retry 3x times with back-off, so before to fail, it tries 3x times before to return a soft error. This ads-up to API rate limits and timings
  • Forthe size of 160 zones, current implementation does 160 calls, while with batching there is only 17 requests

** Follow up **

  • I'll create a proposal to implement a global rate limiter for providers as well. Not sure if I'm going to have time to implement, but at least we could get a consensus around required/nice-to-have API rate limiting for providers as abusing 3rd Party API is not the great idea, instead this should be respected

Checklist

  • Unit tests updated
  • End user documentation updated

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign raffo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 3, 2025
@ivankatliarchuk ivankatliarchuk marked this pull request as draft February 3, 2025 09:17
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2025
@ivankatliarchuk ivankatliarchuk changed the title chore(aws-provider): aws tags batching functionality WIP chore(aws-provider): aws tags batching functionality Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants