Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed freight discovery #2544

Open
4 tasks done
wmiller112 opened this issue Sep 18, 2024 · 5 comments
Open
4 tasks done

Delayed freight discovery #2544

wmiller112 opened this issue Sep 18, 2024 · 5 comments

Comments

@wmiller112
Copy link
Contributor

Checklist

  • I've searched the issue queue to verify this is not a duplicate bug report.
  • I've included steps to reproduce the bug.
  • I've pasted the output of kargo version.
  • I've pasted logs, if applicable.

Description

Frequently I've found that new image tags will be available that match the specified warehouse configuration, but they will not be discovered in a timely manor. The warehouse is configured with spec.interval of 2m0s, but with debug logs enabled, I see the warehouse reconcile loop happens far less frequently, but its often random. Clicking the refresh button also does not seem to reliably trigger the check for new artifacts.

Steps to Reproduce

  • Create a warehouse with spec similar to
spec:
  freightCreationPolicy: Automatic
  interval: 2m0s
  subscriptions:
    - git:
        branch: main
        commitSelectionStrategy: NewestFromBranch
        discoveryLimit: 20
        includePaths:
          - regex:path/to/manifests/.*
        repoURL: https://github.com/<org>/<repo>
    - image:
        allowTags: ^production-[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}-[0-9]{2}-[0-9]{2}Z-[0-9a-f]{40}$
        discoveryLimit: 20
        gitRepoURL: https://github.com/<org>/<repo>
        imageSelectionStrategy: Lexical
        repoURL:xxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/<repo>
  • Push a tag matching the allowTags schema
  • Monitor debug logs in controller to observe the refresh does not happen at the specified interval
  • Click refresh on the warehouse, confirm the annotation is added with updated ts
  • Continue to monitor debug logs to see that frequently a refresh will not happen within specified interval of clicking refresh
  • Eventually a reconcile loop will run, new tag discovered, and the rest of the process works as expected

Version

Client Version: v0.8.4
Server Version: v0.8.4

Logs

time="2024-09-18T15:48:08Z" level=debug msg="reconciling Warehouse" namespace=<project> warehouse=sources
time="2024-09-18T15:48:08Z" level=debug msg="obtained credentials for git repo" namespace=<project>repo="https://github.com/<org>/<repo>" warehouse=sources
time="2024-09-18T15:48:20Z" level=debug msg="discovered commits" count=20 namespace=<project> repo="https://github.com/<org>/<repo>" warehouse=sources
time="2024-09-18T15:48:20Z" level=debug msg="Controller IAM role is not authorized to assume project-specific role or project-specific role is not authorized to obtain an ECR auth token. Falling back to using controller's IAM role directly." awsAccountID=xxxxxxxxxx awsRegion=us-west-2 namespace=l<project> project=l<project>warehouse=sources
time="2024-09-18T15:48:20Z" level=debug msg="got ECR authorization token" awsAccountID=xxxxxxxxxx awsRegion=us-west-2 namespace=<project> project=<project> warehouse=sources
time="2024-09-18T15:48:20Z" level=debug msg="obtained credentials for image repo" namespace=<project>repo=xxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/<repo> warehouse=sources
time="2024-09-18T15:48:23Z" level=debug msg="discovered images" count=20 namespace=<project> repo=xxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/<repo> warehouse=sources
time="2024-09-18T15:48:23Z" level=debug msg="discovered latest artifacts" namespace=<project> warehouse=sources
time="2024-09-18T15:48:23Z" level=debug msg="done reconciling Warehouse" namespace=<project>warehouse=sources
...
time="2024-09-18T16:04:26Z" level=debug msg="reconciling Warehouse" namespace=<project> warehouse=sources
time="2024-09-18T16:04:26Z" level=debug msg="obtained credentials for git repo" namespace=<project>repo="https://github.com/<org>/<repo>" warehouse=sources
time="2024-09-18T16:04:39Z" level=debug msg="discovered commits" count=20 namespace=<project> repo="https://github.com/<org>/<repo>" warehouse=sources
time="2024-09-18T16:04:39Z" level=debug msg="obtained credentials for image repo" namespace=<project> repo=717232957798.dkr.ecr.us-west-2.amazonaws.com/<repo> warehouse=sources
time="2024-09-18T16:04:43Z" level=debug msg="discovered images" count=20 namespace=<project> repo=717232957798.dkr.ecr.us-west-2.amazonaws.com/<repo> warehouse=sources
time="2024-09-18T16:04:43Z" level=debug msg="discovered latest artifacts" namespace=<project> warehouse=sources
time="2024-09-18T16:04:43Z" level=debug msg="created Freight" freight=7a4a82965db54bf6d231bf6bfb68f13f39dcd9ac namespace=<project> warehouse=sources
time="2024-09-18T16:04:43Z" level=debug msg="done reconciling Warehouse" namespace=<project> warehouse=sources
@wmiller112
Copy link
Contributor Author

Looking at this a bit more, I do see the following in the api pod logs, specifically when a manual refresh is triggered.

time="2024-10-01T18:35:17Z" level=error msg="finished streaming call" connect.code=internal connect.duration=3.79452559s connect.method=WatchWarehouses connect.service=akuity.io.kargo.service.v1alpha1.KargoService connect.start_time="2024-10-01T18:35:13Z" error="internal: context canceled"
time="2024-10-01T18:35:17Z" level=error msg="finished streaming call" connect.code=internal connect.duration=3.791075582s connect.method=WatchStages connect.service=akuity.io.kargo.service.v1alpha1.KargoService connect.start_time="2024-10-01T18:35:13Z" error="internal: context canceled"

Figured it was a resource limitation, but the controller seems to use as much memory as I give it. It was at 2Gi when I opened this, currently at 10Gi. The interval refreshes seem a lot more consistent with more memory, but the above error happens any time a manual refresh is triggered. Once a manual refresh is triggered, I sometimes observe that interval refreshes stop working as well until I manually remove the refresh annotation.

@hiddeco
Copy link
Contributor

hiddeco commented Oct 1, 2024

How many tags in total are there in the targeted image repository?

@wmiller112
Copy link
Contributor Author

wmiller112 commented Oct 1, 2024

Largest is about 1100 total tags, ~60 match the filter. ~1500 images total, some untagged, but policies keep it around this number pretty consistently. Find that it happens for projects whose warehouses have a very small number of tags as well though

@krancour
Copy link
Member

krancour commented Oct 2, 2024

Those API server logs do not appear related to this in any way. They look to me like the UI's client disconnecting abruptly.

@krancour krancour modified the milestones: v1.0.0, Post-GA Oct 7, 2024
@wmiller112
Copy link
Contributor Author

wmiller112 commented Oct 9, 2024

I've updated image retention. Max tags in any repo kargo uses is now 500 with around 30 matching in any repo, but issue persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants