Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup CI manifest #540

Closed
NobodyXu opened this issue Jun 15, 2024 · 10 comments
Closed

Speedup CI manifest #540

NobodyXu opened this issue Jun 15, 2024 · 10 comments

Comments

@NobodyXu
Copy link
Collaborator

NobodyXu commented Jun 15, 2024

CI manifest is quite slow, sometimes taking up to 20m.

I think using a dedicated github token might help, secrets.GITHUB_TOKEN only provides 1000 API hits per hour, where one created by user gets 5000.

Switching to GraphQL might also help, since we can fetch all the artifacts information when fetching the release information.

@NobodyXu
Copy link
Collaborator Author

A more efficient approach would be webhooks, though it can only be created by the repository owner.

We can ask them to help with this project, it would be the most efficient way without having to go through GitHub API, and most GitHub friendly.

The webhook information can be cached on a dedicated github branch, so that the manifest CI won't refetch them.

@taiki-e
Copy link
Owner

taiki-e commented Jun 15, 2024

cc @jayvdb

AFAIK this is especially slow since the recent codegen changes (#473).

fetch all the artifacts information when fetching the release information.

Currently, we always call requests that have a limit of 100 releases per request in several times to get all the releases. However, if codegen is called with the "latest" argument, as in CI, it is usually sufficient to look at the latest 100 releases.

Eventually we may need to change the overall logic, I guess there are a number of areas that could be made better without such a big deal.

@NobodyXu
Copy link
Collaborator Author

Eventually we may need to change the overall logic

@taiki-e I think using web hook for certain projects and cache them in a dedicated branch would be much efficient.

cc @sunshowers

@taiki-e
Copy link
Owner

taiki-e commented Jun 15, 2024

By the way, in my understanding, one of the actual causes of the "slowness" is that even though the token has reached the rate limit, that fact is not remembered on the codegen side, and it attempts to download with the token that has reached the rate limit each time the download is invoked, repeatedly.

@NobodyXu
Copy link
Collaborator Author

that fact is not remembered on the codegen side, and it attempts to download with the token that has reached the rate limit each time the download is invoked, repeatedly.

The binstalk-downloader crate checks for rate limit

https://github.com/cargo-bins/cargo-binstall/blob/0d7080e6a9e1ca43a37026b52683e367c416c87f/crates/binstalk-downloader/src/remote.rs#L182

and applies a backoff strategy

https://github.com/cargo-bins/cargo-binstall/blob/0d7080e6a9e1ca43a37026b52683e367c416c87f/crates/binstalk-downloader/src/remote/delay_request.rs#L70

@NobodyXu
Copy link
Collaborator Author

We also have crate binstalk-git-repo-api for some GraphQL APIs.

I could add a artifact listing API for specific version, if you need any.

@jayvdb
Copy link
Contributor

jayvdb commented Jun 16, 2024

AFAIK this is especially slow since the recent codegen changes (#473).

Yes, this is the main cause. It added one github API fetch of each tools https://api.github.com/repos/{repo}, and then also quite a few github_head() calls per tool to find the license files.

A very quick improvement for the license file fetches is to use the existing manifest data to first verify the existing URLs still exist, and fallback to the existing logic which attempt to find the appropriate license files.

We can likely also skip a lot of fetching by storing, in the manifests, a copy of the github headers related to caching, e.g etag , which exists on the API hits, including /releases list, (these provide a weak etag identifier) and on the binary release assets (a strong etag identifier).

Due to the way the code is structured, the /releases list API and release asset fetches are easy to skip if github.com says the content isnt modified.

https://api.github.com/repos/{repo} are not likely to be "not modified" in CI - most of these repos have something change every day, due to fields like created_at, updated_at & pushed_at, and forks, open_issues & watchers. So I wouldnt try (yet) to restructure the code to handle this hit being unmodified. However, switching this to graphql could allow the response that might provide an etag that is more stable, and thus worth restructuring the code to avoid the unnecessary refetches.

@NobodyXu
Copy link
Collaborator Author

A very quick improvement for the license file fetches

For license file, can't we just grab it from the tarball downloaded from crates.io?

The tarball on crates.io is immutable so it can be cached for each release, and I think we already download from crates.io?

@jayvdb
Copy link
Contributor

jayvdb commented Jun 16, 2024

For license file, can't we just grab it from the tarball downloaded from crates.io?

Yup; good idea.

The tarball on crates.io is immutable so it can be cached for each release, and I think we already download from crates.io?

yes we do.

@taiki-e
Copy link
Owner

taiki-e commented Jun 19, 2024

Before a5ddc5a: about 11m - 22m
After a5ddc5a: 4m

GITHUB_TOKEN wasn't applied properly in the first place (I think it worked at one point, but at some point GitHub may have made the validation more strict).

Closing this issue since this is no longer an urgent issue. (Opened separate issue #546 for license file fetch improvements.)
However, I always welcome concrete suggestions/patches for further performance improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants