Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry support to resolving in opm alpha render-template #1513

Open
porridge opened this issue Dec 9, 2024 · 3 comments
Open

Add retry support to resolving in opm alpha render-template #1513

porridge opened this issue Dec 9, 2024 · 3 comments

Comments

@porridge
Copy link
Member

porridge commented Dec 9, 2024

With a larger catalog, this command might want to pull a large number of images.
In my experience even the public production image registries are not reliable enough to return all images successfully.

What is important, even some errors which seem to be client-side issues (such as 403s, see following example) need to be retried, as in my experience they are regularly reported as intermittent issues by (apparently) confused registries.

2024/12/03 18:58:31 render reference "<some-pullspec>": failed to pull image "<some-pullspec>": error resolving name for image ref <some-pullspec>: failed to authorize: failed to fetch oauth token: unexpected status from GET request to <some-url>: 403 Forbidden

While it is possible to retry the whole opm command, this is not terribly effective, since everything is re-fetched from scratch, so the probability of success does not increase.

@bentito
Copy link
Contributor

bentito commented Dec 9, 2024

So you're saying that we should, in the case of the 403 that you give, retry? How can we tell the difference between a real 403, and this one that you say is being made up erroneously by the registry? If it's a real 403, retrying is a waste of time and if a tool output "Error 403, retrying.", I'd really wonder about the tool.

Can you give a more relevant example, like a server-side fail where we don't properly retry?

Because it looks like here:

if nonRetriablePullError.MatchString(pullErr.Error()) {

we use the standard Go retry code and you're just hitting a non-retriable error, no?

@porridge
Copy link
Member Author

So you're saying that we should, in the case of the 403 that you give, retry?

Unfortunately, yes 😢

According to the data from the image prefetching service for our CI clusters (not directly related to our opm usage, but pulling from mostly from the same image registry) the intermittent authorization errors constitute ~0.3% of all image pull errors on average. Looking at historical records, the errors come and go in waves, so at times this rate is significantly higher.

A quick calculation shows that with a catalog template of 100 bundles (roughly the size of ours at the moment) there is a ~0.7% chance of the opm alpha render-template failing due to these intermittent auth issues, on average. Not huge, but it does happen and is really annoying.

How can we tell the difference between a real 403, and this one that you say is being made up erroneously by the registry?

I think you cannot! 😭

If it's a real 403, retrying is a waste of time and if a tool output "Error 403, retrying.", I'd really wonder about the tool.

Actually looking at the code you pointed at, you already seem to retry fetching on all errors, retriable or not! The only non-retried case is the sentinel error for docker schema v1 manifest. So this is good from my PoV.

Can you give a more relevant example, like a server-side fail where we don't properly retry?

Because it looks like here:

if nonRetriablePullError.MatchString(pullErr.Error()) {

we use the standard Go retry code and you're just hitting a non-retriable error, no?

I actually took a closer look, and judging from the error message I pasted, and from the registry code you pointed at, it seems like the error happens during resolving, here, which happens before the retry handling you pointed out.

There do not seem to be any retries around the Resolve() call. 🤔

Could I ask for opt-in retrying of all errors around that call?

@porridge porridge changed the title Add retry support to opm alpha render-template Add retry support to resolving in opm alpha render-template Dec 10, 2024
@bentito
Copy link
Contributor

bentito commented Dec 11, 2024

@porridge you're right, the code I highlighted isn't the code in question.

@grokspawn I think we could add (optional?) retries at resolve time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants