-
-
Notifications
You must be signed in to change notification settings - Fork 17
fix(crates): Detect stale upload caches and retry #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm open to suggestions regarding tests :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try 🤷♂️
src/targets/crates.ts
Outdated
return await spawnProcess(CARGO_BIN, args, { env }); | ||
} catch (e) { | ||
if (i == 0 && e.message.includes(VERSION_ERROR)) { | ||
logger.debug('Potential stale cache detected, trying again...'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug('Potential stale cache detected, trying again...'); | |
logger.warn('Potential stale cache detected, trying again...'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it more visible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double-checking - are you certain? I'd rather swallow this and only if it occurs again emit an error. Usually the entire crate publishing stage is silent, so it might be a bit surprising if suddenly the warning pops up but then nothing after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just to detect these cases in case the user forgot to set log level to debug.
Usually the entire crate publishing stage is silent
We can log something else to make it less lonely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me add some info logs and also the crate name in the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, how do you like the new messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bbbjutiful 👍
Co-authored-by: Anton Ovchinnikov <[email protected]>
I'd like to verify this once with an actual release. I'll check back in once it worked and then we can merge this. |
src/targets/crates.ts
Outdated
}; | ||
|
||
logger.info(`Publishing ${crate.name}`); | ||
for (let i = 0; i <= 1; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this for loop construct makes it harder to read the code than repeating the return await spawnProcess(CARGO_BIN, args, { env });
part. If you don't want to repeat that, we can abstract that out into its own function.
Or at least we should define MAX_RETRIES=1
and do i <= MAX_RETRIES
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I’ll repeat for now.
Unfortunately, that still did not help. Craft detects the condition successfully, but I'm still getting the error. I am not sure what the proper resolution is, because as soon as I try it manually, it just magically works. You can see that it even pulls a new version of the "crates.io index". Potentially increase the timeout?
|
src/targets/crates.ts
Outdated
if (e.message.includes(VERSION_ERROR)) { | ||
logger.warn(`Potential stale cache detected, trying again...`); | ||
await sleepAsync(RETRY_DELAY_MS); | ||
return spawnProcess(CARGO_BIN, args, { env }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah, skipping await
, very clever optimization :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's closer to the original. However, I think I'll have to revert this and really spin for multiple times + increase the timeout. No idea what the root cause of this is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I tried that as well. Unfortunately, you can see in the above output that the crates index was in fact refreshed, but still didn’t contain the new version.
This leads me to believe that it’s server side caches that interfer with this, and the only way around would be larger delays and more retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we kept polling the API directly until we get a response (ie https://crates.io/api/v1/crates/symbolic/7.3.5
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two challenges with this:
- The crates API is viewed as an implementation detail as far as I could see in the cargo issues. I'd rather stick to what cargo offers
- For publishing a package, we would have to first resolve all dependencies, and then check the API for every dependency. This is much more code that I'd hoped for.
- Even then, it doesn't guarantee us that cargo properly refreshes the index.
I ran another test with this implementation and got the same result now:
It is clearly visible how cargo updates the index, and then still doesn't find the crate after 12 seconds. This happens predictably, and I even reproduced this with a manual release. This can only be something we're missing. I'm wondering if we should just remove the |
* master: (23 commits) feat(crates): Add noDevDeps option (#112) fix: Write to cache in base artifact provider fix: Logger scopes for gcs and artifact providers build(ci): Have better defaults for CI environments (#110) fix(gha): Use single quotes for string literals (#108) ref(gha): Remove ENV inputs, add no-merge and keep-branch (#107) fix(docker): Fix CARGO_HOME and RUST_HOME since GHA changes HOME (#106) docs: Add missing CHANGELOG entry for cargo upgrade (#105) build(docker): Upgrade cargo to a recent version (#104) fix(gha): Remove no-merge and keep-branch temporarily fix(gha): Try to skip empty args using null fix(gha): Remove defaults on craft arguments fix(gha): Only pass publish args to publish feat(gha): Add GitHub Action for Craft (#103) docs: Fix `changelogPolicy` enum (#102) build(docker): Add a `craft` binary into the Docker image (#101) docs: Fix `artifactProvider` example (#100) release: 0.10.0 meta: Update Changelog build(gcb): Add a public Docker image (#99) ...
Ok, it does what it should do now. Can I haz a new approve before marge, please? Here is fully trustworthy proof:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎖️
try { | ||
return await spawnProcess(CARGO_BIN, args, { env }); | ||
} catch (e) { | ||
if (i < MAX_ATTEMPTS && e.message.includes(VERSION_ERROR)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This differs from the for loop's condition, intentional? I find having two checks a bit confusing btw. Feels like this should be a while loop. How about a slightly different approach?
const MAX_WAIT_SECS = 60;
let totalWait = 0;
let delay = RETRY_DELAY_SECS;
let error;
do {
if (error) {
if (error.message.includes(VERSION_ERROR)) {
logger.warn(`Publish failed due to potentially stale cache. Trying again in ${delay}s...`);
await sleepAsync(delay * 1000);
totalWait += delay;
delay *= RETRY_EXP_FACTOR;
} else {
break;
}
}
try {
return await spawnProcess(CARGO_BIN, args, { env });
} catch (err) {
error = err;
}
} while (totalWait <= MAX_WAIT_SECS)
throw error;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is intentional. The loop needs to to take another turn so that we get into the else
branch. In your implementation, you do not throw the error in the last iteration.
I thought about a total wait time (which could be written as a for
loop, too), but rather wanted a fixed number of retries. And then doing RETRY_DELAY_SECS * Math.pow(RETRY_EXP_FACTOR, MAX_ATTEMPTS)
seemed to cumbersome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your implementation, you do not throw the error in the last iteration.
Yes I do as I do not check the count or total wait in the if separately.
I need more sleep/coffee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3
The publishing errors also happen locally. When publishing a crate with
--no-verify
, it seems that the crate caches can go out of date andcargo
will fail to publish. We can detect this and retry publishing with exponential backoff.I even ran into this when publishing manually with sufficient time between the workspace crates. The error always looks like this:
It is unclear what the root cause of this is. When cargo updates its index, it runs a "fast path" request against the GitHub API which seems to get cached. Waiting for 20-30 seconds seems sufficient to invalidate those caches.
Fixes #67