Skip to content

fix(crates): Detect stale upload caches and retry #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 18, 2020

Conversation

jan-auer
Copy link
Member

@jan-auer jan-auer commented Jun 9, 2020

The publishing errors also happen locally. When publishing a crate with --no-verify, it seems that the crate caches can go out of date and cargo will fail to publish. We can detect this and retry publishing with exponential backoff.

I even ran into this when publishing manually with sufficient time between the workspace crates. The error always looks like this:

cargo publish --no-verify
    Updating crates.io index
   Packaging symbolic v7.3.4 (/private/tmp/symbolic)
error: failed to prepare local package for uploading

Caused by:
  failed to select a version for the requirement `symbolic-symcache = "^7.3.4"`
  candidate versions found which didn't match: 7.3.3, 7.3.2, 7.3.1, ...
  location searched: crates.io index
required by package `symbolic v7.3.4 (/private/tmp/symbolic)`

It is unclear what the root cause of this is. When cargo updates its index, it runs a "fast path" request against the GitHub API which seems to get cached. Waiting for 20-30 seconds seems sufficient to invalidate those caches.

Fixes #67

@jan-auer jan-auer requested a review from tonyo June 9, 2020 10:11
@jan-auer jan-auer self-assigned this Jun 9, 2020
@jan-auer
Copy link
Member Author

jan-auer commented Jun 9, 2020

I'm open to suggestions regarding tests :)

Copy link
Contributor

@tonyo tonyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try 🤷‍♂️

return await spawnProcess(CARGO_BIN, args, { env });
} catch (e) {
if (i == 0 && e.message.includes(VERSION_ERROR)) {
logger.debug('Potential stale cache detected, trying again...');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug('Potential stale cache detected, trying again...');
logger.warn('Potential stale cache detected, trying again...');

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it more visible.

Copy link
Member Author

@jan-auer jan-auer Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double-checking - are you certain? I'd rather swallow this and only if it occurs again emit an error. Usually the entire crate publishing stage is silent, so it might be a bit surprising if suddenly the warning pops up but then nothing after that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just to detect these cases in case the user forgot to set log level to debug.

Usually the entire crate publishing stage is silent

We can log something else to make it less lonely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me add some info logs and also the crate name in the warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, how do you like the new messages?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bbbjutiful 👍

Co-authored-by: Anton Ovchinnikov <[email protected]>
@jan-auer
Copy link
Member Author

jan-auer commented Jun 9, 2020

I'd like to verify this once with an actual release. I'll check back in once it worked and then we can merge this.

};

logger.info(`Publishing ${crate.name}`);
for (let i = 0; i <= 1; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this for loop construct makes it harder to read the code than repeating the return await spawnProcess(CARGO_BIN, args, { env }); part. If you don't want to repeat that, we can abstract that out into its own function.

Or at least we should define MAX_RETRIES=1 and do i <= MAX_RETRIES.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I’ll repeat for now.

@jan-auer
Copy link
Member Author

jan-auer commented Jun 9, 2020

Unfortunately, that still did not help. Craft detects the condition successfully, but I'm still getting the error. I am not sure what the proper resolution is, because as soon as I try it manually, it just magically works.

You can see that it even pulls a new version of the "crates.io index". Potentially increase the timeout?

...
ℹ info [crates] › Publishing symbolic
⚠ warn [crates] › Potential stale cache detected, trying again...
✖ error Error: Process "cargo" errored with code 101

  STDOUT: cargo:


  STDERR:cargo:     Updating crates.io index
  cargo:    Packaging symbolic v7.3.5 (/var/folders/7p/zvqpr1r53zv_b6tzlz7tw2vw0000gn/T/craft-crates-hJ3Y7K)
  cargo: error: failed to prepare local package for uploading
  cargo:
  cargo: Caused by:
  cargo:   failed to select a version for the requirement `symbolic-common = "^7.3.5"`
  cargo:   candidate versions found which didn't match: 7.3.4, 7.3.3, 7.3.2, ...
  cargo:   location searched: crates.io index
  cargo: required by package `symbolic v7.3.5 (/var/folders/7p/zvqpr1r53zv_b6tzlz7tw2vw0000gn/T/craft-crates-hJ3Y7K)`
  cargo:

if (e.message.includes(VERSION_ERROR)) {
logger.warn(`Potential stale cache detected, trying again...`);
await sleepAsync(RETRY_DELAY_MS);
return spawnProcess(CARGO_BIN, args, { env });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, skipping await, very clever optimization :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's closer to the original. However, I think I'll have to revert this and really spin for multiple times + increase the timeout. No idea what the root cause of this is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I tried that as well. Unfortunately, you can see in the above output that the crates index was in fact refreshed, but still didn’t contain the new version.

This leads me to believe that it’s server side caches that interfer with this, and the only way around would be larger delays and more retries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we kept polling the API directly until we get a response (ie https://crates.io/api/v1/crates/symbolic/7.3.5)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two challenges with this:

  • The crates API is viewed as an implementation detail as far as I could see in the cargo issues. I'd rather stick to what cargo offers
  • For publishing a package, we would have to first resolve all dependencies, and then check the API for every dependency. This is much more code that I'd hoped for.
  • Even then, it doesn't guarantee us that cargo properly refreshes the index.

@jan-auer
Copy link
Member Author

I ran another test with this implementation and got the same result now:

ℹ info [crates] › Publishing symbolic
⚠ warn [crates] › Potential stale cache detected, trying again in 3s...
⚠ warn [crates] › Potential stale cache detected, trying again in 3s...
⚠ warn [crates] › Potential stale cache detected, trying again in 3s...
⚠ warn [crates] › Potential stale cache detected, trying again in 3s...
✖ error Error: Process "cargo" errored with code 101

  STDERR:cargo:     Updating crates.io index
  cargo:    Packaging symbolic v7.3.6 (/var/folders/7p/zvqpr1r53zv_b6tzlz7tw2vw0000gn/T/craft-crates-FkEZbQ)
  cargo: error: failed to prepare local package for uploading
...

It is clearly visible how cargo updates the index, and then still doesn't find the crate after 12 seconds. This happens predictably, and I even reproduced this with a manual release. This can only be something we're missing.

I'm wondering if we should just remove the --no-verify and take into account that it might take some time to build and verify the crates.

jan-auer added 2 commits June 18, 2020 19:26
* master: (23 commits)
  feat(crates): Add noDevDeps option (#112)
  fix: Write to cache in base artifact provider
  fix: Logger scopes for gcs and artifact providers
  build(ci): Have better defaults for CI environments (#110)
  fix(gha): Use single quotes for string literals (#108)
  ref(gha): Remove ENV inputs, add no-merge and keep-branch (#107)
  fix(docker): Fix CARGO_HOME and RUST_HOME since GHA changes HOME (#106)
  docs: Add missing CHANGELOG entry for cargo upgrade (#105)
  build(docker): Upgrade cargo to a recent version (#104)
  fix(gha): Remove no-merge and keep-branch temporarily
  fix(gha): Try to skip empty args using null
  fix(gha): Remove defaults on craft arguments
  fix(gha): Only pass publish args to publish
  feat(gha): Add GitHub Action for Craft (#103)
  docs: Fix `changelogPolicy` enum (#102)
  build(docker): Add a `craft` binary into the Docker image (#101)
  docs: Fix `artifactProvider` example (#100)
  release: 0.10.0
  meta: Update Changelog
  build(gcb): Add a public Docker image (#99)
  ...
@jan-auer
Copy link
Member Author

Ok, it does what it should do now. Can I haz a new approve before marge, please?

Here is fully trustworthy proof:

ℹ info [crates] › Publishing symbolic-common
ℹ info [crates] › Publishing symbolic-sourcemap
ℹ info [crates] › Publishing symbolic-unreal
ℹ info [crates] › Publishing symbolic-debuginfo
ℹ info [crates] › Publishing symbolic-demangle
ℹ info [crates] › Publishing symbolic-proguard
ℹ info [crates] › Publishing symbolic-minidump
ℹ info [crates] › Publishing symbolic-symcache
ℹ info [crates] › Publishing symbolic
⚠ warn [crates] › Publish failed, trying again in 2s...
⚠ warn [crates] › Publish failed, trying again in 4s...
⚠ warn [crates] › Publish failed, trying again in 8s...
⚠ warn [crates] › Publish failed, trying again in 16s...
ℹ info [crates] › Crates release complete

@jan-auer jan-auer requested review from BYK and tonyo June 18, 2020 17:34
Copy link
Contributor

@tonyo tonyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎖️

try {
return await spawnProcess(CARGO_BIN, args, { env });
} catch (e) {
if (i < MAX_ATTEMPTS && e.message.includes(VERSION_ERROR)) {
Copy link
Member

@BYK BYK Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This differs from the for loop's condition, intentional? I find having two checks a bit confusing btw. Feels like this should be a while loop. How about a slightly different approach?

const MAX_WAIT_SECS = 60;
let totalWait = 0;
let delay = RETRY_DELAY_SECS;
let error;

do {
  if (error) {
    if (error.message.includes(VERSION_ERROR)) {
      logger.warn(`Publish failed due to potentially stale cache. Trying again in ${delay}s...`);
      await sleepAsync(delay * 1000);
      totalWait += delay;
      delay *= RETRY_EXP_FACTOR;
    } else {
      break;
    }
  }
  try {
    return await spawnProcess(CARGO_BIN, args, { env });
  } catch (err) {
    error = err;
  }
} while (totalWait <= MAX_WAIT_SECS)

throw error;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is intentional. The loop needs to to take another turn so that we get into the else branch. In your implementation, you do not throw the error in the last iteration.

I thought about a total wait time (which could be written as a for loop, too), but rather wanted a fixed number of retries. And then doing RETRY_DELAY_SECS * Math.pow(RETRY_EXP_FACTOR, MAX_ATTEMPTS) seemed to cumbersome.

Copy link
Member

@BYK BYK Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your implementation, you do not throw the error in the last iteration.

Yes I do as I do not check the count or total wait in the if separately.

I need more sleep/coffee.

Copy link
Member

@BYK BYK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

@jan-auer jan-auer merged commit c9d32ea into master Jun 18, 2020
@jan-auer jan-auer deleted the fix/crates-cache-retry branch June 18, 2020 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crates.io Releases seem to suffer from race conditions
3 participants