Skip to content

fix(vendor)!: direct extraction for registry sources #15514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

weihanglo
Copy link
Member

@weihanglo weihanglo commented May 10, 2025

What does this PR try to resolve?

PathSource::list_files has some heurstic rules for listing files. Those rules are mainly designed for cargo package.

Previously, cargo-vendor relies on those rules to understand what files to vendor. However, it shouldn't use those rules because:

  • Package extracted from a .crate tarball isn't Git-controlled, some rules may apply differently.
  • The extracted package already went through PathSource::list_files during packaging. It should be clean enough.
  • Should keep crate sources from registry sources in a pristine state, which is exactly what vendoring is meant for.

Instead, we switch to direct extraction into the vendor directory
to ensure source code is the same as in the .crate tarball.

How should we test and review this PR?

There is already a test vendor::package_exclude for the fix of #9054,
though now I think it is not a good fix. The test change shows the correct behavior change.

I am not sure if we want more tests.

Also, there are some caveats with this fix:

  • The overwrite protection in unpack_package assumes the unpack
    directory is always <pkg>-<version>.
    We don't want to remove this,
    but for cargo-vendor supports vendoring without version suffix.
    For that case, we need a temporary staging area,
    and move the unpacked source then.
  • The heurstic in PathSource::list_files did something "good" in
    general cases, like excluding hidden directories. That means
    common directories like .github or .config won't be vendored.
    After this, those get included. This is another round of churns.
    We might want to get other cargo-vendor changes along with this
    in one single release.

Additional information

This also opens a door for

Since we are changing vendored source (again), we might want to remove the .rej/.orig special rules in cargo-vendor, as well as look into the new source-dedup vendor dir layout.

Summary Notes

Generated by triagebot, see help for how to add more

@rustbot
Copy link
Collaborator

rustbot commented May 10, 2025

r? @ehuss

rustbot has assigned @ehuss.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added Command-vendor S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 10, 2025
@epage
Copy link
Contributor

epage commented May 11, 2025

Except this doesn't guarantee things are pristine until we vendor the contents of the .crate. I worry this might cause us to vendor things like target/ directories.

@weihanglo
Copy link
Member Author

Except this doesn't guarantee things are pristine until we vendor the contents of the .crate. I worry this might cause us to vendor things like target/ directories.

The chance of this happening is low because cargo-vendor holds a MutateExclusive cache lock during the entire call. Also basically everything is the same except using walkdir instead of PathSource::list_files. We can add a filter to walkdir to exclude target/ directory, if that helps.

This patch aims to be as simple as possible. The “direct extraction” is the correct fix but require more architectural changes.

@epage
Copy link
Contributor

epage commented May 11, 2025

The chance of this happening is low because cargo-vendor holds a MutateExclusive cache lock during the entire call. Also basically everything is the same except using walkdir instead of PathSource::list_files. We can add a filter to walkdir to exclude target/ directory, if that helps.

And that we delete the extracted .crate so we get a fresh extraction.

So looking over this again, this does not fix #15090. This PR only changes which files we list, it doesn't change how we copy them which is what is causing the problem in #15090.

The “direct extraction” is the correct fix but require more architectural changes.

Is it all that big of an architectural change? Can we get an instance of the registry source and and call an extract method directly on it?

@weihanglo
Copy link
Member Author

weihanglo commented May 11, 2025

So looking over this again, this does not fix #15090. This PR only changes which files we list, it doesn't change how we copy them which is what is causing the problem in #15090.

How does it not fix #15090? cargo-vendor doesn't change any content of a registry package, at least cp_sources() doesn't. It does skip some files and create a .cargo-checksum.json though.

Is it all that big of an architectural change? Can we get an instance of the registry source and and call an extract method directly on it?

Maybe not really an architectural change, but we'll need to have an abstraction over local/remote/http registry (which is RegistryData/RegistrySource). Ensure .crates be prepared. And decouple unpack_package and get_pkg from the actual download logic. And then we still need to walk the extraced content to compute checksum. Not architectural but also not small.

weihanglo added 6 commits May 11, 2025 14:56
`PathSource::list_files` has some heurstic rules for listing files.
Those rules are mainly designed for `cargo package`.

Previously, cargo-vendor relies on those rules to understand what
files to vendor. However, it shouldn't use those rules because:

* Package extracted from a `.crate` tarball isn't Git-controlled,
  some rules may apply differently.
* The extracted package already went through `PathSource::list_files`
  during packaging. It should be clean enough.
* Should keep crate sources from registry sources in a pristine state,
  which is exactly what vendoring is meant for.

Instead, we switch to direct extraction into the vendor directory
to ensure source code is the same as in the `.crate` tarball.

There are some caveats:

* The overwrite protection in `unpack_package` assumes the unpack
  directory is always `<pkg>-<version`>.
  We don't want to remove this,
  but for cargo-vendor supports vendoring without version suffix.
  For that case, we need a temporary staging area,
  and move the unpacked source then.
* The heurstic in `PathSource::list_files` did something "good" in
  general cases, like excluding hidden directories. That means
  common directorys like `.github` or `.config` won't be vendored.
  After this, those get included. This is another round of churns.
  We might want to get other `cargo-vendor` changes along with this
  in one single release.
By doing direct extractions, source code is kept in its original state
The old workaround is irrelevant now.
@rustbot rustbot added the A-registries Area: registries label May 11, 2025
Not sure if it is really needed,
though Cargo had better follow what platform support says.

> The behavior on Windows is the same on Windows 10 1607 and higher
> if `FileRenameInfoEx` is supported by the filesystem; otherwise,
> `from` can be anything, but `to` must *not* be a directory.

https://doc.rust-lang.org/1.86.0/src/std/fs.rs.html#2430-2435
@weihanglo weihanglo changed the title fix(vendor): dont use PathSource heuristic when copying files fix(vendor)!: direct extraction for registry sources May 11, 2025
@weihanglo
Copy link
Member Author

Anyway, I got direct extraction working. Doesn't look too terrible than I thought!

@weihanglo
Copy link
Member Author

@rustbot note benchmark-result

Also, the benchmark result shows that it is ~60% faster when vendoring cargo and its dependencies. On Apple M1 Pro with 500GiB APFS SSD

Benchmark 1: cargo-15514 vendor
  Time (mean ± σ):     12.331 s ±  0.221 s    [User: 2.705 s, System: 7.211 s]
  Range (min … max):   12.083 s … 12.753 s    10 runs

Benchmark 2: cargo-15514 vendor --versioned-dirs
  Time (mean ± σ):     12.737 s ±  0.242 s    [User: 2.704 s, System: 7.681 s]
  Range (min … max):   12.481 s … 13.240 s    10 runs

Benchmark 3: cargor-master vendor
  Time (mean ± σ):     19.639 s ±  0.475 s    [User: 2.947 s, System: 12.391 s]
  Range (min … max):   19.006 s … 20.481 s    10 runs

Benchmark 4: cargor-master vendor --versioned-dirs
  Time (mean ± σ):     19.574 s ±  0.558 s    [User: 2.948 s, System: 12.422 s]
  Range (min … max):   18.739 s … 20.489 s    10 runs

Summary
  cargofast vendor ran
    1.03 ± 0.03 times faster than cargo-15514 vendor --versioned-dirs
    1.59 ± 0.05 times faster than cargo-master vendor --versioned-dirs
    1.59 ± 0.05 times faster than cargo-master vendor

Copy link
Contributor

@ehuss ehuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I have not gone over this with a fine-toothed comb, but the overall structure looks good to me.

I'm not going to click merge right now in case anyone else wants to look closer, or you had any additional concerns or changes you were thinking of. If there isn't more progress by sometime next week, then I think it would be fine to just merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment