-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simplify SourceID Hash #14800
simplify SourceID Hash #14800
Conversation
If this turns out not to be a valid transformation I am happy to update the PR to be tests to prove why it's not okay. |
_ => self.inner.url.as_str().hash(into), | ||
} | ||
self.inner.canonical_url.hash(into); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most CanonicalUrl differences are related to git somehow. I'm assuming that means we can't hit them in any other way. There is the stripping of a trailing /
. However, we probably should have been respecting that in the hash in the first place I assume. The question is if this will change anything significant enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps if you have a index hosted on git, then GITHUB.com/me/my-index.git/
could change a lot. I think the biggest concern here is having to recheck out dependencies. So even if this does affect people we might just eat that cost. On the other hand there's a lot of complexity here and maybe there are bigger problems I forgot about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't yet looked closer. I wonder if this would affect hacks like this one: #5478 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that would be affected by this change. But it is something we have to watch for as I "clean up" this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, since [patch.<url>]
uses URLs directly, not SourceId.
This change may affect anywhere we let people define URL and parse it as SourceId
, like the [registries]
table? We could also deem this as a bug and fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hope that we rely on Eq
for those cases and not Hash
. So that the problem you identified is not a problem with this PR, rather one to investigate on the next PR that changes Eq
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Have checked other usages of it, none of them seems user-facing like the path to index files. It should be fine to move over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the paths to crates.io Git and spare index on users disks are not affected by this change, you have my vote.
To be clear, the goal here is to simplify the code a little bit? Are there any other intended improvements? And the downside is that if someone has a custom registry that canonicalizes differently (like a trailing slash, or upper-case characters in a GitHub URL), it will hash differently and they will end up with new directories in If that's the tradeoff, I'm a little on the fence and could go either way. If I'm understanding it correctly, would it be possible to update the |
From this PR on its own, no other improvements. On its own, this is a small simplification that does not pull its weight. I have spiked out a larger re-factor that passes tests and that IMO does make things clearer and faster. It relies on several small changes like this that really need to be discussed thoroughly. So I wanted to pull out each of them in turn to make sure we found all of the unintended consequences.
That is the only concern I'm aware of. But it is possible that I missed something.
Would it be helpful for me to post the
We can definitely add a unit test for this changing behavior. I should probably start by making sure it is in fact user facing with an end-to-end test. If it is, then yes a unit test is good idea. I will try and do that. |
While trying to update the tests as requested I thought of another concern. If the user has two custom registries that canonical eyes to the same thing, before this PR they would end up in separate directories in |
Theoretically yes. However I doubt somebody relies on this behavior to distinguish custom registries. It either needs to have trailing |
☔ The latest upstream changes (presumably a8f7db2) made this pull request unmergeable. Please resolve the merge conflicts. |
This comment has been minimized.
This comment has been minimized.
I was going to abandon this PR, but before I do... @weihanglo do those filenames still go through this code path at all? Where do they go through |
The code path doesn't change at all. cargo/src/cargo/core/source_id.rs Line 553 in 652623b
|
Good point! But we can just in line that. I just rebased and added a commit. The first commit changes I feel like this alleviates the relevant concerns. |
That is true. However I am forgetting the motivation of this PR. It doesn't seem to simplify the current code. Does this change open any door for future improvement? |
Separating the "must return the same hash value across versions of the code" into 1 function, even though there's more code, I do think it is a very small when for clarity. The Hash trait does not normally have that requirement, our reliance on it always feels a little awkward.
Yes, and I'm happy to post a draft PR of where this is going if that would help. The idea is that I like to separately intern the parts of SourceID that are part of
Opinions can and do vary. I am open to however you wish to proceed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, Jacob!
Update cargo 6 commits in 652623b779c88fe44afede28bf7f1c9c07812511..c86f4b3a1b153218e6e50861214b0b4b4e695f23 2024-12-20 15:44:42 +0000 to 2024-12-24 17:49:48 +0000 - fix(package): check dirtiness of path fields in manifest (rust-lang/cargo#14966) - test: make path arguments more generic and flexible (rust-lang/cargo#14979) - Moved manifest metadata tracking from fingerprint to dep info (rust-lang/cargo#14973) - fix: assure possibly blocking non-files (like FIFOs) won't be picked up for publishing. (rust-lang/cargo#14977) - simplify SourceID Hash (rust-lang/cargo#14800) - upgrade `gix` to the latest release 0.69. (rust-lang/cargo#14975)
### What does this PR try to resolve? This is a followup to #14800. Like that PR, this is a small incremental change that does not pull its own weight. If this PR is accepted, the next PR will unlock large performance wins. I am not posting them together because the logic of why this PR is correct is subtle and deserves to be discussed and reviewed without unrelated code changes. ### How should we test and review this PR? All tests pass on all commits. This **should** be reviewed one commit at a time. ### Additional information I pushed one commit at a time, so that CI can confirm that the assert (in the first commit) is never hit.
What does this PR try to resolve?
Despite being interned
SourceId::Eq
is not aptr::eq
. Which in turn is becauseSourceId
s concept of identity is a complete mess. The code relies on having to IDs that areEq
but do not have the same values for their fields.As one measure of this
SourceId
has animpl Hash
which does something different fromfn full_hash
andfn stable_hash
. SeparatelySourceIdInner
has a different implementation. Similar levels of complexity exist forEq
. Every one of theseimpl
s was added due to a real bug/issue we've had that needs to stay fixed. Not all of witch are reproducible enough to have made it into our test suite.I have some ideas for how to reorganize the types so that this is easier to reason about and faster. But given the history and the complexity I want to move extremely carefully.
How should we test and review this PR?
The test pass, and it's a one line change, but this still needs careful review.
Additional information
r? @ehuss I remember you and Alex working very hard to track down most of these bugs.