Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify SourceID Hash #14800

Merged
merged 2 commits into from
Dec 23, 2024
Merged

simplify SourceID Hash #14800

merged 2 commits into from
Dec 23, 2024

Conversation

Eh2406
Copy link
Contributor

@Eh2406 Eh2406 commented Nov 8, 2024

What does this PR try to resolve?

Despite being interned SourceId::Eq is not a ptr::eq. Which in turn is because SourceIds concept of identity is a complete mess. The code relies on having to IDs that are Eq but do not have the same values for their fields.

As one measure of this SourceId has an impl Hash which does something different from fn full_hash and fn stable_hash. Separately SourceIdInner has a different implementation. Similar levels of complexity exist for Eq. Every one of these impls was added due to a real bug/issue we've had that needs to stay fixed. Not all of witch are reproducible enough to have made it into our test suite.

I have some ideas for how to reorganize the types so that this is easier to reason about and faster. But given the history and the complexity I want to move extremely carefully.

How should we test and review this PR?

The test pass, and it's a one line change, but this still needs careful review.

Additional information

r? @ehuss I remember you and Alex working very hard to track down most of these bugs.

@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Nov 8, 2024
@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 8, 2024

If this turns out not to be a valid transformation I am happy to update the PR to be tests to prove why it's not okay.

Comment on lines -676 to 679
_ => self.inner.url.as_str().hash(into),
}
self.inner.canonical_url.hash(into);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most CanonicalUrl differences are related to git somehow. I'm assuming that means we can't hit them in any other way. There is the stripping of a trailing /. However, we probably should have been respecting that in the hash in the first place I assume. The question is if this will change anything significant enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps if you have a index hosted on git, then GITHUB.com/me/my-index.git/ could change a lot. I think the biggest concern here is having to recheck out dependencies. So even if this does affect people we might just eat that cost. On the other hand there's a lot of complexity here and maybe there are bigger problems I forgot about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't yet looked closer. I wonder if this would affect hacks like this one: #5478 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would be affected by this change. But it is something we have to watch for as I "clean up" this code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, since [patch.<url>] uses URLs directly, not SourceId.

This change may affect anywhere we let people define URL and parse it as SourceId, like the [registries] table? We could also deem this as a bug and fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope that we rely on Eq for those cases and not Hash. So that the problem you identified is not a problem with this PR, rather one to investigate on the next PR that changes Eq.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Have checked other usages of it, none of them seems user-facing like the path to index files. It should be fine to move over.

Copy link
Member

@weihanglo weihanglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the paths to crates.io Git and spare index on users disks are not affected by this change, you have my vote.

@ehuss
Copy link
Contributor

ehuss commented Nov 12, 2024

To be clear, the goal here is to simplify the code a little bit? Are there any other intended improvements?

And the downside is that if someone has a custom registry that canonicalizes differently (like a trailing slash, or upper-case characters in a GitHub URL), it will hash differently and they will end up with new directories in ~/.cargo/registry/*/..., and duplicated cache contents?

If that's the tradeoff, I'm a little on the fence and could go either way.

If I'm understanding it correctly, would it be possible to update the test_stable_hash test with an example that illustrates the hash changing with this PR?

@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 12, 2024

To be clear, the goal here is to simplify the code a little bit? Are there any other intended improvements?

From this PR on its own, no other improvements. On its own, this is a small simplification that does not pull its weight. I have spiked out a larger re-factor that passes tests and that IMO does make things clearer and faster. It relies on several small changes like this that really need to be discussed thoroughly. So I wanted to pull out each of them in turn to make sure we found all of the unintended consequences.

And the downside is that if someone has a custom registry that canonicalizes differently (like a trailing slash, or upper-case characters in a GitHub URL), it will hash differently and they will end up with new directories in ~/.cargo/registry/*/..., and duplicated cache contents?

That is the only concern I'm aware of. But it is possible that I missed something.

If that's the tradeoff, I'm a little on the fence and could go either way.

Would it be helpful for me to post the full current spike as a draft PR to share where this is going?

If I'm understanding it correctly, would it be possible to update the test_stable_hash test with an example that illustrates the hash changing with this PR?

We can definitely add a unit test for this changing behavior. I should probably start by making sure it is in fact user facing with an end-to-end test. If it is, then yes a unit test is good idea. I will try and do that.

@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 14, 2024

And the downside is that if someone has a custom registry that canonicalizes differently (like a trailing slash, or upper-case characters in a GitHub URL), it will hash differently and they will end up with new directories in ~/.cargo/registry/*/..., and duplicated cache contents?

That is the only concern I'm aware of. But it is possible that I missed something.

While trying to update the tests as requested I thought of another concern. If the user has two custom registries that canonical eyes to the same thing, before this PR they would end up in separate directories in ~/.cargo/registry/*/... and now they would end up in the same folder. With this collision cause problems?

@weihanglo
Copy link
Member

, before this PR they would end up in separate directories in ~/.cargo/registry/*/... and now they would end up in the same folder. With this collision cause problems?

Theoretically yes. However I doubt somebody relies on this behavior to distinguish custom registries. It either needs to have trailing / or .git. or on gitub.com with protocol scheme other than https. We could notify interest groups in advance. Not sure what is the correct channel, t-cargo/build-integration doesn't seem to be the best fit.

@bors
Copy link
Contributor

bors commented Nov 15, 2024

☔ The latest upstream changes (presumably a8f7db2) made this pull request unmergeable. Please resolve the merge conflicts.

@rustbot

This comment has been minimized.

@Eh2406
Copy link
Contributor Author

Eh2406 commented Dec 20, 2024

I was going to abandon this PR, but before I do... @weihanglo do those filenames still go through this code path at all? Where do they go through stable_hash?

@weihanglo
Copy link
Member

I was going to abandon this PR, but before I do... @weihanglo do those filenames still go through this code path at all? Where do they go through stable_hash?

The code path doesn't change at all. stable_hash still relies on Hash impl for SourceId

self.hash(into)

@Eh2406
Copy link
Contributor Author

Eh2406 commented Dec 20, 2024

Good point! But we can just in line that. I just rebased and added a commit. The first commit changes stable_hash to not depend on hash but to have its own copy of the code. Then the second commit simplifies the main hash.

I feel like this alleviates the relevant concerns.

@weihanglo
Copy link
Member

Good point! But we can just in line that. […] I feel like this alleviates the relevant concerns.

That is true. However I am forgetting the motivation of this PR. It doesn't seem to simplify the current code. Does this change open any door for future improvement?

@Eh2406
Copy link
Contributor Author

Eh2406 commented Dec 23, 2024

It doesn't seem to simplify the current code.

Separating the "must return the same hash value across versions of the code" into 1 function, even though there's more code, I do think it is a very small when for clarity. The Hash trait does not normally have that requirement, our reliance on it always feels a little awkward.

I am forgetting the motivation of this PR. [..] Does this change open any door for future improvement?

Yes, and I'm happy to post a draft PR of where this is going if that would help. The idea is that I like to separately intern the parts of SourceID that are part of Eq from the parts that are mutable. In order for that to be effective, I need to minimize the parts that are relevant to Eq. But, Hash == Hash must imply Eq == Eq so in order to simplify Eq I needed to start by simplifying Hash. I think the three changes deserve their own PRs is because:

  • this PR, the change to Hash, is mostly worried about the stable hash interactions.
  • the next PR, the change to Eq, It is mostly concerned with the interactions between the interning strategy and its implications on the rest of the code.
  • the last PR, changes to interning, are mostly concerned about performance.
    If I did it in one PR, I'd be very concerned that one or more of these concerns might get lost in the other conversations.

Opinions can and do vary. I am open to however you wish to proceed.

Copy link
Member

@weihanglo weihanglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, Jacob!

@weihanglo weihanglo added this pull request to the merge queue Dec 23, 2024
Merged via the queue into rust-lang:master with commit 0c157e0 Dec 23, 2024
20 checks passed
@Eh2406 Eh2406 deleted the HashSourceId branch December 24, 2024 17:40
bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 25, 2024
Update cargo

6 commits in 652623b779c88fe44afede28bf7f1c9c07812511..c86f4b3a1b153218e6e50861214b0b4b4e695f23
2024-12-20 15:44:42 +0000 to 2024-12-24 17:49:48 +0000
- fix(package): check dirtiness of path fields in manifest (rust-lang/cargo#14966)
- test: make path arguments more generic and flexible (rust-lang/cargo#14979)
- Moved manifest metadata tracking from fingerprint to dep info (rust-lang/cargo#14973)
- fix: assure possibly blocking non-files (like FIFOs) won't be picked up for publishing. (rust-lang/cargo#14977)
- simplify SourceID Hash (rust-lang/cargo#14800)
- upgrade `gix` to the latest release 0.69. (rust-lang/cargo#14975)
@rustbot rustbot added this to the 1.85.0 milestone Dec 25, 2024
github-merge-queue bot pushed a commit that referenced this pull request Jan 7, 2025
### What does this PR try to resolve?

This is a followup to #14800. Like that PR, this is a small incremental
change that does not pull its own weight. If this PR is accepted, the
next PR will unlock large performance wins. I am not posting them
together because the logic of why this PR is correct is subtle and
deserves to be discussed and reviewed without unrelated code changes.

### How should we test and review this PR?

All tests pass on all commits. This **should** be reviewed one commit at
a time.

### Additional information

I pushed one commit at a time, so that CI can confirm that the assert
(in the first commit) is never hit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants