Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate VCS urls in hash-checking mode using their commit hashes #6469

Open
jcushman opened this issue May 6, 2019 · 26 comments · May be fixed by #11968
Open

Validate VCS urls in hash-checking mode using their commit hashes #6469

jcushman opened this issue May 6, 2019 · 26 comments · May be fixed by #11968
Assignees
Labels
C: vcs pip's interaction with version control systems like git, svn and bzr state: needs discussion This needs some more discussion type: enhancement Improvements to functionality

Comments

@jcushman
Copy link

jcushman commented May 6, 2019

What's the problem this feature will solve?

It is often useful to temporarily switch a requirement to a VCS URL, for example to refer to a pull request in a fork prior to release, and then switch back once the feature is released.

This practice is incompatible with --require-hashes, however. If any requirements are hashed, then the VCS URL requirement is rejected with "The editable requirement cannot be installed when requiring hashes, because there is no single file to hash" or "Can't verify hashes for these requirements because we don't have a way to hash version control repositories."

This makes it more difficult to use hashes, which ultimately discourages secure pip use.

Describe the solution you'd like

Many VCS URLs contain sha1 hashes in the refspec. (sha1 is not a great hash -- more discussion below.) A requirement like git+https://github.com/requests/requests.git@e52932c427438c30c3600a690fb8093a1d643ef3#egg=requests could be accepted with --require-hashes as long as the commit validates.

Support on the pip side would help to encourage use of hashes by pip-tools users. A user could include git+https://github.com/requests/requests.git@master#egg in requirements.in, and use pip-compile --generate-hashes to write a pinned VCS URL along with other hashed requirements to requirements.txt. Currently pip-compile --generate-hashes necessarily generates an uninstallable requirements.txt with VCS URLs, which discourages use of --generate-hashes.

Alternative Solutions

  • Don't require hashes for VCS URLs, on the grounds that some hashes are better than none. (I believe this is the approach taken by pipenv, though I haven't verified.) This still checks hashes for most requirements, and the security risk of allowing some unhashed URLs is "attacker has compromised communications with a specific VCS server", which for many users would not significantly change their risk profile.

  • (Provide option to disable hash-checking #4344) Same as above, but as an opt-in flag, either with a global flag or a --hash=skip flag per line.

  • Keep VCS URLs in a separate requirements file, as recommended on editable cannot be installed when requiring hashes #4995. This adds complexity to the install process without any additional security over the flag option.

  • Some VCS URLs can be converted to artifact URLs, like https://github.com/requests/requests/archive/e52932c427438c30c3600a690fb8093a1d643ef3.zip#egg=requests, which allows them to be hashed. This doesn't work for other packages, however, such as those that use setuptools_scm to set their version. For example, pip install https://github.com/jazzband/pip-tools/archive/f97e62ecb0d9b70965c8eff952c001d8e2722e94.zip will fail with "setuptools-scm was unable to detect version".

Additional context

This feature request was discussed when --require-hashes was first added and ultimately rejected by the author because of the insecurity of sha1:

@jezdez on Oct 8, 2015 Contributor
I was wondering about that, wouldn't some VCSs at least allow providing a hash for a repo? Is that out of scope and need to be added at some point? Or just a silly idea?

@erikrose on Oct 8, 2015 Author Contributor
So yes, this is something I thought we might add in the future: pip would recognize hash-based git refspecs as okay for hash-checking mode. It would run git fsck over them to make sure the hashes really match. (git doesn't check them implicitly, at least in older versions, allowing a malicious git server to pass off whatever it wants.) My only reservation is that SHA1 isn't considered a very collision-resistant hash anymore.

@sigmavirus24 on Oct 12, 2015 Member
git fsck is good except that sometimes genuine commits are bad. requests has a commit that was accepted through a PR and which has (much later) been determined to be invalid by git fsck. There's nothing we can do now without rewriting all of the history since that commit if we were to fix it.

@erikrose
erikrose on Oct 12, 2015 Author Contributor
I assume later commits fsck fine, correct? I'd be willing to accept that broken commits can't be used as hash-checked VCS checkouts by pip. But the point is moot because SHA1 will probably be cost-effectively breakable in a few years.

Since then, there's been a lot more discussion of the risks of using sha1 for refspecs following the ShAttered attack (where Google paid to create a PDF hash collision, and could have done the same for a git commit).

After that attack:

In short, a sha1 collision attack requires attackers to first issue a specially-crafted, suspicious benign commit, at great expense, and then replace it with a malicious one. So a security model that is concerned with sha1 attacks assumes that an attacker has already crafted a seemingly-benign commit and had it distributed -- by which point, the Mercurial page argues, there are cheaper, easier, and more deniable ways to succeed in the attacker's aim.

That argument is certainly debatable! So I'm hoping to discuss it here: would it be worth counting sha1 refspecs as "hashed" for now, in order to encourage use of pip install --require-hashes and pip-compile --generate-hashes?

@erikrose
Copy link
Contributor

I think you (and others) make a good case that stonewalling SHA1 VCS URLs as unhashable is counterproductive in a lot of practical cases, like when using secure internal git servers. The obvious options I see are…

  • Just letting them through. Not really my favorite, as it downgrades security silently. But maybe no users care.
  • Adding an --almost-require-hashes option or similar. Seems okay on the face of it.

However, I will put this here for everyone's evaluation. Seems like SHA1 collisions are about to get another notch cheaper: https://www.zdnet.com/article/sha-1-collision-attacks-are-now-actually-practical-and-a-looming-danger/.

In my ideal world, git would release SHA256 support, and we'd have no reservations at all.

@jcushman
Copy link
Author

@erikrose:

Adding an --almost-require-hashes option or similar. Seems okay on the face of it.

How about a special --hash value, like:

# accept the hash value provided in the VCS url:
git+https://github.com/requests/requests.git@e52932c427438c30c3600a690fb8093a1d643ef3#egg=requests --hash=vcs

or

# perform no hash validation on this line:
git+https://localhost:9000/requests/requests.git@e52932c427438c30c3600a690fb8093a1d643ef3#egg=requests --hash=skip

@erikrose
Copy link
Contributor

That also occurred to me, but I thought it'd be best to make it clear to the runner of pip rather than just the author of the requirements file. But, now that you mention it, a package author doesn't (ideally) include hashes, as that would frustrate downstream integrators. So it's fine.

@funtikas25
Copy link

hello pleas help what the hash is hier ???????
[18:21:27] [WARNING] there has been a problem while writing to the session file ('OperationalError: attempt to write a readonly database')
[18:21:30] [INFO] retrieved: 07b1e0439ba077bc7bdfbbf2341b1d1f:dc
[18:21:30] [DEBUG] performed 249 queries in 380.19 seconds
[18:21:30] [INFO] retrieving the length of query output
[18:21:30] [INFO] retrieved: 21
[18:23:18] [DEBUG] performed 10 queries in 107.48 seconds
[18:23:18] [DEBUG] starting 8 threads
[18:24:14] [INFO] retrieved: o_i________________ 2/21 (9%)

@pradyunsg pradyunsg added the S: needs triage Issues/PRs that need to be triaged label Jun 22, 2019
@awbirdsall
Copy link

I've run into this same problem. I just want to point out another limitation with using an artifact url like https://github.com/requests/requests/archive/e52932c427438c30c3600a690fb8093a1d643ef3.zip#egg=requests is, as far as I can tell, there isn't good support by pip/pip-tools for working with this style url for a private repo. (jazzband/pip-tools#947 (comment))

@chrahunt chrahunt added C: vcs pip's interaction with version control systems like git, svn and bzr state: needs discussion This needs some more discussion type: enhancement Improvements to functionality labels Dec 8, 2019
@triage-new-issues triage-new-issues bot removed the S: needs triage Issues/PRs that need to be triaged label Dec 8, 2019
@sbidoul
Copy link
Member

sbidoul commented Jan 17, 2020

I would be in favor of having --require-hashes accept such VCS URLs too.

In #6851 I added is_immutable_rev_checkout that could help implementing this.

@lcarva
Copy link

lcarva commented Jun 25, 2020

Why not just hash the whole VCS directory and use whatever hash algorithm the user wants?

This would probably have to exclude things like the .git directory which may(?) contain different files depending on the version of git used.

@mitar
Copy link

mitar commented Sep 18, 2020

Yes, let's just fix this in a backwards compatible way by start supporting a hash of a VCS checked out directory. I do not see why it is so hard to recursively compute hash of a directory in a deterministic way and this is it.

@uranusjr
Copy link
Member

It is not hard to compute a directory’s hash, but to compute reliably. .git is already mentioned as a problem, and there are many other platform- and tool-specific files that can also affect this (.DS_Store comes to mind), and pip would be stuck in maintenance hell if it starts adding those specific rules for obscure edge cases. Checking against commit ID is much more practical approach IMO.

@uranusjr
Copy link
Member

Another way to do this is to stick with only checking artifact hashes. There has been talks about pip should build an sdist for source tree installations (to solve unrelated issues), and pip can use the built sdist’s hash here. That would be consistent.

@mitar
Copy link

mitar commented Sep 18, 2020

Commit ID sadly is not complete. For example if you use git LFS to pull some data but that is not installed on the target system. Then git LFS will not pull data and your checkout of the repository will not really contain expected files.

I do not think it is maintenance hell. You do not compute hash on the file system, you compute it on git's view of the filesystem. So, .DS_Store should be in .gitignore and this is why it is not included in the hash. If it is not in .gitignore and it is present but on the source system it was not, then hash SHOULD fail. This is why it is there. To assure exact directory reproduction.

And the way to achieve this is simply to call git archive | sha256sum and this is it.

There has been talks about pip should build an sdist for source tree installations (to solve unrelated issues), and pip can use the built sdist’s hash here. That would be consistent.

+1 on that one. Yes, hashes should probably not be computed on repository directory, but on what is being seen as source to be installed from.

@erikrose
Copy link
Contributor

In reply to @uranusjr, that's been brought up before, but we'd have to do some work to make sure the building of sdists are deterministic. For starters, modification timestamps make their way into tarballs.

@zyv
Copy link

zyv commented Oct 16, 2020

This is preventing us in moneymeets/python-poetry-buildpack#19 to enable hashes - people use private repositories as requirements a lot for deployment - I'm not really concerned about SHA1 weaknesses, as it seems to me, that not using hashes at all is less secure as using SHA1 for private VCS dependencies, but I like the idea of git archive | sha256. Isn't it a perfect solution that alleviates all previous concerns (security, reproducibility, maintenance burden)?

@aleksanb
Copy link

aleksanb commented Nov 3, 2021

We would like to migrate away from Pipenv to pip-compile from pip-tools with --generate-hashes which uses normal pip under the hood, but have a number of git dependencies due to forking libraries and making custom patches.
Those git dependencies prevent us from enabling --generate-hashes, without doing some of the acrobatics described in the initial post.

Has anything changed since the issue was created two years ago that could allow for this feature to move forward now?
The current situation is unfortunate, especially compared to other package managers like cargo and yarn that just support git deps out of the box together with version locking.

@pfmoore
Copy link
Member

pfmoore commented Nov 3, 2021

Not that I'm aware of. It would need someone to come up with a proposal and PR, and so far no-one has been sufficiently motivated to do so.

@sbidoul
Copy link
Member

sbidoul commented Nov 3, 2021

I'd personally be happy to review a PR doing that.

I think it will involve our existing is_immutable_rev_checkout() function to reject branches that look like hashes.

Also VCS references that point an immutable ref cause wheels built from them to be cached, and subsequently reused. So the question of whether the pip wheel cache can be trusted might come up, as we can't hash check wheels coming from there.

@jcushman
Copy link
Author

jcushman commented Nov 3, 2021

Seems like handling all the edge cases has made this bug prohibitive for anyone to take on. Given how long it's been open, I wonder if anyone is up for trying a simpler PR that just supports the per-requirement opt-out I suggested up-thread:

git+https://localhost:9000/requests/requests.git@e52932c427438c30c3600a690fb8093a1d643ef3#egg=requests --hash=skip

I imagine that's a much simpler PR to write, and it lets projects move forward that are better off with 99% hashing than none (in particular because the requirements where they're skipping hashing are likely to be ones they control). [Edit: and it doesn't interfere with a more complete solution if-and-when someone is able to write one.]

@aleksanb
Copy link

aleksanb commented Nov 4, 2021

I imagine that's a much simpler PR to write, and it lets projects move forward that are better off with 99% hashing than none (in particular because the requirements where they're skipping hashing are likely to be ones they control). [Edit: and it doesn't interfere with a more complete solution if-and-when someone is able to write one.]

This is our case! Our git deps are mainly for forked repos where we have a handful of patches. Right now i worked around this restriction by adding patchfiles directly to our project, and git apply'ing them into our virtualenv after install has completed.

ahal added a commit to ahal/taskgraph that referenced this issue May 3, 2022
Currently when installing pip dependencies via the
`<REPO>_PIP_REQUIREMENTS` environment variable, `run-task` explicitly
passes in the `--require-hashes` option. This makes it difficult to test
against pre-release versions of Taskgraph since VCS urls do not support
hashes (pypa/pip#6469).

Luckily when `pip` encounters a requirements with `--hash`, the
`--require-hashes` flag is implied. This means that if `run-task`
doesn't explicitly pass it in, we'll be able to test against pre-release
versions of taskgraph by adding the git URL and then simply *not* using
the `--generate-hashes` flag in `pip-compile`. But in production we'd
still get hash verification.

This does loosen restrictions a bit, since previously a repo would have
no choice other than to use hashes. But now they could opt not to.
Personally I feel like this is a trade-off that should be left to repo
owners anyway.
ahal added a commit to ahal/taskgraph that referenced this issue May 3, 2022
Currently when installing pip dependencies via the
`<REPO>_PIP_REQUIREMENTS` environment variable, `run-task` explicitly
passes in the `--require-hashes` option. This makes it difficult to test
against pre-release versions of Taskgraph since VCS urls do not support
hashes (pypa/pip#6469).

Luckily when `pip` encounters a requirements with `--hash`, the
`--require-hashes` flag is implied. This means that if `run-task`
doesn't explicitly pass it in, we'll be able to test against pre-release
versions of taskgraph by adding the git URL and then simply *not* using
the `--generate-hashes` flag in `pip-compile`. But in production we'd
still get hash verification.

This does loosen restrictions a bit, since previously a repo would have
no choice other than to use hashes. But now they could opt not to.
Personally I feel like this is a trade-off that should be left to repo
owners anyway.
ahal added a commit to ahal/taskgraph that referenced this issue May 5, 2022
Currently when installing pip dependencies via the
`<REPO>_PIP_REQUIREMENTS` environment variable, `run-task` explicitly
passes in the `--require-hashes` option. This makes it difficult to test
against pre-release versions of Taskgraph since VCS urls do not support
hashes (pypa/pip#6469).

To work around this, a `PIP_DISABLE_REQUIRE_HASHES` environment variable
is being introduced. When set, `run-task` will not pass down
`--require-hashes` to the `pip install` command. Allowing consumers to
test against URL specifiers.
ahal added a commit to taskcluster/taskgraph that referenced this issue May 5, 2022
Currently when installing pip dependencies via the
`<REPO>_PIP_REQUIREMENTS` environment variable, `run-task` explicitly
passes in the `--require-hashes` option. This makes it difficult to test
against pre-release versions of Taskgraph since VCS urls do not support
hashes (pypa/pip#6469).

To work around this, a `PIP_DISABLE_REQUIRE_HASHES` environment variable
is being introduced. When set, `run-task` will not pass down
`--require-hashes` to the `pip install` command. Allowing consumers to
test against URL specifiers.
@4thel00z
Copy link

workaround:

  1. Save your vc deps in a separate file without --hash
  2. pip install -r vcs-reqs.txt; pip install -r reqs.txt

@davidxia
Copy link

Do I have to run one pip install on multiple requirement files to get a consistent dependency set? If so, that workaround can result in an inconsistent dependency set, right?

@sbidoul
Copy link
Member

sbidoul commented Apr 13, 2023

I plan to work on this for pip 23.2 (July). My plan is to accept VCS URLs which point to immutable (git) commit hashes when pip is in --require-hashes mode.

Granted git's sha1 may be considered weaker than the sha256 hashes we require for archives. On the other hand I think rejecting it as it is now forces users of VCS URLs into a more insecure situation than they deserve as they can use hash validation at all for other requirements.

If this is controversial I'm happy to enable this new behaviour with an option.

@sbidoul sbidoul added this to the 23.2 milestone Apr 13, 2023
@sbidoul sbidoul self-assigned this Apr 13, 2023
@uranusjr
Copy link
Member

How does this apply to other VCS?

@sbidoul
Copy link
Member

sbidoul commented Apr 14, 2023

How does this apply to other VCS?

I plan to rely on our is_immutable_rev_checkout method, which is currently used to determine if a wheel built from a VCS URL can be cached. It is currently implemented only for git, but at soon as it gets implemented for other VCS they will immediately benefit from it. VCS that do not support the concept (such as SVN) will continue to cause an error in this situation.

@sbidoul
Copy link
Member

sbidoul commented Apr 16, 2023

A draft PR is at #11968. I'll very much welcome tests by interested users.

@pfmoore pfmoore removed this from the 23.2 milestone Jun 25, 2023
@uranusjr
Copy link
Member

Please put comments on a specific pull request in the pull request thread instead. Commenting here may cause confusion in the future if more than one pull request is associated with the issue.

@ncoghlan
Copy link
Member

ncoghlan commented Dec 9, 2024

Just noting another problem with the "hash and build the exported source archive" approach: doing that produces incorrect package version metadata if the sdist creation process relies on VCS metadata to generate the version number (e.g. versioneer, setuptools-scm).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: vcs pip's interaction with version control systems like git, svn and bzr state: needs discussion This needs some more discussion type: enhancement Improvements to functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.