Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package source priority (HPC cluster use case) #10156

Closed
pradyunsg opened this issue Jul 12, 2021 · 20 comments
Closed

Package source priority (HPC cluster use case) #10156

pradyunsg opened this issue Jul 12, 2021 · 20 comments
Labels
type: support User Support

Comments

@pradyunsg
Copy link
Member

Originally posted by @mboisson in #8606 (comment)

Here are what we can and can not do, in our environment (HPC clusters).

  1. We can't run a server. This is for HPC clusters which are not under our control. The only thing we distribute is a filesystem. We can't long-standing processes and we don't control the network.
  2. We can't run binary (manylinux or other) wheels from PyPI for two primary reasons reasons: missing libraries or libraries installed in non-standard locations, non-optimized CPU instructions
  3. We can't host all of PyPI, that's just too much, and pure-python wheels work just fine.
  4. We can't change our requirement files/install differently, because we aren't the ones doing the installation. We support end users (researchers), we don't do their works, they install whatever package they need.

What we can do, and have been doing:

  1. Provide users a directory with wheels that we compile from source
  2. Define a PIP_CONFIG_FILE in the user's environments to make sure their pip install ... commands use some settings. We have been using this to point their pip to our wheel directory, which used to be preferred over pypi. We are now also using that to put a constraint on pip < 21, since that's now broken for us.
@pradyunsg
Copy link
Member Author

There was additional discussion in #8606, but I'm not moving that over for my own sanity. :)

@mboisson
Copy link

Other relevant information in this comment:
#8606 (comment)

We are not trying to provide a "secure" version and exclude installing other versions from PyPI. We are trying to provide an "optimized" and "working" version which PyPI fails to do in some instances.

Our users install everything from PyPI anyway. We are not trying to make it more secure than PyPI, we are trying to make it more optimal, and to repair broken binary packages provided on PyPI.

In that use case, shadowing PyPI in some circumstances is precisely what we want and need to do.

Yes, I acknowledge that PyPI is subject to supply chain attack. But that's not what we are trying to fix.

@pradyunsg
Copy link
Member Author

pradyunsg commented Jul 12, 2021

IIUC, your use case is "we have wheels that we want to use preferentially over what's on PyPI for the same versions of the package".

Since you're building these wheels yourself, can you add local version identifiers to those built wheels? Specifically those would look like 1.0.0+awesomelyfast. :)

@mboisson
Copy link

I am not familiar with those, but carry on ?

Would these involve just renaming the wheel files ? Or do we need to inject some metadata inside of them ? We have over 6000 wheel files at the moment, so rebuilding them is a large endeavour.

Assuming we can, how do these impact dependency and version resolution in pip ?

@pradyunsg
Copy link
Member Author

pradyunsg commented Jul 12, 2021

In that use case, shadowing PyPI in some circumstances is precisely what we want and need to do.

I guess a way to better explain why your use case broke would be -- substitute "PyPI" for "target company's package index" and hopefully you see that it models the intent of a supply chain attack.

Would these involve just renaming the wheel files ? Or do we need to inject some metadata inside of them ?

Unless I'm missing something, you'll need to do 2 things -- rename the wheels and update the version in the METADATA file inside the wheel (that's package-name.dist-info/METADATA) [1]

Assuming we can, how do these impact dependency and version resolution in pip ?

pip will preferentially use local versions of packages. If there's 1.0.0 and 1.0.0+awesomelyfast, pip will use 1.0.0+awesomelyfast.

If (somehow) that doesn't happen, that's a bug in pip that we'd need to fix. :)


To be clear, it's not that anyone thinks that your use case isn't important / worth catering to somehow, but it was extremely unclear what you're trying to do and the story came together in pieces (which can all be put together now, hopefully). That said, it is very likely that you'll need to change something to be compatible with the new way that things work -- based on ComputeCanada/software-stack#80 being filed, I'm gonna be optimistic and guess that you'd be open to changes on this front to keep things working without needing to fork things and add maintenance workload for yourself. :)


[1] The file's format is documented at https://packaging.python.org/specifications/core-metadata/ and how-to-parse example at https://github.com/pradyunsg/installer/blob/35ce9141733f1d3fcedfd5e19ea5e34d732fe822/src/installer/utils.py#L69.

@pradyunsg
Copy link
Member Author

Another option is wheel build tags -- which might be an even better answer. If a wheel has a build tag, that wheel is preferred over a wheel that doesn't.

That should only require renaming the wheel files. You can read more about the filename here: https://packaging.python.org/specifications/binary-distribution-format/#file-name-convention

@pradyunsg pradyunsg added the type: support User Support label Jul 12, 2021
@pradyunsg
Copy link
Member Author

From #8606 (comment):

So... from the stackoverflow comment

Where's this SO answer/comment?

@mboisson
Copy link

Would these involve just renaming the wheel files ? Or do we need to inject some metadata inside of them ?

Unless I'm missing something, you'll need to do 2 things -- rename the wheels and update the version in the METADATA file inside the wheel (that's package-name.dist-info/METADATA) [1]

I am hoping we don't need to hack the wheels manually (i.e. unzip them and sed the METADATA file) ?
We basically build our wheels with python setup.py bdist_wheel (after some hacking/patching/changing environment variable to use other dependencies, etc.). I guess there is an option to add local version identifier ?

Assuming we can, how do these impact dependency and version resolution in pip ?

pip will preferentially use local versions of packages. If there's 1.0.0 and 1.0.0+awesomelyfast, pip will use 1.0.0+awesomelyfast.

Even with pip install <something>==1.0.0 ? or with a requirement file that specify an exact version ?

@mboisson
Copy link

From #8606 (comment):

So... from the stackoverflow comment

Where's this SO answer/comment?

https://stackoverflow.com/a/67442488

@mboisson
Copy link

mboisson commented Jul 12, 2021

Another option is wheel build tags -- which might be an even better answer. If a wheel has a build tag, that wheel is preferred over a wheel that doesn't.

That should only require renaming the wheel files. You can read more about the filename here: https://packaging.python.org/specifications/binary-distribution-format/#file-name-convention

Thanks. I'll need to do some testing of how it actually behaves in dependency and version resolution with the latest pip. The local version is more tempting because we could tag them with a meaningful word (computecanada in our case), rather than just a number.

It would be useful to have a description of how dependency resolution is expected to behave, because unfortunately, history has shown that we can run test and have something that works, just to have the way it works changed in the next version of pip.

In particular, if I have, locally:

package-1.0.0+local-cp38-none-linux_x86_64.whl
package-1.0.0-cp38-none-linux_x86_64.whl
package-1.0.0-1-cp38-none-linux_x86_64.whl
package-1.0.0-2-cp38-none-linux_x86_64.whl

and PyPI has

package-1.0.0-cp38-none-linux_x86_64.whl
package-1.0.0-1-cp38-none-linux_x86_64.whl

and I run
pip install package==1.0.0 --find-links=/my/local/wheelhouse/directory
what is the expected order of priority ?

@pradyunsg
Copy link
Member Author

pradyunsg commented Jul 12, 2021

The local version is more tempting because we could tag them with a meaningful word (computecanada in our case), rather than just a number.

Build tags can be 1computecanada or something like that. It's a digit + (optional) string.

Wheels on PyPI can't have build tags, IIRC. Ignoring that, it's gonna be the local version, followed by decreasing build tags followed by remote 1.0.0 followed by local 1.0.0 I think.

None the less, I think the local version will be prefered over build tags. The order of preference is specified at: https://github.com/pypa/pip/blob/main/src/pip/_internal/index/package_finder.py#L530

@uranusjr
Copy link
Member

Wheels on PyPI can't have build tags, IIRC.

I believe build tags are possible on PyPI, although practically nobody uses them.

@mboisson
Copy link

I have confirmed that either build tags or local version specifiers will work for this use case, with a slight preference for local version specifiers.

Thanks for pointing out these options!

I know I annoyed some people, and I apologize for that. I realize we come from a very different point of view, and I think both sides misunderstood the other side's point of view.

I suggest (and I am volunteering to write the draft) to write a use case description/blog article/documentation page about this, presenting my understand of both the security use case and the priority use case, how they differ, and which solution is valid for both, hopefully with input from the devs if I get something wrong. I saw many issues (not just from me) being opened about what users misunderstood as setting an index that has priority over another one, and I genuinely think I can contribute to making it clearer for everyone.

Are the devs interested in such a contribution, and if so, what form should it take ?

@pradyunsg
Copy link
Member Author

Color me very interested!

The whole "dependency confusion / package source priority" situation has suffered from fairly poor communication -- there seems to no single place describing what the users can do and what mechanisms/tools they have at hand to deal with the situation (which leads to them asking for a "would work for me" approach -> "what do you mean it's unmaintainable" -- basically, generally a bad experience for everyone involved). In other words, I think we would benefit a lot from this being documented clearly in pip's documentation. :)

What we do to get there though, depends on how you'd prefer to do it. I personally prefer just having you write a blog post somewhere -- it avoids the overhead of needing to go through multiple rounds of review before adding something into pip's documentation. :)

  • a PR to pip's documentation would be nice, although this will need some back and forth about how exactly to do this -- we don't really have usecase-based docs clearly provided in the documentation today and there's no good place to put it right now. This is something I'm chipping away at in Major changes to the documentation #9475. :)
  • you can post this content in a blog post somewhere and share it with us -- one of our contributors (or most likely, I) can then pull that into pip's documentation with appropriate changes, to adapt it for the different style (blog post vs prose-heavy documentation). This basically means that instead of reviewing a PR + back-and-forth on it, we'd need to copy-paste + copy-edit the content. OTOH, it allows the blog post to stand on its own being helpful right now (eg: we can just post a link to that in the earlier issue as well as on the SO pages, and possibly link to it from relevant sections of pip's documentation).

@mboisson
Copy link

Ok. I posted this here:
https://github.com/ComputeCanada/software-stack/blob/main/pip-which-version.md

Feedback is welcome, especially if I got things wrong.

Hopefully this post can be useful to others in order to untangle some misconceptions.

@mboisson
Copy link

mboisson commented Jul 14, 2021

After meandering into the way pip decides which package to install and understanding the security point of view, I would recommend that, considering that all indexes are treated equal, if pip figures out that two Python packages with the same labels are available from two different indexes, it should issue a warning about it (and allow to disable the warning, assuming the risk), with pointers about alternative solutions

@uranusjr
Copy link
Member

By two packages with the same label, do you mean e.g. there are two pages for package foo, or there are two entries for file foo-1.0-none-any.whl? Because the former is an absolutely valid strategy to support architectures that PyPI doesn’t and potentially can’t host, and is used by many projects like piwheels and pytorch. Showing a warning would cause great disruption to those projects (because unassuming users would bug them thinking they are to be faulted), who have been meticulously following community guidelines and doing things “right”. The latter is likely harmless, but probably also not very useful?

@mboisson
Copy link

I meant if there are two entries for a given file. Or more generally, if there are two entries which have the same priority as far as pip goes, i.e. the behavior of which one gets installed is undefined and could change over time.

That is ubiquitous if you build wheels yourself and don't use a build-tag or local version. It will definitely yield wheels that have the same name as the versions available on PyPI.

@uranusjr
Copy link
Member

Sounds reasonable to me. We can also try to quelch the warning when the files have identical hash (obtained from the index page or directly hashing the file on the file system). I’d very much welcome a PR on this.

@pradyunsg
Copy link
Member Author

Closing this out, since this seems to have been resolved.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: support User Support
Projects
None yet
Development

No branches or pull requests

3 participants