Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to set up pulp-python sync from JFrog local pypi repository #669

Open
grzleadams opened this issue May 14, 2024 · 19 comments
Open

Unable to set up pulp-python sync from JFrog local pypi repository #669

grzleadams opened this issue May 14, 2024 · 19 comments
Labels

Comments

@grzleadams
Copy link

Version
Deployed via Operator:

{                                                                                                                                                                                              
  "versions": [              
    {                                                                                                 
      "component": "core",                                                                            
      "version": "3.49.1",                                                                            
      "package": "pulpcore",                   
      "module": "pulpcore.app",                                                                       
      "domain_compatible": true
    },                                                                                                                                       
    {                                                                                                 
      "component": "python",
      "version": "3.11.0",
      "package": "pulp-python",
      "module": "pulp_python.app",
      "domain_compatible": false
    },

Describe the bug
I set up a Pulp python remote pointing at a local JFrog pypi repository ("url": "https://<redacted>/artifactory/api/pypi/pypi-local/simple"), providing valid credentials in the process (with username and password), and linked it with a Pulp python repository and distribution. However, it appears that the credentials are not being passed during the requests when syncing, or the URL is being malformed, or something. From JFrog logs (note the non_authenticated_user and 401):

2024-05-14T16:33:15.472Z|<redacted>|<redacted>|non_authenticated_user|GET|/api/pypi/pypi-local/simple/pypi/<redacted>/json|401|-1|0|1|bandersnatch/6.1.0 (cpython 3.9.18-final0, Linux x86_64) (aiohttp 3.9.3)

For what it's worth, that URL also looks strange... I would expect .../simple/<redacted>/json, not .../simple/pypi/<redacted>/json. It's worth noting that Artifactory requires the username/password to be included in the URL but Pulp prevents that:

Error: {"url":["The remote url contains username or password. Please use remote username or password instead."]}

To Reproduce
Steps to reproduce the behavior:

  1. Create a local pypi repository in JFrog.
  2. Create a Pulp remote pointing at the JFrog pypi repository (and then set up a Pulp repository and distribution as needed to sync it).
  3. Attempt an authenticated sync.

Expected behavior
The sync should happen successfully.

Additional context
N/A

@gerrod3 gerrod3 transferred this issue from pulp/pulpcore May 16, 2024
@gerrod3
Copy link
Contributor

gerrod3 commented May 16, 2024

I moved the issue to the pulp_python repository since the issue is with this plugin.

I would try to remove the simple/ part from the remote url and retry the sync. I not entirely sure how JFrog sets up their pypi repository, but assuming that https://<host>/artifactory/pypi/pypi-local/ is the base index page for your repository then this should be the url you use for your remote.

It is still possible we might not fully support authenticated syncs, atleast through normal use of the remote's username and password field. If you are still getting 401s, try using the https://<username>:<password>@<host>/.../ url form.

@grzleadams
Copy link
Author

grzleadams commented May 16, 2024

Unfortunately, I did try removing simple/ (and a bunch of other permutations of the URL) and nothing worked. I did try to set the URL to include the credentials but Pulp wouldn't let me (I get the url contains username or password error I mentioned before). Is there a way to work around that/set it directly on the remote without using the CLI/API (I assume the validation happens either way)?

@gerrod3
Copy link
Contributor

gerrod3 commented May 16, 2024

If you want to directly set the url on the remote without validation you can do it through the shell. On the pulp instance run pulpcore-manager shell_plus, this should bring up a python shell with some classes already imported. Try:

py_remote = Remote.objects.get(name="your_python_remote_name")
py_remote.url = "https://<username>:<password>@<host>/artifactory/pypi/pypi-local/"
py_remote.save()

This should bypass the validation done through the API.

@grzleadams
Copy link
Author

Is shell_plus available in 3.49.1/the minimal image?

Unknown command: 'shell_plus'. Did you mean shell?

@gerrod3
Copy link
Contributor

gerrod3 commented May 16, 2024

It might not be. Instead use its suggestion pulpcore-manager shell and then add this line to the top: from pulpcore.app.models import Remote.

@grzleadams
Copy link
Author

Setting the credentials in the URL seems to have worked, so we're not getting the unauthenticated user business anymore.

2024-05-16T18:01:01.122Z|<thread_id>|<ipaddress>|<authenticated_user>|GET|/api/pypi/pypi-local/simple/pypi/<module_name>/json|404|-1|0|3|bandersnatch/6.1.0 (cpython 3.9.18-final0, Linux x86_64) (aiohttp 3.9.3)

The 404 appears to be related to both simple/pypi and /json; fixing both gives an HTML response that lists all available module versions. Are those two things required by the PyPI API spec?

@gerrod3
Copy link
Contributor

gerrod3 commented May 16, 2024

Both /simple/ and /pypi/<package_name>/json are PyPI APIs. /simple/ is used by pip for package installs and /pypi/* is used by bandersnatch (the tool Pulp uses under the hood) for syncing. When specifying the url for syncing you should only use the base-url of your index, no /simple/ or /pypi/* as bandersnatch will add the /pypi ending itself.

@grzleadams
Copy link
Author

Do you know if bandersnatch will follow redirects? Apparently JFrog is doing something with their reverse proxy that requires it (for example, to just curl the simple index you need -L.

@gerrod3
Copy link
Contributor

gerrod3 commented May 16, 2024

It should follow redirects, and same with pip as well.

@grzleadams
Copy link
Author

grzleadams commented May 16, 2024

I looked through the worker logs and it looks like Pulp finds the package list (there are in fact 26 packages to sync) but hits .netrc errors when trying to pull them:

pulp []: pulpcore.tasking.tasks:INFO: Starting task <task_id>
pulp []: bandersnatch:INFO: Initialized release plugin blocklist_release, filtering []
pulp []: bandersnatch.mirror:INFO: Syncing with https://<url>/artifactory/api/pypi/pypi-local.
pulp []: pulp_python.app.tasks.sync:INFO: Attempt 0 to get package list from https://<url>/artifactory/api/pypi/pypi-local
pulp []: pulp_python.app.tasks.sync:INFO: Syncing all packages.
pulp []: aiohttp.client:WARNING: Could not read .netrc file: [Errno 2] No such file or directory: '.fake-netrc'
pulp []: pulp_python.app.tasks.sync:INFO: Attempt 1 to get package list from https://<url>/artifactory/api/pypi/pypi-local
pulp []: pulp_python.app.tasks.sync:INFO: Syncing all packages.
pulp []: aiohttp.client:WARNING: Could not read .netrc file: [Errno 2] No such file or directory: '.fake-netrc'
pulp []: pulp_python.app.tasks.sync:INFO: Attempt 2 to get package list from https://<url>/artifactory/api/pypi/pypi-local
pulp []: pulp_python.app.tasks.sync:INFO: Syncing all packages.
pulp []: aiohttp.client:WARNING: Could not read .netrc file: [Errno 2] No such file or directory: '.fake-netrc'
pulp []: pulp_python.app.tasks.sync:INFO: Failed to get package list using XMLRPC, trying parse simple page.
pulp []: bandersnatch.mirror:INFO: No project filters are enabled. Skipping filtering
pulp []: pulp_python.app.tasks.sync:INFO: 26 packages to sync.
pulp []: bandersnatch.mirror:INFO: No metadata filters are enabled. Skipping metadata filtering
pulp []: bandersnatch.mirror:INFO: No release file filters are enabled. Skipping release file filtering
pulp []: bandersnatch.package:INFO: Fetching metadata for package: <module> (serial 0)
pulp []: aiohttp.client:WARNING: Could not read .netrc file: [Errno 2] No such file or directory: '.fake-netrc'
pulp []: bandersnatch.package:INFO: <module> no longer exists on PyPI
<snip>
pulp []: pulpcore.tasking.tasks:INFO: Task completed <task_id>

@gerrod3
Copy link
Contributor

gerrod3 commented May 17, 2024

Those Could not read .netrc file: warnings are harmless, they were fixed in pulp_python 3.11.1, but they shouldn't affect the sync.

Can you check the output of the sync task? The logs say it completed, so it should give info on how many packages it synced. pulp task show --href <task_href> or --uuid <task_id>.

If the number of synced packages is zero then can you try to curl https://<url>/artifactory/api/pypi/pypi-local/pypi/<package_name>/json and see if it responds with a json of that package's metadata? This should be the endpoint that the sync is trying for each package it is syncing.

@gerrod3
Copy link
Contributor

gerrod3 commented Jun 20, 2024

@grzleadams Did you ever get the sync to work?

@grzleadams
Copy link
Author

No, we were in a bit of a time crunch so I just downloaded all the files and added them to Pulp manually.

@gerrod3
Copy link
Contributor

gerrod3 commented Jun 20, 2024

I see. Well when you have time, I am willing to continue helping out to resolve this issue, else we can close it if no longer needed.

@treydock
Copy link

I came across this and have discovered that it appears Artifactory does not have the /pypi/<package>/json endpoint. Examples: https://jfrog.atlassian.net/browse/RTFACT-30744

I have attempted many permutations of URLs and the only one that appears to work is /simple/ and /simple/<package>/. If having only those URLs will not work with bandersnatch and Pulp, then I think Pulp cannot sync from Artifactory.

@gerrod3
Copy link
Contributor

gerrod3 commented Dec 18, 2024

Good find @treydock. @grzleadams Looks like we found the cause of the problem, they haven't implemented the JSON api. Some options you could try:

  1. Sync from the remote sources used to populate your Artifactory repository
  2. Set up a pull-through cache pointing to your Artifactory repository. This doesn't use the JSON api, but it does require you to request(pip install) the package once for Pulp to download the artifact. Also, it will only download 1 version/file at a time so if you have multiple versions/distributions of a package you will have to go and request each one manually.

We can probably support syncing from repositories that don't support the JSON apis, but there would be some limitations like only immediate syncing would be available and some of the advanced metadata filters might not work. If this is something you want we can change this to a feature request.

@grzleadams
Copy link
Author

We actually moved off JFrog, so I don't feel strongly either way. I don't know how many people are trying to sync Artifactory to Pulp... my guess is not many.

@treydock
Copy link

For now we're going to sync directly from pypi.org and then at some point move to having teams package Python dependencies as RPMs using something like pyp2rpm, similar to what Foreman does to package Pulp.

We likely won't be putting a lot of effort into using Pulp for Python packages due to sync issues that I think are just limitations of PyPi API and what bandersnatch has to do to perform a full sync of PyPi.org. Even with the broken Artifactory API the PyPi sync took 8 hours and that prevents new publications while the sync is running. Our use case is a little unique in that we will be frequently creating publications and distributions to create numerous and frequent "snapshots" of all repositories that can be accessed in parallel to other "snapshots". This means we had to switch to include lists for PyPi remotes in Pulp so that the sync is a much smaller number of API calls to the upstream PyPi API.

One issue we ran into using include list with the Python plugin for Pulp is bandersnatch doesn't resolve dependencies so I had to come up with a script to use johnnydep to work out dependencies for a list of Python packages and populate the correct include list. In our unique use case this would mean a team wishing to install a new Python package would first have to inform Pulp of this new package need and also run the script to pull out the dependency list to fully populate the includes list.

@gerrod3
Copy link
Contributor

gerrod3 commented Dec 18, 2024

Yes, I don't recommend trying to sync all of PyPI using Pulp. There are just too many packages and due to limitations of Pulp's current architecture you would need an insanely powerful machine for the database to handle the sync. The frequent snapshots sounds like a pretty normal usecase for Pulp, but I guess the syncs were taking too long to meet the frequent requirement?

As for dependency solving for syncs, I had a PR opened to add this feature (#626), but I never finished it because Python dependencies are hard to solve and I wasn't confident in my solution. The big issue is that the metadata is not set up for easy dependency solving, so the best you can do is try to cast as wide a net as possible for potential dependencies, but even with that strategy some might still slip through the crack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants