Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mirroring - Filtering for specific (ex. "py3.13") minor versions / slimming down size of mirroring? #1860

Open
InfiniteBSOD opened this issue Jan 25, 2025 · 2 comments
Labels
help wanted Extra attention is needed needs_external_pr Will rely on non maintainer PR in order to close

Comments

@InfiniteBSOD
Copy link

Hello,

Using Python 3.13.1 on WSL2 (Ubuntu 24.04.1) .

My bandersnatch.conf has the following configuration:

[mirror]
; The directory where the mirror data will be stored.
directory = /mnt/d/bandersnatch

; Save JSON metadata into the web tree:
; URL/pypi/PKG_NAME/json (Symlink) -> URL/json/PKG_NAME
json = true

; Save package release files
release-files = true

; Cleanup legacy non PEP 503 normalized named simple directories
cleanup = false

; The PyPI server which will be mirrored.
; master = https://test.pypi.org
; scheme for PyPI server MUST be https
master = https://pypi.org

; The network socket timeout to use for all connections. This is set to a
; somewhat aggressively low value: rather fail quickly temporarily and re-run
; the client soon instead of having a process hang infinitely and have TCP not
; catching up for ages.
timeout = 50

; The global-timeout sets aiohttp total timeout for it's coroutines
; This is set incredibly high by default as aiohttp coroutines need to be
; equipped to handle mirroring large PyPI packages on slow connections.
global-timeout = 1800

; Number of worker threads to use for parallel downloads.
; Recommendations for worker thread setting:
; - leave the default of 3 to avoid overloading the pypi master
; - official servers located in data centers could run 10 workers
; - anything beyond 10 is probably unreasonable and avoided by bandersnatch
workers = 3

; Whether to hash package indexes
; Note that package index directory hashing is incompatible with pip, and so
; this should only be used in an environment where it is behind an application
; that can translate URIs to filesystem locations.  For example, with the
; following Apache RewriteRule:
;     RewriteRule ^([^/])([^/]*)/$ /mirror/pypi/web/simple/$1/$1$2/
;     RewriteRule ^([^/])([^/]*)/([^/]+)$/ /mirror/pypi/web/simple/$1/$1$2/$3
; OR
; following nginx rewrite rules:
;     rewrite ^/simple/([^/])([^/]*)/$ /simple/$1/$1$2/ last;
;     rewrite ^/simple/([^/])([^/]*)/([^/]+)$/ /simple/$1/$1$2/$3 last;
; Setting this to true would put the package 'abc' index in simple/a/abc.
; Recommended setting: the default of false for full pip/pypi compatibility.
hash-index = false

; Format for simple API to be stored in
; Since PEP691 we have HTML and JSON
simple-format = ALL

; Whether to stop a sync quickly after an error is found or whether to continue
; syncing but not marking the sync as successful. Value should be "true" or
; "false".
stop-on-error = false

; The storage backend that will be used to save data and metadata while
; mirroring packages. By default, use the filesystem backend. Other options
; currently include: 'swift'
storage-backend = filesystem

; Advanced logging configuration. Uncomment and set to the location of a
; python logging format logging config file.
; log-config = /etc/bandersnatch-log.conf

; Generate index pages with absolute urls rather than relative links. This is
; generally not necessary, but was added for the official internal PyPI mirror,
; which requires serving packages from https://files.pythonhosted.org
; root_uri = https://example.com

; Number of consumers which verify metadata
verifiers = 3

; Number of prior simple index.html to store. Used as a safeguard against
; upstream changes generating blank index.html files. Prior versions are
; stored under as "versions/index_<serial>_<timestamp>.html" and the current
; index.html will be a symlink to the latest version.
; If set to 0 no prior versions are stored and index.html is the latest version.
; If unset defaults to 0.
; keep_index_versions = 0

; Configure an option to compare whether a file is identical. By default the
; "hash" method is used which reads local file content and computes hashes,
; which is slow but more reliable; when "stat" method is used, file size and
; change time are used to compare, which is useful to reduce IO workload when
; verifying a lot of files frequently.
; Possible values are: hash (default), stat
compare-method = hash

; Configure to download packages from an alternative mirror.
; By default bandersnatch downloads packages from the server in the "url"
; value of json response from master server. This option asks bandersnatch
; to try to download from the configured PyPI mirror first, and fallback to
; "url" value if it was not successful (unable to get content or checksum
; mismatch). It is useful to sync most of the files from an existing, nearby
; mirror, for example when setting up a new server sitting next to an existing
; one for the purpose of load sharing.
; Downloading only from the mirror site without fallback is also possible,
; but be aware this could lead to more failures than expected and is not
; recommended for most scenarios.
; download-mirror = https://pypi-mirror.example.com/
; download-mirror-no-fallback = False

; vim: set ft=cfg:

; Configure a file to write out the list of files downloaded during the mirror.
; This is useful for situations when mirroring to offline systems where a process
; is required to only sync new files to the upstream mirror.
; The file be be named as set in the diff-file, and overwritten unless the
; diff-append-epoch setting is set to true.  If this is true, the epoch date will
; be appended to the filename (i.e. /path/to/diff-1568129735)
; diff-file = /srv/pypi/mirrored-files
; diff-append-epoch = true

[plugins]
enabled =
    exclude_platform
    latest_release

[blocklist]
platforms =
    macos
    freebsd
    py2.4
    py2.5
    py2.6
    py2.7
    py3.1
    py3.2
    py3.3
    py3.4
    py3.5
    py3.6
    py3.7
    py3.8
    py3.9

[latest_release]
keep = 3

I want to serve a local PyPi repository in an offline environment only catering to Python 3.13 on Windows and Linux.
AFAIK the "blocklist" and "platforms" filter only supports py2.4 ~ py2.7 and py3.1 ~ py3.10.
So I can't exclude "py3.11" and "py3.12"?
I omitted "py3.10" in the blocklist above since as I understand "py3.10" encompasses "py3.1x" meaning python 3.10 to 3.13?

With my config I downloaded 736GB until I received an error (sadly the clipboard messed up so I can't share it here).

Any ideas on being able to slim down my mirroring?

@cooperlees
Copy link
Contributor

You should be able to restart a sync and it'll just checksum downloads of projects and not need to redownload ...

  • The bummer is we won't delete already downloaded pacakges - We do not have a smart deleting story
    • So if you add more filters and rerun we won't "cleanup" ... We have many open issues here and really need pypi enhancements to be better there :(

I wonder what we can do to avoid having to hard code py versions for the blocklists. That said py3.11 and 3.12 have been defined for 2 years (6528a9f) - What version of bandersnatch are you running?

I omitted "py3.10" in the blocklist above since as I understand "py3.10" encompasses "py3.1x" meaning python 3.10 to 3.13?
Hmmm, I wonder if that's a bug and we should just make it explicit ...

  • As always, our plugins are 99% contributor supplied, so I am happy to all enhancements should bugs / behaviour be unexpected

If you install master version of bandersnatch you should be able to even set 3.13 (not that you need to) due to b6e1288:

  • pip install [-U] git+https://github.com/pypa/bandersnatch

@cooperlees cooperlees added help wanted Extra attention is needed needs_external_pr Will rely on non maintainer PR in order to close labels Jan 25, 2025
@InfiniteBSOD
Copy link
Author

InfiniteBSOD commented Jan 26, 2025

Hmmm, I wonder if that's a bug and we should just make it explicit ...

Really appreciate the answer and the information that py3.11 and py3.12 are supported to be used in for example filtering.
I simply read the docs here and since I didn't see any mention of py3.11 and py3.12 I believed they weren't supported. That's also why I assumed that "py3.10" encompassed "py3.1x" ("py3.10",py3.11","py3.12" and "py3.13").

I modified my "bandersnatch.conf" and added "py3.10", "py3.11" and "py3.12" to my blocklist for platforms and have started a new mirroring session 👍

So I believe simply updating the docs here would clear up any confusion.

Update:

Received the same error I received earlier, quite a hefty log so linking to Pastebin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed needs_external_pr Will rely on non maintainer PR in order to close
Projects
None yet
Development

No branches or pull requests

2 participants