-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable syncing without filelists present #3780
Conversation
I tested this locally and it works, but it requires a new version of createrepo_c that isn't released yet. rpm-software-management/createrepo_c#441 @ggainey, I do have some misgivings for the reasons we've previously discussed. Content is immutable and these fields aren't part of the natural key. Syncing the repo means that you can't, for instance, upload the same RPM and "complete" the filelist. We already have this problem w/ metadata, so it's no different really, just an additional case of it. But idk that "trusting" the metadata is such a great idea, at least not without doing something like chucking on a metadata digest like we do for updaterecords. |
As a day-of-learning experiment I wrote a script that does the whole "parse only primary.xml and download the headers" approach. It takes about 15 minutes on the Fedora 40 release repo just to do all the downloading, which is a lot more than 1 minute. But that's theoretically work that would only ever need to be done the first time a new package is seen, and never again thereafter. So maybe it's not so bad? One way or another we probably should fix the problem of packages being immutable but also global and potentially incomplete issue though. |
That depends entirely on the "typical" amount of change in the remote repo. If it's one package a week, over time it's a clear win. If every update to a repo affects, say, 25% of the packages, then it's a net loss. Also, "only pay it once" is fine - unless your users are made about even that first one-time-only cost. Also also, "how long will this take" is dependent on network-speed. There are A LOT of variables here. I don't have an answer - I don't think there IS "an" answer - I think we need to think really carefully about the tradeoffs, and whether it's worth the complexity we're thinking about adding. |
We would also probably need to flip the sync pipeline around. Filter out dups first, then optionally download artifacts (you don't want to duplicate effort in the immediate download case), then the plugin decides whether or not downloading the header is necessary based on whether the full artifact was downloaded, then the new content unit is created. It would be a whole thing. |
This feels like A LOT of heuristic work to solve an edge-case |
349b2be
to
d322769
Compare
736e9b8
to
de9a067
Compare
de9a067
to
ded428f
Compare
if we sync such a repo, are we going to also skip publishing such files? |
@ipanova No because primary.xml contains at least a subset of the filelist, generally, so we ought to have something. And if not there's still no harm in publishing a filelists.xml w/o files and not worth the edge cases to do otherwise. |
closes #3777