Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable syncing without filelists present #3780

Merged
merged 1 commit into from
Nov 14, 2024

Conversation

dralley
Copy link
Contributor

@dralley dralley commented Nov 8, 2024

closes #3777

@dralley
Copy link
Contributor Author

dralley commented Nov 8, 2024

I tested this locally and it works, but it requires a new version of createrepo_c that isn't released yet.

rpm-software-management/createrepo_c#441

@ggainey, I do have some misgivings for the reasons we've previously discussed. Content is immutable and these fields aren't part of the natural key. Syncing the repo means that you can't, for instance, upload the same RPM and "complete" the filelist.

We already have this problem w/ metadata, so it's no different really, just an additional case of it. But idk that "trusting" the metadata is such a great idea, at least not without doing something like chucking on a metadata digest like we do for updaterecords.

@dralley
Copy link
Contributor Author

dralley commented Nov 8, 2024

As a day-of-learning experiment I wrote a script that does the whole "parse only primary.xml and download the headers" approach. It takes about 15 minutes on the Fedora 40 release repo just to do all the downloading, which is a lot more than 1 minute. But that's theoretically work that would only ever need to be done the first time a new package is seen, and never again thereafter. So maybe it's not so bad?

One way or another we probably should fix the problem of packages being immutable but also global and potentially incomplete issue though.

@ggainey
Copy link
Contributor

ggainey commented Nov 8, 2024

But that's theoretically work that would only ever need to be done the first time a new package is seen, and never again thereafter. So maybe it's not so bad?

That depends entirely on the "typical" amount of change in the remote repo. If it's one package a week, over time it's a clear win. If every update to a repo affects, say, 25% of the packages, then it's a net loss. Also, "only pay it once" is fine - unless your users are made about even that first one-time-only cost. Also also, "how long will this take" is dependent on network-speed.

There are A LOT of variables here. I don't have an answer - I don't think there IS "an" answer - I think we need to think really carefully about the tradeoffs, and whether it's worth the complexity we're thinking about adding.

@dralley
Copy link
Contributor Author

dralley commented Nov 8, 2024

We would also probably need to flip the sync pipeline around. Filter out dups first, then optionally download artifacts (you don't want to duplicate effort in the immediate download case), then the plugin decides whether or not downloading the header is necessary based on whether the full artifact was downloaded, then the new content unit is created. It would be a whole thing.

@ggainey
Copy link
Contributor

ggainey commented Nov 8, 2024

This feels like A LOT of heuristic work to solve an edge-case

@dralley dralley force-pushed the missing-filelists branch 2 times, most recently from 349b2be to d322769 Compare November 12, 2024 14:47
@dralley dralley marked this pull request as ready for review November 12, 2024 14:47
@dralley dralley force-pushed the missing-filelists branch 2 times, most recently from 736e9b8 to de9a067 Compare November 12, 2024 16:03
@ipanova
Copy link
Member

ipanova commented Nov 14, 2024

if we sync such a repo, are we going to also skip publishing such files?

@dralley
Copy link
Contributor Author

dralley commented Nov 14, 2024

@ipanova No because primary.xml contains at least a subset of the filelist, generally, so we ought to have something. And if not there's still no harm in publishing a filelists.xml w/o files and not worth the edge cases to do otherwise.

@dralley dralley merged commit 6abc7e7 into pulp:main Nov 14, 2024
12 checks passed
@dralley dralley deleted the missing-filelists branch November 14, 2024 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Repositories without filelists not syncable
3 participants