Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create mirrors.cicku.me #1030

Merged
merged 8 commits into from
Jun 10, 2024
Merged

Create mirrors.cicku.me #1030

merged 8 commits into from
Jun 10, 2024

Conversation

cicku
Copy link
Contributor

@cicku cicku commented May 10, 2024

Add a new mirror

@jonathanspw
Copy link
Member

Please follow the geolocation example at the bottom of the page at https://wiki.almalinux.org/Mirrors.html

The system uses this to serve your mirror to users geographically close to your mirror.

@jonathanspw
Copy link
Member

Just noticed the filename doesn't end in .yml. Please rename the file to mirrors.cicku.me.yml

@cicku
Copy link
Contributor Author

cicku commented May 10, 2024

My mirror is behind CDN, what country should I fill it? It is available everywhere. Even cloud_type is not sufficient.

@jonathanspw
Copy link
Member

I see. Are the files pre-cached basically everywhere or only cached on each edge based on need? I see CF Magic Transit is what's behind it.

This is certainly an interesting situation that we've not encountered yet.

Where is the actual server that sits behind the CDN in this case?

@cicku
Copy link
Contributor Author

cicku commented May 10, 2024

There are 4 bare metal servers powering the mirror:

  1. US West is the main one, it has ~160 TB storage.
  2. US East is the fallback, it has ~120 TB storage.
  3. UK is the one load balancing with US West and not used much at the moment due to some hardware issues. ~80 TB storage.
  4. Singapore has a ~40TB server as the fallback of US West when there is a cable issue, not having all mirror files.

And I have additional VPSs which will act as the warm-up site.

image

Files will not always be stored in each metal, but if more users visit the same edge, these requested files will be cached in the same PoP (referred as hot cache). If hot cache is not found, Cache Reserve will be used (cold cache), Tiered Cache is also enabled for fast fetch of both hot and cold, all cache files will be stored for a certain time period. I have rules written for each project I support, so I can customize (for example, ISO files do not need to updated often, they can be in the cold cache for a month or two without changes).


Magic Transit is not directly used by the mirror site (my homepage is the same across different subdomains 💡), it is behind the scene though. I do have a rsync service behind Spectrum, but it is a private one and due to bandwidth concern I do not plan to announce it.

@jonathanspw
Copy link
Member

CDNs are a bit tricky for the mirror system because it goes against what it was designed to do. The mirror system distributes traffic to local mirrors and having potentially cold caches where files don't currently exist would degrade the user experience.

The best thing will be to set the geolocation on this to the primary location where the files will always be with a DNS entry tied only to that location, and not one that'd fall back to other locations which even if hot, would result in sub-par user experience by potentially serving users across the country/world.

@cicku
Copy link
Contributor Author

cicku commented May 31, 2024

My understanding of a modern mirror ecosystem is that CDN can co-exist with local mirrors because CDN may not technically be the best/fastest, it is for load balancing global traffic instead. A package manager should regularly perform latency check/speed test and select the best, like fastestmirror in dnf. Since CDN does not have bandwidth issue (I can do 0.5 PB in a single day based on the load testing), latency will be the only concern.

The best thing will be to set the geolocation on this to the primary location where the files will always be with a DNS entry tied only to that location

I have a long list of subdomains like jp.mirrors.cicku.me that only serves traffic around Japan. If you need them I can also provide in that way, we can just try a few for testing before adding all of them to the list.

@cicku
Copy link
Contributor Author

cicku commented May 31, 2024

I do have 1 question about yaml format, should I put all subdomains in a single file and create one by one?

@jonathanspw
Copy link
Member

My understanding of a modern mirror ecosystem is that CDN can co-exist with local mirrors because CDN may not technically be the best/fastest, it is for load balancing global traffic instead. A package manager should regularly perform latency check/speed test and select the best, like fastestmirror in dnf. Since CDN does not have bandwidth issue (I can do 0.5 PB in a single day based on the load testing), latency will be the only concern.

We do not rely on fastestmirror, instead our mirror system does the logic to try to serve the best mirror to you and that's why CDNs don't play nicely with our mirror system. Since we expect one mirror to represent one location for our geolocation logic a CDN that can exist from any number of places poses a problem.

Furthermore, when said CDN has an endpoint that dies and it redirects traffic to another endpoint that is great, but it causes the user sub-par performance if it is redirect to one far away. We can do a better job of removing problematic mirrors from the list and redirecting users to other mirrors that are close to them rather than the CDN doing potentially less than ideal things.

Having said all that - if you have records to each of your locations, and it sounds like you do, the solution here is to create a mirror entry in mirrors.d for each location with DNS that goes directly to it and doesn't hit CDN/fallback logic on your end. Then you can also provide accurate location data for each mirror and we can serve it to users accordingly. Preferably these locations have direct storage of the files and don't rely on a caching architecture that has the potential to have the files get removed - again it's all about providing the best experience to end users (translation: fast dnf transactions/downloads).

If you must use a hot/cold caching architecture then there are some TTLs we could provide that would result in good UX, but you'd have to bypass any rules that'd remove things based on infrequent access...but again direct storage is much preferred.

Thanks for working on setting up mirroring, it is very much appreciated :) Let me know what you think about my comments.

@cicku cicku marked this pull request as draft June 2, 2024 17:51
@jonathanspw
Copy link
Member

Thanks for your work on this. I know it's still a draft but you'll need to drop the asn line from each config. The result of thta would be sending all Cloudflare-originated traffic to your mirrors which isn't how things should be configured here.

Other than that looking good so far! Thanks for working on this!

@cicku
Copy link
Contributor Author

cicku commented Jun 5, 2024

I'm going to submit these as the "first wave", there are actually far more hostnames of other regions but not for public use at the moment.

@cicku cicku marked this pull request as ready for review June 5, 2024 16:44
@codyro
Copy link
Member

codyro commented Jun 6, 2024

The address will need to point to the repository, ex: https://us.mirrors.cicku.me/almalinux/

@jonathanspw
Copy link
Member

The addresses all seem to still be doing some CDN-ish things. I believe these are mostly anycast addresses?

Is each endpoint otherwise direct with a hot copy of the files in the given location? If so I think we're ready to get this merged.

@cicku
Copy link
Contributor Author

cicku commented Jun 7, 2024

Yes, everything is anycast based.

Hot cache is used as much as possible, but not all files. More users => more files in hot cache => faster pull.

@jonathanspw
Copy link
Member

What is the TTL on files? I noticed HTTP headers show CF caches in action and when I get cache hits things are very fast. On cache misses things can be <5MB/s at times which is certainly not great.

I'm not overly concerned with common updates, I'm sure they'll stay hot in caches, but ISOs, cloud images, etc. may not get used enough to stay hot in the cache that will lead to sub-par UX if they're slow to download.

@cicku
Copy link
Contributor Author

cicku commented Jun 8, 2024

ISO and images: 14d
RPMs: 1d
Other files including repodata: 3h
Timestamp: No cache

@jonathanspw
Copy link
Member

Can you adjust everything to max TTL - longer is better, except the following?

*/repomd.xml - same as rsync ttl, 1h-3h recommended
/timestamp.txt - same as rsync ttl, 1h-3h recommended
/TIME - same as rsync ttl, 1h-3h recommended
*/repomd.xml.asc - no caching. caching this causes fits for some reason so they need to be passed through.

Everything else should be as long as possible. 14d would work, longer would be better. One our internally cache-tiered things we use a 6mo TTL by default.

@cicku
Copy link
Contributor Author

cicku commented Jun 10, 2024

I will push the images to 6m (technically speaking the max is 1y but that barely happens) and 30d for RPMs.

Regarding asc files, I think it is safe have the same TTL as the rsync frequency, but I will bypass the cache as suggested.


The configuration is open to further adjustment but I need to see the real world feedback, I do not want to spend too much time tweaking at the moment.

@jonathanspw
Copy link
Member

Thanks for all the mirrors. Let's see how this goes!

@jonathanspw jonathanspw merged commit ff05c9a into AlmaLinux:master Jun 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants