borg2: build_chunkindex_from_repo is slow #8397

ThomasWaldmann · 2024-09-19T15:19:50Z

Problem:

That function does a repository.list(), listing all the object IDs in the repo to build an in-memory chunkindex.

Because all objects are stored separately into a 2 levels deep dir structure, that are (1+)256+65536 listdir() calls in the worst case. Depending on store speed, connection latency, etc., that can take quite a while.

The in-memory chunkindex is currently not persisted to local cache.

The text was updated successfully, but these errors were encountered:

ThomasWaldmann · 2024-09-19T15:30:53Z

Analysis:

There are only a few borg2 commands that remove objects from data/ in the store:

borg compact (deletes unused objects)
borg check --repair (deletes corrupted objects)
borg debug obj-delete (expert-only, rarely used)
borg repo-delete (deletes the complete repository)

Notably, these commands do NOT delete objects from data/:

borg delete (just kills the entry in archives/)
borg prune (basically calls borg delete internally)

So, the set of objects in data/ is always increasing until compact/check is run (we can ignore borg debug and borg repo-delete).

borg create must not assume a chunk is in the repo when it in fact isn't anymore, that would create a corrupt archive, referencing a non-existing object.

OTOH, storing a chunk into the repo that already exists in there (but we did not know) is only a performance issue, but otherwise not a problem.

ThomasWaldmann · 2024-09-19T15:50:53Z

Implementation idea:

uptodate = TBD
load chunkindex from cache if uptodate else (use build_chunkindex_from_repo + save to cache immediately)
update in-memory chunkindex (borg create adds new chunks)
save updated chunkindex to cache

Uptodate check and lockless operation (even if multiple borg of same user on same machine use the same repository) needs more thoughts.

ThomasWaldmann · 2024-09-19T22:02:24Z

Another idea:

borg compact needs to use repository.list() anyway (and has an exclusive lock), so it could build the list of objects ids (after it deleted objects due to garbage collection) and store them into the repository (cache/* maybe?) and also to local cache.
format could be a compacted ChunkIndex file
clients could load that ChunkIndex from local cache or repo.
borg check --repair would either have to update these caches or invalidate them IF it has deleted objects.

ThomasWaldmann modified the milestones: 2.0.0rc1, 2.0.0b11 Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

borg2: build_chunkindex_from_repo is slow #8397

borg2: build_chunkindex_from_repo is slow #8397

ThomasWaldmann commented Sep 19, 2024 •

edited

Loading

ThomasWaldmann commented Sep 19, 2024 •

edited

Loading

ThomasWaldmann commented Sep 19, 2024 •

edited

Loading

ThomasWaldmann commented Sep 19, 2024

borg2: build_chunkindex_from_repo is slow #8397

borg2: build_chunkindex_from_repo is slow #8397

Comments

ThomasWaldmann commented Sep 19, 2024 • edited Loading

ThomasWaldmann commented Sep 19, 2024 • edited Loading

ThomasWaldmann commented Sep 19, 2024 • edited Loading

ThomasWaldmann commented Sep 19, 2024

ThomasWaldmann commented Sep 19, 2024 •

edited

Loading

ThomasWaldmann commented Sep 19, 2024 •

edited

Loading

ThomasWaldmann commented Sep 19, 2024 •

edited

Loading