Check filesystem for existing corpora on creation, import and deletion #313

matthias-stemmler · 2025-02-11T20:45:45Z

I found an issue with the check whether a corpus with a given name exists when trying to create an empty corpus, import a corpus or delete a corpus: The current implementation only checks if there is a cache entry for the given name, but not whether a corpus actually exists on disk. This can be a problem in two cases:

When a corpus has never been added to the cache
When a corpus has been evicted from the cache because the cache grew too large

We originally stumbled upon this in matthias-stemmler/annimate#471, where it caused an interesting case of data corruption. The sequence of events was presumably as follows:

The user imported all corpora of ReM 1.0.
They loaded all corpora, but due to limited memory, some got evicted from the cache.
They tried to import all corpora of ReM 2.x.
- Since the names are identical in both versions and I set overwrite_existing to false, this should have failed. However, since existence is checked against the cache, the import went through, at least for those corpora that had been evicted.
Additionally, the old data weren't deleted, but instead the new data were imported on top of them.
- This means that components that exist in ReM 1.0, but not in ReM 2.x, remained on disk. In this case it was Coverage/default_ns/, Coverage/VIRTUAL/ and Ordering/default_ns/.
Now node ids from the ReM 1.0 components were reinterpreted in the context of ReM 2.x, causing nonsensical edges such as coverage between nodes of different documents.
This made the subgraph method return wrong results, causing a broken export.

We could of course fix it by deleting and reimporting the corpora.

When looking through the code, I found that the same issue as for the import also applies to corpus creation and deletion. All other operations seem fine to me.

For the fix, I'm now checking whether a corpus exists on disk in addition to the cache check. I kept the latter to avoid changing the behavior in case a corpus exists only in the cache and not on disk (which can happen, for instance, when the info method is called on a nonexistent corpus).
I also noticed that the deletion logic acquires an exclusive lock on the cache entry before deleting from disk, so I figured the import should do the same in case the corpus already exists.

Thanks to @S-Oppermann for reporting the original issue and providing the data for analysis.

codecov · 2025-02-11T20:56:52Z

Codecov Report

Attention: Patch coverage is 60.71429% with 11 lines in your changes missing coverage. Please review.

Project coverage is 67.86%. Comparing base (c368b15) to head (67e17c9).

Files with missing lines	Patch %	Lines
graphannis/src/annis/db/corpusstorage.rs	60.71%	5 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #313      +/-   ##
==========================================
+ Coverage   67.70%   67.86%   +0.15%     
==========================================
  Files          76       76              
  Lines       18958    18968      +10     
  Branches    18958    18968      +10     
==========================================
+ Hits        12835    12872      +37     
+ Misses       4662     4633      -29     
- Partials     1461     1463       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Check filesystem for existing corpora on creation, import and deletion

67e17c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check filesystem for existing corpora on creation, import and deletion #313

Check filesystem for existing corpora on creation, import and deletion #313

matthias-stemmler commented Feb 11, 2025

codecov bot commented Feb 11, 2025

Check filesystem for existing corpora on creation, import and deletion #313

Are you sure you want to change the base?

Check filesystem for existing corpora on creation, import and deletion #313

Conversation

matthias-stemmler commented Feb 11, 2025

codecov bot commented Feb 11, 2025

Codecov Report