Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check filesystem for existing corpora on creation, import and deletion #313

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

matthias-stemmler
Copy link
Contributor

I found an issue with the check whether a corpus with a given name exists when trying to create an empty corpus, import a corpus or delete a corpus: The current implementation only checks if there is a cache entry for the given name, but not whether a corpus actually exists on disk. This can be a problem in two cases:

  • When a corpus has never been added to the cache
  • When a corpus has been evicted from the cache because the cache grew too large

We originally stumbled upon this in matthias-stemmler/annimate#471, where it caused an interesting case of data corruption. The sequence of events was presumably as follows:

  • The user imported all corpora of ReM 1.0.
  • They loaded all corpora, but due to limited memory, some got evicted from the cache.
  • They tried to import all corpora of ReM 2.x.
    • Since the names are identical in both versions and I set overwrite_existing to false, this should have failed. However, since existence is checked against the cache, the import went through, at least for those corpora that had been evicted.
  • Additionally, the old data weren't deleted, but instead the new data were imported on top of them.
    • This means that components that exist in ReM 1.0, but not in ReM 2.x, remained on disk. In this case it was Coverage/default_ns/, Coverage/VIRTUAL/ and Ordering/default_ns/.
  • Now node ids from the ReM 1.0 components were reinterpreted in the context of ReM 2.x, causing nonsensical edges such as coverage between nodes of different documents.
  • This made the subgraph method return wrong results, causing a broken export.

We could of course fix it by deleting and reimporting the corpora.

When looking through the code, I found that the same issue as for the import also applies to corpus creation and deletion. All other operations seem fine to me.

For the fix, I'm now checking whether a corpus exists on disk in addition to the cache check. I kept the latter to avoid changing the behavior in case a corpus exists only in the cache and not on disk (which can happen, for instance, when the info method is called on a nonexistent corpus).
I also noticed that the deletion logic acquires an exclusive lock on the cache entry before deleting from disk, so I figured the import should do the same in case the corpus already exists.

Thanks to @S-Oppermann for reporting the original issue and providing the data for analysis.

Copy link

codecov bot commented Feb 11, 2025

Codecov Report

Attention: Patch coverage is 60.71429% with 11 lines in your changes missing coverage. Please review.

Project coverage is 67.86%. Comparing base (c368b15) to head (67e17c9).

Files with missing lines Patch % Lines
graphannis/src/annis/db/corpusstorage.rs 60.71% 5 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #313      +/-   ##
==========================================
+ Coverage   67.70%   67.86%   +0.15%     
==========================================
  Files          76       76              
  Lines       18958    18968      +10     
  Branches    18958    18968      +10     
==========================================
+ Hits        12835    12872      +37     
+ Misses       4662     4633      -29     
- Partials     1461     1463       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant