Check filesystem for existing corpora on creation, import and deletion #313
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I found an issue with the check whether a corpus with a given name exists when trying to create an empty corpus, import a corpus or delete a corpus: The current implementation only checks if there is a cache entry for the given name, but not whether a corpus actually exists on disk. This can be a problem in two cases:
We originally stumbled upon this in matthias-stemmler/annimate#471, where it caused an interesting case of data corruption. The sequence of events was presumably as follows:
overwrite_existing
tofalse
, this should have failed. However, since existence is checked against the cache, the import went through, at least for those corpora that had been evicted.Coverage/default_ns/
,Coverage/VIRTUAL/
andOrdering/default_ns/
.subgraph
method return wrong results, causing a broken export.We could of course fix it by deleting and reimporting the corpora.
When looking through the code, I found that the same issue as for the import also applies to corpus creation and deletion. All other operations seem fine to me.
For the fix, I'm now checking whether a corpus exists on disk in addition to the cache check. I kept the latter to avoid changing the behavior in case a corpus exists only in the cache and not on disk (which can happen, for instance, when the
info
method is called on a nonexistent corpus).I also noticed that the deletion logic acquires an exclusive lock on the cache entry before deleting from disk, so I figured the import should do the same in case the corpus already exists.
Thanks to @S-Oppermann for reporting the original issue and providing the data for analysis.