Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content-sqlite: verify database integrity during module load #4340

Merged
merged 3 commits into from
May 24, 2022

Conversation

garlick
Copy link
Member

@garlick garlick commented May 24, 2022

Problem: if the content-sqlite database is corrupted, this is typically discovered at runtime when the KVS attempts to load/store a blob, which causes secondary fallout. It would be better to catch that condition early.

Add the quick_check sqlite pragma. Since pragmas are executed before the module enters the reactor, if this check fails, the module load fails, and rc1 aborts before the higher level services are loaded.

This addresses #4338

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

garlick added 2 commits May 24, 2022 10:55
Problem: if content-sqlite encounters a fatal error, it may not
unregister itself from the content cache in the broker before
exiting the module thread.

Fix the control flow so the backing store is always unregistered
on failure.
Problem: if database is corrupt, content-sqlite may load successfully
then fail a load/store request, resulting in secondary failures
such as a KVS module load failure, and content-flush hang in rc3.

Add the sqlite "quick_check" pragma which detects a corrupt database
during module initialization.

Fixes flux-framework#4338
@garlick
Copy link
Member Author

garlick commented May 24, 2022

Thanks, I tacked on a bit of testing to try to improve the numbers.

Problem: there is no coverage for content-sqlite's handling of
a corrupt database file.

Attempt to create a couple of forms of database corruption
and ensure that the content-sqlite module fails to load.
@chu11
Copy link
Member

chu11 commented May 24, 2022

Would it be worthwhile to copy the quick_check over to job-archive? I will copy over to PR #4336 of course.

@garlick
Copy link
Member Author

garlick commented May 24, 2022

Would it be worthwhile to copy the quick_check over to job-archive? I will copy over to PR #4336 of course

Maybe, although I haven't seen that database get corrupted on my test system. Be aware the overhead (when the pragma is evaluated during initialization) is proprtional to the number of rows in the database.

@chu11
Copy link
Member

chu11 commented May 24, 2022

Be aware the overhead (when the pragma is evaluated during initialization) is proprtional to the number of rows in the database.

ahhh, which there would no cleanup in the job-archive dbs long term. I guess its ok to leave this out for the time being in job-archive and job-db.

@codecov
Copy link

codecov bot commented May 24, 2022

Codecov Report

Merging #4340 (e6518a9) into master (fe3b799) will increase coverage by 0.03%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master    #4340      +/-   ##
==========================================
+ Coverage   83.62%   83.65%   +0.03%     
==========================================
  Files         389      389              
  Lines       65548    65550       +2     
==========================================
+ Hits        54814    54836      +22     
+ Misses      10734    10714      -20     
Impacted Files Coverage Δ
src/modules/content-sqlite/content-sqlite.c 67.69% <50.00%> (+4.68%) ⬆️
src/modules/job-info/guest_watch.c 77.02% <0.00%> (-0.55%) ⬇️
src/broker/state_machine.c 82.70% <0.00%> (-0.40%) ⬇️
src/cmd/flux-module.c 83.96% <0.00%> (-0.30%) ⬇️
src/cmd/flux-job.c 87.40% <0.00%> (+0.13%) ⬆️
src/common/libterminus/terminus.c 86.06% <0.00%> (+0.24%) ⬆️
src/common/libpmi/simple_server.c 87.12% <0.00%> (+0.49%) ⬆️
src/shell/pmi/pmi.c 82.95% <0.00%> (+0.65%) ⬆️

@garlick
Copy link
Member Author

garlick commented May 24, 2022

I couldn't quite hit the code path I wanted by selectively mangling the database file, although I did cover another uncovered path. Oh well, let's call this good. Thanks for the review! Setting MWP.

@mergify mergify bot merged commit f1b97f0 into flux-framework:master May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants