Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assess risk of corruption of SQLite DB #15

Closed
giovannipizzi opened this issue Apr 13, 2020 · 9 comments
Closed

Assess risk of corruption of SQLite DB #15

giovannipizzi opened this issue Apr 13, 2020 · 9 comments

Comments

@giovannipizzi
Copy link
Member

giovannipizzi commented Apr 13, 2020

Note that anyway this should only happen during "maintenance" operations (see also #6).

One option is that maintenance operations always perform a backup of relevant files (e.g. of all pack indexes, including the various WAL files), maybe keeping a few past ones). Also, assess with tests that one can read properly historical versions.

Probably this is enough to feel confident that it's reliable?

@dev-zero
Copy link

Well, it is not only for the DB but also for the files being streamed: ENOSPC comes to mind (which would at the same time result in truncated incoming file as a truncated WAL), but also temporary hardware or network errors may require a careful design. And they don't necessarily happen only in the methods you declared to be @maintenance.

@giovannipizzi
Copy link
Member Author

Thanks, this is a very good comment.
BTW, this morning I was reading this: How To Corrupt An SQLite Database File. I will need to look in detail into it to see if the implementation on top of SQLite suffers of some of the problems that are taken care by SQLite (the current one actually does).

In term of full disk, I see the problem for the WAL file (I need to check if SQLite already has a way to cope with this, it might).
In terms of the objects, in the current implementation my expectation is:

  • for loose ones, if it fails, it fails in the sandbox, so the object is never partially written to disk. Then, if the sandbox file is written, the os.rename should (AFAIK) be atomic and shouldn't fail as it's not increasing the disk usage?
  • for packing: if the disk gets full while writing a pack, writing a pack fails, this should raise an exception and not commit to the DB, and leave the object loose. As such, data shouldn't be corrupt, one just has some bytes in the pack that are not referenced by anybody. Not ideal of course (you could repack, but hard to do with a full hard drive :-) ) but should avoid data loss?

@dev-zero
Copy link

Thanks, this is a very good comment.
BTW, this morning I was reading this: How To Corrupt An SQLite Database File. I will need to look in detail into it to see if the implementation on top of SQLite suffers of some of the problems that are taken care by SQLite (the current one actually does).

In term of full disk, I see the problem for the WAL file (I need to check if SQLite already has a way to cope with this, it might).
In terms of the objects, in the current implementation my expectation is:

* for loose ones, if it fails, it fails in the sandbox, so the object is never partially written to disk.

I have to check how you implemented the sandbox, but I guess when receiving data you're already starting to write to disk?

Then, if the sandbox file is written, the os.rename should (AFAIK) be atomic and shouldn't fail as it's not increasing the disk usage?

This really depends on the filesystem used. Based on the documentation for os.rename it seems that at least on POSIX os.rename will fail if src and dst are not on the same filesystems, otherwise the op is atomic. But that still doesn't mean it can't fail due to some other error, just that it won't move the data.

* for packing: if the disk gets full while writing a pack, writing a pack fails, this should raise an exception and not commit to the DB, and leave the object loose. As such, data shouldn't be corrupt, one just has some bytes in the pack that are not referenced by anybody. Not ideal of course (you could repack, but hard to do with a full hard drive :-) ) but should avoid data loss?

Yes and no, the problem is how you're going to recover from that: having extra bytes at the end of a pack gets unnoticed until you start writing to the pack again, or unless you do a consistency check (from the DB you should be able to get the metadata and then be able to check the size) or if you have a WAL to reply. Unfortunately there are other reasons for additional or missing bytes: having a crash after an operation could also leave you with a truncated file, even though you (the application) finished writing the data successfully.

@giovannipizzi
Copy link
Member Author

I have to check how you implemented the sandbox, but I guess when receiving data you're already starting to write to disk?

Yes. What I actually meant is not that there will not be leftover files on disk, but that there will be no leftover partial loose objects.
One can then tell users that if nobody is using the container, all content in the sandbox folder is a leftover and can be deleted.

BTW, this is also what AiiDA is doing now during .store(), so it's not going to be more but not even less reliable of the current behavior, I think.

This really depends on the filesystem used. Based on the documentation for os.rename it seems that at least on POSIX os.rename will fail if src and dst are not on the same filesystems, otherwise the op is atomic. But that still doesn't mean it can't fail due to some other error, just that it won't move the data.

Indeed. I also checked and the tests seem to indicate that this is also the case on Windows (I used earlier shutil.move instead, and the tests on windows failed badly).
However, the logic is that the application will get a correct response with no exception only if the file was correctly moved. Otherwise, the file might remain in the sandbox (see discussion above) but the application will know that there was a problem.

The question is rather what is the risk that the application thinks that the file has been moved, but the OS still has to do it and if the disk fails, the file remains in the old position or even worse the filesystem partition gets corrupted. But this is a general problem, and I imagine that modern journaled filesystems at least would not get a corrupted filesystem table.

The second part of this comment is if also the filesystem table is cached, and if there exists a sync/fsync call for that (for a file, I could call fsync, but for a file that is already closed can I also do it, meaning to make sure it's moved?)

Yes and no, the problem is how you're going to recover from that: having extra bytes at the end of a pack gets unnoticed until you start writing to the pack again, or unless you do a consistency check (from the DB you should be able to get the metadata and then be able to check the size) or if you have a WAL to reply. Unfortunately there are other reasons for additional or missing bytes: having a crash after an operation could also leave you with a truncated file, even though you (the application) finished writing the data successfully.

My reasoning is that even if I append garbage to the end of a pack, this does not corrupt the disk-objectstore in the way it's implemented now: these bytes are just unaccounted for. It would be equivalent to having an object packed, and then soft-deleting it by just removing the reference from SQLite. Or a process that was adding packed data, but crashed. Actually, this makes me think that maybe, instead of opening always the pack in append mode, I could open it normally, and then seek to the last byte I am aware of (let's say largest offset in that pack+lenght of that pack, assuming that there is no overlap between packed objects by mistake), truncate the file, and then start writing again from there.

The question is more to understand which are the conditions that might make the application think that it correctly appended to a file, and it was correctly closed, but then the data can get lost.

I imagine that if I call fsync at the end (I am not doing it, and probably I should), and we trust that the OS is really doing it and continuing only afterwards, and the same is true for the underlying hardware, this should be safe (some hardware like pen drives tell you that they flushed but they do it later, as discussed in the document "How to corrupt a SQLite DB" linked earlier - but I think that there is nothing we can do about it, and in that case probably also the SQLite DB might get corrupted.

@dev-zero
Copy link

The second part of this comment is if also the filesystem table is cached, and if there exists a sync/fsync call for that (for a file, I could call fsync, but for a file that is already closed can I also do it, meaning to make sure it's moved?)

It seems that you can actually get an fd for a directory and then call fsync on that, yes.

My reasoning is that even if I append garbage to the end of a pack, this does not corrupt the disk-objectstore in the way it's implemented now: these bytes are just unaccounted for. It would be equivalent to having an object packed, and then soft-deleting it by just removing the reference from SQLite. Or a process that was adding packed data, but crashed. Actually, this makes me think that maybe, instead of opening always the pack in append mode, I could open it normally, and then seek to the last byte I am aware of (let's say largest offset in that pack+lenght of that pack, assuming that there is no overlap between packed objects by mistake), truncate the file, and then start writing again from there.

You should even be able to avoid the truncation until you finish writing. Position the file and start writing will simply overwrite it.

The question is more to understand which are the conditions that might make the application think that it correctly appended to a file, and it was correctly closed, but then the data can get lost.

This is actually very difficult, as the PostgreSQL had to find out:

I imagine that if I call fsync at the end (I am not doing it, and probably I should), and we trust that the OS is really doing it and continuing only afterwards, and the same is true for the underlying hardware, this should be safe (some hardware like pen drives tell you that they flushed but they do it later, as discussed in the document "How to corrupt a SQLite DB" linked earlier - but I think that there is nothing we can do about it, and in that case probably also the SQLite DB might get corrupted.

A very good read seems to be (have to study it once more in detail):

https://danluu.com/file-consistency/

I haven't completely thought it through, but using buffered I/O for incoming data (without direct write to packs, for performance reasons) and direct I/O from there to packs might be something? Allowing for low latency writing while a low-prio thread will pack up the files later on (similar to git gc)?

@giovannipizzi
Copy link
Member Author

Thanks for the very interesting links! I'll look into them in detail.
Regarding your last comment - maybe in a sense it's what I am already doing?
When you write a new object, this just goes into a regular file (with sandbox, so precisely:

  • you ask for an handle for a new object
  • I give you back the handle that tempfile gives me, in a sandbox folder
  • you write into it and "close" it going out of the context manager
  • I close the file, get the key (now a UUID, could be a hash in the future), do a os.rename into an appropriately named file in the folder for loose files
  • [TBD: I should here fsync the file, and the parent folder as well, according to your last link]
  • return -> you can now ask the object key that I created
    The goal is that after this point, you shouldn't lose your data (putting fsync already will give more guarantees than AiiDA does today, BTW).

Then, packing is triggered by the user, when he/she wants. In my idea, this should be done e.g. in a cron job (daily or weekly, for instance).
In principle the user might not call it ever, if he/she does not care, and then you never use packs.
(One might think to automatic packing in the future, as suggested by @greschd in his comments to the AEP - I'm going to think about it, but for now we can just say that we can "suggest" people to pack periodically, or e.g. if there are too many loose objects, e.g. we could check when restarting the daemon, when exporting, or other operations like these).

Packing will therefore be done in a different process, when the user wants, which is probably what you are suggesting, right? (The idea comes exactly form git, that by default stores loose objects, and then you can pack with git gc - and I think it repacks automatically before certain operations like push/pull).

(sorry if instead this was already very obvious and I misunderstood your comment).

Regarding buffered vs. direct: I am not doing anything explicit in this direction.
I guess normal file write is buffered by default, right? For writing to packs, I am using again the default (in the separate process that does the packing).

@giovannipizzi
Copy link
Member Author

BTW, the good news is that in your link they say that SQLite is the best in taking care of these problems and SQLite+WAL didn't show any static vulnerability, better than e.g. Postgres, much better than git, ... (at least at the time the article was written). This justifies my choice of using SQLite+WAL (of course it's not enough per se, if then I introduce bugs in coordinating the SQLite DB with the packs content ;-) ).

@giovannipizzi
Copy link
Member Author

Check also #28

@giovannipizzi
Copy link
Member Author

From the links (and from my experience after testing) it's clear that if there is a problem it's not going to be in the SQLite part, so closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants