-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EOF issue with zfsbackup-go #59
Comments
This feels like an issue in Blazer but let me see if I can get a test case failing. This code is possibly more complicated than it ought to be 🙄 |
I suspect you've surfaced a bug where if Resume is true but there's nothing to resume it just blows up. |
Thanks for catching this! Try with the latest commit, you should be good. |
Apologies for the tardy chiming in - I added that bit since retries after a network failure would result in incomplete open large file uploads. When I made the change I thought blazer would know to start a new file if there was nothing to resume (may have just read the code wrong at the time!) Anyway, thanks for all the debugging @lhoward and for the quick fix @kurin ! |
This fix unfortunately gets it stuck in an endless loop with
|
Does this happen with any invocation? I have a zfs box around here somewhere I can try to test this on, but if you can reproduce it with the environment variable B2_LOG_LEVEL=2 you'll get a lot of debug lines that you could add to the bug (be sure to purge of secret data). |
I wasn't exactly sure what data to consider secret, so for now I'll just paste in a summary – happy to send you the complete details privately if you mail me (lukeh at padl.com).
The key appears to be this ("chunks don't match"):
|
Oooh okay so what's happening is that to resume the large file, the checksums of the already-uploaded chunks have to match the checksums of the corresponding chunks of the current file; if not you're really uploading a different file. The reason your checksums don't match, I'm pretty sure, is because the snapshot you're uploading is encrypted, and the encryption is of course resulting in a different data stream each time. This library is designed so that clients who receive errors will by default simply retry forever; the only permanent failure should be if the process itself dies. @someone1 do you have examples of the kinds of failures that led you to set I need to think about whether there's a better way to handle non-matching resumes. I think it might be a good idea to add an option, when there's a conflict, to remove the partially uploaded file and start from scratch. |
So setting resume to false should fix it? There is a resume command line option to zfsbackup BTW but I’m not setting that. And yes I am encrypting. |
Is this resuming within or across a zfsbackup session, I wonder? In the latter case it seems like it should be disabled if encryption is on (even if it’s off, unless the volumes are cached locally possibly an unencrypted backup may differ?) just speculating. |
I'm not familiar with zfsbackup-go but it sounds to me like it's starting the upload from the beginning with I'm wondering what's happening to cause zfsbackup-go to restart from the beginning, however, because that shouldn't be necessary. If the backup is interrupted (e.g. if the network drops), the client should just hang until service is restored. |
If in zfsbackup-go's |
@kurin - I think it was a mistake to add I think it would make sense to provide the following in the blazer:
I'd love to have my code better tested against B2 - the bonfire package you have seems interesting in this regard - can it be used to setup a local B2 emulator to write tests against? If so, mind sharing a quick example how? |
I don't believe B2 has a way to automatically cancel large files. The only permanent failure for the writer should be a cancelled context. Any other failure ought to be recoverable, and so wouldn't benefit from an abort-on-failure option, and when the context is cancelled we can't use it to clean up after ourselves. It's possible to list unfinished files (with Bonfire is an experimental reimplementation of the B2 server API that I haven't had time to work on in a while. |
B2 does not automatically cancel large files - but it does have a b2_cancel_large_file call which is not exposed via blazer. The Writer interface already handles chunking a file, initiating a large file upload, and completing it for you, can it be updated to also cancel the upload if any failures occur (or expose a call that does)? In addition to a canceled context, wouldn't a permanent network failure or internal server errors from backblaze eventually fail any sort of retry/backoff attempts blazer makes or does it operate in such a way that it will NEVER return an error? I know GCS/S3 Go SDKs will eventually fail outside of a canceled context. The automatic cleanup would be great for situations where the power is pulled and large files are left but that's a backblaze feature, not blazer (S3 uses lifecycle rules for this). The AWS SDK for Go's s3manager will start a multipart upload and cancel it should it exhaust retries (network failures, cancelled context, etc.) This is helpful as a high-level API where the details surrounding running and managing a multipart upload are hidden from the developer. I thought it'd make sense as an option to the blazer Writer. The options I was suggesting are (naming aside):
If you think this is better left outside the Writer interface, I understand - I like the feature of |
The B2 integration checklist specifies that some errors (including network errors) should be handled with exponentially backed-off retries. So in that sense blazer will retry forever even given network failures until a context is cancelled or the process exits. It turns out some blazer clients would actually prefer just to receive and handle errors themselves, and I've kicked around some ideas for how to implement that (#55) but nothing's been done. It probably makes sense to add the ability to say "if NO progress has been made in , then kill the upload". At the very least I should expose the b2_cancel_large_file request. I can't think of a way to gracefully handle the dead-context situation; it might be best to make an unexported cancel that doesn't rely on a context at all. |
In the interim, backups with |
I agree there should be some way to call b2_cancel_large_file - looking forward to an implementation! As for error handling/retires - given the b2 integration checklist, I think the current behavior is to spec and therefore fine (assuming it errors out for any non-408 400 error codes, seems like everything else should be retried). I can cancel requests using the Context I pass in. One thing might be worth adjusting though: Instead of using select {
case <-w.ctx.Done():
err = w.ctx.Err()
case <-time.After(sleep):
// Remainder of code from 169-182
} |
Yeah that's probably a good idea. |
So I've added a thing for this: https://godoc.org/github.com/kurin/blazer/b2#WithCancelOnError I'm not super happy about the interface though. The idea with ctxf is that you probably want a new context if the old one is cancelled, and probably with a short deadline on it so that your writer doesn't hang forever even after you cancel the writer's context. Play around with this and see how it fits your needs. |
Sorry to come back to this issue so late - finally updating zfsbackup-go and while updating my backend code all around I noticed how the google cloud storage handled all sort of situations using a cancelled context and wonder if it would be a good fit here: https://pkg.go.dev/cloud.google.com/go/storage?tab=doc#Writer.CloseWithError (read the deprecation notice) Basically, instead of using an option or anything of the sort, if the I'm using the new option you added, cancelling the context manually on error, and just opting not to call |
Automatically calling You should always call |
Hmmm - the problem I'm experiencing is if I start an upload and there's an error while writing it to b2, calling For the GCS client, I just cancel the My test is somewhat contrived, but I basically try to Maybe there's an edge case here since |
I'd need to think about whether we can infer from a cancelled context whether the user does or does not want to clean up the underlying file. On the one hand it would suck to get partway through a huge file just to have a ctrl-c or something undo all your work, but on the other hand as you say we don't want to leave partials lying around. Today, if you cancel the context and call Close(), the file should be left in an unwritten state; it can't complete |
Let me preface this by saying that I am not a Go programmer, nor an expert on B2; just a humble punter trying to get zfsbackup-go working on my Solaris machine. I don't know enough to know whether this issue is in zfsbackup-go's usage of Blazer, or in Blazer itself.
cf someone1/zfsbackup-go#17
ListObjects()
in Writer.getLargeFile() could return EOF, which was causing zfsbackup-go to hang starting a new upload. Treating EOF identically tolen(objs) < 1
(i.e. restarting withw.Resume = false
) seemed to fix it, but I'm not sure if this is the correct fix.The text was updated successfully, but these errors were encountered: