-
Notifications
You must be signed in to change notification settings - Fork 454
Support expiring ops from oplog #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Awesome - thanks for this. A lot of people have been very excited about this feature. I'll review & merge in the next couple of days if I can. |
Great, thanks. 😉 |
Not to be pushy, but is there any chance you could review this in the next few days? I'd like to get it into production asap since our db is growing really fast. Thanks a ton! |
Meta: This is one of those cases where it'd be useful to have an easy pay-for-features system. I'm currently (by choice) unemployed and I'm working on the new JSON type (which I'm finding quite enjoyable). I try to avoid context switching, especially when working on a hard problem. I don't want to context switch to other problems purely out of social obligation. In the long run, that will make me feel resentful of the opensource work I do, and that doesn't benefit anyone. There are a few solutions which come to mind. The most obvious ones are things like gittip or having some people who I trust that can review code like this. Its something to think about going forward. In this case the diff is actually quite small and I'll review it today. |
The code looks good. Can you also add tests to see what happens if you call |
I get it. Where is the best place to send you tips? Gittip? We greatly appreciate the work you've put into ShareJS etc. over the years. It is a seriously awesome enabling technology. I'll add the test for subscribing at an old version. Any thoughts about the last paragraph of the original issue? Not sure if we need to fetch the snapshot in getOps to check if ops are missing. Thanks! |
Thanks - I appreciate you saying so :) Yeah gittip is great. There's also a bitcoin address down the bottom of the gittip page. I didn't notice that comment - I'll take another close readthrough of that code in a few hours. I remember thinking carefully about the logic in that function, and I can't remember all the cases off the top of my head. I agree that it'd be nice to avoid calling getSnapshot there - I'll have a think about it soon. |
You're right - we straight out don't have enough information in lib/index.js's getOps function to do the right thing when Lets talk about the interfaces between parts. There's 3 interoperating parts in a getOps call:
It seems like there's two possible APIs:
Imagine the code consuming these operations. Imagine if I ask for ops 1-10 and you return ops 5-10. Lets talk through a few use cases:
But nobody has implemented a playback control yet, as far as I know. For now I'm happy with simply returning an error if any elements of the range are missing. Question: what is the API for the oplog and the driver? If the oplog's API is 'Best effort at returning as many ops as possible in the range', does the redis driver do the right thing in all cases? Is that the right API for each of those components? I think we need to tighten the interface constraints. First, the oplog should not be best-effort. If I ask for ops from 10-20, it should return a continuous range of values (it should never return 10,11,13,...). If its missing some operations, it should either return a partial set to the end of the requested range, or it should error. We need to be clear about what happens here. Looking at the code, currently livedb-mongo could do literally anything here - it simply returns the set of all operations which are still in mongo. Question: How does mongo handle TTLs? Will it ever be possible for mongo to return operation 10 but not operation 11? (I bet it is.) What API should the driver's getOps function expose? I bet the redis driver is currently buggy if the oplog returns a partial set of results. But if the oplog errors instead, the error should get propagated correctly. -> So lets just make livedb error if your range isn't in the database for now. Separately there's the question you actually asked. What should we do if I suppose another way to do it is to have a getops wrapper function in livedb that calls oplog.getOps then checks the results. If the range is the wrong size, it errors. If from is null and the range is empty, it calls oplog.getVersion to check the version. However, there's a race condition. Imagine the document is version 10. You call getOps(10,null) which returns an empty list. At the same time, someone else submits an operation. You call getVersion to make sure that you aren't missing any operations and it says you're at version 11 - so you are missing an operation! In this case, you should retry your getOps call and make sure you get that operation you missed. Sorry for the huge brain dump - I hope thats everything. |
Wow, thanks for all the thought you put into this. Here's what I think we should do. I think, for now at least, we should return an error when ops are missing. There should be a wrapper around I think this race condition is already handled by |
Requested ranges are checked against the returned ops now by livedb-mongo, so an error will be returned if ops are missing.
I decided that it was easier and made more sense to just handle this stuff from livedb-mongo since the driver calls that directly from subscribe, etc., and it means we only need to do these checks in one place. I basically did what you and I described above, so see share/sharedb-mongo#12 for the details. In terms of this PR, there are only test changes that are required now (due to behavior change in getOps - emitting error when ops are missing vs emitting what we have). The other changes I made earlier could be backed out I guess since we handle it from the db layer. Do you want me to do that, or keep them around (even though they are not complete) to catch oddities in redis or other places? |
Just merged master back into this branch. Any chance of reviewing this and share/sharedb-mongo#12 soon? FWIW, we've been running this in production at Storify for a few months now. |
@devongovett Without the ability to purge historical data, the official LiveDB is not a good option for me unless I had unlimited resources. Are you still using your |
Yes, still running this in production. |
Sweet, I'll start using your fork. It's a shame this feature never got merged. |
Reviewing? I did review it. Its missing the things I said, and missing tests. |
Uhh... I made changes since you reviewed and moved a bunch of stuff into the mongo backend instead of doing it here. There are a few tests here, but the real meat of the logic was moved into the mongo backend, so most of the tests are there. |
@@ -172,7 +172,7 @@ RedisDriver.prototype.atomicSubmit = function(cName, docName, opData, options, c | |||
// In this case, we should write a no-op ramp to the snapshot | |||
// version, followed by a delete & a create to fill in the missing | |||
// ops. | |||
throw Error('Missing oplog for ' + cName + ' ' + docName); | |||
return callback('Missing oplog for ' + cName + ' ' + docName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return callback('Missing oplog');
... so people can match on the error string.
Hah - I never took the money out of gittip and a week ago they quietly started refunding it :/
|
If a document is deleted, do the ops get purged as well? Not sure if this is implemented in the standard behavior already. |
This goes along with share/sharedb-mongo#12. It handles emitting
Missing operations
errors when operations requested from the oplog are not there. This happens insubmit
andgetOps
(and by extensionfetch
, since it callsgetOps
).This error should occur if a client has been offline longer than the ttl and tries to submit/fetch some ops. It should only occur if other edits have taken place while the user was offline, which have also expired. Specifically:
If 3 and 4 are swapped (i.e. user 2 doesn’t edit anything before the ops are dropped), the error shouldn't occur. Basically, it should be fairly rare.
When the error happens, ShareJS sends it back to the client, which has an opportunity to notify the user that their changes could not be merged. Not ideal, but necessary to keep the data size in mongo from growing forever.
One case I thought of but I'm not sure about is whether we need to get the snapshot version in
getOps
whento
is null to determine whether ops are expected from the oplog. Right now, the check for missing operations only occurs when there are ops returned (checkops[0].v === from
). Ifto
is null, we may also need to check if ops were expected iffrom < snapshot.v
, which means fetching the snapshot there. I don't really like that, but it may be necessary. Insubmit
this is handled already. What do you think?