-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OAK-11139 Allow downloading only recently changed nodes from MongoDB #1734
Conversation
...apache/jackrabbit/oak/index/indexer/document/incrementalstore/MergeIncrementalTreeStore.java
Show resolved
Hide resolved
...apache/jackrabbit/oak/index/indexer/document/incrementalstore/MergeIncrementalTreeStore.java
Show resolved
Hide resolved
...rg/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/PipelinedTreeStoreIT.java
Show resolved
Hide resolved
oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/IndexerSupport.java
Show resolved
Hide resolved
...java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/FlatFileNodeStoreBuilder.java
Show resolved
Hide resolved
...g/apache/jackrabbit/oak/index/indexer/document/incrementalstore/IncrementalStoreOperand.java
Show resolved
Hide resolved
} | ||
|
||
public void setMinModified(long minModified) { | ||
this.minModified = minModified; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oak documentation mentions:
The _modified field contains an indexed low-resolution timestamp when the node was last modified. The time resolution is five seconds. This field is also updated when a branch commit modifies a node.
If I am understanding it correctly anything updated till 4 sec will be set to 0 as per time resolution.
i.e for a 14th second update _modified property will be set to 10 sec.
For a 29th second update _modified property will be set to 25 sec.
So to make sure that we don't miss any update
I think minModified should be
this.minModified = minModified - minModified/5
@stefan-egli , @mreutegg am I correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minModified is the minimum (lower bound) of the "_modified" property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything updated till 4 sec will be set to 0
Anything between on January 1st, 1970 midnight (up to ~5 seconds later), will be set to 0.
@@ -76,12 +76,21 @@ public class IndexerSupport { | |||
private File indexDefinitions; | |||
private final String checkpoint; | |||
private File existingDataDumpDir; | |||
private long minModified; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use lastModifed instead of minModified as we are using _modified property of mongMK which denotes lastModified timestamp?
Oak document:
The _modified field contains an indexed low-resolution timestamp when the node was last modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Last modified" refers to the last time a node was modified. It refers to the property "jcr:lastModified", which may or may not exist.
"Min modified" means the minimum value (lower bound) of the "_modified" property.
_modified property of mongMK which denotes lastModified timestamp
The "_modified" property does not denote the "jcr:lastModified" timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, _modified is independent of jcr:lastModified. My point was to just add info about what this property maps to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand now! Thanks!
} | ||
treeStore.removeNode(line.path); | ||
break; | ||
case REMOVE_IF_EXISTS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference from a behaviour point of view between the 2 modes is that there is warning logged for DELETE, which doesn't seem to warrant a new mode. But do you forsee different implementations going forward? In that case maybe it's ok to have these different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's the point: "REMOVE_IF_EXISTS" should not log a warning, but "DELETE" should.
I think both implementations are needed: one for the "diff" command (DELETE), and one for the "top-up" command (REMOVE_IF_EXISTS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
I have one query on the overall top up implementation - when getting the topup directly from mongo using the _modified filter - we would miss the deleted nodes - right ? Eventually this would lead to larger and larger index stores ? We filter the non-existant nodes in queries, so even if this won't effect query results - but would it still impact the performance ? Let's say a bulk delete on a large content repository just before the top up. |
Only for those deleted nodes that the "diff" doesn't see. (My plan is to run both the diff and the top-up.)
Until the bugs in the "diff" parts are fixed, yes. But the verification would still notice it, and so we could re-download everything once we see that it's too many.
If the bugs in the "diff" logic are not fixed, then I think the behavior is the same as it is now. This is because "diff" is also used for async index update. If it turns out that the "diff" logic can not be fixed easily, then we could also change the download logic (pipelined strategy) to emit "removed" entries. But I think we don't need to implement this logic currently. |
No description provided.