-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the changelog and / or create a blog post about recent repeat
step multi-query improvements
#3799
Comments
Just want to say that I also think that a blog post would be great here. In general, I think that it would be really great if we could create some blog posts to accompany the 1.0.0 release as we have quite some interesting features in it that we want to make users aware of. Downside is of course that someone needs to take the time to write those posts. |
I have plans to write a blog post related to multi-query improvements in JanusGraph. That said, I’m not sure what platform to use yet. |
I was planning to write a blog post to introduce the string vertex id feature. I was planning to publish it on my own medium blog, but if there's an official one set up, I'd like to publish it there too. |
Fixes JanusGraph#3799 Signed-off-by: Oleksandr Porunov <[email protected]>
This issue is here to track documentation improvements for the next feature: #3783
As seen from the next discussion the change-log regarding this improvement is hard to understand and people might be confused what really changed. We need to improve wording, add more examples, explain how JanusGraph's batch queries was working before #3783 and how it works now.
Current changelog says:
Batch registration for nested batch compatible steps is changed for
repeat
stepPreviously any batch compatible steps like
out
,in
,values
, etc. would receive vertices for batch registrationfrom all
repeat
parent steps, but only for their starts in case of multi-nested repeat steps(skipping their subsequent iterations registration).
With JanusGraph 1.0.0 batches registration for the subsequent iterations of multi-nested repeat steps are used as well.
In the example above multi-nested
repeat
case would not register vertices returned from the inneremit()
stepfor the next outer iteration which would result in sequential calls of
in("connects")
for next outer iteration. The behaviour isnow changed to register these vertices for the next child
repeat
step start.The behaviour can be controlled by
query.batch.repeat-step-mode
configuration option.In case the old behaviour is preferable then
query.batch.repeat-step-mode
should be set tostarts_only_of_all_repeat_parents
.However, in cases when transaction cache is small and repeat step traverses more than one level
deep, it could result for some vertices to be re-fetched again which would mean a waste of operation when it isn't necessary.
In such situations
closest_repeat_parent
mode might be more preferable thanall_repeat_parents
.With
closest_repeat_parent
mode vertices for batch registration will be received from the start of the closestrepeat
step as well as the end of the closestrepeat
step (for the next iteration). Any other parentrepeat
stepswill be ignored.
The changelog's first sentence is quite confusing and hard to understand. We need to restructure it in something different or simply write a blog post to explain what changed.
In short, the old behavior was a bad version of the new
starts_only_of_all_repeat_parents
mode.I say a bad version because:
MultiQueriable
steps which arestart
steps of theirrepeat
parent steps would receive vertices for batch queries, but ALLMultiQueriable
children steps would receive vertices for batching from all theirrepeat
steps. For example,g.(v1,v2,v3).repeat(out().out().out().out()).emit()
- in this example you may think that only the firstout
step will be registered withJanusGraphMultiQueryStep
which is placed before.repeat()
step. It's true after Revamp JanusGraphMultiQueryStrategy for better parent step usage [cql-tests] [tp-tests] #3783 , but not the case before. Before Revamp JanusGraphMultiQueryStrategy for better parent step usage [cql-tests] [tp-tests] #3783 we would register each of those fourout
steps withJanusGraphMultiQueryStep
. So, each of those steps will perform badly for their first batch query most likely because they will have to perform unnecessary operations for unnecessary vertices before they have a chance to perform batch requests for needed vertices. This could be considered as aperformance
bug. Nevertheless it's fixed in Revamp JanusGraphMultiQueryStrategy for better parent step usage [cql-tests] [tp-tests] #3783 .g.V(v1).emit().repeat(out()).until(loops().is(P.gt(5)))
. In this situation we don't registerout()
result of the previous iteration toout()
of the next iterations. Thus, we basically perform single vertex batches which are quite inefficient. In Revamp JanusGraphMultiQueryStrategy for better parent step usage [cql-tests] [tp-tests] #3783 the behavior is changed and we now register whatever result we have afterrepeat
iteration with the next iteration.repeat
steps are more trickier because we might now want to register next iterations for nestedrepeat
steps. Thus, we have to have some modes which control the behavior. Do we want to register with all parents'repeat
step start? Do we want to register with all parents'repeat
step end (i.e. next iterations)? It all depends in case by case situation. For example,g.V(v1,v2,v3).emit().repeat(__.repeat(__.in("connects")).emit()).until(__.loops().is(P.gt(10)))
in this situation we need to understand that eachrepeat
iteration performs a singleTraverser
. So, do we really want to registerv1
,v2
, andv3
within("connects")
as the first batch query request? In case transaction cache is small then it might be that when we traversev2
afterv1
- the cache is already gone and we will perform multi-query request forv2
again, thus making the first multi-query request redundant.v2
is not used on the first traverser going intoin("connects")
and most likelyv2
won't be even second or third traverser going intoin("connects")
because there could be several levels going fromv1
. Nevertheless, in some situations users want to retrievev2
andv3
together withv1
at the first access because they know that their transaction cache is big enough and they eventually will accessv2
andv3
during their traversal. The same logic applies to the next iterations as well. I.e. we don't know if the user wants to request vertices in batch for the next iteration or not. We can say it only for the first parentrepeat
step, but not for all parentrepeat
steps. IN SHORT: Previously the logic would not take into account next step iterations, but always would always register vertices from the start of ALL repeat steps which are direct parents to each other. It's not changed and we can now say if we want to use the closestrepeat
parent step only, allrepeat
steps for starts or allrepeat
steps both for starts and ends.g.(v1,v2,v3).emit().repeat(union(repeat(out()).emit())).until(loops().is(5))
- in this situation, as you might notice, innerrepeat
step is not a direct child of outerrepeat
step. Even sounion
is a start step and the innerrepeat
step is also a start step -v1
,v2
,v3
won't be registered for the first batch because thoserepeat
steps are not directly referenced. In Revamp JanusGraphMultiQueryStrategy for better parent step usage [cql-tests] [tp-tests] #3783 the behavior is improved and we can now detect and skip any multi-query compatible start parent steps. Thusv1
,v2
, andv3
will be registered for the first batch request (ifall_repeat_parents
orstarts_only_of_all_repeat_parents
modes are used).We need to explain all of the above information in some other form which is easier to catch up by current users. Also, we need to explain why those
repeat
modes exist and how they work.Here is the current modes descriptions from the
batch-processing.md
:Multi-nested
repeat
step modes:By default, in cases when batch start steps have multiple
repeat
step parents the batch registration is considering allrepeat
parent steps.
However, in cases when transaction cache is small and repeat step traverses more than one level
deep, it could result for some vertices to be re-fetched again or vertices which don't need to be fetched due to early
cycle end could potentially be fetched into the transaction cache. It would mean a waste of operation when it isn't necessary.
Thus, JanusGraph provides a configuration option
query.batch.repeat-step-mode
to control multi-repeat step behaviour:closest_repeat_parent
(default option) - consider the closestrepeat
step only.out("knows")
will be receiving vertices for batching from theand
step input for the first iterationsas well as the
out("knows")
step output for the next iterations.all_repeat_parents
- consider registering vertices from the start and end of eachrepeat
step parent.out("knows")
will be receiving vertices for batching from the most outerrepeat
step input(for the first iterations), the most outer
repeat
step output (which isand
output) (for the first iterations),the
and
step input (for the first iterations), and from theout("knows")
output (for the next iterations).starts_only_of_all_repeat_parents
- consider registering vertices from the start of eachrepeat
step parent.out("knows")
will be receiving vertices for batching from the most outerrepeat
step input(for the first iterations), the
and
step input (for the first iterations), and from theout("knows")
output(for the next iterations).
The text was updated successfully, but these errors were encountered: