Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the changelog and / or create a blog post about recent repeat step multi-query improvements #3799

Open
porunov opened this issue Jun 4, 2023 · 3 comments · May be fixed by #4067
Open
Assignees
Milestone

Comments

@porunov
Copy link
Member

porunov commented Jun 4, 2023

This issue is here to track documentation improvements for the next feature: #3783
As seen from the next discussion the change-log regarding this improvement is hard to understand and people might be confused what really changed. We need to improve wording, add more examples, explain how JanusGraph's batch queries was working before #3783 and how it works now.
Current changelog says:


Batch registration for nested batch compatible steps is changed for repeat step

Previously any batch compatible steps like out, in, values, etc. would receive vertices for batch registration
from all repeat parent steps, but only for their starts in case of multi-nested repeat steps
(skipping their subsequent iterations registration).
With JanusGraph 1.0.0 batches registration for the subsequent iterations of multi-nested repeat steps are used as well.

g.V(startVertexId).emit().
    repeat(__.repeat(__.in("connects")).emit()).
    until(__.loops().is(P.gt(2)))

In the example above multi-nested repeat case would not register vertices returned from the inner emit() step
for the next outer iteration which would result in sequential calls of in("connects") for next outer iteration. The behaviour is
now changed to register these vertices for the next child repeat step start.

The behaviour can be controlled by query.batch.repeat-step-mode configuration option.
In case the old behaviour is preferable then query.batch.repeat-step-mode should be set to starts_only_of_all_repeat_parents.

However, in cases when transaction cache is small and repeat step traverses more than one level
deep, it could result for some vertices to be re-fetched again which would mean a waste of operation when it isn't necessary.
In such situations closest_repeat_parent mode might be more preferable than all_repeat_parents.
With closest_repeat_parent mode vertices for batch registration will be received from the start of the closest
repeat step as well as the end of the closest repeat step (for the next iteration). Any other parent repeat steps
will be ignored.


The changelog's first sentence is quite confusing and hard to understand. We need to restructure it in something different or simply write a blog post to explain what changed.

In short, the old behavior was a bad version of the new starts_only_of_all_repeat_parents mode.
I say a bad version because:

  • Previously not only MultiQueriable steps which are start steps of their repeat parent steps would receive vertices for batch queries, but ALL MultiQueriable children steps would receive vertices for batching from all their repeat steps. For example, g.(v1,v2,v3).repeat(out().out().out().out()).emit() - in this example you may think that only the first out step will be registered with JanusGraphMultiQueryStep which is placed before .repeat() step. It's true after Revamp JanusGraphMultiQueryStrategy for better parent step usage [cql-tests] [tp-tests] #3783 , but not the case before. Before Revamp JanusGraphMultiQueryStrategy for better parent step usage [cql-tests] [tp-tests] #3783 we would register each of those four out steps with JanusGraphMultiQueryStep. So, each of those steps will perform badly for their first batch query most likely because they will have to perform unnecessary operations for unnecessary vertices before they have a chance to perform batch requests for needed vertices. This could be considered as a performance bug. Nevertheless it's fixed in Revamp JanusGraphMultiQueryStrategy for better parent step usage [cql-tests] [tp-tests] #3783 .
  • Next iterations were often not considered. For example, g.V(v1).emit().repeat(out()).until(loops().is(P.gt(5))). In this situation we don't register out() result of the previous iteration to out() of the next iterations. Thus, we basically perform single vertex batches which are quite inefficient. In Revamp JanusGraphMultiQueryStrategy for better parent step usage [cql-tests] [tp-tests] #3783 the behavior is changed and we now register whatever result we have after repeat iteration with the next iteration.
  • multi-nested repeat steps are more trickier because we might now want to register next iterations for nested repeat steps. Thus, we have to have some modes which control the behavior. Do we want to register with all parents' repeat step start? Do we want to register with all parents' repeat step end (i.e. next iterations)? It all depends in case by case situation. For example, g.V(v1,v2,v3).emit().repeat(__.repeat(__.in("connects")).emit()).until(__.loops().is(P.gt(10))) in this situation we need to understand that each repeat iteration performs a single Traverser. So, do we really want to register v1, v2, and v3 with in("connects") as the first batch query request? In case transaction cache is small then it might be that when we traverse v2 after v1 - the cache is already gone and we will perform multi-query request for v2 again, thus making the first multi-query request redundant. v2 is not used on the first traverser going into in("connects") and most likely v2 won't be even second or third traverser going into in("connects") because there could be several levels going from v1. Nevertheless, in some situations users want to retrieve v2 and v3 together with v1 at the first access because they know that their transaction cache is big enough and they eventually will access v2 and v3 during their traversal. The same logic applies to the next iterations as well. I.e. we don't know if the user wants to request vertices in batch for the next iteration or not. We can say it only for the first parent repeat step, but not for all parent repeat steps. IN SHORT: Previously the logic would not take into account next step iterations, but always would always register vertices from the start of ALL repeat steps which are direct parents to each other. It's not changed and we can now say if we want to use the closest repeat parent step only, all repeat steps for starts or all repeat steps both for starts and ends.
  • Continuing to talk about multi-nested repeat steps, if we check the previous issue, we see that I said ALL repeat steps which are direct parents to each other. This is actually another flow. The problem is that we don't take into account non-repeat steps in this situation. For example, g.(v1,v2,v3).emit().repeat(union(repeat(out()).emit())).until(loops().is(5)) - in this situation, as you might notice, inner repeat step is not a direct child of outer repeat step. Even so union is a start step and the inner repeat step is also a start step - v1, v2, v3 won't be registered for the first batch because those repeat steps are not directly referenced. In Revamp JanusGraphMultiQueryStrategy for better parent step usage [cql-tests] [tp-tests] #3783 the behavior is improved and we can now detect and skip any multi-query compatible start parent steps. Thus v1, v2, and v3 will be registered for the first batch request (if all_repeat_parents or starts_only_of_all_repeat_parents modes are used).

We need to explain all of the above information in some other form which is easier to catch up by current users. Also, we need to explain why those repeat modes exist and how they work.

Here is the current modes descriptions from the batch-processing.md:


Multi-nested repeat step modes:

By default, in cases when batch start steps have multiple repeat step parents the batch registration is considering all repeat
parent steps.
However, in cases when transaction cache is small and repeat step traverses more than one level
deep, it could result for some vertices to be re-fetched again or vertices which don't need to be fetched due to early
cycle end could potentially be fetched into the transaction cache. It would mean a waste of operation when it isn't necessary.

Thus, JanusGraph provides a configuration option query.batch.repeat-step-mode to control multi-repeat step behaviour:

  • closest_repeat_parent (default option) - consider the closest repeat step only.
    g.V().repeat(and(repeat(out("knows")).emit())).emit()
    In the example above, out("knows") will be receiving vertices for batching from the and step input for the first iterations
    as well as the out("knows") step output for the next iterations.
  • all_repeat_parents - consider registering vertices from the start and end of each repeat step parent.
    g.V().repeat(and(repeat(out("knows")).emit())).emit()
    In the example above, out("knows") will be receiving vertices for batching from the most outer repeat step input
    (for the first iterations), the most outer repeat step output (which is and output) (for the first iterations),
    the and step input (for the first iterations), and from the out("knows") output (for the next iterations).
  • starts_only_of_all_repeat_parents - consider registering vertices from the start of each repeat step parent.
    g.V().repeat(and(repeat(out("knows")).emit())).emit()
    In the example above, out("knows") will be receiving vertices for batching from the most outer repeat step input
    (for the first iterations), the and step input (for the first iterations), and from the out("knows") output
    (for the next iterations).
@FlorianHockmann
Copy link
Member

Just want to say that I also think that a blog post would be great here. In general, I think that it would be really great if we could create some blog posts to accompany the 1.0.0 release as we have quite some interesting features in it that we want to make users aware of. Downside is of course that someone needs to take the time to write those posts.
It also looks like we already have a domain in place for such a blog: https://github.com/JanusGraph/blog.janusgraph.org

@porunov
Copy link
Member Author

porunov commented Jun 5, 2023

Just want to say that I also think that a blog post would be great here. In general, I think that it would be really great if we could create some blog posts to accompany the 1.0.0 release as we have quite some interesting features in it that we want to make users aware of. Downside is of course that someone needs to take the time to write those posts.
It also looks like we already have a domain in place for such a blog: https://github.com/JanusGraph/blog.janusgraph.org

I have plans to write a blog post related to multi-query improvements in JanusGraph. That said, I’m not sure what platform to use yet.
I know only Medium for blog posts.
It would be great to have all related blog posts under blog.janusgraph.org, but I didn’t research how to do that, so not sure about it. I guess some blog posts UI integration could exist.

@li-boxuan
Copy link
Member

I think that it would be really great if we could create some blog posts to accompany the 1.0.0 release as we have quite some interesting features in it that we want to make users aware of

I was planning to write a blog post to introduce the string vertex id feature. I was planning to publish it on my own medium blog, but if there's an official one set up, I'd like to publish it there too.

@porunov porunov self-assigned this Oct 16, 2023
porunov added a commit to porunov/janusgraph that referenced this issue Oct 18, 2023
@porunov porunov linked a pull request Oct 18, 2023 that will close this issue
9 tasks
@porunov porunov modified the milestones: Release v1.0.0, Unscheduled Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants