Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: support more and 2048 node/edge labels #64

Open
kuzeko opened this issue Aug 21, 2020 · 12 comments
Open

Feature Request: support more and 2048 node/edge labels #64

kuzeko opened this issue Aug 21, 2020 · 12 comments

Comments

@kuzeko
Copy link

kuzeko commented Aug 21, 2020

Currently ArangoDB cannot support more than 2048 node/edge labels (in total?)
since all labels are mapped to collections.

See: #63

Some comments:

1 - A new feature would be required to decide whether to store 1-label-per-collection or labels as properties.

2 - This will need to take into account also how indexing/traversal is handled for queries like:

g.V().out('label1').out('label2')...

3 - Currently the documentation should mention upfront this limitation

@MartinBrugnara
Copy link
Contributor

@arcanefoam IIRC, the adapter for tinkerpop 2.x worked around this problem by puting all nodes in a collection V and all edges in a collection E. But this solution would surely have performance implications...

@arcanefoam
Copy link
Collaborator

Hi, as stated on #54 the provider architecture and the use of ArangoDB's named graph API makes this change a difficult one.

However, this particular one adds the extra burden of not using collections for labels. This would probably result in not being able to use the Graph syntax of AQL thus a complete re-write of the AQL queries used by the current implementation. Thus, implementation wise we would need to change the architecture to add something like a "AQL translator" and then we could provide different implementations, one with Graph AQL, one with plain AQL (plus the integrity checks for #54).

@kuzeko
Copy link
Author

kuzeko commented Aug 24, 2020

Sorry, but does this means that, in the end, the real problem is that Arango graphs and AQL itself have a physical limit to number of node/edge labels?
So we are trying to fix with a hack in the gremlin driver a real limitation in the underling system?

@arcanefoam
Copy link
Collaborator

In the current implementation each label is mapped to a collection, as @dothebart mentioned, there is a limit of 2048 collections and as a result a limit to the number of labels supported by the arango tinkerpop provider.

The decision to model labels as collections was an architectural one driven by the decision to rely on the ArangoDB AQL support for graphs (see #58). If labels are modelled as vertex/edge properties then we could have two collections (Vertex, Edges), but we would need to use "normal" AQL and implement all the graph logic in the provider. It is possible, but hard.

Does this clarify your question?

@kuzeko
Copy link
Author

kuzeko commented Aug 24, 2020

Not completely, does Arango DB natively have a concept of node/edge labels? If yes, I understand that in AQL a node label is the same of a node collection, ditto for edges.
If this is also correct, then, trying to have this driver support more node/edge labels, is the equivalent of trying to have the driver overcome a limitation of the underlying system.
Here is then my question: are we trying to achieve this ?

@arcanefoam
Copy link
Collaborator

arcanefoam commented Aug 24, 2020 via email

@kuzeko
Copy link
Author

kuzeko commented Aug 25, 2020

Thanks, this is all clear.

But I still believe we are trying to overcome Arango's limitation: not having labels (which is worse than only allowing 2048 labels)

Anyway, for reference to others coming to the issue, here I found the relevant bit: https://www.arangodb.com/docs/stable/graphs.html#multiple-edge-collections-vs-filters-on-edge-document-attributes

If you want to only traverse edges of a specific type, there are two ways to achieve this. The first would be an attribute in the edge document - i.e. type, where you specify a differentiator for the edge - i.e. "friends", "family", "married" or "workmates", so you can later FILTER e.type = "friends" if you only want to follow the friend edges.

Another way, which may be more efficient in some cases, is to use different edge collections for different types of edges, so you have friend_edges, family_edges, married_edges and workmate_edges as collection names. You can then configure several named graphs including a subset of the available edge and vertex collections - or you use anonymous graph queries, where you specify a list of edge collections to take into account in that query. To only follow friend edges, you would specify friend_edges as sole edge collection

The multiple edge collections approach is limited by the number of collections that can be used simultaneously in one query. Every collection used in a query requires some resources inside of ArangoDB and the number is therefore limited to cap the resource requirements. You may also have constraints on other edge attributes, such as a hash index with a unique constraint, which requires the documents to be in a single collection for the uniqueness guarantee, and it may thus not be possible to store the different types of edges in multiple edge collections.

So, if your edges have about a dozen different types, it’s okay to choose the collection approach, otherwise the FILTER approach is preferred. You can still use FILTER operations on edges of course. You can get rid of a FILTER on the type with the former approach, everything else can stay the same.

@dothebart
Copy link
Contributor

dothebart commented Aug 25, 2020

ArangoDB doesn't have an as strict data model to how edges and vertices have to look like as other solutions available.

For ArangoDB all that sets appart edges from regular documents is, that they live in special edge collections which demand (and index) the availability of the _from and _to keys.
I.e. in the arangodb web interface, you can choose any attribute to be shown as an edge or vertex label.

https://www.arangodb.com/docs/stable/aql/graphs-traversals.html#filtering-edges-on-the-path demonstrates how to classify edges by an additional property.

@kuzeko
Copy link
Author

kuzeko commented Aug 25, 2020

Is not a "strict" data model, it is the property graph model.
As far as I know, also in other solutions you can always filter for properties on edges, yet , when searching for labels you benefit from an additional speedup because this is using some special data-structure or index.
The way I see it, in ArangoDB the only thing that can mimic that and offers similar advantages are "collections", which are limited though.

@dothebart
Copy link
Contributor

dothebart commented Aug 27, 2020

Hi,
In ArangoDB you can use vertex centric indices for that.
They index one of _from and _to and the Attribute you want to use as label.
Hence if you want to traverse the graph in forward and backward direction (in AQL IN/OUT|ANY), you need to create two indices.

So its probably a question of the count of your label, whether the cost of maintaining the additional edge collections, or the different indices are higher.
If the count is low, (and index selectivity bad for that matter) going the multiple edge collection way is probably better.

@kuzeko
Copy link
Author

kuzeko commented Aug 28, 2020

@dothebart the edge collection is not an options, because i have some 4 thousands edge types

@dothebart
Copy link
Contributor

Yes, for your usecase vertex centric indices are definitely the way to go.
I just wanted to mention, that both ways are viable options, and one should choose by ones own usecase.

So maybe @arcanefoam (or you?) can create the option to choose between vertex centric indices and collections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants