-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization order with blank nodes is non-deterministic #8
Comments
@fkleedorfer Does that use RDF Canonization approaches as per qudt/qudt-public-repo#959 (comment) by @dr-shorthair? Eg in this simple triangle, all the blank nodes have a different "role". So no matter what numbering I use in the input, it should always result in the same the output _:b1 :foo _:b2, _:b3.
_:b2 :foo _:b3. |
No. I discussed the issue with @afs and his comment
led me to believe it may be enough just to be able to reproduce the ordering of the blank nodes in the input file. That's what I implemented in #15 Canonicalization I think would not help. It is guaranteed to produce the same output given some input, but the hashes (and therefore the ordering) of the blank nodes could change when the input data changes. (The PR has a longer description now, though still sloppy - sorry for that) |
Jena stability is based on input order. Canonization stability is based on graph structure, which may be a better choice. We need to test: which of the two approaches produces more similar output for my first example, and for this second one (which adds 2 more blanks before the numbered blanks): [] :foo [].
_:b1 :foo _:b2, _:b3.
_:b2 :foo _:b3. |
It's more complicated than that! The pretty Turtle writer uses a gathering algorithm; it has an extension point to choose subject order and then predicate object. It uses memory so does not scale beyond RAM. There are several other pretty features that require pre-processing (e.g RDF lists) and use memory. The flat and block Turtle writers are streaming. An insert-order preserving graph implementation is doable.
That's interesting because the canonization hash may go several steps out from a blank node. If the goal includes git diff, as mentioned in apache/jena#2672 (comment), then a change in one part of the graph may affect another part several steps away. |
I think the goal being git diff, the size is not going into the billion triples, also considering that turtle-formatter is aimed at manual editing and human consumption. turtle-formatter does not use any jena writer, instead its a highly configurable custom writer. For obtaining the blank node ordering, a jena reader is used (with a wrapper around the parser profile). Thus, uri resources ordering can be configured and blank node ordering is as they appear in the file. Sound like the right characteristics for git diff to me. @VladimirAlexiev I will add these examples as test cases. Turtle-formatter will remove blank node labels that are not needed, which might happen here. If that is a problem, we can add a configuration property to customize this behavior |
@VladimirAlexiev your examples are formatted as follows (just copy&pasting from the unit tests here): Both results are stable (the transformation is executed 20 times with the same result). As I said earlier, it would not be hard to add an option for preserving labeled nodes. This would make sense in TTL/TRIG, but not so much in ntriples/nquads, where all blank nodes are labeled, and it would be hard-ish to determine the hand-labeled ones. But here, we are dealing with TTL, so the question is: should we do that? EDIT: this is with the default style. input:
result:
input
result
|
In my experience blank identifiers are not used often. Eg they are not used in owl:Restriction, which is a main use case here. But @fkleedorfer, I guess the output you cite is with the old version (without your PR) because it preserves given identifiers, not order (numbering). |
This result is with the PR. I went the extra mile to preserve labels present in the input. I think that is required as inserting a labeled blank will otherwise change all later labels. Now that I think about it, I have to check for collisions between preexisting labels and generated ones. |
I agree with @VladimirAlexiev that blank node labels are not used often, at least not in manually created RDF. Since turtle-formatter tries to avoid using labels anyways whenever possible and labels only show up for blank nodes with multiple incoming edges, we only talk about those: IMHO it's not necessary to preserve original labels, but it also doesn't hurt. |
This issue was originally reported here:
write
command generation is not stable owl-cli#8When multiple statements share the same subject and predicate and their objects are blank nodes, the serialization order is non-deterministic. Although they are sorted (cf. code) the reason could be that the lexicographic order of blank node ids varies between runs due to Jena's parser logic, i.e. blank node ids are already nondeterministic.
Solutions include:
The text was updated successfully, but these errors were encountered: