Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization order with blank nodes is non-deterministic #8

Closed
atextor opened this issue Jan 4, 2023 · 10 comments · Fixed by #15
Closed

Serialization order with blank nodes is non-deterministic #8

atextor opened this issue Jan 4, 2023 · 10 comments · Fixed by #15

Comments

@atextor
Copy link
Owner

atextor commented Jan 4, 2023

This issue was originally reported here:

When multiple statements share the same subject and predicate and their objects are blank nodes, the serialization order is non-deterministic. Although they are sorted (cf. code) the reason could be that the lexicographic order of blank node ids varies between runs due to Jena's parser logic, i.e. blank node ids are already nondeterministic.
Solutions include:

  • Implement own parser (see also Option to retain comments? #2)
  • Don't use blank node ids for sorting. Instead, sort by a hash over the statements that have the blank node as subject.
@fkleedorfer
Copy link
Contributor

@atextor There is a PR that fixes this: #15

@VladimirAlexiev
Copy link

@fkleedorfer Does that use RDF Canonization approaches as per qudt/qudt-public-repo#959 (comment) by @dr-shorthair?

Eg in this simple triangle, all the blank nodes have a different "role". So no matter what numbering I use in the input, it should always result in the same the output

_:b1 :foo _:b2, _:b3.
_:b2 :foo _:b3.

@fkleedorfer
Copy link
Contributor

fkleedorfer commented Sep 9, 2024

No. I discussed the issue with @afs and his comment

Unlabelled blank nodes become _:0000 inumbered in encounter order and the parser output is deterministic with the order in the file. Obviously, if you add something in the middle, it has a knock on effect.

led me to believe it may be enough just to be able to reproduce the ordering of the blank nodes in the input file. That's what I implemented in #15

Canonicalization I think would not help. It is guaranteed to produce the same output given some input, but the hashes (and therefore the ordering) of the blank nodes could change when the input data changes.

(The PR has a longer description now, though still sloppy - sorry for that)

@VladimirAlexiev
Copy link

Jena stability is based on input order. Canonization stability is based on graph structure, which may be a better choice.

We need to test: which of the two approaches produces more similar output for my first example, and for this second one (which adds 2 more blanks before the numbered blanks):

[] :foo [].
_:b1 :foo _:b2, _:b3.
_:b2 :foo _:b3.

@afs
Copy link

afs commented Sep 9, 2024

Jena stability is based on input order.

It's more complicated than that! The pretty Turtle writer uses a gathering algorithm; it has an extension point to choose subject order and then predicate object. It uses memory so does not scale beyond RAM. There are several other pretty features that require pre-processing (e.g RDF lists) and use memory.

The flat and block Turtle writers are streaming. An insert-order preserving graph implementation is doable.

Canonization stability is based on graph structure

That's interesting because the canonization hash may go several steps out from a blank node.

If the goal includes git diff, as mentioned in apache/jena#2672 (comment), then a change in one part of the graph may affect another part several steps away.

@fkleedorfer
Copy link
Contributor

fkleedorfer commented Sep 9, 2024

Jena stability is based on input order.

It's more complicated than that! The pretty Turtle writer uses a gathering algorithm; it has an extension point to choose subject order and then predicate object. It uses memory so does not scale beyond RAM. There are several other pretty features that require pre-processing (e.g RDF lists) and use memory.

The flat and block Turtle writers are streaming. An insert-order preserving graph implementation is doable.

Canonization stability is based on graph structure

That's interesting because the canonization hash may go several steps out from a blank node.

If the goal includes git diff, as mentioned in apache/jena#2672 (comment), then a change in one part of the graph may affect another part several steps away.

I think the goal being git diff, the size is not going into the billion triples, also considering that turtle-formatter is aimed at manual editing and human consumption.

turtle-formatter does not use any jena writer, instead its a highly configurable custom writer. For obtaining the blank node ordering, a jena reader is used (with a wrapper around the parser profile). Thus, uri resources ordering can be configured and blank node ordering is as they appear in the file. Sound like the right characteristics for git diff to me.

@VladimirAlexiev I will add these examples as test cases. Turtle-formatter will remove blank node labels that are not needed, which might happen here. If that is a problem, we can add a configuration property to customize this behavior

@fkleedorfer
Copy link
Contributor

fkleedorfer commented Sep 9, 2024

@VladimirAlexiev your examples are formatted as follows (just copy&pasting from the unit tests here):

Both results are stable (the transformation is executed 20 times with the same result). As I said earlier, it would not be hard to add an option for preserving labeled nodes. This would make sense in TTL/TRIG, but not so much in ntriples/nquads, where all blank nodes are labeled, and it would be hard-ish to determine the hand-labeled ones. But here, we are dealing with TTL, so the question is: should we do that?

EDIT: this is with the default style.

input:

 @prefix : <http://example.com/ns#> .
 _:b1 :foo _:b2, _:b3.
 _:b2 :foo _:b3.

result:

 @prefix : <http://example.com/ns#> .

 [
   :foo [
     :foo _:b3 ;
   ] ;
   :foo _:b3 ;
 ] .

input


 @prefix : <http://example.com/ns#> .
 [] :foo [] .
 _:b1 :foo _:b2, _:b3.
 _:b2 :foo _:b3.

result

 @prefix : <http://example.com/ns#> .
 
 [
   :foo [];
 ] .
 
 [
   :foo [
     :foo _:b3 ;
   ] ;
   :foo _:b3 ;
 ] .

@VladimirAlexiev
Copy link

VladimirAlexiev commented Sep 10, 2024

In my experience blank identifiers are not used often. Eg they are not used in owl:Restriction, which is a main use case here.
So I was wrong: preserving blanks order (numbering) should be enough, and canonization is not needed.

But @fkleedorfer, I guess the output you cite is with the old version (without your PR) because it preserves given identifiers, not order (numbering).
Actually it doesn't match the documentation, which says identifiers are rewritten to the form _:gen0, where 0 is replaced with the number

@fkleedorfer
Copy link
Contributor

This result is with the PR. I went the extra mile to preserve labels present in the input. I think that is required as inserting a labeled blank will otherwise change all later labels.

Now that I think about it, I have to check for collisions between preexisting labels and generated ones.

@atextor
Copy link
Owner Author

atextor commented Sep 12, 2024

I agree with @VladimirAlexiev that blank node labels are not used often, at least not in manually created RDF. Since turtle-formatter tries to avoid using labels anyways whenever possible and labels only show up for blank nodes with multiple incoming edges, we only talk about those: IMHO it's not necessary to preserve original labels, but it also doesn't hurt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants