Serialization order with blank nodes is non-deterministic #8

atextor · 2023-01-04T05:05:41Z

This issue was originally reported here:

write command generation is not stable owl-cli#8

When multiple statements share the same subject and predicate and their objects are blank nodes, the serialization order is non-deterministic. Although they are sorted (cf. code) the reason could be that the lexicographic order of blank node ids varies between runs due to Jena's parser logic, i.e. blank node ids are already nondeterministic.
Solutions include:

Implement own parser (see also Option to retain comments? #2)
Don't use blank node ids for sorting. Instead, sort by a hash over the statements that have the blank node as subject.

The text was updated successfully, but these errors were encountered:

fkleedorfer · 2024-09-09T15:09:44Z

@atextor There is a PR that fixes this: #15

VladimirAlexiev · 2024-09-09T15:24:25Z

@fkleedorfer Does that use RDF Canonization approaches as per qudt/qudt-public-repo#959 (comment) by @dr-shorthair?

Eg in this simple triangle, all the blank nodes have a different "role". So no matter what numbering I use in the input, it should always result in the same the output

_:b1 :foo _:b2, _:b3.
_:b2 :foo _:b3.

fkleedorfer · 2024-09-09T15:32:36Z

No. I discussed the issue with @afs and his comment

Unlabelled blank nodes become _:0000 inumbered in encounter order and the parser output is deterministic with the order in the file. Obviously, if you add something in the middle, it has a knock on effect.

led me to believe it may be enough just to be able to reproduce the ordering of the blank nodes in the input file. That's what I implemented in #15

Canonicalization I think would not help. It is guaranteed to produce the same output given some input, but the hashes (and therefore the ordering) of the blank nodes could change when the input data changes.

(The PR has a longer description now, though still sloppy - sorry for that)

VladimirAlexiev · 2024-09-09T15:40:08Z

Jena stability is based on input order. Canonization stability is based on graph structure, which may be a better choice.

We need to test: which of the two approaches produces more similar output for my first example, and for this second one (which adds 2 more blanks before the numbered blanks):

[] :foo [].
_:b1 :foo _:b2, _:b3.
_:b2 :foo _:b3.

afs · 2024-09-09T16:35:38Z

Jena stability is based on input order.

It's more complicated than that! The pretty Turtle writer uses a gathering algorithm; it has an extension point to choose subject order and then predicate object. It uses memory so does not scale beyond RAM. There are several other pretty features that require pre-processing (e.g RDF lists) and use memory.

The flat and block Turtle writers are streaming. An insert-order preserving graph implementation is doable.

Canonization stability is based on graph structure

That's interesting because the canonization hash may go several steps out from a blank node.

If the goal includes git diff, as mentioned in apache/jena#2672 (comment), then a change in one part of the graph may affect another part several steps away.

fkleedorfer · 2024-09-09T17:38:23Z

Jena stability is based on input order.

It's more complicated than that! The pretty Turtle writer uses a gathering algorithm; it has an extension point to choose subject order and then predicate object. It uses memory so does not scale beyond RAM. There are several other pretty features that require pre-processing (e.g RDF lists) and use memory.

The flat and block Turtle writers are streaming. An insert-order preserving graph implementation is doable.

Canonization stability is based on graph structure

That's interesting because the canonization hash may go several steps out from a blank node.

If the goal includes git diff, as mentioned in apache/jena#2672 (comment), then a change in one part of the graph may affect another part several steps away.

I think the goal being git diff, the size is not going into the billion triples, also considering that turtle-formatter is aimed at manual editing and human consumption.

turtle-formatter does not use any jena writer, instead its a highly configurable custom writer. For obtaining the blank node ordering, a jena reader is used (with a wrapper around the parser profile). Thus, uri resources ordering can be configured and blank node ordering is as they appear in the file. Sound like the right characteristics for git diff to me.

@VladimirAlexiev I will add these examples as test cases. Turtle-formatter will remove blank node labels that are not needed, which might happen here. If that is a problem, we can add a configuration property to customize this behavior

fkleedorfer · 2024-09-09T18:44:51Z

@VladimirAlexiev your examples are formatted as follows (just copy&pasting from the unit tests here):

Both results are stable (the transformation is executed 20 times with the same result). As I said earlier, it would not be hard to add an option for preserving labeled nodes. This would make sense in TTL/TRIG, but not so much in ntriples/nquads, where all blank nodes are labeled, and it would be hard-ish to determine the hand-labeled ones. But here, we are dealing with TTL, so the question is: should we do that?

EDIT: this is with the default style.

input:

 @prefix : <http://example.com/ns#> .
 _:b1 :foo _:b2, _:b3.
 _:b2 :foo _:b3.

result:

 @prefix : <http://example.com/ns#> .

 [
   :foo [
     :foo _:b3 ;
   ] ;
   :foo _:b3 ;
 ] .

input


 @prefix : <http://example.com/ns#> .
 [] :foo [] .
 _:b1 :foo _:b2, _:b3.
 _:b2 :foo _:b3.

result

 @prefix : <http://example.com/ns#> .
 
 [
   :foo [];
 ] .
 
 [
   :foo [
     :foo _:b3 ;
   ] ;
   :foo _:b3 ;
 ] .

VladimirAlexiev · 2024-09-10T04:54:19Z

In my experience blank identifiers are not used often. Eg they are not used in owl:Restriction, which is a main use case here.
So I was wrong: preserving blanks order (numbering) should be enough, and canonization is not needed.

But @fkleedorfer, I guess the output you cite is with the old version (without your PR) because it preserves given identifiers, not order (numbering).
Actually it doesn't match the documentation, which says identifiers are rewritten to the form _:gen0, where 0 is replaced with the number

fkleedorfer · 2024-09-10T05:40:13Z

This result is with the PR. I went the extra mile to preserve labels present in the input. I think that is required as inserting a labeled blank will otherwise change all later labels.

Now that I think about it, I have to check for collisions between preexisting labels and generated ones.

atextor · 2024-09-12T09:03:15Z

I agree with @VladimirAlexiev that blank node labels are not used often, at least not in manually created RDF. Since turtle-formatter tries to avoid using labels anyways whenever possible and labels only show up for blank nodes with multiple incoming edges, we only talk about those: IMHO it's not necessary to preserve original labels, but it also doesn't hurt.

VladimirAlexiev mentioned this issue Sep 10, 2024

Use more shortcuts for blank nodes #21

Open

VladimirAlexiev mentioned this issue Sep 10, 2024

LOV data dump hosted as a git repo pyvandenbussche/lov#80

Open

fkleedorfer mentioned this issue Sep 10, 2024

Bugfix for Blank node ordering, RDF lists and DecimalFormat #15

Merged

atextor closed this as completed in #15 Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialization order with blank nodes is non-deterministic #8

Serialization order with blank nodes is non-deterministic #8

atextor commented Jan 4, 2023

fkleedorfer commented Sep 9, 2024

VladimirAlexiev commented Sep 9, 2024

fkleedorfer commented Sep 9, 2024 •

edited

Loading

VladimirAlexiev commented Sep 9, 2024

afs commented Sep 9, 2024

fkleedorfer commented Sep 9, 2024 •

edited

Loading

fkleedorfer commented Sep 9, 2024 •

edited

Loading

VladimirAlexiev commented Sep 10, 2024 •

edited

Loading

fkleedorfer commented Sep 10, 2024

atextor commented Sep 12, 2024

Serialization order with blank nodes is non-deterministic #8

Serialization order with blank nodes is non-deterministic #8

Comments

atextor commented Jan 4, 2023

fkleedorfer commented Sep 9, 2024

VladimirAlexiev commented Sep 9, 2024

fkleedorfer commented Sep 9, 2024 • edited Loading

VladimirAlexiev commented Sep 9, 2024

afs commented Sep 9, 2024

fkleedorfer commented Sep 9, 2024 • edited Loading

fkleedorfer commented Sep 9, 2024 • edited Loading

VladimirAlexiev commented Sep 10, 2024 • edited Loading

fkleedorfer commented Sep 10, 2024

atextor commented Sep 12, 2024

fkleedorfer commented Sep 9, 2024 •

edited

Loading

fkleedorfer commented Sep 9, 2024 •

edited

Loading

fkleedorfer commented Sep 9, 2024 •

edited

Loading

VladimirAlexiev commented Sep 10, 2024 •

edited

Loading