Optimize SirixDeweyId.toBytes-method #512

JohannesLichtenberger · 2022-08-08T19:47:25Z

We can probably use the new Vector API to use explicit SIMD instructions.

IlayMenahem · 2022-08-12T12:15:34Z

hi i'm new to open source contributing; i read about SIMD and i can help on this task.
also can you direct where exactly is SirixDeweyId.toBytes-method.

JohannesLichtenberger · 2022-08-12T22:51:44Z

The class is called SirixDeweyID and the method is the toBytes(int[])method. We've optimized the method a bit and optimized other stuff in regards to storage of these hierarchical IDs. Runtime is approximately on my Notebook now around 5 minutes for importing a 3.8 Gb file with the generation of these IDs enabled (without it needs around 3:42m with hashes enabled in both cases). I'm currently always profiling the method (you have to remove the ignore annotation and add the cityofchicago.json file (I've uploaded it here: https://www.transfernow.net/dl/20220812X7v5ZTGZ) in the resource json folder:

JsonShredderTest::testChicago

The concept of DeweyIDs is described in the following documents (they are stored using huffman codes as numbers often times repeat in the hierarchical label -- we're using them for our JSON nodes, too): http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/users/mathis/documents/HHM_SBBD.pdf and http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/publications/2010/HM10.JIDM.pdf

JohannesLichtenberger · 2022-08-12T22:52:36Z

It would probably a good start to profile the mentioned testChicago-method and switch on/off the DeweyID generation.

IlayMenahem · 2022-08-14T22:06:50Z

i've tried to use the test with and without @ignore, both times i got the same message. i also tried to find if there is any special guidance on how to conduct test and there wasn't.

The supplied phased action failed with an exception.
A problem occurred configuring root project 'sirix-parent'.
Failed to notify project evaluation listener.
A problem occurred configuring project ':sirix-core'.
A problem occurred evaluating project ':sirix-core'.
Could not find method api() for arguments [io.sirix:brackit:0.2] on object of type org.gradle.api.internal.artifacts.dsl.dependencies.DefaultDependencyHandler.

JohannesLichtenberger · 2022-08-15T19:03:31Z

Can you try executing using IntelliJ?

yeohbraddy · 2022-09-15T13:15:34Z

Hi,

If @IlayMenahem is no longer working on this, I'd like to contribute! It is my first time contributing to open source so apologies if this is against etiquette.

Thank you!

JohannesLichtenberger · 2022-09-17T14:47:26Z

Sure, thanks :-) I'm also thinking about not storing DeweyIDs for object field values to reduce storage and CPU processing costs. I'm currently still on holiday with my small tablet so I can't code right now.

JohannesLichtenberger · 2022-09-18T08:44:12Z

@yeohbraddy I guess it's perfectly fair to assume, as in my experience after a week or more without any comments/PRs etc.pp. the user hasn't had time or something like that to work on the issue.

JohannesLichtenberger · 2022-11-11T00:36:18Z

@yeohbraddy any work in progress?

JohannesLichtenberger · 2022-11-21T20:54:52Z

@yeohbraddy guess you don't have time to work on the issue!? Let me know, then I'd unassign the issue for now or will have another look myself. I think it should be possible to use explicit SIMD.

yeohbraddy · 2022-11-22T10:12:59Z

Hi @JohannesLichtenberger, sorry I missed the previous ping! Yes I don't have time to work on this issue unfortunately, I started a new job 😌

ConnorCable · 2022-12-13T21:54:33Z

Hi! I’m new to OSS and interested in taking this on. I think I have a a grasp of the issue here but I’d like to jump on a call with whoever to learn more and get a sense of direction .

thank you!

JohannesLichtenberger · 2022-12-14T16:45:31Z

Sounds great. The thing is, that these IDs even if huffman encoded almost double storage costs in my tests and also have a huge performance penalty due to the toBytes() method, which seems to add up CPU time. The class is based on this class: https://github.com/sebbae/brackitdb/blob/master/brackitdb-server/src/main/java/org/brackit/server/node/XTCdeweyID.java

You can check JsonShredderTest::testChicago and enable the storage of DeweyIDs. You'll need this file to execute the test: https://www.file-upload.net/download-15038251/cityofchicago.zip.html

JohannesLichtenberger · 2022-12-14T18:41:12Z

You can also join our Discord channel if you need further explanations :-)

adamretter · 2022-12-14T18:48:38Z

@JohannesLichtenberger I have been lurking for a while. I contribute to eXist-db where we use Dynamic Level Numbering for our hierarchical ids. Perhaps they would be more efficient for your needs?

ConnorCable · 2022-12-14T18:59:00Z

@adamretter Hey Adam, do you have any papers or resources regarding Dynamic Level Numbering I can take a look at?

JohannesLichtenberger · 2022-12-14T19:00:42Z

Hi Adam, awesome, that you've had an eye on SirixDB 👍 can you elaborate how they are implemented in ExistDB? I think in SirixDB they are especially useful for sorting regarding document order of XQuery results. However, I switched to disable them for JSON storage per default, because of the space consumption and drop regarding insertion latency.

BTW: Do you have an idea how to store bigger files and run automatoc integration tests on these?

JohannesLichtenberger · 2022-12-14T19:01:46Z

Even Git LFS does only allow to store up to 2 or 3 Gb of data at most IIRC (but maybe that's another issue :-))

adamretter · 2022-12-14T20:11:53Z

@ConnorCable The paper is titled Supporting Efficient Streaming and Insertion of XML Data in RDBMS and is authored by Timo Böhme, and Erhard Rahm.

There is an implementation of this in eXist-db here - https://github.com/eXist-db/exist/blob/develop/exist-core/src/main/java/org/exist/numbering/DLN.java#L42

The implementation in eXist-db is by no means perfect. eXist-db also introduces sub-levels to support XQuery Update.
If I recall, it had some bugs and further optimisations that could be made... that we improved upon for our DLN implementation in FusionDB.

I actually think the ORDPATH approach from Microsoft has some advantages over DLN; unfortunately though Microsoft hold a patent on ORDPATH, and so I deemed it unimplementable for my uses.

adamretter · 2022-12-14T20:15:17Z

I think in SirixDB they are especially useful for sorting regarding document order of XQuery results

@JohannesLichtenberger Sure. We also use them for fast sibling, parent, and ancestor lookups/proxies.

BTW: Do you have an idea how to store bigger files and run automatoc integration tests on these?

I am not quite sure what you are asking about here. I haven't looked in detail into how SirixDB stores its data. eXist-db stores its XML as the events (SAX like) needed to build a DOM in a B+Tree. FusionDB takes a similar approach but uses an LSM-tree instead.

JohannesLichtenberger · 2022-12-14T20:56:29Z

The serialization at least is straight forward without the huffman encoding regarding DLN. I'm also skipping values, which share the same prefix in the page fragment during serialization :-)

True, but I think the hierarchical node labels suffer from worse performance regarding the axis traversal in comparison to the pre/post or pre/size/dist (I don't remember the exact encoding in BaseX), or in our case a persistent DOM. However, maybe it also depends on the index structure. I think they are ideally also stored in a trie instead of as in BrackitDB in a B+ tree.

It would be amazing to see a version of storing only the leaf nodes plus the path summary (I think in BrackitDB/XTC they called it elementless storage).

In SirixDB we needed unique IDs which are never reassigned (thus based on a simple AtomicLong-based sequence generator), and we store a nodeKey/parent/firstChild/lastChild/rightSibling/leftSibling encoding (doubly linked list of children basically) in order to get rid of the variable length node labels as the DeweyIDs we mention over here. But I think in some cases they are also really useful as also mentioned in the papers by Theo Haerder, even if we use a single writer for instance.

Regarding DeweyIDs in SirixDB: we can quickly select the diffs in subtrees of specific nodes between two revisions and also in document/preorder, which is also nice in the case of storing JSON, so it might also be useful in some cases, that's why I've implemented the optional storage in the first place, but performance and storage costs currently are high.

BTW: Do you already have a query engine for FusionDB? Otherwise, I'd like to mention Brackit, which implements JSONiq with set-orientated operators and implicit join recognition to support hash joins, sort-merge joins... and it's supposed to be a query engine for different backends where you can add AST rewrite rules for index accesses, add additional expressions... and it should implement all optimizations, which are not part of the individual storage engine :-)

Sorry for the very, very long reply, but I'm missing people with whom I can share some thoughts and ideas.

JohannesLichtenberger · 2022-12-14T21:11:20Z

With bigger files, I mean to import JSON or XML files up to at least a few Gb and do other integration tests. Maybe most database system engineers add data via the APIs instead of importing Gbs of data from other storage formats for the tests...

Currently, I'm commenting/uncommenting tests locally, but that's, of course, not the right way because I can't store these files in Git.

JohannesLichtenberger added good first issue help wanted labels Aug 8, 2022

JohannesLichtenberger assigned yeohbraddy Sep 17, 2022

yeohbraddy removed their assignment Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize SirixDeweyId.toBytes-method #512

Optimize SirixDeweyId.toBytes-method #512

JohannesLichtenberger commented Aug 8, 2022

IlayMenahem commented Aug 12, 2022

JohannesLichtenberger commented Aug 12, 2022

JohannesLichtenberger commented Aug 12, 2022

IlayMenahem commented Aug 14, 2022

JohannesLichtenberger commented Aug 15, 2022

yeohbraddy commented Sep 15, 2022

JohannesLichtenberger commented Sep 17, 2022

JohannesLichtenberger commented Sep 18, 2022

JohannesLichtenberger commented Nov 11, 2022

JohannesLichtenberger commented Nov 21, 2022

yeohbraddy commented Nov 22, 2022

ConnorCable commented Dec 13, 2022

JohannesLichtenberger commented Dec 14, 2022

JohannesLichtenberger commented Dec 14, 2022

adamretter commented Dec 14, 2022

ConnorCable commented Dec 14, 2022

JohannesLichtenberger commented Dec 14, 2022

JohannesLichtenberger commented Dec 14, 2022

adamretter commented Dec 14, 2022 •

edited

Loading

adamretter commented Dec 14, 2022 •

edited

Loading

JohannesLichtenberger commented Dec 14, 2022 •

edited

Loading

JohannesLichtenberger commented Dec 14, 2022 •

edited

Loading

Optimize SirixDeweyId.toBytes-method #512

Optimize SirixDeweyId.toBytes-method #512

Comments

JohannesLichtenberger commented Aug 8, 2022

IlayMenahem commented Aug 12, 2022

JohannesLichtenberger commented Aug 12, 2022

JohannesLichtenberger commented Aug 12, 2022

IlayMenahem commented Aug 14, 2022

JohannesLichtenberger commented Aug 15, 2022

yeohbraddy commented Sep 15, 2022

JohannesLichtenberger commented Sep 17, 2022

JohannesLichtenberger commented Sep 18, 2022

JohannesLichtenberger commented Nov 11, 2022

JohannesLichtenberger commented Nov 21, 2022

yeohbraddy commented Nov 22, 2022

ConnorCable commented Dec 13, 2022

JohannesLichtenberger commented Dec 14, 2022

JohannesLichtenberger commented Dec 14, 2022

adamretter commented Dec 14, 2022

ConnorCable commented Dec 14, 2022

JohannesLichtenberger commented Dec 14, 2022

JohannesLichtenberger commented Dec 14, 2022

adamretter commented Dec 14, 2022 • edited Loading

adamretter commented Dec 14, 2022 • edited Loading

JohannesLichtenberger commented Dec 14, 2022 • edited Loading

JohannesLichtenberger commented Dec 14, 2022 • edited Loading

adamretter commented Dec 14, 2022 •

edited

Loading

adamretter commented Dec 14, 2022 •

edited

Loading

JohannesLichtenberger commented Dec 14, 2022 •

edited

Loading

JohannesLichtenberger commented Dec 14, 2022 •

edited

Loading