Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize SirixDeweyId.toBytes-method #512

Open
JohannesLichtenberger opened this issue Aug 8, 2022 · 22 comments
Open

Optimize SirixDeweyId.toBytes-method #512

JohannesLichtenberger opened this issue Aug 8, 2022 · 22 comments

Comments

@JohannesLichtenberger
Copy link
Member

We can probably use the new Vector API to use explicit SIMD instructions.

@IlayMenahem
Copy link

hi i'm new to open source contributing; i read about SIMD and i can help on this task.
also can you direct where exactly is SirixDeweyId.toBytes-method.

@JohannesLichtenberger
Copy link
Member Author

The class is called SirixDeweyID and the method is the toBytes(int[])method. We've optimized the method a bit and optimized other stuff in regards to storage of these hierarchical IDs. Runtime is approximately on my Notebook now around 5 minutes for importing a 3.8 Gb file with the generation of these IDs enabled (without it needs around 3:42m with hashes enabled in both cases). I'm currently always profiling the method (you have to remove the ignore annotation and add the cityofchicago.json file (I've uploaded it here: https://www.transfernow.net/dl/20220812X7v5ZTGZ) in the resource json folder:

JsonShredderTest::testChicago

The concept of DeweyIDs is described in the following documents (they are stored using huffman codes as numbers often times repeat in the hierarchical label -- we're using them for our JSON nodes, too): http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/users/mathis/documents/HHM_SBBD.pdf and http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/publications/2010/HM10.JIDM.pdf

@JohannesLichtenberger
Copy link
Member Author

It would probably a good start to profile the mentioned testChicago-method and switch on/off the DeweyID generation.

@IlayMenahem
Copy link

i've tried to use the test with and without @ignore, both times i got the same message. i also tried to find if there is any special guidance on how to conduct test and there wasn't.

The supplied phased action failed with an exception.
A problem occurred configuring root project 'sirix-parent'.
Failed to notify project evaluation listener.
A problem occurred configuring project ':sirix-core'.
A problem occurred evaluating project ':sirix-core'.
Could not find method api() for arguments [io.sirix:brackit:0.2] on object of type org.gradle.api.internal.artifacts.dsl.dependencies.DefaultDependencyHandler.

@JohannesLichtenberger
Copy link
Member Author

Can you try executing using IntelliJ?

@yeohbraddy
Copy link

Hi,

If @IlayMenahem is no longer working on this, I'd like to contribute! It is my first time contributing to open source so apologies if this is against etiquette.

Thank you!

@JohannesLichtenberger
Copy link
Member Author

Sure, thanks :-) I'm also thinking about not storing DeweyIDs for object field values to reduce storage and CPU processing costs. I'm currently still on holiday with my small tablet so I can't code right now.

@JohannesLichtenberger
Copy link
Member Author

@yeohbraddy I guess it's perfectly fair to assume, as in my experience after a week or more without any comments/PRs etc.pp. the user hasn't had time or something like that to work on the issue.

@JohannesLichtenberger
Copy link
Member Author

@yeohbraddy any work in progress?

@JohannesLichtenberger
Copy link
Member Author

@yeohbraddy guess you don't have time to work on the issue!? Let me know, then I'd unassign the issue for now or will have another look myself. I think it should be possible to use explicit SIMD.

@yeohbraddy
Copy link

Hi @JohannesLichtenberger, sorry I missed the previous ping! Yes I don't have time to work on this issue unfortunately, I started a new job 😌

@yeohbraddy yeohbraddy removed their assignment Nov 22, 2022
@ConnorCable
Copy link
Contributor

Hi! I’m new to OSS and interested in taking this on. I think I have a a grasp of the issue here but I’d like to jump on a call with whoever to learn more and get a sense of direction .

thank you!

@JohannesLichtenberger
Copy link
Member Author

Sounds great. The thing is, that these IDs even if huffman encoded almost double storage costs in my tests and also have a huge performance penalty due to the toBytes() method, which seems to add up CPU time. The class is based on this class: https://github.com/sebbae/brackitdb/blob/master/brackitdb-server/src/main/java/org/brackit/server/node/XTCdeweyID.java

You can check JsonShredderTest::testChicago and enable the storage of DeweyIDs. You'll need this file to execute the test: https://www.file-upload.net/download-15038251/cityofchicago.zip.html

@JohannesLichtenberger
Copy link
Member Author

You can also join our Discord channel if you need further explanations :-)

@adamretter
Copy link

@JohannesLichtenberger I have been lurking for a while. I contribute to eXist-db where we use Dynamic Level Numbering for our hierarchical ids. Perhaps they would be more efficient for your needs?

@ConnorCable
Copy link
Contributor

@adamretter Hey Adam, do you have any papers or resources regarding Dynamic Level Numbering I can take a look at?

@JohannesLichtenberger
Copy link
Member Author

Hi Adam, awesome, that you've had an eye on SirixDB 👍 can you elaborate how they are implemented in ExistDB? I think in SirixDB they are especially useful for sorting regarding document order of XQuery results. However, I switched to disable them for JSON storage per default, because of the space consumption and drop regarding insertion latency.

BTW: Do you have an idea how to store bigger files and run automatoc integration tests on these?

@JohannesLichtenberger
Copy link
Member Author

Even Git LFS does only allow to store up to 2 or 3 Gb of data at most IIRC (but maybe that's another issue :-))

@adamretter
Copy link

adamretter commented Dec 14, 2022

@ConnorCable The paper is titled Supporting Efficient Streaming and Insertion of XML Data in RDBMS and is authored by Timo Böhme, and Erhard Rahm.

There is an implementation of this in eXist-db here - https://github.com/eXist-db/exist/blob/develop/exist-core/src/main/java/org/exist/numbering/DLN.java#L42

The implementation in eXist-db is by no means perfect. eXist-db also introduces sub-levels to support XQuery Update.
If I recall, it had some bugs and further optimisations that could be made... that we improved upon for our DLN implementation in FusionDB.

I actually think the ORDPATH approach from Microsoft has some advantages over DLN; unfortunately though Microsoft hold a patent on ORDPATH, and so I deemed it unimplementable for my uses.

@adamretter
Copy link

adamretter commented Dec 14, 2022

I think in SirixDB they are especially useful for sorting regarding document order of XQuery results

@JohannesLichtenberger Sure. We also use them for fast sibling, parent, and ancestor lookups/proxies.

BTW: Do you have an idea how to store bigger files and run automatoc integration tests on these?

I am not quite sure what you are asking about here. I haven't looked in detail into how SirixDB stores its data. eXist-db stores its XML as the events (SAX like) needed to build a DOM in a B+Tree. FusionDB takes a similar approach but uses an LSM-tree instead.

@JohannesLichtenberger
Copy link
Member Author

JohannesLichtenberger commented Dec 14, 2022

The serialization at least is straight forward without the huffman encoding regarding DLN. I'm also skipping values, which share the same prefix in the page fragment during serialization :-)

True, but I think the hierarchical node labels suffer from worse performance regarding the axis traversal in comparison to the pre/post or pre/size/dist (I don't remember the exact encoding in BaseX), or in our case a persistent DOM. However, maybe it also depends on the index structure. I think they are ideally also stored in a trie instead of as in BrackitDB in a B+ tree.

It would be amazing to see a version of storing only the leaf nodes plus the path summary (I think in BrackitDB/XTC they called it elementless storage).

In SirixDB we needed unique IDs which are never reassigned (thus based on a simple AtomicLong-based sequence generator), and we store a nodeKey/parent/firstChild/lastChild/rightSibling/leftSibling encoding (doubly linked list of children basically) in order to get rid of the variable length node labels as the DeweyIDs we mention over here. But I think in some cases they are also really useful as also mentioned in the papers by Theo Haerder, even if we use a single writer for instance.

Regarding DeweyIDs in SirixDB: we can quickly select the diffs in subtrees of specific nodes between two revisions and also in document/preorder, which is also nice in the case of storing JSON, so it might also be useful in some cases, that's why I've implemented the optional storage in the first place, but performance and storage costs currently are high.

BTW: Do you already have a query engine for FusionDB? Otherwise, I'd like to mention Brackit, which implements JSONiq with set-orientated operators and implicit join recognition to support hash joins, sort-merge joins... and it's supposed to be a query engine for different backends where you can add AST rewrite rules for index accesses, add additional expressions... and it should implement all optimizations, which are not part of the individual storage engine :-)

Sorry for the very, very long reply, but I'm missing people with whom I can share some thoughts and ideas.

@JohannesLichtenberger
Copy link
Member Author

JohannesLichtenberger commented Dec 14, 2022

With bigger files, I mean to import JSON or XML files up to at least a few Gb and do other integration tests. Maybe most database system engineers add data via the APIs instead of importing Gbs of data from other storage formats for the tests...

Currently, I'm commenting/uncommenting tests locally, but that's, of course, not the right way because I can't store these files in Git.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants