-
-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize SirixDeweyId.toBytes-method #512
Comments
hi i'm new to open source contributing; i read about SIMD and i can help on this task. |
The class is called
The concept of DeweyIDs is described in the following documents (they are stored using huffman codes as numbers often times repeat in the hierarchical label -- we're using them for our JSON nodes, too): http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/users/mathis/documents/HHM_SBBD.pdf and http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/publications/2010/HM10.JIDM.pdf |
It would probably a good start to profile the mentioned testChicago-method and switch on/off the DeweyID generation. |
i've tried to use the test with and without @ignore, both times i got the same message. i also tried to find if there is any special guidance on how to conduct test and there wasn't. The supplied phased action failed with an exception. |
Can you try executing using IntelliJ? |
Hi, If @IlayMenahem is no longer working on this, I'd like to contribute! It is my first time contributing to open source so apologies if this is against etiquette. Thank you! |
Sure, thanks :-) I'm also thinking about not storing |
@yeohbraddy I guess it's perfectly fair to assume, as in my experience after a week or more without any comments/PRs etc.pp. the user hasn't had time or something like that to work on the issue. |
@yeohbraddy any work in progress? |
@yeohbraddy guess you don't have time to work on the issue!? Let me know, then I'd unassign the issue for now or will have another look myself. I think it should be possible to use explicit SIMD. |
Hi @JohannesLichtenberger, sorry I missed the previous ping! Yes I don't have time to work on this issue unfortunately, I started a new job 😌 |
Hi! I’m new to OSS and interested in taking this on. I think I have a a grasp of the issue here but I’d like to jump on a call with whoever to learn more and get a sense of direction . thank you! |
Sounds great. The thing is, that these IDs even if huffman encoded almost double storage costs in my tests and also have a huge performance penalty due to the You can check |
You can also join our Discord channel if you need further explanations :-) |
@JohannesLichtenberger I have been lurking for a while. I contribute to eXist-db where we use Dynamic Level Numbering for our hierarchical ids. Perhaps they would be more efficient for your needs? |
@adamretter Hey Adam, do you have any papers or resources regarding Dynamic Level Numbering I can take a look at? |
Hi Adam, awesome, that you've had an eye on SirixDB 👍 can you elaborate how they are implemented in ExistDB? I think in SirixDB they are especially useful for sorting regarding document order of XQuery results. However, I switched to disable them for JSON storage per default, because of the space consumption and drop regarding insertion latency. BTW: Do you have an idea how to store bigger files and run automatoc integration tests on these? |
Even Git LFS does only allow to store up to 2 or 3 Gb of data at most IIRC (but maybe that's another issue :-)) |
@ConnorCable The paper is titled There is an implementation of this in eXist-db here - https://github.com/eXist-db/exist/blob/develop/exist-core/src/main/java/org/exist/numbering/DLN.java#L42 The implementation in eXist-db is by no means perfect. eXist-db also introduces sub-levels to support XQuery Update. I actually think the ORDPATH approach from Microsoft has some advantages over DLN; unfortunately though Microsoft hold a patent on ORDPATH, and so I deemed it unimplementable for my uses. |
@JohannesLichtenberger Sure. We also use them for fast sibling, parent, and ancestor lookups/proxies.
I am not quite sure what you are asking about here. I haven't looked in detail into how SirixDB stores its data. eXist-db stores its XML as the events (SAX like) needed to build a DOM in a B+Tree. FusionDB takes a similar approach but uses an LSM-tree instead. |
The serialization at least is straight forward without the huffman encoding regarding DLN. I'm also skipping values, which share the same prefix in the page fragment during serialization :-) True, but I think the hierarchical node labels suffer from worse performance regarding the axis traversal in comparison to the It would be amazing to see a version of storing only the leaf nodes plus the path summary (I think in BrackitDB/XTC they called it elementless storage). In SirixDB we needed unique IDs which are never reassigned (thus based on a simple AtomicLong-based sequence generator), and we store a Regarding DeweyIDs in SirixDB: we can quickly select the diffs in subtrees of specific nodes between two revisions and also in document/preorder, which is also nice in the case of storing JSON, so it might also be useful in some cases, that's why I've implemented the optional storage in the first place, but performance and storage costs currently are high. BTW: Do you already have a query engine for FusionDB? Otherwise, I'd like to mention Brackit, which implements JSONiq with set-orientated operators and implicit join recognition to support hash joins, sort-merge joins... and it's supposed to be a query engine for different backends where you can add AST rewrite rules for index accesses, add additional expressions... and it should implement all optimizations, which are not part of the individual storage engine :-) Sorry for the very, very long reply, but I'm missing people with whom I can share some thoughts and ideas. |
With bigger files, I mean to import JSON or XML files up to at least a few Gb and do other integration tests. Maybe most database system engineers add data via the APIs instead of importing Gbs of data from other storage formats for the tests... Currently, I'm commenting/uncommenting tests locally, but that's, of course, not the right way because I can't store these files in Git. |
We can probably use the new Vector API to use explicit SIMD instructions.
The text was updated successfully, but these errors were encountered: