-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xpath experiment #1
Draft
Adzz
wants to merge
47
commits into
master
Choose a base branch
from
xpath_experiment
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There is a lot of stuff in here. A good place to start is with the findings.md doc where we attempt to capture everything we've been up to. Some high level TL;DRs...: * High memory seems to come from how xmerl represents the XML, namely parents, position and the inclusion of smaller fields that we probably just dont need, nsinfo, expanded name... etc. We attempt an integration with DataSchema, there are a few ways that can work: 1. Define a Saxy data accessor, this would result in Saxy.parse_string being called once per field in the schema, but it ignores everything except the one path it is looking for. It also returns as soon as we know we got what we needed. Preliminary results suggest it might be a bit slower but uses like half the memory. What makes this very tricky is figuring out what to do when we hit an has_many. This really feels solveable but the best I have ATM is a hacky solution - still incomplete too. 2. We define our own "reducer" ie to_struct fn that that takes the schema and the xml and perhaps handles has_many differently. This is yet unexplored. It certainly feels less clean but if it works who cares 3. Alter the representation of the schema - possibly to be keyd by the xpath, then as we progress through the XML we detect when we have reached a field we care about (based on the schema) and we save it if we have. This feels promising because we parse through the doc once but 1. representation of schemas is different and 2. it's a bti tricky to implement. 4. We should think about it from scratch a bit, rather than trying to fit it into established paradigms, what's the simplest way to get what we want? (Might be one of the solutions proposed but we should think about it.) We also attempt to keep the current system but instead of creating erlang records and xmerl, creating a map of the XML - removing the unnecessary things like "parents" etc. Preliminary results show that this clearly reduces memeory a lot. We now need to figure out what the data we serialise to should look like AND we need to figure out an xpath replacement / integration. What's nice about this approach is that it still works with DataSchema.
This uses the Saxy handlers to create a map where the keys are dynamic. The theory is that this would make the querying faster, we can see that this approach is still significantly less memory than the current xmerl approach, but it does fare worse than our other "slimmed down map" approach. And it still feels like far too much mems. The next approaches are to: * Use a tuple instead of a dynamic map. * try the "slimmed down map" with a struct. * ... We also need to bench the querying, which is easy to do the simple case, but do we want to support `//` etc..?
Also ensures we return any children in the correct order.
dynamic map tuple thing. There are some bits to implement like list of and probably but loads of edge cases and improvements. But this should let us benchmark the steamed ham examples Vs SweetXML.
This is banging if it holds up!! 5 times less mems and quicker. Really need to try with a larger input now. That's gonna require porting large schema to the new one though....
Day474, captain's log. we are close there are noises outside. We are about to change where we pop off the stack from inside characters to inside the end element, this is because if a tag doesn't have characters in it we would never pop off the stack!
…and stuff which smells like not putting it in the parent correctly
one correct has many, one to go We are about to experiment with changing how a path should be structured.... We are moving to putting the last node in the has_many xpath to be NOT Salads but Salad. I think this might even match xpath...
These are some iterations on the "straight to struct" approach - mainly experimenting with changing the accumulator in some way.
ALRIGHT this should let us benchamrk performance of querying! What we should have done is Runtime schemas then had one struct used for both, then we could compare the two to_struct results for equality. In fact we still could, especially if we wrote a function that turned the compile time schema into a runtime one. Whell guess we have that already nearly with __data_schema_fields.
…ge here. But we did get a working version of straight to struct
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a bit of a mess but this work has been to aide investigation into XML parsing in platform.
This is the notion doc that started it:
https://www.notion.so/duffel/Farelogix-Investigation-85bd0d4dca7e453d89a64bc6b04f1f4c
This is the Jira epic:
https://duffel.atlassian.net/browse/ES-106
A lot of it wont make a huge amount of sense as this branch has been a scratchpad for me to try out ideas that have either been progressed elsewhere or canned.
Broadly these were some of the experimental approaches: