Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xpath experiment #1

Draft
wants to merge 47 commits into
base: master
Choose a base branch
from
Draft

Xpath experiment #1

wants to merge 47 commits into from

Conversation

Adzz
Copy link

@Adzz Adzz commented May 13, 2022

This is a bit of a mess but this work has been to aide investigation into XML parsing in platform.

This is the notion doc that started it:

https://www.notion.so/duffel/Farelogix-Investigation-85bd0d4dca7e453d89a64bc6b04f1f4c

This is the Jira epic:

https://duffel.atlassian.net/browse/ES-106

A lot of it wont make a huge amount of sense as this branch has been a scratchpad for me to try out ideas that have either been progressed elsewhere or canned.

Broadly these were some of the experimental approaches:

  • Iterating over a schema for the XML and parsing the whole document once per field in that schema. Most of the time you ignore everything in the schema and just select whatever you needed.
  • A Saxy handler that goes straight to a struct - the struct is defined in a data schema
  • A saxy handler that creates an XMLNode struct, that we then query with a data schema accessor
  • We accidentally re-implemented simple form without realising that simple form existed.
  • Benchmarked current approaches.

Adzz added 30 commits April 30, 2022 22:23
There is a lot of stuff in here. A good place to start is with the
findings.md doc where we attempt to capture everything we've been up to.
Some high level TL;DRs...:

* High memory seems to come from how xmerl represents the XML, namely
  parents, position and the inclusion of smaller fields that we probably
  just dont need, nsinfo, expanded name... etc.

We attempt an integration with DataSchema, there are a few ways that can
work:

1. Define a Saxy data accessor, this would result in Saxy.parse_string
   being called once per field in the schema, but it ignores everything
   except the one path it is looking for. It also returns as soon as we
   know we got what we needed. Preliminary results suggest it might be a
   bit slower but uses like half the memory. What makes this very tricky
   is figuring out what to do when we hit an has_many. This really feels
   solveable but the best I have ATM is a hacky solution - still
   incomplete too.
2. We define our own "reducer" ie to_struct fn that that takes the
   schema and the xml and perhaps handles has_many differently. This is
   yet unexplored. It certainly feels less clean but if it works who
   cares
3. Alter the representation of the schema - possibly to be keyd by the
   xpath, then as we progress through the XML we detect when we have
   reached a field we care about (based on the schema) and we save it if
   we have. This feels promising because we parse through the doc once
   but 1. representation of schemas is different and 2. it's a bti
   tricky to implement.
4. We should think about it from scratch a bit, rather than trying to
   fit it into established paradigms, what's the simplest way to get
   what we want? (Might be one of the solutions proposed but we should
   think about it.)

We also attempt to keep the current system but instead of creating
erlang records and xmerl, creating a map of the XML - removing the
unnecessary things like "parents" etc. Preliminary results show that
this clearly reduces memeory a lot. We now need to figure out what the
data we serialise to should look like AND we need to figure out an xpath
replacement / integration. What's nice about this approach is that it
still works with DataSchema.
This uses the Saxy handlers to create a map where the keys are dynamic.
The theory is that this would make the querying faster, we can see that
this approach is still significantly less memory than the current xmerl
approach, but it does fare worse than our other "slimmed down map"
approach. And it still feels like far too much mems. The next approaches
are to:

* Use a tuple instead of a dynamic map.
* try the "slimmed down map" with a struct.
* ...

We also need to bench the querying, which is easy to do the simple case,
but do we want to support `//` etc..?
Also ensures we return any children in the correct order.
dynamic map tuple thing. There are some bits to implement like list of
and probably but loads of edge cases and improvements. But this should
let us benchmark the steamed ham examples Vs SweetXML.
This is banging if it holds up!! 5 times less mems and quicker.

Really need to try with a larger input now. That's gonna require
porting large schema to the new one though....
Day474, captain's log. we are close there are noises outside.

We are about to change where we pop off the stack from inside characters to inside the end element, this is because if a tag doesn't have characters in it we would never pop off the stack!
…and stuff which smells like not putting it in the parent correctly
one correct has many, one to go

We are about to experiment with changing how a path should be
structured.... We are moving to putting the last node in the has_many
xpath to be NOT Salads but Salad. I think this might even match xpath...
These are some iterations on the "straight to struct" approach - mainly
experimenting with changing the accumulator in some way.
Adzz added 17 commits May 9, 2022 00:48
ALRIGHT this should let us benchamrk performance of querying!

What we should have done is Runtime schemas then had one struct used for
both, then we could compare the two to_struct results for equality. In
fact we still could, especially if we wrote a function that turned the
compile time schema into a runtime one. Whell guess we have that already
nearly with __data_schema_fields.
…ge here. But we did get a working version of straight to struct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant