Xpath experiment #1

Adzz · 2022-05-13T14:26:14Z

This is a bit of a mess but this work has been to aide investigation into XML parsing in platform.

This is the notion doc that started it:

https://www.notion.so/duffel/Farelogix-Investigation-85bd0d4dca7e453d89a64bc6b04f1f4c

This is the Jira epic:

https://duffel.atlassian.net/browse/ES-106

A lot of it wont make a huge amount of sense as this branch has been a scratchpad for me to try out ideas that have either been progressed elsewhere or canned.

Broadly these were some of the experimental approaches:

Iterating over a schema for the XML and parsing the whole document once per field in that schema. Most of the time you ignore everything in the schema and just select whatever you needed.
A Saxy handler that goes straight to a struct - the struct is defined in a data schema
A saxy handler that creates an XMLNode struct, that we then query with a data schema accessor
We accidentally re-implemented simple form without realising that simple form existed.
Benchmarked current approaches.

There is a lot of stuff in here. A good place to start is with the findings.md doc where we attempt to capture everything we've been up to. Some high level TL;DRs...: * High memory seems to come from how xmerl represents the XML, namely parents, position and the inclusion of smaller fields that we probably just dont need, nsinfo, expanded name... etc. We attempt an integration with DataSchema, there are a few ways that can work: 1. Define a Saxy data accessor, this would result in Saxy.parse_string being called once per field in the schema, but it ignores everything except the one path it is looking for. It also returns as soon as we know we got what we needed. Preliminary results suggest it might be a bit slower but uses like half the memory. What makes this very tricky is figuring out what to do when we hit an has_many. This really feels solveable but the best I have ATM is a hacky solution - still incomplete too. 2. We define our own "reducer" ie to_struct fn that that takes the schema and the xml and perhaps handles has_many differently. This is yet unexplored. It certainly feels less clean but if it works who cares 3. Alter the representation of the schema - possibly to be keyd by the xpath, then as we progress through the XML we detect when we have reached a field we care about (based on the schema) and we save it if we have. This feels promising because we parse through the doc once but 1. representation of schemas is different and 2. it's a bti tricky to implement. 4. We should think about it from scratch a bit, rather than trying to fit it into established paradigms, what's the simplest way to get what we want? (Might be one of the solutions proposed but we should think about it.) We also attempt to keep the current system but instead of creating erlang records and xmerl, creating a map of the XML - removing the unnecessary things like "parents" etc. Preliminary results show that this clearly reduces memeory a lot. We now need to figure out what the data we serialise to should look like AND we need to figure out an xpath replacement / integration. What's nice about this approach is that it still works with DataSchema.

This uses the Saxy handlers to create a map where the keys are dynamic. The theory is that this would make the querying faster, we can see that this approach is still significantly less memory than the current xmerl approach, but it does fare worse than our other "slimmed down map" approach. And it still feels like far too much mems. The next approaches are to: * Use a tuple instead of a dynamic map. * try the "slimmed down map" with a struct. * ... We also need to bench the querying, which is easy to do the simple case, but do we want to support `//` etc..?

Also ensures we return any children in the correct order.

dynamic map tuple thing. There are some bits to implement like list of and probably but loads of edge cases and improvements. But this should let us benchmark the steamed ham examples Vs SweetXML.

This is banging if it holds up!! 5 times less mems and quicker. Really need to try with a larger input now. That's gonna require porting large schema to the new one though....

Day474, captain's log. we are close there are noises outside. We are about to change where we pop off the stack from inside characters to inside the end element, this is because if a tag doesn't have characters in it we would never pop off the stack!

…and stuff which smells like not putting it in the parent correctly

one correct has many, one to go We are about to experiment with changing how a path should be structured.... We are moving to putting the last node in the has_many xpath to be NOT Salads but Salad. I think this might even match xpath...

These are some iterations on the "straight to struct" approach - mainly experimenting with changing the accumulator in some way.

ALRIGHT this should let us benchamrk performance of querying! What we should have done is Runtime schemas then had one struct used for both, then we could compare the two to_struct results for equality. In fact we still could, especially if we wrote a function that turned the compile time schema into a runtime one. Whell guess we have that already nearly with __data_schema_fields.

…ge here. But we did get a working version of straight to struct

Adzz added 30 commits April 30, 2022 22:23

stashwippp savinf to not lose work

ad5db2c

Adds findings for dymanix tuple, tries with atoms as keys.

ad53846

Also ensures we return any children in the correct order.

This adds a working sweetXML alternative for querying into the

ad546c4

dynamic map tuple thing. There are some bits to implement like list of and probably but loads of edge cases and improvements. But this should let us benchmark the steamed ham examples Vs SweetXML.

Add bench for dynamic tuple query

ad559e7

Adds findings for the query approach.

ad521ca

This is banging if it holds up!! 5 times less mems and quicker. Really need to try with a larger input now. That's gonna require porting large schema to the new one though....

bench

ad5a1b8

start air shop change

ad5c252

Fix has_many for DynamicTupleAccessor

ad53dbe

stashwip

ad527ff

aksjnddjknkjsndckdsjncdskjcns why is this so hard

ad59af3

fix attrs on has_one schema

ad592f9

edging ever closer to my doom or completion; whichever comes first

ad5ca71

Checkpoint save

ad52488

Day474, captain's log. we are close there are noises outside. We are about to change where we pop off the stack from inside characters to inside the end element, this is because if a tag doesn't have characters in it we would never pop off the stack!

checkpoint Its so almost working, some weirdness with has many dupes …

ad555d9

…and stuff which smells like not putting it in the parent correctly

Checkpoint:

ad5554c

one correct has many, one to go We are about to experiment with changing how a path should be structured.... We are moving to putting the last node in the has_many xpath to be NOT Salads but Salad. I think this might even match xpath...

kjfnkjfnsd this might be it LADS

ad5dc89

slight clean up

ad59613

OMG IT WORKS

ad5a86e

remove comments

ad5db9c

Adds some results

ad58d63

These are some iterations on the "straight to struct" approach - mainly experimenting with changing the accumulator in some way.

more results and experiment with eliminating ++

ad5568f

comment about when to list reverse this is tricky to work out

ad5178c

Fixed data accessor for struct handler

ad51733

remove inspects etc

ad55659

change from fetch! to . access to help jose with the benchmarks

ad520a4

fix handler

ad56bfe

Add the benchmark results for captures vs mfa etc etc

ad5ebd1

progressing the air shop schema

ad5d134

Adzz added 17 commits May 9, 2022 00:48

lay ground for sweet air shop

ad598b6

air shop schemas

ad5ebc3

add findings for air shop including querying

ad5328c

Adds v4 test

ad5b5f3

try v4 schema

ad50c8f

adds some test for v4, has_many has_one works so maybe basically works?

ad5183a

confused stash to not lose work

ad5e464

A more working V4 that works! 🤯

ad5dc4e

clean up warnings for the v4 straigh to struct

ad57fc6

Add v3 vs v4 straight to struct benchmark

ad5aeb3

add some notes

ad536bb

Adds wip bunch of stuff. I think it's clear the complexity is too lar…

ad5a6d0

…ge here. But we did get a working version of straight to struct

fix some warnings

ad56b1c

more benches

ad5b5a0

messing with bench more

ad50a34

slimmed simple form

ad566ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xpath experiment #1

Xpath experiment #1

Adzz commented May 13, 2022

Xpath experiment #1

Are you sure you want to change the base?

Xpath experiment #1

Conversation

Adzz commented May 13, 2022