Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Experiment with alternatives to Xmerl
There is a lot of stuff in here. A good place to start is with the findings.md doc where we attempt to capture everything we've been up to. Some high level TL;DRs...: * High memory seems to come from how xmerl represents the XML, namely parents, position and the inclusion of smaller fields that we probably just dont need, nsinfo, expanded name... etc. We attempt an integration with DataSchema, there are a few ways that can work: 1. Define a Saxy data accessor, this would result in Saxy.parse_string being called once per field in the schema, but it ignores everything except the one path it is looking for. It also returns as soon as we know we got what we needed. Preliminary results suggest it might be a bit slower but uses like half the memory. What makes this very tricky is figuring out what to do when we hit an has_many. This really feels solveable but the best I have ATM is a hacky solution - still incomplete too. 2. We define our own "reducer" ie to_struct fn that that takes the schema and the xml and perhaps handles has_many differently. This is yet unexplored. It certainly feels less clean but if it works who cares 3. Alter the representation of the schema - possibly to be keyd by the xpath, then as we progress through the XML we detect when we have reached a field we care about (based on the schema) and we save it if we have. This feels promising because we parse through the doc once but 1. representation of schemas is different and 2. it's a bti tricky to implement. 4. We should think about it from scratch a bit, rather than trying to fit it into established paradigms, what's the simplest way to get what we want? (Might be one of the solutions proposed but we should think about it.) We also attempt to keep the current system but instead of creating erlang records and xmerl, creating a map of the XML - removing the unnecessary things like "parents" etc. Preliminary results show that this clearly reduces memeory a lot. We now need to figure out what the data we serialise to should look like AND we need to figure out an xpath replacement / integration. What's nice about this approach is that it still works with DataSchema.
- Loading branch information