Skip to content

Commit

Permalink
Experiment with alternatives to Xmerl
Browse files Browse the repository at this point in the history
There is a lot of stuff in here. A good place to start is with the
findings.md doc where we attempt to capture everything we've been up to.
Some high level TL;DRs...:

* High memory seems to come from how xmerl represents the XML, namely
  parents, position and the inclusion of smaller fields that we probably
  just dont need, nsinfo, expanded name... etc.

We attempt an integration with DataSchema, there are a few ways that can
work:

1. Define a Saxy data accessor, this would result in Saxy.parse_string
   being called once per field in the schema, but it ignores everything
   except the one path it is looking for. It also returns as soon as we
   know we got what we needed. Preliminary results suggest it might be a
   bit slower but uses like half the memory. What makes this very tricky
   is figuring out what to do when we hit an has_many. This really feels
   solveable but the best I have ATM is a hacky solution - still
   incomplete too.
2. We define our own "reducer" ie to_struct fn that that takes the
   schema and the xml and perhaps handles has_many differently. This is
   yet unexplored. It certainly feels less clean but if it works who
   cares
3. Alter the representation of the schema - possibly to be keyd by the
   xpath, then as we progress through the XML we detect when we have
   reached a field we care about (based on the schema) and we save it if
   we have. This feels promising because we parse through the doc once
   but 1. representation of schemas is different and 2. it's a bti
   tricky to implement.
4. We should think about it from scratch a bit, rather than trying to
   fit it into established paradigms, what's the simplest way to get
   what we want? (Might be one of the solutions proposed but we should
   think about it.)

We also attempt to keep the current system but instead of creating
erlang records and xmerl, creating a map of the XML - removing the
unnecessary things like "parents" etc. Preliminary results show that
this clearly reduces memeory a lot. We now need to figure out what the
data we serialise to should look like AND we need to figure out an xpath
replacement / integration. What's nice about this approach is that it
still works with DataSchema.
  • Loading branch information
Adzz committed May 2, 2022
1 parent ad5db2c commit ad5bca4
Show file tree
Hide file tree
Showing 17 changed files with 207,683 additions and 151 deletions.
57 changes: 29 additions & 28 deletions bench.exs
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
# It would def be better and more accurate to test against larger schemas - especially
# more nested and self referential ones. My suspicion is that we use a lot of memory
# if there is a lot of nesting because we have to maintain a stack in the current approach
# I need to more fully understand what we do there and why, but that stack could be large

# Like not super sure why we need parents if we create a tree from the XML...
# SO maybe we can just trim that...

# ANYWAY. Larger schema would be better but requires some more work, so we can start
# small and test larger with more promising results.

xml = """
<SteamedHam price="1">
<ReadyDate>2021-09-11</ReadyDate>
Expand All @@ -14,35 +25,25 @@ xml = """
</Salads>
</SteamedHam>
"""
seat_xml = File.read!("/Users/adz/Duffel/saxy/test/support/fixture/jetstar_seat.xml")
# large_xml = File.read!("/Users/adz/Duffel/saxy/test/support/fixture/really_large.xml")

# Benchee.run(%{
# "current to xmerl (records)" => fn ->
# Saxy.Xmerl.parse_string(large_xml, [atom_fun: &String.to_atom/1])
# end,
# "current to map" => fn ->
# Saxy.XmerlMap.parse_string(large_xml, [atom_fun: &String.to_atom/1])
# end
# }, memory_time: 1, reduction_time: 1)

Benchee.run(%{
"just to xmerl" => fn ->
Saxy.Xmerl.parse_string(xml, [atom_fun: &String.to_atom/1])
end,
"xmerl -> data_schema" => fn ->
{:ok, xmerl} = Saxy.Xmerl.parse_string(xml, [atom_fun: &String.to_atom/1])
DataSchema.to_struct(xmerl, SteamedHam)
seat_xml = File.read!("/Users/adz/Duffel/saxy/test/support/fixture/jetstar_seat.xml")
{:ok, xmerl} = Saxy.Xmerl.parse_string(seat_xml, [atom_fun: &String.to_atom/1])
DataSchema.to_struct(xmerl, SeatAvailabilityResponse)
end,
"Xperiment" => fn ->
Saxy.Experiment.parse_to_struct(xml, SteamedHam)
end
# "xmerl -> sweet map" => fn ->
# xmerl = Saxy.Xmerl.parse_string(xml, [atom_fun: &String.to_atom/1])

# # type: "/SteamedHam/Type/text()", &__MODULE__.to_upcase(&1)},
# # price: "/SteamedHam/@price", fn x -> {:ok, String.to_integer(x)} end},
# has_many: {:salads, "/SteamedHam/Salads", Salad},
# has_one: {:sauce, "/SteamedHam/Sauce", Sauce},
# aggregate: {:ready_datetime, @datetime_fields, &__MODULE__.datetime/1}

# %{
# type: "/SteamedHam/Type/text()",
# price: "/SteamedHam/@price",
# salads: []
# }

# end,
# "xmerl -> sweet struct" => fn ->

# end
})
# "experiment parse many times" => fn ->
# Saxy.Experiment.parse_to_struct(xml, SteamedHam)
# end
}, memory_time: 1, reduction_time: 1)
Loading

0 comments on commit ad5bca4

Please sign in to comment.