Experiment with alternatives to Xmerl

There is a lot of stuff in here. A good place to start is with the findings.md doc where we attempt to capture everything we've been up to. Some high level TL;DRs...: * High memory seems to come from how xmerl represents the XML, namely parents, position and the inclusion of smaller fields that we probably just dont need, nsinfo, expanded name... etc. We attempt an integration with DataSchema, there are a few ways that can work: 1. Define a Saxy data accessor, this would result in Saxy.parse_string being called once per field in the schema, but it ignores everything except the one path it is looking for. It also returns as soon as we know we got what we needed. Preliminary results suggest it might be a bit slower but uses like half the memory. What makes this very tricky is figuring out what to do when we hit an has_many. This really feels solveable but the best I have ATM is a hacky solution - still incomplete too. 2. We define our own "reducer" ie to_struct fn that that takes the schema and the xml and perhaps handles has_many differently. This is yet unexplored. It certainly feels less clean but if it works who cares 3. Alter the representation of the schema - possibly to be keyd by the xpath, then as we progress through the XML we detect when we have reached a field we care about (based on the schema) and we save it if we have. This feels promising because we parse through the doc once but 1. representation of schemas is different and 2. it's a bti tricky to implement. 4. We should think about it from scratch a bit, rather than trying to fit it into established paradigms, what's the simplest way to get what we want? (Might be one of the solutions proposed but we should think about it.) We also attempt to keep the current system but instead of creating erlang records and xmerl, creating a map of the XML - removing the unnecessary things like "parents" etc. Preliminary results show that this clearly reduces memeory a lot. We now need to figure out what the data we serialise to should look like AND we need to figure out an xpath replacement / integration. What's nice about this approach is that it still works with DataSchema.
duffelhq · May 2, 2022 · ad5bca4 · ad5bca4
1 parent ad5db2c
commit ad5bca4
Show file tree

Hide file tree

Showing 17 changed files with 207,683 additions and 151 deletions.
diff --git a/bench.exs b/bench.exs
@@ -1,3 +1,14 @@
+# It would def be better and more accurate to test against larger schemas - especially
+# more nested and self referential ones. My suspicion is that we use a lot of memory
+# if there is a lot of nesting because we have to maintain a stack in the current approach
+# I need to more fully understand what we do there and why, but that stack could be large
+
+# Like not super sure why we need parents if we create a tree from the XML...
+# SO maybe we can just trim that...
+
+# ANYWAY. Larger schema would be better but requires some more work, so we can start
+# small and test larger with more promising results.
+
 xml = """
 <SteamedHam price="1">
   <ReadyDate>2021-09-11</ReadyDate>
@@ -14,35 +25,25 @@ xml = """
   </Salads>
 </SteamedHam>
 """
+seat_xml = File.read!("/Users/adz/Duffel/saxy/test/support/fixture/jetstar_seat.xml")
+# large_xml = File.read!("/Users/adz/Duffel/saxy/test/support/fixture/really_large.xml")
+
+# Benchee.run(%{
+#   "current to xmerl (records)" => fn ->
+#     Saxy.Xmerl.parse_string(large_xml, [atom_fun: &String.to_atom/1])
+#    end,
+#    "current to map" => fn ->
+#     Saxy.XmerlMap.parse_string(large_xml, [atom_fun: &String.to_atom/1])
+#    end
+# }, memory_time: 1, reduction_time: 1)
 
 Benchee.run(%{
-  "just to xmerl" => fn ->
-    Saxy.Xmerl.parse_string(xml, [atom_fun: &String.to_atom/1])
-   end,
   "xmerl -> data_schema" => fn ->
-    {:ok, xmerl} = Saxy.Xmerl.parse_string(xml, [atom_fun: &String.to_atom/1])
-    DataSchema.to_struct(xmerl, SteamedHam)
+    seat_xml = File.read!("/Users/adz/Duffel/saxy/test/support/fixture/jetstar_seat.xml")
+    {:ok, xmerl} = Saxy.Xmerl.parse_string(seat_xml, [atom_fun: &String.to_atom/1])
+    DataSchema.to_struct(xmerl, SeatAvailabilityResponse)
   end,
-  "Xperiment" => fn ->
-    Saxy.Experiment.parse_to_struct(xml, SteamedHam)
-  end
-#   "xmerl -> sweet map" => fn ->
-#     xmerl = Saxy.Xmerl.parse_string(xml, [atom_fun: &String.to_atom/1])
-
-# # type: "/SteamedHam/Type/text()", &__MODULE__.to_upcase(&1)},
-# # price: "/SteamedHam/@price", fn x -> {:ok, String.to_integer(x)} end},
-#     has_many: {:salads, "/SteamedHam/Salads", Salad},
-#     has_one: {:sauce, "/SteamedHam/Sauce", Sauce},
-#     aggregate: {:ready_datetime, @datetime_fields, &__MODULE__.datetime/1}
-
-#     %{
-#       type: "/SteamedHam/Type/text()",
-#       price: "/SteamedHam/@price",
-#       salads: []
-#     }
-
-#   end,
-#   "xmerl -> sweet struct" => fn ->
-
-#   end
-})
+  # "experiment parse many times" => fn ->
+  #   Saxy.Experiment.parse_to_struct(xml, SteamedHam)
+  # end
+}, memory_time: 1, reduction_time: 1)