-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New parameter 'land_on' to xml_find_function_calls() #2496
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2496 +/- ##
==========================================
- Coverage 98.54% 98.54% -0.01%
==========================================
Files 126 126
Lines 5720 5713 -7
==========================================
- Hits 5637 5630 -7
Misses 83 83 ☔ View full report in Codecov by Sentry. |
I'd be interested in a benchmark for two scenarios:
and
The second one will give us info on how to best examine call parameters, the first one on how expensive it is to ascend in the tree in the first place. Maybe a benchmark showing how expensive the descent is, is also interesting. This would motivate the layout of the cache. I have a feeling, getting the parent node ( |
OK, using data.table's inst/tests/tests.Rraw since that file is huge. Working on the benchmarks, but first, something surprising: x = get_source_expressions("inst/tests/tests.Rraw")
xml = x$expressions[[length(x$expressions)]]$full_xml_parsed_content system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL"))
# user system elapsed
# 0.162 0.000 0.164
system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::expr"))
# user system elapsed
# 2.231 0.000 2.239
system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::expr/parent::expr"))
# user system elapsed
# 4.364 0.003 4.385 AFAIK a node can only have one parent? So it's very surprising it is so slow to find the parents of a node. There's also this, which is extra puzzling: system.time(xml_find_all(xml, "//expr[SYMBOL_FUNCTION_CALL]"))
# user system elapsed
# 0.352 0.000 0.354
system.time(xml_find_all(xml, "//expr[expr/SYMBOL_FUNCTION_CALL]"))
# user system elapsed
# 4.317 0.016 4.368
system.time(xml_find_all(xml, "//expr[SYMBOL_FUNCTION_CALL]/parent::expr"))
# user system elapsed
# 2.449 0.000 2.455 |
call_symbol = xml_find_all(xml, "//SYMBOL_FUNCTION_CALL")
call_symbol_expr = xml_find_all(xml, "//expr[SYMBOL_FUNCTION_CALL]")
call_expr = xml_find_all(xml, "//expr[SYMBOL_FUNCTION_CALL]/parent::expr")
library(microbenchmark)
microbenchmark(times = 200L,
call_symbol = xml_find_all(call_symbol, "parent::expr/parent::expr[SYMBOL_SUB[text() = 'na.rm']]"),
call_symbol_expr = xml_find_all(call_symbol_expr, "parent::expr[SYMBOL_SUB[text() = 'na.rm']]"),
call_expr = xml_find_all(call_expr, "self::expr[SYMBOL_SUB[text() = 'na.rm']]")
)
microbenchmark(times = 200L,
call_symbol = xml_find_all(call_symbol, "parent::expr/following-sibling::SYMBOL_SUB[text() = 'na.rm']"),
call_symbol_expr = xml_find_all(call_symbol_expr, "following-sibling::SYMBOL_SUB[text() = 'na.rm']"),
call_expr = xml_find_all(call_expr, "./SYMBOL_SUB[text() = 'na.rm']")
)
microbenchmark(times = 200L,
call_symbol = xml_find_all(call_symbol, "parent::expr/following-sibling::expr[1][expr/SYMBOL_FUNCTION_CALL[text() = 'sum']]"),
call_symbol_expr = xml_find_all(call_symbol_expr, "following-sibling::expr[1][expr/SYMBOL_FUNCTION_CALL[text() = 'sum']]"),
call_expr = xml_find_all(call_expr, "./expr[2][expr/SYMBOL_FUNCTION_CALL[text() = 'sum']]")
) Timings in order:
Not a ton of differentiation. Maybe there's a better benchmark to run. |
The third benchmark confuses me. Why would |
We should use x <- get_source_expressions("https://raw.githubusercontent.com/Rdatatable/data.table/master/inst/tests/tests.Rraw")
xml <- x$expressions[[length(x$expressions)]]$full_xml_parsed_content
system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL"))
# user system elapsed
# 0.073 0.000 0.072
system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::expr"))
# user system elapsed
# 2.02 0.00 2.02
system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::expr/parent::expr"))
# user system elapsed
# 3.903 0.000 3.904
system.time(xml_parent(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL")))
# user system elapsed
# 0.232 0.000 0.233
system.time(xml_parent(xml_parent(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL"))))
# user system elapsed
# 0.590 0.007 0.597 |
Related: If we know the type of the parent node, system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::*"))
# user system elapsed
# 1.989 0.000 1.989
system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::*/parent::*"))
# user system elapsed
# 3.877 0.000 3.877 |
Very interesting, seems like libxml2 has some serious issues 😂 |
Note that |
Updated the benchmarks with |
Closes #2431.
Draft for now -- demonstrating what implementation could look like / how XPaths could be adjusted.
It's looking like the most common landing is
parent::expr/parent::expr
. If so, I think it makes sense for the implementation to land the cached calls there by default, and usexml_find_first()
to "descend" optionally, rather than the default behavior being an extra call to "ascend" every time. That is, we want to save the extra call toxml_find_first()
as often as possible.