Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New parameter 'land_on' to xml_find_function_calls() #2496

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

MichaelChirico
Copy link
Collaborator

Closes #2431.

Draft for now -- demonstrating what implementation could look like / how XPaths could be adjusted.

It's looking like the most common landing is parent::expr/parent::expr. If so, I think it makes sense for the implementation to land the cached calls there by default, and use xml_find_first() to "descend" optionally, rather than the default behavior being an extra call to "ascend" every time. That is, we want to save the extra call to xml_find_first() as often as possible.

@MichaelChirico MichaelChirico marked this pull request as draft December 20, 2023 02:29
@codecov-commenter
Copy link

codecov-commenter commented Dec 20, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f00f4a9) 98.54% compared to head (234c9ff) 98.54%.

❗ Current head 234c9ff differs from pull request most recent head 143cba1. Consider uploading reports for the commit 143cba1 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2496      +/-   ##
==========================================
- Coverage   98.54%   98.54%   -0.01%     
==========================================
  Files         126      126              
  Lines        5720     5713       -7     
==========================================
- Hits         5637     5630       -7     
  Misses         83       83              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@AshesITR
Copy link
Collaborator

AshesITR commented Dec 20, 2023

I'd be interested in a benchmark for two scenarios:

self::expr[cond] vs. parent::expr[cond]

and

self::expr[expr[2]] vs following-sibling::expr[1] on the appropriate "starting nodes".

The second one will give us info on how to best examine call parameters, the first one on how expensive it is to ascend in the tree in the first place. Maybe a benchmark showing how expensive the descent is, is also interesting. This would motivate the layout of the cache.

I have a feeling, getting the parent node (xml_parent()) is cheaper than descending. Maybe it's even worth caching all options directly at the cost of a bit more memory.

@MichaelChirico
Copy link
Collaborator Author

OK, using data.table's inst/tests/tests.Rraw since that file is huge.

Working on the benchmarks, but first, something surprising:

x = get_source_expressions("inst/tests/tests.Rraw")
xml = x$expressions[[length(x$expressions)]]$full_xml_parsed_content
system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL"))
#    user  system elapsed 
#   0.162   0.000   0.164

system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::expr"))
#    user  system elapsed 
#   2.231   0.000   2.239 

system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::expr/parent::expr"))
#    user  system elapsed 
#   4.364   0.003   4.385 

AFAIK a node can only have one parent? So it's very surprising it is so slow to find the parents of a node.

There's also this, which is extra puzzling:

system.time(xml_find_all(xml, "//expr[SYMBOL_FUNCTION_CALL]"))
#    user  system elapsed 
#   0.352   0.000   0.354 
system.time(xml_find_all(xml, "//expr[expr/SYMBOL_FUNCTION_CALL]"))
#    user  system elapsed 
#   4.317   0.016   4.368 
system.time(xml_find_all(xml, "//expr[SYMBOL_FUNCTION_CALL]/parent::expr"))
#    user  system elapsed 
#   2.449   0.000   2.455 

@MichaelChirico
Copy link
Collaborator Author

MichaelChirico commented Dec 20, 2023

call_symbol = xml_find_all(xml, "//SYMBOL_FUNCTION_CALL")
call_symbol_expr = xml_find_all(xml, "//expr[SYMBOL_FUNCTION_CALL]")
call_expr = xml_find_all(xml, "//expr[SYMBOL_FUNCTION_CALL]/parent::expr")

library(microbenchmark)

microbenchmark(times = 200L,
  call_symbol = xml_find_all(call_symbol, "parent::expr/parent::expr[SYMBOL_SUB[text() = 'na.rm']]"),
  call_symbol_expr = xml_find_all(call_symbol_expr, "parent::expr[SYMBOL_SUB[text() = 'na.rm']]"),
  call_expr = xml_find_all(call_expr, "self::expr[SYMBOL_SUB[text() = 'na.rm']]")
)

microbenchmark(times = 200L,
  call_symbol = xml_find_all(call_symbol, "parent::expr/following-sibling::SYMBOL_SUB[text() = 'na.rm']"),
  call_symbol_expr = xml_find_all(call_symbol_expr, "following-sibling::SYMBOL_SUB[text() = 'na.rm']"),
  call_expr = xml_find_all(call_expr, "./SYMBOL_SUB[text() = 'na.rm']")
)

microbenchmark(times = 200L,
  call_symbol = xml_find_all(call_symbol, "parent::expr/following-sibling::expr[1][expr/SYMBOL_FUNCTION_CALL[text() = 'sum']]"),
  call_symbol_expr = xml_find_all(call_symbol_expr, "following-sibling::expr[1][expr/SYMBOL_FUNCTION_CALL[text() = 'sum']]"),
  call_expr = xml_find_all(call_expr, "./expr[2][expr/SYMBOL_FUNCTION_CALL[text() = 'sum']]")
)

Timings in order:

Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval cld
      call_symbol 397.0347 580.1961 783.0970 734.1186 965.4032 1908.150   200   a
 call_symbol_expr 395.7689 577.9087 752.7817 732.6729 929.1027 2275.043   200   a
        call_expr 387.1860 568.2300 771.1322 739.5183 941.1022 1957.151   200   a

Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval cld
      call_symbol 373.3066 493.4054 720.3137 677.9868 897.6890 1899.974   200   a
 call_symbol_expr 364.4360 524.0173 711.2700 686.5143 866.4839 1815.027   200   a
        call_expr 363.6562 522.7739 710.8393 687.5207 866.8302 1792.148   200   a

Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval cld
      call_symbol 407.0132 557.7811 757.9634 730.3022 939.6353 1829.729   200   a
 call_symbol_expr 396.9409 585.3716 777.3550 732.0125 946.6847 2183.543   200   a
        call_expr 390.7928 579.7217 760.2763 736.1617 935.4866 1386.491   200   a

Not a ton of differentiation. Maybe there's a better benchmark to run.

@AshesITR
Copy link
Collaborator

The third benchmark confuses me. Why would parent::expr/following-sibling::expr[1] be faster than following-sibling::expr[1]?
Maybe try ./following-sibling::expr[1]?

@AshesITR
Copy link
Collaborator

system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL"))
#    user  system elapsed 
#   0.162   0.000   0.164

system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::expr"))
#    user  system elapsed 
#   2.231   0.000   2.239 

system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::expr/parent::expr"))
#    user  system elapsed 
#   4.364   0.003   4.385 

We should use xml_parent() instead:

x <- get_source_expressions("https://raw.githubusercontent.com/Rdatatable/data.table/master/inst/tests/tests.Rraw")
xml <- x$expressions[[length(x$expressions)]]$full_xml_parsed_content
system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL"))
#   user  system elapsed 
#  0.073   0.000   0.072 
system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::expr"))
#   user  system elapsed 
#   2.02    0.00    2.02 
system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::expr/parent::expr"))
#   user  system elapsed 
#  3.903   0.000   3.904 
system.time(xml_parent(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL")))
#   user  system elapsed 
#  0.232   0.000   0.233 
system.time(xml_parent(xml_parent(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL"))))
#   user  system elapsed 
#  0.590   0.007   0.597 

@AshesITR
Copy link
Collaborator

Related: If we know the type of the parent node, parent::* seems faster than parent::expr:

system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::*"))
#   user  system elapsed 
#  1.989   0.000   1.989 
system.time(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::*/parent::*"))
#   user  system elapsed 
#  3.877   0.000   3.877 

@MichaelChirico
Copy link
Collaborator Author

We should use xml_parent() instead:

Very interesting, seems like libxml2 has some serious issues 😂

@MichaelChirico
Copy link
Collaborator Author

The third benchmark confuses me. Why would parent::expr/following-sibling::expr[1] be faster than following-sibling::expr[1]? Maybe try ./following-sibling::expr[1]?

Note that [lq, uq] for the former includes the latter, i.e., it could just be noise / they're roughly the same performance.

@MichaelChirico
Copy link
Collaborator Author

Updated the benchmarks with n=200. There's really not much differentiation -- timings differ by <1% at O(10µs) scale.

@MichaelChirico MichaelChirico mentioned this pull request Mar 20, 2024
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should xml_find_function_calls()` land on on the parent expr instead?
3 participants