Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use file-level expressions and avoid slow loop for whitespace_linter() for up to 50% speedup #2024

Merged
merged 8 commits into from
Jul 29, 2023

Conversation

MichaelChirico
Copy link
Collaborator

@MichaelChirico MichaelChirico commented Jul 27, 2023

I was testing out why our lint suite ran so slow on a particular long file & found somewhat surprisingly that whitespace_linter() was the slowest of ~100 linters.

We can test this on R's src/library/tools/R/QC.R:

exp <- get_source_expressions("QC.R")
l <- whitespace_linter()
system.time(replicate(100, lapply(exp$expressions, l)))

## BEFORE
#    user  system elapsed 
# 398.371   1.090 400.302 

## AFTER
#    user  system elapsed 
#   1.153   0.004   1.163 

End-to-end timing including get_source_expressions():

system.time(replicate(10, lint("QC.R", whitespace_linter())))

## BEFORE
#    user  system elapsed 
#  66.333   1.945  68.471 

## AFTER
#    user  system elapsed 
#  31.435   0.819  32.291 

So an end-to-end (user-facing) 50% improvement, inclusive of parsing time.

So while this approach is not as cache-friendly, of course checking initial whitespace in a file should not be expensive, so I don't think we lose much by switching to a file-level search.

@MichaelChirico
Copy link
Collaborator Author

Hmm, I think the error fixed by b5793a7 also reduced the efficiency improvement. I only saw a 3% improvement just now.

@jimhester
Copy link
Member

Can you update the timings after the full change. I doubt they will change much, but still good to be sure

@MichaelChirico
Copy link
Collaborator Author

I'm looking more closely at what the bottleneck is, and re_matches() seems to be a big piece of it

c.f.

source_expressions <- get_source_expressions("QC.R")$expressions
full_expression <- source_expressions[[length(source_expressions)]]

regex <- "^(?:\\s)*(?:\t)+"

system.time(re_matches(full_expression[["file_lines"]], regex, locations = TRUE, global = TRUE))
#    user  system elapsed 
#   1.956   0.016   1.982 
system.time(grepl(regex, full_expression[["file_lines"]]))
#    user  system elapsed 
#    0.01    0.00    0.01 

@MichaelChirico
Copy link
Collaborator Author

MichaelChirico commented Jul 27, 2023

I think the issue is global=TRUE. That regex doesn't need global=TRUE -- we can use regexpr() and avoid gregexpr().

system.time(re_matches(full_expression[["file_lines"]], regex, locations = TRUE, global = FALSE))
#    user  system elapsed 
#   0.002   0.000   0.002 

global=TRUE is slow because it has to do a slow iteration over a long list of gregexpr() outputs, few of which even have regex matches.

@MichaelChirico
Copy link
Collaborator Author

Bad news: my simple 5-character PR has ballooned substantially

Good news: this PR now marks a 400x (!!) speed improvement for QC.R

(timings added to the top)

@codecov-commenter
Copy link

codecov-commenter commented Jul 27, 2023

Codecov Report

Merging #2024 (181d72d) into main (1b459fc) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 181d72d differs from pull request most recent head 28c4e98. Consider uploading reports for the commit 28c4e98 to get more accurate results

@@            Coverage Diff             @@
##             main    #2024      +/-   ##
==========================================
- Coverage   98.56%   98.55%   -0.01%     
==========================================
  Files         113      113              
  Lines        5006     4995      -11     
==========================================
- Hits         4934     4923      -11     
  Misses         72       72              
Files Changed Coverage Δ
R/make_linter_from_regex.R 100.00% <100.00%> (ø)

@MichaelChirico MichaelChirico changed the title Use file-level expressions for whitespace_linter() for up to 50% speedup Use file-level expressions and avoid slow loop for whitespace_linter() for up to 50% speedup Jul 29, 2023
@MichaelChirico MichaelChirico merged commit c3c10bb into main Jul 29, 2023
21 checks passed
@MichaelChirico MichaelChirico deleted the MichaelChirico-patch-1 branch July 29, 2023 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants