Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance #14

Closed
wants to merge 6 commits into from
Closed

Improve performance #14

wants to merge 6 commits into from

Conversation

Bisaloo
Copy link
Member

@Bisaloo Bisaloo commented Jul 28, 2024

Should fix epiverse-trace/cleanepi#163 for the most part. In total, this allows a ~20x speedup.

I'll likely propose a C implementation of number_from() in a couple of weeks for fun and as an exercise. It won't need to be merged and can live as an alternative branch for users with high performance need & a C compiler.

main version

library(rio)
library(cleanepi)

covid <- rio::import(
  "https://raw.githubusercontent.com/Joskerus/Enlaces-provisionales/main/data_limpieza.zip",
  which = "datos_covid_LA.RDS"
) |> 
  cleanepi::standardize_column_names()
#> Warning: The `trust` argument of `import()` should be explicit for serialization formats
#> as of rio 1.0.3.
#> ℹ Missing `trust` will be set to FALSE by default for RDS in 2.0.0.
#> ℹ The deprecated feature was likely used in the rio package.
#>   Please report the issue at <https://github.com/gesistsa/rio/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

tictoc::tic()
test <- numberize::numberize(covid$numero_de_hospitalizaciones_recientes, lang = "es")
tictoc::toc()
#> 73.346 sec elapsed

Created on 2024-07-27 with reprex v2.1.1

proposed version

library(rio)
library(cleanepi)

covid <- rio::import(
  "https://raw.githubusercontent.com/Joskerus/Enlaces-provisionales/main/data_limpieza.zip",
  which = "datos_covid_LA.RDS"
) |> 
  cleanepi::standardize_column_names()
#> Warning: The `trust` argument of `import()` should be explicit for serialization formats
#> as of rio 1.0.3.
#> ℹ Missing `trust` will be set to FALSE by default for RDS in 2.0.0.
#> ℹ The deprecated feature was likely used in the rio package.
#>   Please report the issue at <https://github.com/gesistsa/rio/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

tictoc::tic()
test <- numberize::numberize(covid$numero_de_hospitalizaciones_recientes, lang = "es")
tictoc::toc()
#> 3.088 sec elapsed

Created on 2024-07-27 with reprex v2.1.1

Bisaloo and others added 6 commits July 27, 2024 23:32
To avoid the cost of recreating the data.frame each time

3.5x faster than previous commit
2x faster than previous commit
3x faster than the previous commit
@TimTaylor
Copy link

TimTaylor commented Jul 29, 2024

I wonder if you can also get a little speed up by switching the ordering here to check the 100 digit branch first? This likely depends on your inputs but I think you're likely to generally have more 100 digits so you can break out of the loop quicker. It may also be worth being "cute" and avoiding the ifelse() call (actually the ifelse() should probably be an if then else which will probably speed things up as is).

numberize/R/numberize.R

Lines 63 to 72 in dcd2d2e

for (d in digits) {
if (d %in% c(1E3, 1E6, 1E9, 1E12)) {
total <- ifelse(summed == 0, total + d, total + summed * d)
summed <- 0
} else if (d == 100) {
summed <- ifelse(summed == 0, d, summed * d)
} else {
summed <- summed + d
}
}

Changing the above to something like

for (d in digits) {
    if (d == 100) {
      summed <- d + d * (summed -1 ) * (summed != 0)
    } else if (d %in% c(1E3, 1E6, 1E9, 1E12)) {
      total <- total + d + d * (summed - 1) * (summed != 0)
      summed <- 0
    } else {
      summed <- summed + d
    }
  }

I didn't benchmark the dataset above, just some toy examples locally where it seemed to help. It feels like you could do some more complicated vectorisation as well but would need to ponder that some more and may not be worth the effort.

@avallecam
Copy link
Member

avallecam commented Jul 29, 2024

Should fix epiverse-trace/cleanepi#163 for the most part. In total, this allows a ~20x speedup.

👍 running the proposed version took me 4.25 sec elapsed. I would only like to note that after installing pak::pak("epiverse-trace/numberize@improve-performance"), I needed to restart R to make this installed branch work.

@Bisaloo Bisaloo requested a review from bahadzie July 29, 2024 09:55
@Bisaloo
Copy link
Member Author

Bisaloo commented Aug 27, 2024

@bahadzie, would you be able to take care of the conflicts in this PR? Let me know if you've tried and gotten stuck, and I can give it a look myself.

@bahadzie
Copy link
Member

@Bisaloo there a lot of other changes that were implemented as a result of the full package review during CRAN submission that is causing 6 tests currently in main to fail.

In my opinion our options are

  1. rebase the PR on main and fix the merge conflicts
  2. implement the performance improvements from a new PR branching of main

What's your preference?

@Bisaloo
Copy link
Member Author

Bisaloo commented Sep 20, 2024

In my opinion our options are

  1. rebase the PR on main and fix the merge conflicts

  2. implement the performance improvements from a new PR branching of main

What's your preference?

If you're taking care of it, I don't have a preference. Both solutions are valid and will lead to the same result.

@bahadzie
Copy link
Member

I will take care of it.

@Bisaloo
Copy link
Member Author

Bisaloo commented Oct 7, 2024

Superseded by #15

@Bisaloo Bisaloo closed this Oct 7, 2024
@Bisaloo Bisaloo deleted the improve-performance branch October 7, 2024 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

convert_to_numeric() in a dataset of 500,000+ rows took 2.5 minutes
5 participants