Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress bar/indicator for "by" operations #6228

Merged
merged 9 commits into from
Jul 29, 2024
Merged

Progress bar/indicator for "by" operations #6228

merged 9 commits into from
Jul 29, 2024

Conversation

joshhwuu
Copy link
Member

@joshhwuu joshhwuu commented Jul 8, 2024

Closes #3060

Similar to fwrite progress printing, triggers when ETA >= 3s, prints the following, updating every second until complete.

> dt = data.table(a = 1:100000000)
> dt[, 1, by = a]
# Processed 48203243 groups out of 100000000. 48% done. Time elapsed: 10s. ETA: 10s
  • Custom progress() function to show groups done, groups remaining, time elapsed
  • Add new option or argument to trigger progress printing
  • Add to gforce operations?
  • Benchmarking with current (calls to wallclock() are quite expensive) to measure overhead

Benchmarking

With Progress Printing

dt = data.table(a = 1:10000000)
microbenchmark(copy(dt)[, 1, by = a, showProgress = FALSE], times = 50)
Unit: seconds
                                        expr      min       lq     mean   median       uq      max neval
 copy(dt)[, 1, by = a, showProgress = FALSE] 1.994391 2.036843 2.096827 2.069326 2.111358 2.365085    50

microbenchmark(copy(dt)[, 1, by = a, showProgress = TRUE], times = 50)
Unit: seconds
                                       expr      min      lq     mean   median       uq     max neval
 copy(dt)[, 1, by = a, showProgress = TRUE] 2.103596 2.14602 2.182603 2.168911 2.205532 2.40131    50

Current Master

dt = data.table(a = 1:10000000)
microbenchmark(copy(dt)[, 1, by = a], times = 50)
Unit: seconds
                  expr     min     lq     mean   median       uq      max neval
 copy(dt)[, 1, by = a] 2.01209 2.0507 2.113709 2.086281 2.105044 3.377778    50

Performance analysis using atime:

a.result = atime::atime(
  N = 2^seq(2, 24),
  setup = {
    rows.per.group = 10
    N.rows = N*rows.per.group
    DT = data.table(i=1:N.rows, g=rep(1:N, each=rows.per.group))
  },
  seconds.limit = 3,
  default = DT[, 1, by=g],
  "TRUE" = DT[, 1, by=g, showProgress=TRUE],
  "FALSE" = DT[, 1, by=g, showProgress=FALSE]
)
plot(a.result)

image

Copy link

github-actions bot commented Jul 8, 2024

Comparison Plot

Generated via commit 78807e9

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 12 minutes and 23 seconds

Time taken to run atime::atime_pkg on the tests: 6 minutes and 12 seconds

@joshhwuu

This comment was marked as outdated.

@tdhock
Copy link
Member

tdhock commented Jul 16, 2024

please add an argument to [.data.table as in fread

     showProgress=getOption("datatable.showProgress", interactive()),

Also please add docs similar to ?fread

showProgress: ‘TRUE’ displays progress on the console if the ETA is
          greater than 3 seconds. It is produced in fread's C code
          where the very nice (but R level) txtProgressBar and
          tkProgressBar are not easily available.

@joshhwuu
Copy link
Member Author

joshhwuu commented Jul 16, 2024

New argument added, allows us to do:

DT[, 1, by = a, showProgress=TRUE] (Will also automatically start if interactive() and > ?1s)

R CMD check should pass now.

However, there's a memory leak associated with the changes found by valgrind, will investigate.

Edit: Apparently this also happens with current master... not sure if this is known:

my_test_file.R:

library(devtools)
load_all()
dt = data.table(a = 1:1000000)
dt[, 1, by = a]

R -d "valgrind --tool=memcheck --leak-check=full" --vanilla < my_test_file.R

==190100== 
==190100== HEAP SUMMARY:
==190100==     in use at exit: 144,647,575 bytes in 26,707 blocks
==190100==   total heap usage: 309,984 allocs, 283,277 frees, 448,491,612 bytes allocated
==190100== 
==190100== 2,352 bytes in 7 blocks are possibly lost in loss record 438 of 2,164
==190100==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==190100==    by 0x40147D9: calloc (rtld-malloc.h:44)
==190100==    by 0x40147D9: allocate_dtv (dl-tls.c:375)
==190100==    by 0x40147D9: _dl_allocate_tls (dl-tls.c:634)
==190100==    by 0x4D597B4: allocate_stack (allocatestack.c:430)
==190100==    by 0x4D597B4: pthread_create@@GLIBC_2.34 (pthread_create.c:647)
==190100==    by 0x575D25F: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==190100==    by 0x5753A10: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==190100==    by 0x16BBEB60: forder (forder.c:491)
==190100==    by 0x4958279: ??? (in /usr/lib/R/lib/libR.so)
==190100==    by 0x495888C: ??? (in /usr/lib/R/lib/libR.so)
==190100==    by 0x49AFB97: Rf_eval (in /usr/lib/R/lib/libR.so)
==190100==    by 0x49B3237: ??? (in /usr/lib/R/lib/libR.so)
==190100==    by 0x49AF935: Rf_eval (in /usr/lib/R/lib/libR.so)
==190100==    by 0x49B16DE: ??? (in /usr/lib/R/lib/libR.so)
==190100== 
==190100== LEAK SUMMARY:
==190100==    definitely lost: 0 bytes in 0 blocks
==190100==    indirectly lost: 0 bytes in 0 blocks
==190100==      possibly lost: 2,352 bytes in 7 blocks
==190100==    still reachable: 144,645,223 bytes in 26,700 blocks
==190100==                       of which reachable via heuristic:
==190100==                         newarray           : 4,264 bytes in 1 blocks
==190100==         suppressed: 0 bytes in 0 blocks
==190100== Reachable blocks (those to which a pointer was found) are not shown.
==190100== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==190100== 
==190100== For lists of detected and suppressed errors, rerun with: -s
==190100== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

@tdhock @Anirban166 have you seen this before? Should I file an issue?

After doing a quick search on some forums, apparently Valgrind often doesn't playtoo well with openMP, this is most likely a false positive from Valgrind:
https://medium.com/@auraham/pseudo-memory-leaks-when-using-openmp-11a383cc4cf9
https://stackoverflow.com/questions/6973489/valgrind-and-openmp-still-reachable-and-possibly-lost-is-that-bad
dmlc/xgboost#8238

R/data.table.R Outdated Show resolved Hide resolved
Co-authored-by: Michael Chirico <[email protected]>
@tdhock
Copy link
Member

tdhock commented Jul 18, 2024

I did some benchmarking https://tdhock.github.io/blog/2024/atime-showProgress/ and it looks like the proposed code with new showProgress arg has the same performance as previous master, at least for small data sizes / compute times. (<0.1 sec).

@joshhwuu
Copy link
Member Author

joshhwuu commented Jul 18, 2024

I did some benchmarking https://tdhock.github.io/blog/2024/atime-showProgress/ and it looks like the proposed code with new showProgress arg has the same performance as previous master, at least for small data sizes / compute times. (<0.1 sec).

Great! Really cool article! I included some benchmarks for some large (>2s) operations in the PR description above and it seems that on average the new changes aren't that much slower than current master (at least not by magnitudes). I think performance is nearly identical for small (<1s) operations because a built in feature of the progress indicator is that it doesn't print unless the operation is longer than 1 second, similar to fwrite (3s)

When I log on tomorrow I'll do some more comprehensive benchmarking with some of the steps you've outlined in your article!

@tdhock
Copy link
Member

tdhock commented Jul 18, 2024

it doesn't print unless the operation is longer than 1 second, similar to fwrite (3s)

I wonder if we easily can make that time delay (1 or 3 seconds) the same and user controlled? If not in thie PR then please create another issue.

@tdhock
Copy link
Member

tdhock commented Jul 18, 2024

For longer by operations to benchmark before and after showProgress, https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping

@joshhwuu
Copy link
Member Author

joshhwuu commented Jul 18, 2024

I wonder if we easily can make that time delay (1 or 3 seconds) the same and user controlled? If not in thie PR then please create another issue.

An option to select how long the function to wait before progress printing should be easy to add, it is simply a check. (This means yes, I can change this to match fwrite's 3 seconds, I just did 1 second in this draft so I can easily test the feature without having to use super large data) Not sure how useful it'll be though, seeing as fwrite and fread don't have this option and I've yet to see someone ask for it

@joshhwuu
Copy link
Member Author

joshhwuu commented Jul 18, 2024

Also @tdhock and anyone that's interested in this feature, there currently isn't support for any gforce optimized operations (ie sum, mean, order etc), this is because I've only added progress printing to dogroups which is only called whenever gforce = FALSE

Would we want progress printing for gforce operations as well? It will add a little overhead but can be good for consistency

On second note: I took a quick glance at the implementations for the gforce optimized functions, and it seems that if we were to implement this for gforce, I would have to do one for every single separate function, and figure out how to with parallelization. For this PR, I think it may be a little out of scope, unless I find an easier way to do it while I dig around

@MichaelChirico
Copy link
Member

I also don't see a particular need to implement for GForce, besides consistency.

With how GForce actually operates, it's ~roughly equivalent to just doing an un-grouped operation, for which we don't offer any progress bar. So in that sense, offering a progress bar in this case would then be inconsistent with the lack of bar for any j computation.

So, I'm happy to at very least leave this as out of scope for this issue & wait if such request will be made later.

@joshhwuu
Copy link
Member Author

joshhwuu commented Jul 18, 2024

Ok, then I believe this PR should be ready to go, are there any preferences to having a 1 second vs 3 second wait for progression printing? Otherwise, I can add a news (does this count as a new feature or a note?) entry, write better documentation

Although I would prefer that we merge #6296 first, as some overlapping variables between the two now that we no longer want to rely on clock()

@tdhock
Copy link
Member

tdhock commented Jul 18, 2024

I would prefer consistent 3 second wait.
Is there any other code/data that can obviously be shared between the different progress indicators?

@MichaelChirico
Copy link
Member

I agree on the need for consistency, and a preference to share code between the implementations.

One thing I immediately see is the current implementation uses Rprintf(), which can't be called from fread.c (since that's R-agnostic). I think it will require a bit more investment to get the implementations to share code, therefore I'm OK to defer that as a follow-up issue. WDYT @tdhock?

@tdhock
Copy link
Member

tdhock commented Jul 18, 2024

Rather than deferring code sharing to a follow up issue, I would rather get it right here.
At least look at what is common to fread and fwrite, and use the same here.

@joshhwuu
Copy link
Member Author

joshhwuu commented Jul 18, 2024

I would prefer consistent 3 second wait.

Ok 👍

Is there any other code/data that can obviously be shared between the different progress indicators?

fread progress printing is just a progress bar, printing out equal signs until it is done using a custom progress function, fwrite prints operation-specific information iteratively similar to 'by' operations, although the information it prints differs. As for gforce, the information would be based on how each of the different optimized functions work internally, ie the differences between gmean and gsum and how they handle groups. AFAIU, gforce doesn't iteratively handle each group, instead it handles every group at once via vectorization, so how progress printing would work isn't immediately clear to me

@joshhwuu
Copy link
Member Author

There's definitely a world where all of the code can use void progress() implemented for fread, (the prototype for this PR actually called that), it just means the progress printing wouldn't be very specific, although I did see here that there might've been intention at one point to use it for fwrite:

// TODO: use progress() as in fread

@joshhwuu joshhwuu marked this pull request as ready for review July 19, 2024 18:09
@tdhock
Copy link
Member

tdhock commented Jul 29, 2024

ok great thanks for the analysis about fread/fwrite, so I guess we create a follow-up issue about sharing progress() between all three.

please add a NEWS item.

@tdhock
Copy link
Member

tdhock commented Jul 29, 2024

great thanks

@tdhock tdhock merged commit 4704c82 into master Jul 29, 2024
5 checks passed
MichaelChirico added a commit that referenced this pull request Sep 4, 2024
* replace calls to clock() with wallclock()

* reorganize and clarify time variables following #6228

* better translatability

* NEWS

* better NEWS

---------

Co-authored-by: Michael Chirico <[email protected]>
int ETA = (int)(avgTimePerGroup*(ngrp-i-1));
if (hasPrinted || ETA >= 0) {
if (verbose && !hasPrinted) Rprintf(_("\n"));
Rprintf(_("\rProcessed %d groups out of %d. %.0f%% done. Time elapsed: %ds. ETA: %ds."), i+1, ngrp, 100.0*(i+1)/ngrp, (int)(now-startTime), ETA);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there inconsistency on whether the message ends with \n?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm correct you are talking about lines 449 and 462? 449 doesn't include \n because it is updated many times and I would like for it to replace the previous line instead of printing to a new line. 462 includes \n because things printed afterwards need to be in a new line (later verbose messages, resulting data.table, etc). I believe fread's progress does something similar where the the final call at the end of the operation prints a complete output with a new line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're worried about this inconsistency I think we can get around it by deferring to a helper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

progress indicator/bar for "by" operations
3 participants