-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Progress bar/indicator for "by" operations #6228
Conversation
Generated via commit 78807e9 Download link for the artifact containing the test results: ↓ atime-results.zip Time taken to finish the standard R installation steps: 12 minutes and 23 seconds Time taken to run |
This comment was marked as outdated.
This comment was marked as outdated.
please add an argument to showProgress=getOption("datatable.showProgress", interactive()), Also please add docs similar to ?fread
|
New argument added, allows us to do:
R CMD check should pass now.
Edit: Apparently this also happens with current master... not sure if this is known:
library(devtools)
load_all()
dt = data.table(a = 1:1000000)
dt[, 1, by = a]
@tdhock @Anirban166 have you seen this before? Should I file an issue? After doing a quick search on some forums, apparently Valgrind often doesn't playtoo well with openMP, this is most likely a false positive from Valgrind: |
Co-authored-by: Michael Chirico <[email protected]>
I did some benchmarking https://tdhock.github.io/blog/2024/atime-showProgress/ and it looks like the proposed code with new showProgress arg has the same performance as previous master, at least for small data sizes / compute times. (<0.1 sec). |
Great! Really cool article! I included some benchmarks for some large (>2s) operations in the PR description above and it seems that on average the new changes aren't that much slower than current master (at least not by magnitudes). I think performance is nearly identical for small (<1s) operations because a built in feature of the progress indicator is that it doesn't print unless the operation is longer than 1 second, similar to fwrite (3s) When I log on tomorrow I'll do some more comprehensive benchmarking with some of the steps you've outlined in your article! |
I wonder if we easily can make that time delay (1 or 3 seconds) the same and user controlled? If not in thie PR then please create another issue. |
For longer by operations to benchmark before and after showProgress, https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping |
An option to select how long the function to wait before progress printing should be easy to add, it is simply a check. (This means yes, I can change this to match |
Also @tdhock and anyone that's interested in this feature, there currently isn't support for any Would we want progress printing for On second note: I took a quick glance at the implementations for the |
I also don't see a particular need to implement for GForce, besides consistency. With how GForce actually operates, it's ~roughly equivalent to just doing an un-grouped operation, for which we don't offer any progress bar. So in that sense, offering a progress bar in this case would then be inconsistent with the lack of bar for any So, I'm happy to at very least leave this as out of scope for this issue & wait if such request will be made later. |
Ok, then I believe this PR should be ready to go, are there any preferences to having a 1 second vs 3 second wait for progression printing? Otherwise, I can add a news (does this count as a new feature or a note?) entry, write better documentation Although I would prefer that we merge #6296 first, as some overlapping variables between the two now that we no longer want to rely on |
I would prefer consistent 3 second wait. |
I agree on the need for consistency, and a preference to share code between the implementations. One thing I immediately see is the current implementation uses |
Rather than deferring code sharing to a follow up issue, I would rather get it right here. |
Ok 👍
|
There's definitely a world where all of the code can use Line 957 in 0030b15
|
ok great thanks for the analysis about fread/fwrite, so I guess we create a follow-up issue about sharing progress() between all three. please add a NEWS item. |
great thanks |
* replace calls to clock() with wallclock() * reorganize and clarify time variables following #6228 * better translatability * NEWS * better NEWS --------- Co-authored-by: Michael Chirico <[email protected]>
int ETA = (int)(avgTimePerGroup*(ngrp-i-1)); | ||
if (hasPrinted || ETA >= 0) { | ||
if (verbose && !hasPrinted) Rprintf(_("\n")); | ||
Rprintf(_("\rProcessed %d groups out of %d. %.0f%% done. Time elapsed: %ds. ETA: %ds."), i+1, ngrp, 100.0*(i+1)/ngrp, (int)(now-startTime), ETA); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is there inconsistency on whether the message ends with \n
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm correct you are talking about lines 449 and 462? 449 doesn't include \n
because it is updated many times and I would like for it to replace the previous line instead of printing to a new line. 462 includes \n
because things printed afterwards need to be in a new line (later verbose messages, resulting data.table, etc). I believe fread
's progress
does something similar where the the final call at the end of the operation prints a complete output with a new line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're worried about this inconsistency I think we can get around it by deferring to a helper
Closes #3060
Similar to
fwrite
progress printing, triggers when ETA >= 3s, prints the following, updating every second until complete.progress()
function to show groups done, groups remaining, time elapsedAdd to gforce operations?wallclock()
are quite expensive) to measure overheadBenchmarking
With Progress Printing
Current Master
Performance analysis using atime: