You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> @philippechataignon do you want to have a go at adding a atime performance regression test? Totally fine if not -- what would help at least would be to write a simple benchmark of gzipped fwrite that you think would capture the important pieces of what's changed here, does that make sense?
OK for testing regression but notice that the core of fwrite hasn't change : same buffer sizes, same number of jobs, same number of rows per job. Personally I observe similar timings that previous version.
One point of discussion : I notice that #2020 introduces a change that I never realized before this PR. By default scipen fwrite parameter uses the value of R option scipen. Like a lot of persons, I have a options(scipen = 999) in my Rprofile because I don't like sci output on screen but I never realize that fwrite was penalized. Why ? Because of the maxLineLen which can be very high and gives a lot of batches with few lines and little chunk. In fwrite, scipen has an maximum of 350 but it's a very high limit. And maxLineLen is used for determining number and sizes of fwrite chunks.
maxLineLen=61026. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true)
Writing 10000 rows in 73 batches of 137 rows, each buffer size 8388608 bytes (8 MiB), showProgress=1, nth=4
zlib: uncompressed length=48947760 (46 MiB), compressed length=21719798 (20 MiB), ratio=44.4%, crc=26a23b0a
Written 10000 rows in 1.569 secs using 4 threads. MaxBuffUsed=7%
With scipen = 999
maxLineLen=761026. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true)
Writing 10000 rows in 910 batches of 11 rows, each buffer size 8388608 bytes (8 MiB), showProgress=1, nth=4
zlib: uncompressed length=48947760 (46 MiB), compressed length=22745125 (21 MiB), ratio=46.5%, crc=26a23b0a
Written 10000 rows in 1.202 secs using 4 threads. MaxBuffUsed=0%
In last case real mean line length is ~ 5000 but estimated to 761026. Compression ratio is higher because the buffers are very little used. Surprisingly timing is better despite openmp number of threads overhead.
In my opinion, scipen parameter should be 0 in fwrite and not scipen option witch is related to digits option not present in fwrite and digits is higher in fwrite output (20 ?).
I use this little bench for scipen impact and I think it can be used for atime. I've tried to add this :
# Issue with fwrite length for gzip output, fixed in: https://github.com/Rdatatable/data.table/pull/6393
# No regression timing test
"No regression fwrite" = atime::atime_test(
N = 10^seq(2, 8),
setup = {
set.seed(1)
ncol = 1000
L <- data.table(i=1:N)
L[, paste0("V", 1:ncol) := lapply(1:ncol, function(x) rnorm(N))]
},
expr = {
fwrite(dt, "/dev/null", compress="gzip")
},
Fast = "117ab45674f1e56304abca83f9f0df50ab0274be", # Close-to-last merge commit in the PR
Slow = "e73c2c849f921cf4ef51e3809842e0fee9b9f52c"),
but I'm not sure that /dev/null is portable and if we write a real file, that's made the timing.
OK for another one to continue and test that there is not time regression.
OK for testing regression but notice that the core of fwrite hasn't change : same buffer sizes, same number of jobs, same number of rows per job. Personally I observe similar timings that previous version.
One point of discussion : I notice that #2020 introduces a change that I never realized before this PR. By default
scipen
fwrite parameter uses the value of R option scipen. Like a lot of persons, I have a options(scipen = 999) in my Rprofile because I don't like sci output on screen but I never realize that fwrite was penalized. Why ? Because of the maxLineLen which can be very high and gives a lot of batches with few lines and little chunk. In fwrite, scipen has an maximum of 350 but it's a very high limit. And maxLineLen is used for determining number and sizes of fwrite chunks.For testing impact, I have this little program :
With scipen = 0
With scipen = 999
In last case real mean line length is ~ 5000 but estimated to 761026. Compression ratio is higher because the buffers are very little used. Surprisingly timing is better despite openmp number of threads overhead.
In my opinion,
scipen
parameter should be 0 in fwrite and notscipen
option witch is related todigits
option not present in fwrite and digits is higher in fwrite output (20 ?).I use this little bench for scipen impact and I think it can be used for atime. I've tried to add this :
but I'm not sure that /dev/null is portable and if we write a real file, that's made the timing.
OK for another one to continue and test that there is not time regression.
Originally posted by @philippechataignon in #6393 (comment)
The text was updated successfully, but these errors were encountered: