If you are viewing this file on CRAN, please check latest news on GitHub here.
-
Empty RHS of
:=
is no longer an error when thei
clause returns no rows to assign to anyway, #2829. Thanks to @cguill95 for reporting and to @MarkusBonsch for fixing. -
Fixed runaway memory usage with R-devel (R > 3.5.0), #2882. Thanks to many people but in particular to Trang Nguyen for making the breakthrough reproducible example, Paul Bailey for liaising, and Luke Tierney for then pinpointing the issue. It was caused by an interaction of two or more data.table threads operating on new compact vectors in the ALTREP framework, such as the sequence
1:n
. This interaction could result in R's garbage collector turning off, and hence the memory explosion. Problems may occur in R 3.5.0 too but we were only able to reproduce in R > 3.5.0. The R code in data.table's implementation benefits from ALTREP (for
loops in R no longer allocate their range vector input, for example) but are not so appropriate as data.table columns. Sequences such as1:n
are common in test data but not very common in real-world datasets. Therefore, there is no need for data.table to support columns which are ALTREP compact sequences. Thedata.table()
function already expanded compact vectors (by happy accident) butsetDT()
did not (it now does). If, somehow, a compact vector still reaches the internal parallel regions, a helpful error will now be generated. If this happens, please report it as a bug. -
Tests 1590.3 & 1590.4 now pass when users run
test.data.table()
on Windows, #2856. Thanks to Avraham Adler for reporting. Those tests were passing on AppVeyor, win-builder and CRAN's Windows becauseR CMD check
setsLC_COLLATE=C
as documented in R-exts$1.3.1, whereas by default on WindowsLC_COLLATE
is usually a regional Windows-1252 dialect such asEnglish_United States.1252
. -
Around 1 billion very small groups (of size 1 or 2 rows) could result in
"Failed to realloc working memory"
even when plenty of memory is available, #2777. Thanks once again to @jsams for the detailed report as a follow up to bug fix 40 in v1.11.0.
-
test.data.table()
created/overwrote variablex
in.GlobalEnv
, #2828; i.e. a modification of user's workspace which is not allowed. Thanks to @etienne-s for reporting. -
as.chron
methods forIDate
andITime
have been removed, #2825.as.chron
still works sinceIDate
inherits fromDate
. We are not sure why we had specific methods in the first place. It may have been from a time whenIDate
did not inherit fromDate
, perhaps. Note that we don't usechron
ourselves in our own work. -
Fixed
SETLENGTH() cannot be applied to an ALTVEC object
starting in R-devel (R 3.6.0) on 1 May 2018, a few hours after 1.11.0 was accepted on CRAN, #2820. Many thanks to Luke Tierney for pinpointing the problem. -
Fixed some rare memory faults in
fread()
andrbindlist()
found withgctorture2()
andrchk
, #2841.
-
fread()
'sna.strings=
argument :"NA" # old default getOption("datatable.na.strings", "NA") # this release; i.e. the same; no change yet getOption("datatable.na.strings", "") # future release
This option controls how
,,
is read in character columns. It does not affect numeric columns which read,,
asNA
regardless. We would like,,
=>NA
for consistency with numeric types, and,"",
=>empty string to be the standard default forfwrite/fread
character columns so thatfread(fwrite(DT))==DT
without needing any change to any parameters.fwrite
has never writtenNA
as"NA"
in case"NA"
is a valid string in the data; e.g., 2 character id columns sometimes do. Instead,fwrite
has always written,,
by default for an<NA>
in a character columns. The use of R'sgetOption()
allows users to move forward now, usingoptions(datatable.fread.na.strings="")
, or restore old behaviour when the default's default is changed in future, usingoptions(datatable.fread.na.strings="NA")
. -
fread()
andfwrite()
'slogical01=
argument :logical01 = FALSE # old default getOption("datatable.logical01", FALSE) # this release; i.e. the same; no change yet getOption("datatable.logical01", TRUE) # future release
This option controls whether a column of all 0's and 1's is read as
integer
, orlogical
directly to avoid needing to change the type afterwards tological
or usecolClasses
.0/1
is smaller and faster than"TRUE"/"FALSE"
, which can make a significant difference to space and time the morelogical
columns there are. When the default's default changes toTRUE
forfread
we do not expect much impact since all arithmetic operators that are currently receiving 0's and 1's as typeinteger
(thinksum()
) but instead could receivelogical
, would return exactly the same result on the 0's and 1's aslogical
type. However, code that is manipulating column types usingis.integer
oris.logical
onfread
's result, could require change. It could be painful ifDT[(logical_column)]
(i.e.DT[logical_column==TRUE]
) changed behaviour due tological_column
no longer being typelogical
butinteger
. But that is not the change proposed. The change is the other way around; i.e., a previouslyinteger
column holding only 0's and 1's would now be typelogical
. Since it's that way around, we believe the scope for breakage is limited. We think a lot of code is converting 0/1 integer columns to logical anyway, either usingcolClasses=
or afterwards with an assign. Forfwrite
, the level of breakage depends on the consumer of the output file. We believe0/1
is a better more standard default choice to move to. See notes below about improvements tofread
's sampling for type guessing, and automatic rereading in the rare cases of out-of-sample type surprises.
These options are meant for temporary use to aid your migration, #2652. You are not meant to set them to the old default and then not migrate your code that is dependent on the default. Either set the argument explicitly so your code is not dependent on the default, or change the code to cope with the new default. Over the next few years we will slowly start to remove these options, warning you if you are using them, and return to a simple default. See the history of NEWS and NEWS.0 for past migrations that have, generally speaking, been successfully managed in this way. For example, at the end of NOTES for this version (below in this file) is a note about the usage of datatable.old.unique.by.key
now warning, as you were warned it would do over a year ago. When that change was introduced, the default was changed and that option provided an option to restore the old behaviour. These fread
/fwrite
changes are even more cautious and not even changing the default's default yet. Giving you extra warning by way of this notice to move forward. And giving you a chance to object.
-
fread()
:- Efficiency savings at C level including parallelization announced here; e.g. a 9GB 2 column integer csv input is 50s down to 12s to cold load on a 4 core laptop with 16GB RAM and SSD. Run
echo 3 >/proc/sys/vm/drop_caches
first to measure cold load time. Subsequent load time (after file has been cached by OS on the first run) 40s down to 6s. - The fread for small data page has been revised.
- Memory maps lazily; e.g. reading just the first 10 rows with
nrow=10
is 12s down to 0.01s from cold for the 9GB file. Large files close to your RAM limit may work more reliably too. The progress meter will commence sooner and more consistently. fread
has always jumped to the middle and to the end of the file for a much improved column type guess. The sample size is increased from 100 rows at 10 jump jump points (1,000 rows) to 100 rows at 100 jumps points (10,000 row sample). In the rare case of there still being out-of-sample type exceptions, those columns are now automatically reread so you don't have to usecolClasses
yourself.- Large number of columns support; e.g. 12,000 columns tested.
- Quoting rules are more robust and flexible. See point 10 on the wiki page here.
- Numeric data that has been quoted is now detected and read as numeric.
- The ability to position
autostart
anywhere inside one of multiple tables in a single file is removed with warning. It used to search upwards from that line to find the start of the table based on a consistent number of columns. People appear to be usingskip="string"
orskip=nrow
to find the header row exactly, which is retained and simpler. It was too difficult to retain search-upwards-autostart together with skipping/filling blank lines, filling incomplete rows and parallelization too. If there is any header info above the column names, it is still auto detected and auto skipped (particularly useful when loading a set of files where the column names start on different lines due to a varying height messy header). dec=','
is now implemented directly so there is no dependency on locale. The optionsdatatable.fread.dec.experiment
anddatatable.fread.dec.locale
have been removed.\\r\\r\\n
line endings are now handled such as produced bybase::download.file()
when it doubles up\\r
. Other rare line endings (\\r
and\\n\\r
) are now more robust.- Mixed line endings are now handled; e.g. a file formed by concatenating a Unix file and a Windows file so that some lines end with
\\n
while others end with\\r\\n
. - Improved automatic detection of whether the first row is column names by comparing the types of the fields on the first row against the column types ascertained by the 10,000 rows sample (or
colClasses
if provided). If a numeric column has a string value at the top, then column names are deemed present. - Detects GB-18030 and UTF-16 encodings and in verbose mode prints a message about BOM detection.
- Detects and ignores trailing ^Z end-of-file control character sometimes created on MS DOS/Windows, #1612. Thanks to Gergely Daróczi for reporting and providing a file.
- Added ability to recognize and parse hexadecimal floating point numbers, as used for example in Java. Thanks for @scottstanfield #2316 for the report.
- Now handles floating-point NaN values in a wide variety of formats, including
NaN
,sNaN
,1.#QNAN
,NaN1234
,#NUM!
and others, #1800. Thanks to Jori Liesenborgs for highlighting and the PR. - If negative numbers are passed to
select=
the out-of-range error now suggestsdrop=
instead, #2423. Thanks to Michael Chirico for the suggestion. sep=NULL
orsep=""
(i.e., no column separator) can now be used to specify single column input reliably likebase::readLines
, #1616.sep='\\n'
still works (even on Windows where line ending is actually\\r\\n
) butNULL
or""
are now documented and recommended. Thanks to Dmitriy Selivanov for the pull request and many others for comments. As before,sep=NA
is not valid; use the default"auto"
for automatic separator detection.sep='\\n'
is now deprecated and in future will start to warn when used.- Single-column input with blank lines is now valid and the blank lines are significant (representing
NA
). The blank lines are significant even at the very end, which may be surprising on first glance. The change is so thatfread(fwrite(DT))==DT
for single-column inputs containingNA
which are written as blank. There is no change whenncol>1
; i.e., input stops with detailed warning at the first blank line, because a blank line whenncol>1
is invalid input due to no separators being present. Thanks to @skanskan, Michael Chirico, @franknarf1 and Pasha for the testing and discussions, #2106. - Too few column names are now auto filled with default column names, with warning, #1625. If there is just one missing column name it is guessed to be for the first column (row names or an index), otherwise the column names are filled at the end. Similarly, too many column names now automatically sets
fill=TRUE
, with warning. skip=
andnrow=
are more reliable and are no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, #1267.- Ram disk (
/dev/shm
) is no longer used for the output of system command input. Although faster when it worked, it was causing too many device full errors; e.g., #1139 and zUMIs/19. Thanks to Kyle Chung for reporting. Standardtempdir()
is now used. If you wish to use ram disk, set TEMPDIR to/dev/shm
; see?tempdir
. - Detecting whether a very long input string is a file name or data is now much faster, #2531. Many thanks to @javrucebo for the detailed report, benchmarks and suggestions.
- A column of
TRUE/FALSE
s is ok, as well asTrue/False
s andtrue/false
s, but mixing styles (e.g.TRUE/false
) is not and will be read as typecharacter
. - New argument
index
to compliment the existingkey
argument for applying secondary orderings out of the box for convenience, #2633. - A warning is now issued whenever incorrectly quoted fields have been detected and fixed using a non-standard quote rule.
fread
has always used these advanced rules but now it warns that it is using them. Most file writers correctly quote fields if the field contains the field separator, but a common error is not to also quote fields that contain a quote and then escape those quotes, particularly if that quote occurs at the start of the field. The ability to detect and fix such files is referred to as self-healing. Ambiguities are resolved using the knowledge that the number of columns is constant, and therefore this ability is not available whenfill=TRUE
. This feature can be improved in future by using column type consistency as well as the number of fields.
txt = 'A,B\n1,hello\n2,"howdy" said Joe\n3,bonjour\n' cat(txt) # A,B # 1,hello # 2,"howdy" said Joe # 3,bonjour fread(txt) A B <int> <char> 1: 1 hello 2: 2 "howdy" said Joe 3: 3 bonjour Warning message: In fread(txt) : Found and resolved improper quoting
- Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik, Pasha Stetsenko, Mahyar K, Tom Crockett, @cnoelke, @qinjs, @etienne-s, Mark Danese, Avraham Adler, @franknarf1, @MichaelChirico, @tdhock, Luke Tierney, Ananda Mahto, @memoryfull, @brandenkmurray for testing dev and reporting these regressions before release to CRAN: #1464, #1671, #1888, #1895, #2070, #2073, #2087, #2091, #2092, #2107, #2118, #2123, #2167, #2194, #2196, #2201, #2222, #2228, #2238, #2246, #2251, #2265, #2267, #2285, #2287, #2299, #2322, #2347, #2352, #2370, #2371, #2395, #2404, #2446, #2453, #2457, #2464, #2481, #2499, #2512, #2515, #2516, #2518, #2520, #2523, #2526, #2535, #2542, #2548, #2561, #2600, #2625, #2666, #2697, #2735, #2744.
- Efficiency savings at C level including parallelization announced here; e.g. a 9GB 2 column integer csv input is 50s down to 12s to cold load on a 4 core laptop with 16GB RAM and SSD. Run
-
fwrite()
:- empty strings are now always quoted (
,"",
) to distinguish them fromNA
which by default is still empty (,,
) but can be changed usingna=
as before. Ifna=
is provided andquote=
is the default'auto'
thenquote=
is set toTRUE
so that if thena=
value occurs in the data, it can be distinguished fromNA
. Thanks to Ethan Welty for the request #2214 and Pasha for the code change and tests, #2215. logical01
has been added and the old namelogicalAsInt
retained. Pease move to the new name when convenient for you. The old argument name (logicalAsInt
) will slowly be deprecated over the next few years. The default is unchanged:FALSE
, sological
is still written as"TRUE"
/"FALSE"
in full by default. We intend to change the default's default in future toTRUE
; see the notice at the top of these release notes.
- empty strings are now always quoted (
-
Added helpful message when subsetting by a logical column without wrapping it in parentheses, #1844. Thanks @dracodoc for the suggestion and @MichaelChirico for the PR.
-
tables
gainsindex
argument for supplementary metadata aboutdata.table
s in memory (or any optionally specified environment), part of #1648. Thanks due variously to @jangorecki, @rsaporta, @MichaelChirico for ideas and work towards PR. -
Improved auto-detection of
character
inputs' formats toas.ITime
to mirror the logic inas.POSIXlt.character
, #1383 Thanks @franknarf1 for identifying a discrepancy and @MichaelChirico for investigating. -
setcolorder()
now accepts less thanncol(DT)
columns to be moved to the front, #592. Thanks @MichaelChirico for the PR. This also incidentally fixed #2007 whereby explicitly settingselect = NULL
infread
errored; thanks to @rcapell for reporting that and @dselivanov and @MichaelChirico for investigating and providing a new test. -
Three new Grouping Sets functions:
rollup
,cube
andgroupingsets
, #1377. Allows to aggregation on various grouping levels at once producing sub-totals and grand total. -
as.data.table()
gains new method forarray
s to return a useful data.table, #1418. -
print.data.table()
(all via master issue #1523):-
gains
print.keys
argument,FALSE
by default, which displays the keys and/or indices (secondary keys) of adata.table
. Thanks @MichaelChirico for the PR, Yike Lu for the suggestion and Arun for honing that idea to its present form. -
gains
col.names
argument,"auto"
by default, which toggles which registers of column names to include in printed output."top"
forcesdata.frame
-like behavior where column names are only ever included at the top of the output, as opposed to the default behavior which appends the column names below the output as well for longer (>20 rows) tables."none"
shuts down column name printing altogether. Thanks @MichaelChirico for the PR, Oleg Bondar for the suggestion, and Arun for guiding commentary. -
list columns would print the first 6 items in each cell followed by a comma if there are more than 6 in that cell. Now it ends ",..." to make it clearer, part of #1523. Thanks to @franknarf1 for drawing attention to an issue raised on Stack Overflow by @TMOTTM here.
-
-
setkeyv
accelerated if key already exists #2331. Thanks to @MarkusBonsch for the PR. -
Keys and indexes are now partially retained up to the key column assigned to with ':=' #2372. They used to be dropped completely if any one of the columns was affected by
:=
. Tanks to @MarkusBonsch for the PR. -
Faster
as.IDate
andas.ITime
methods forPOSIXct
andnumeric
, #1392. Thanks to Jan Gorecki for the PR. -
unique(DT)
now returnsDT
early when there are no duplicates to save RAM, #2013. Thanks to Michael Chirico for the PR, and thanks to @mgahan for pointing out a reversion inna.omit.data.table
before release, #2660. -
uniqueN()
is now faster on logical vectors. Thanks to Hugh Parsonage for PR#2648.N = 1e9 was now x = c(TRUE,FALSE,NA,rep(TRUE,N)) uniqueN(x) == 3 5.4s 0.00s x = c(TRUE,rep(FALSE,N), NA) uniqueN(x,na.rm=TRUE) == 2 5.4s 0.00s x = c(rep(TRUE,N),FALSE,NA) uniqueN(x) == 3 6.7s 0.38s
-
Subsetting optimization with keys and indices is now possible for compound queries like
DT[a==1 & b==2]
, #2472. Thanks to @MichaelChirico for reporting and to @MarkusBonsch for the implementation. -
melt.data.table
now offers friendlier functionality for providingvalue.name
forlist
input tomeasure.vars
, #1547. Thanks @MichaelChirico and @franknarf1 for the suggestion and use cases, @jangorecki and @mrdwab for implementation feedback, and @MichaelChirico for ultimate implementation. -
update.dev.pkg
is new function to update package from development repository, it will download package sources only when newer commit is available in repository.data.table::update.dev.pkg()
defaults updatesdata.table
, but any package can be used. -
Item 1 in NEWS for v1.10.2 on CRAN in Jan 2017 included :
When j is a symbol prefixed with
..
it will be looked up in calling scope and its value taken to be column names or numbers. When you see the..
prefix think one-level-up, like the directory..
in all operating systems means the parent directory. In future the..
prefix could be made to work on all symbols apearing anywhere insideDT[...]
.The response has been positive (this tweet and FR#2655) and so this prefix is now expanded to all symbols appearing in
j=
as a first step; e.g. :cols = "colB" DT[, c(..cols, "colC")] # same as DT[, .(colB,colC)] DT[, -..cols] # all columns other than colB
Thus,
with=
should no longer be needed in any cases. Please change to using the..
prefix and over the next few years we will start to formally deprecate and remove thewith=
parameter. If this is well received, the..
prefix could be expanded to symbols appearing ini=
andby=
, too. Note that column names should not now start with..
. If a symbol..var
is used inj=
but..var
exists as a column name, the column still takes precedence, for backwards compatibility. Over the next few years, data.table will start issuing warnings/errors when it sees column names starting with..
. This affects one CRAN package out of 475 using data.table, so we do not believe this restriction to be unreasonable. Our main focus here which we believe..
achieves is to resolve the more common ambiguity whenvar
is in calling scope andvar
is a column name too. Further, we have not forgotten that in the past we recommended prefixing the variable in calling scope with..
yourself. If you did that and..var
exists in calling scope, that still works, provided neithervar
exists in calling scope nor..var
exists as a column name. Please now remove the..
prefix on..var
in calling scope to tidy this up. In future data.table will start to warn/error on such usage. -
setindexv
can now assign multiple (separate) indices by accepting alist
in thecols
argument. -
as.matrix.data.table
method now has an additionalrownames
argument allowing for a single column to be used as therownames
after conversion to amatrix
. Thanks to @sritchie73 for the suggestion, use cases, #2692 and implementation PR#2702 and @MichaelChirico for additional use cases.
-
The new quote rules handles this single field
"Our Stock Screen Delivers an Israeli Software Company (MNDO, CTCH)<\/a> SmallCapInvestor.com - Thu, May 19, 2011 10:02 AM EDT<\/cite><\/div>Yesterday in \""Google, But for Finding Great Stocks\"", I discussed the value of stock screeners as a powerful tool"
, #2051. Thanks to @scarrascoso for reporting. Example file added to test suite. -
fwrite()
creates a file with permissions that now play correctly withSys.umask()
, #2049. Thanks to @gnguy for reporting. -
fread()
no longer holds an open lock on the file when a line outside the large sample has too many fields and generates an error, #2044. Thanks to Hugh Parsonage for reporting. -
Setting
j = {}
no longer results in an error, #2142. Thanks Michael Chirico for the pull request. -
Segfault in
rbindlist()
when one or more items are empty, #2019. Thanks Michael Lang for the pull request. Another segfault if the result would be more than 2bn rows, thanks to @jsams's comment in #2340. -
Error printing 0-length
ITime
andNA
objects, #2032 and #2171. Thanks Michael Chirico for the pull requests and @franknarf1 for pointing out a shortcoming of the initial fix. -
as.IDate.POSIXct
error withNULL
timezone, #1973. Thanks @lbilli for reporting and Michael Chirico for the pull request. -
Printing a null
data.table
withprint
no longer visibly outputsNULL
, #1852. Thanks @aaronmcdaid for spotting and @MichaelChirico for the PR. -
data.table
now works with Shiny Reactivity / Flexdashboard. The error was typically something likecol not found
inDT[col==val]
. Thanks to Dirk Eddelbuettel leading Matt through reproducible steps and @sergeganakou and Richard White for reporting. Closes #2001 and shiny/#1696. -
The
as.IDate.POSIXct
method passedtzone
along but was not exported. Sotzone
is now taken into account byas.IDate
too as well asIDateTime
, #977 and #1498. Tests added. -
Named logical vector now select rows as expected from single row data.table. Thanks to @skranz for reporting. Closes #2152.
-
fread()
's rareInternal error: Sampling jump point 10 is before the last jump ended
has been fixed, #2157. Thanks to Frank Erickson and Artem Klevtsov for reporting with example files which are now added to the test suite. -
CJ()
no longer loses attribute information, #2029. Thanks to @MarkusBonsch and @royalts for the pull request. -
split.data.table
respectsfactor
ordering inby
argument, #2082. Thanks to @MichaelChirico for identifying and fixing the issue. -
.SD
would incorrectly include symbol on lhs of:=
when.SDcols
is specified andget()
appears inj
. Thanks @renkun-ken for reporting and the PR, and @ProfFancyPants for reporing a regression introduced in the PR. Closes #2326 and #2338. -
Integer values that are too large to fit in
int64
will now be read as strings #2250. -
Internal-only
.shallow
now retains keys correctly, #2336. Thanks to @MarkusBonsch for reporting, fixing (PR #2337) and adding 37 tests. This much advances the journey towards exportingshallow()
, #2323. -
isoweek
calculation is correct regardless of local timezone setting (Sys.timezone()
), #2407. Thanks to @MoebiusAV and @SimonCoulombe for reporting and @MichaelChirico for fixing. -
Fixed
as.xts.data.table
to support all xts supported time based index clasess #2408. Thanks to @ebs238 for reporting and for the PR. -
A memory leak when a very small number such as
0.58E-2141
is bumped to typecharacter
is resolved, #918. -
The edge case
setnames(data.table(), character(0))
now works rather than error, #2452. -
Order of rows returned in non-equi joins were incorrect in certain scenarios as reported under #1991. This is now fixed. Thanks to @Henrik-P for reporting.
-
Non-equi joins work as expected when
x
inx[i, on=...]
is a 0-row data.table. Closes #1986. -
Non-equi joins along with
by=.EACHI
returned incorrect result in some rare cases as reported under #2360. This is fixed now. This fix also takes care of #2275. Thanks to @ebs238 for the nice minimal reproducible report, @Mihael for asking on SO and to @Frank for following up on SO and filing an issue. -
by=.EACHI
works now whenlist
columns are being returned and some join values are missing, #2300. Thanks to @jangorecki and @franknarf1 for the reproducible examples which have been added to the test suite. -
Indices are now retrieved by exact name, #2465. This prevents usage of wrong indices as well as unexpected row reordering in join results. Thanks to @pannnda for reporting and providing a reproducible example and to @MarkusBonsch for fixing.
-
setnames
of whole table when original table hadNA
names skipped replacing those, #2475. Thanks to @franknarf1 and BenoitLondon on StackOverflow for the report and @MichaelChirico for fixing. -
CJ()
works with multiple empty vectors now #2511. Thanks to @MarkusBonsch for fixing. -
:=
assignment of one vector to two or more columns, e.g.DT[, c("x", "y") := 1:10]
, failed to copy the1:10
data causing errors later if and when those columns were updated by reference, #2540. This is an old issue (#185) that had been fixed but reappeared when code was refactored. Thanks to @patrickhowerter for the detailed report with reproducible example and to @MarkusBonsch for fixing and strengthening tests so it doesn't reappear again. -
"Negative length vectors not allowed" error when grouping
median
andvar
fixed, #2046 and #2111. Thanks to @caneff and @osofr for reporting and to @kmillar for debugging and explaining the cause. -
Fixed a bug on Windows where
data.table
s containing non-UTF8 strings inkey
s were not properly sorted, #2462, #1826 and StackOverflow. Thanks to @shrektan for reporting and fixing. -
x.
prefixes during joins sometimes resulted in a "column not found" error. This is now fixed. Closes #2313. Thanks to @franknarf1 for the MRE. -
setattr()
no longer segfaults when setting 'class' to empty character vector, #2386. Thanks to @hatal175 for reporting and to @MarkusBonsch for fixing. -
Fixed cases where the result of
merge.data.table()
would contain duplicate column names ifby.x
was also innames(y)
.merge.data.table()
gains theno.dups
argument (default TRUE) to match the correpsonding patched behaviour inbase:::merge.data.frame()
. Now, whenby.x
is also innames(y)
the column name fromy
has the correspondingsuffixes
added to it.by.x
remains unchanged for backwards compatibility reasons. In addition, where duplicate column names arise anyway (i.e.suffixes = c("", "")
)merge.data.table()
will now throw a warning to match the behaviour ofbase:::merge.data.frame()
. Thanks to @sritchie73 for reporting and fixing PR#2631 and PR#2653 -
CJ()
now fails with proper error message when results would exceed max integer, #2636. -
NA
in character columns now display as<NA>
just like base R to distinguish from""
and"NA"
. -
getDTthreads()
could return INT_MAX (2 billion) after an explicit call tosetDTthreads(0)
, PR#2708. -
Fixed a bug on Windows that
data.table
may break if the garbage collecting was triggered when sorting a large number of non-ASCII characters. Thanks to @shrektan for reporting and fixing PR#2678, #2674. -
Internal aliasing of
.
tolist
was over-aggressive in applyinglist
even when.
was intended withinbquote
, #1912. Thanks @MichaelChirico for reporting/filing and @ecoRoland for suggesting and testing a fix. -
Attempt to allocate a wildly large amount of RAM (16EB) when grouping by key and there are close to 2 billion 1-row groups, #2777. Thanks to @jsams for the detailed report.
-
Fix a bug that
print(dt, class=TRUE)
shows onlytopn - 1
rows. Thanks to @heavywatal for reporting #2803 and filing PR#2804.
-
The license has been changed from GPL to MPL (Mozilla Public License). All contributors were consulted and approved. PR#2456 details the reasons for the change.
-
?data.table
makes explicit the option of using alogical
vector inj
to select columns, #1978. Thanks @Henrik-P for the note and @MichaelChirico for filing. -
Test 1675.1 updated to cope with a change in R-devel in June 2017 related to
factor()
andNA
levels. -
Package
ezknitr
has been added to the whitelist of packages that run user code and should be consider data.table-aware, #2266. Thanks to Matt Mills for testing and reporting. -
Printing with
quote = TRUE
now quotes column names as well, #1319. Thanks @jan-glx for the suggestion and @MichaelChirico for the PR. -
Added a blurb to
?melt.data.table
explicating the subtle difference in behavior of theid.vars
argument vis-a-vis its analog inreshape2::melt
, #1699. Thanks @MichaelChirico for uncovering and filing. -
Added some clarification about the usage of
on
to?data.table
, #2383. Thanks to @peterlittlejohn for volunteering his confusion and @MichaelChirico for brushing things up. -
Clarified that "data.table always sorts in
C-locale
" means that upper-case letters are sorted before lower-case letters by ordering in data.table (e.g.setorder
,setkey
,DT[order(...)]
). Thanks to @hughparsonage for the pull request editing the documentation. Note this makes no difference in most cases of data; e.g. ids where only uppercase or lowercase letters are used ("AB123"<"AC234"
is always true, regardless), or country names and words which are consistently capitalized. For example,"America" < "Brazil"
is not affected (it's always true), and neither is"america" < "brazil"
(always true too); since the first letter is consistently capitalized. But, whether"america" < "Brazil"
(the words are not consistently capitalized) is true or false in base R depends on the locale of your R session. In America it is true by default and false if you i) typeSys.setlocale(locale="C")
, ii) the R session has been started in a C locale for you which can happen on servers/services (the locale comes from the environment the R session is started in). However,"america" < "Brazil"
is always, consistently false in data.table which can be a surprise because it differs to base R by default in most regions. It is false because"B"<"a"
is true because all upper-case letters come first, followed by all lower case letters (the ascii number of each letter determines the order, which is what is meant byC-locale
). -
data.table
's dependency has been moved forward from R 3.0.0 (Apr 2013) to R 3.1.0 (Apr 2014; i.e. 3.5 years old). We keep this dependency as old as possible for as long as possible as requested by users in managed environments. Thanks to Jan Gorecki, the test suite from latest dev now runs on R 3.1.0 continously, as well as R-release (currently 3.4.2) and latest R-devel snapshot. Our CRAN release procedures also double check with this stated dependency before release to CRAN. The primary motivation for the bump to R 3.1.0 was allowing one new test which relies on better non-copying behaviour in that version, #2484. It also allows further internal simplifications. Thanks to @MichaelChirico for fixing another test that failed on R 3.1.0 due to slightly different behaviour ofbase::read.csv
in R 3.1.0-only which the test was comparing to, #2489. -
New vignette added: Importing data.table - focused on using data.table as a dependency in R packages. Answers most commonly asked questions and promote good practices.
-
As warned in v1.9.8 release notes below in this file (on CRAN 25 Nov 2016) it has been 1 year since then and so use of
options(datatable.old.unique.by.key=TRUE)
to restore the old default is now deprecated with warning. The new warning states that this option still works and repeats the request to passby=key(DT)
explicitly tounique()
,duplicated()
,uniqueN()
andanyDuplicated()
and to stop using this option. In another year, this warning will become error. Another year after that the option will be removed. -
As
set2key()
andkey2()
have been warning since v1.9.8 on CRAN Nov 2016, their warnings have now been upgraded to errors. Note that when they were introduced in version 1.9.4 (Oct 2014) they were marked as 'experimental' in NEWS item 4. They will be removed in one year.
Was warning: set2key() will be deprecated in the next relase. Please use setindex() instead.
Now error: set2key() is now deprecated. Please use setindex() instead.
-
The option
datatable.showProgress
is no longer set to a default value when the package is loaded. Instead, thedefault=
argument ofgetOption
is used by bothfwrite
andfread
. The default is the result ofinteractive()
at the time of the call. UsinggetOption
in this way is intended to be more helpful to users looking atargs(fread)
and?fread
. -
print.data.table()
invisibly returns its first argument instead ofNULL
. This behavior is compatible with the standardprint.data.frame()
and tibble'sprint.tbl_df()
. Thanks to @heavywatal for PR#2807
- Fixed crash/hang on MacOS when
parallel::mclapply
is used and data.table is merely loaded, #2418. Oddly, all tests including test 1705 (which testsmclapply
with data.table) passed fine on CRAN. It appears to be some versions of MacOS or some versions of libraries on MacOS, perhaps. Many thanks to Martin Morgan for reporting and confirming this fix works. Thanks also to @asenabouth, Joe Thorley and Danton Noriega for testing, debugging and confirming that automatic parallelism inside data.table (such asfwrite
) works well even on these MacOS installations. See also news items below for 1.10.4-1 and 1.10.4-2.
-
OpenMP on MacOS is now supported by CRAN and included in CRAN's package binaries for Mac. But installing v1.10.4-1 from source on MacOS failed when OpenMP was not enabled at compile time, #2409. Thanks to Liz Macfie and @fupangpangpang for reporting. The startup message when OpenMP is not enabled has been updated.
-
Two rare potential memory faults fixed, thanks to CRAN's automated use of latest compiler tools; e.g. clang-5 and gcc-7
-
The
nanotime
v0.2.0 update on CRAN 22 June 2017 changed frominteger64
toS4
and brokefwrite
ofnanotime
columns. Fixed to work withnanotime
both before and after v0.2.0. -
Pass R-devel changes related to
deparse(,backtick=)
andfactor()
. -
Internal
NAMED()==2
nowMAYBE_SHARED()
, #2330. Back-ported to pass under the stated dependency, R 3.0.0. -
Attempted improvement on Mac-only when the
parallel
package is used too (which forks), #2137. Intel's OpenMP implementation appears to leave threads running after the OpenMP parallel region (inside data.table) has finished unlike GNU libgomp. So, if and whenparallel
'sfork
is invoked by the user after data.table has run in parallel already, instability occurs. The problem only occurs with Mac package binaries from CRAN because they are built by CRAN with Intel's OpenMP library. No known problems on Windows or Linux and no known problems on any platform whenparallel
is not used. If this Mac-only fix still doesn't work, callsetDTthreads(1)
immediately afterlibrary(data.table)
which has been reported to fix the problem by puttingdata.table
into single threaded mode earlier. -
When
fread()
andprint()
seeinteger64
columns are present but packagebit64
is not installed, the warning is now displayed as intended. Thanks to a question by Santosh on r-help and forwarded by Bill Dunlap.
- The new specialized
nanotime
writer infwrite()
type punned using*(long long *)&REAL(column)[i]
which, strictly, is undefined behavour under C standards. It passed a plethora of tests on linux (gcc 5.4 and clang 3.8), win-builder and 6 out 10 CRAN flavours using gcc. But failed (wrong data written) with the newest version of clang (3.9.1) as used by CRAN on the failing flavors, and solaris-sparc. Replaced with the union method and added a grep to CRAN_Release.cmd.
-
When
j
is a symbol prefixed with..
it will be looked up in calling scope and its value taken to be column names or numbers.myCols = c("colA","colB") DT[, myCols, with=FALSE] DT[, ..myCols] # same
When you see the
..
prefix think one-level-up like the directory..
in all operating systems meaning the parent directory. In future the..
prefix could be made to work on all symbols apearing anywhere insideDT[...]
. It is intended to be a convenient way to protect your code from accidentally picking up a column name. Similar to howx.
andi.
prefixes (analogous to SQL table aliases) can already be used to disambiguate the same column name present in bothx
andi
. A symbol prefix rather than a..()
function will be easier for us to optimize internally and more convenient if you have many variables in calling scope that you wish to use in your expressions safely. This feature was first raised in 2012 and long wished for, #633. It is experimental. -
When
fread()
orprint()
seeinteger64
columns are present,bit64
's namespace is now automatically loaded for convenience. -
fwrite()
now supports the newnanotime
type by Dirk Eddelbuettel, #1982. Aside:data.table
already automatically supportednanotime
in grouping and joining operations via longstanding support of its underlyinginteger64
type. -
indices()
gains a new argumentvectors
, defaultFALSE
. This strsplits the index names by__
for you, #1589.DT = data.table(A=1:3, B=6:4) setindex(DT, B) setindex(DT, B, A) indices(DT) [1] "B" "B__A" indices(DT, vectors=TRUE) [[1]] [1] "B" [[2]] [1] "B" "A"
-
Some long-standing potential instability has been discovered and resolved many thanks to a detailed report from Bill Dunlap and Michael Sannella. At C level any call of the form
setAttrib(x, install(), allocVector())
can be unstable in any R package. DespitesetAttrib()
PROTECTing its inputs, the 3rd argument (allocVector
) can be executed first only for its result to to be released byinstall()
's potential GC before reachingsetAttrib
's PROTECTion of its inputs. Fixed by either PROTECTing or pre-install()
ing. Added to CRAN_Release.cmd procedures: i)grep
s to prevent usage of this idiom in future and ii) running data.table's test suite withgctorture(TRUE)
. -
A new potential instability introduced in the last release (v1.10.0) in GForce optimized grouping has been fixed by reverting one change from malloc to R_alloc. Thanks again to Michael Sannella for the detailed report.
-
fwrite()
could write floating point values incorrectly, #1968. A thread-local variable was incorrectly thread-global. This variable's usage lifetime is only a few clock cycles so it needed large data and many threads for several threads to overlap their usage of it and cause the problem. Many thanks to @mgahan and @jmosser for finding and reporting.
-
fwrite()
's..turbo
option has been removed as the warning message warned. If you've found a problem, please report it. -
No known issues have arisen due to
DT[,1]
andDT[,c("colA","colB")]
now returning columns as introduced in v1.9.8. However, as we've moved forward by settingoptions('datatable.WhenJisSymbolThenCallingScope'=TRUE)
introduced then too, it has become clear a better solution is needed. All 340 CRAN and Bioconductor packages that use data.table have been checked with this option on. 331 lines would need to be changed in 59 packages. Their usage is elegant, correct and recommended, though. Examples areDT[1, encoding]
in quanteda andDT[winner=="first", freq]
in xgboost. These are looking up the columnsencoding
andfreq
respectively and returning them as vectors. But if, for some reason, those columns are removed fromDT
andencoding
orfreq
are still variables in calling scope, their values in calling scope would be returned. Which cannot be what was intended and could lead to silent bugs. That was the risk we were trying to avoid.
options('datatable.WhenJisSymbolThenCallingScope')
is now removed. A migration timeline is no longer needed. The new strategy needs no code changes and has no breakage. It was proposed and discussed in point 2 here, as follows.
Whenj
is a symbol (as in the quanteda and xgboost examples above) it will continue to be looked up as a column name and returned as a vector, as has always been the case. If it's not a column name however, it is now a helpful error explaining that data.table is different to data.frame and what to do instead (use..
prefix orwith=FALSE
). The old behaviour of returning the symbol's value in calling scope can never have been useful to anybody and therefore not depended on. Just as theDT[,1]
change could be made in v1.9.8, this change can be made now. This change increases robustness with no downside. Rerunning all 340 CRAN and Bioconductor package checks reveal 2 packages throwing the new error: partools and simcausal. Their maintainers have been informed that there is a likely bug on those lines due to data.table's (now remedied) weakness. This is exactly what we wanted to reveal and improve. -
As before, and as we can see is in common use in CRAN and Bioconductor packages using data.table,
DT[,myCols,with=FALSE]
continues to lookupmyCols
in calling scope and take its value as column names or numbers. You can move to the new experimental convenience featureDT[, ..myCols]
if you wish at leisure.
-
fwrite(..., quote='auto')
already quoted a field if it contained asep
or\n
, orsep2[2]
whenlist
columns are present. Now it also quotes a field if it contains a double quote ("
) as documented, #1925. Thanks to Aki Matsuo for reporting. Tests added. Theqmethod
tests did test escaping embedded double quotes, but only whensep
or\n
was present in the field as well to trigger the quoting of the field. -
Fixed 3 test failures on Solaris only, #1934. Two were on both sparc and x86 and related to a
tzone
attribute difference betweenas.POSIXct
andas.POSIXlt
even when passed the defaulttz=""
. The third was on sparc only: a minor rounding issue infwrite()
of 1e-305. -
Regression crash fixed when 0's occur at the end of a non-empty subset of an empty table, #1937. Thanks Arun for tracking down. Tests added. For example, subsetting the empty
DT=data.table(a=character())
withDT[c(1,0)]
should return a 1 row result with oneNA
since 1 is past the end ofnrow(DT)==0
, the same result asDT[1]
. -
Fixed newly reported crash that also occurred in old v1.9.6 when
by=.EACHI
,nomatch=0
, the first item ini
has no match ANDj
has a function call that is passed a key column, #1933. Many thanks to Reino Bruner for finding and reporting with a reproducible example. Tests added. -
Fixed
fread()
error occurring for a subset of Windows users:showProgress is not type integer but type 'logical'.
, #1944 and #1111. Our tests cover this usage (it is just default usage), pass on AppVeyor (Windows), win-builder (Windows) and CRAN's Windows so perhaps it only occurs on a specific and different version of Windows to all those. Thanks to @demydd for reporting. Fixed by using strictlylogical
type at R level andRboolean
at C level, consistently throughout. -
Combining
on=
(new in v1.9.6) withby=
orkeyby=
gave incorrect results, #1943. Many thanks to Henrik-P for the detailed and reproducible report. Tests added. -
New function
rleidv
was ignoring itscols
argument, #1942. Thanks Josh O'Brien for reporting. Tests added.
-
It seems OpenMP is not available on CRAN's Mac platform; NOTEs appeared in CRAN checks for v1.9.8. Moved
Rprintf
frominit.c
topackageStartupMessage
to avoid the NOTE as requested urgently by Professor Ripley. Also fixed the bad grammar of the message: 'single threaded' now 'single-threaded'. If you have a Mac and run macOS or OS X on it (I run Ubuntu on mine) please contact CRAN maintainers and/or Apple if you'd like CRAN's Mac binary to support OpenMP. Otherwise, please follow these instructions for OpenMP on Mac which people have reported success with. -
Just to state explicitly: data.table does not now depend on or require OpenMP. If you don't have it (as on CRAN's Mac it appears but not in general on Mac) then data.table should build, run and pass all tests just fine.
-
There are now 5,910 raw tests as reported by
test.data.table()
. Tests cover 91% of the 4k lines of R and 89% of the 7k lines of C. These stats are now known thanks to Jim Hester's Covr package and Codecov.io. If anyone is looking for something to help with, creating tests to hit the missed lines shown by clicking theR
andsrc
folders at the bottom here would be very much appreciated. -
The FAQ vignette has been revised given the changes in v1.9.8. In particular, the very first FAQ.
-
With hindsight, the last release v1.9.8 should have been named v1.10.0 to convey it wasn't just a patch release from .6 to .8 owing to the 'potentially breaking changes' items. Thanks to @neomantic for correctly pointing out. The best we can do now is now bump to 1.10.0.