forked from Rdatatable/data.table
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdatatable-importing.Rmd
159 lines (124 loc) · 9.52 KB
/
datatable-importing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "Importing data.table"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Importing data.table}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
<style>
h2 {
font-size: 20px;
}
</style>
This document is focused on using data.table as a dependency in other R packages. If you are interested to use data.table C code from non-R application, or call its C functions directly, jump to the last section of this vignette.
Importing data.table is no different from importing other R packages. This vignette meant to answer most common questions which popups around that subject. Defining dependency presented here can be applied to other R packages.
## Why to import data.table
One of the biggest features of data.table is the concise syntax which makes exploratory analysis faster and easier to perceive. Yet the same reason can drive packages authors to use data.table in their own packages. Another maybe even more important reason is high performance. When outsourcing heavy computing tasks from your package to data.table you usually get top performance without needing to learn any high performance computing programming or tricks.
## Importing data.table is easy
It is very easy to use data.table as a dependency due to the fact that data.table does not have any dependencies. This statement is valid for both operating system dependencies and R dependencies. It means that if you have R installed on your machine, it already has everything what is needed to install data.table. This also means that adding data.table as a dependency of your package will not result in chain of other recursive dependencies required to be installed, making it very convenient for offline installation.
## DESCRIPTION file
First place to define a dependency in a package is DESCRIPTION file. Most commonly you will need to use `Imports: data.table` keyword there. This definition will force to install data.table before your package installation. As mentioned above no other packages will be installed because data.table does not have own dependencies. You can also specify the lowest required version of a dependency, for example if your package is using `fwrite` function which was introduced in data.table 1.9.8 you may define `Imports: data.table (>= 1.9.8)`. This way you can ensure that data.table is installed in 1.9.8 or later version before you will be able to install your package. Another way is to use `Depends: data.table` instead of `Imports` although this should be avoided because it forces end users of your package to use data.table.
## NAMESPACE file
Next thing is to define what content of data.table your package is using. This needs to be done in NAMESPACE file. Most commonly package authors will want to use `import(data.table)` which will import all (exported) functions from data.table package.
If you want to use just a subset of data.table functions, lets say only fast read and write CSV files, you can use `importFrom(data.table, fread, fwrite)` in NAMESPACE file. It is possible to import all functions from a package excluding particular ones using `import(data.table, except=c(fread, fwrite))`.
## Usage
As an example we will define two functions in `a.pkg` package that uses data.table. `gen` function will generate data.table, `aggr` will aggregate that data.table.
```r
gen = function (n = 100L) {
dt = as.data.table(list(id = seq_len(n)))
dt[, grp := ((id - 1) %% 26) + 1
][, grp := letters[grp]
][]
}
aggr = function (x) {
stopifnot(
is.data.table(x),
"grp" %in% names(x)
)
x[, .N, by = grp]
}
```
## Testing
Be sure to include tests in your package. Before each major release of data.table we check reverse dependencies so if any changes in data.table would break your code we will be able to spot breaking changes and inform you before releasing new version. This of course assumes you will publish your package to CRAN. The most basic test can be a plaintext R script in your package directory `tests/test.R`:
```r
library(a.pkg)
dt = gen()
stopifnot(nrow(dt) == 100)
dt2 = aggr(dt)
stopifnot(nrow(dt2) < 100)
```
When testing package you may want to use `R CMD check --no-stop-on-test-error` which will continue to run all your tests and not stop on first script that failed (requires R 3.4.0).
## Testing using testthat
It is very common to use testthat package for purpose of tests. Testing package that imports data.table is no different from testing other packages. An example test script `tests/testthat/test-pkg.R`:
```r
context("pkg tests")
test_that("generate dt", { expect_true(nrow(gen()) == 100) })
test_that("aggregate dt", { expect_true(nrow(aggr(gen())) < 100) })
```
## Dealing with "undefined global functions or variables"
`data.table`'s use of Non-Standard Evaluation (especially on the left-hand side of `:=`) is not well-recognised by `R CMD check`. This results `NOTE`s like the following during package check:
```
* checking R code for possible problems ... NOTE
aggr: no visible binding for global variable 'grp'
gen: no visible binding for global variable 'grp'
gen: no visible binding for global variable 'id'
Undefined global functions or variables:
grp id
```
The easiest way to deal with this is to pre-define those variables and set them to `NULL`, eventually adding comment (as was done in the function `gen` above). When possible, you can also use a character vector instead of symbols (as in `aggr`). The functions from above example would then look like:
```r
gen = function (n = 100L) {
id = grp = NULL # due to NSE notes in R CMD check
dt = as.data.table(list(id = seq_len(n)))
dt[, grp := ((id - 1) %% 26) + 1
][, grp := letters[grp]
][]
}
aggr = function (x) {
stopifnot(
is.data.table(x),
"grp" %in% names(x)
)
x[, .N, by = "grp"]
}
```
The case for `:=` is slightly different, because `:=` is interpreted as a function in the above code; so instead of registering `:=` as `NULL`, you must register it as a function, e.g.:
```
`:=` = function(...) NULL
```
If you don't mind having `id` and `grp` registered as variables globally in your package namespace you can use `?globalVariables`. Be aware that these notes do not have any impact on the code or its functionality; if you are not going to publish your package, you may simply choose to ignore them.
## Troubleshooting
If you face any problems in this process before trying to ask questions or reporting issues please confirm that problem is reproducible in clean R session using R console `R CMD check package.name`.
Some of the most common issues developers are facing are usually related to helpers tools that meant to automate some package development tasks. For example using roxygen to generate NAMESPACE file from metadata in R code files. Others are related to helpers that build and check package. Thus be sure to double check using R console, also ensure proper import is defined in DESCRIPTION and NAMESPACE files. If are not able to reproduce problems you have using just R console build and check you may try to get some support in [devtools#192](https://github.com/hadley/devtools/issues/192) or [devtools#1472](https://github.com/hadley/devtools/issues/1472).
## License
Since version 1.10.5 data.table is licensed as Mozilla Public License (MPL). The reasons for the change from GPL should be read in full [here](https://github.com/Rdatatable/data.table/pull/2456) and you can read more about MPL on Wikipedia [here](https://en.wikipedia.org/wiki/Mozilla_Public_License) and [here](https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licenses).
## Optionally import data.table: Suggests
If you want to use data.table conditionally only when it is installed you should use `Suggests: data.table` in your DESCRIPTION file instead of using `Imports: data.table`. By default this definition will not force to install data.table when installing your package. This also requires you to conditionally use data.table in your package code which should be done using `?requireNamespace` function. Below example demonstrates conditional use of data.table's fast CSV writer `?fwrite`. If data.table package is not installed then much slower base R `?write.table` function is used to write CSV file.
```r
my.write = function (x) {
if(requireNamespace("data.table", quietly=TRUE)) {
data.table::fwrite(x, "data.csv")
} else {
write.table(x, "data.csv")
}
}
```
Slightly more extended way which also ensure that installed data.table is in version recent enough to have `fwrite` function available.
```r
my.write = function (x) {
if(requireNamespace("data.table", quietly=TRUE) &&
utils::packageVersion("data.table") >= "1.9.8") {
data.table::fwrite(x, "data.csv")
} else {
write.table(x, "data.csv")
}
}
```
When using package as a suggested dependency you should not import it in NAMESPACE file, just mention it in DESCRIPTION file. You also have to use `data.table::` prefix when calling data.table functions because none of them are imported.
## Further information on dependencies
For more canonical documentation of defining packages dependency check [Writing R Extensions](https://cran.r-project.org/doc/manuals/r-release/R-exts.html) official manual.
## Importing from non-R app
Some tiny parts of data.table C code were isolated from R C API and now can be used from non-R application by linking to .so / .dll files. More details about this will be provided later, for now you can study C code that were isolated from R C API in [src/fread.c](https://github.com/Rdatatable/data.table/blob/master/src/fread.c) and [src/fwrite.c](https://github.com/Rdatatable/data.table/blob/master/src/fwrite.c).