From 2bb53d46a88dfc54b63acb72064d21c63f5c73fd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kirill=20M=C3=BCller?= Date: Mon, 27 Jan 2025 16:48:43 +0100 Subject: [PATCH] feat: Move content from README to vignettes (#504) --- .github/CODE_OF_CONDUCT.md | 126 ++++++++++++++++ R/relational.R | 6 +- README.Rmd | 67 +++------ README.md | 106 ++++---------- _pkgdown.yml | 14 ++ index.md | 87 +++-------- vignettes/developers.Rmd | 195 ++++++------------------- vignettes/extend.Rmd | 156 ++++++++++++++++++++ vignettes/funnel.Rmd | 276 +++++++++++++++++++++++++++++++++++ vignettes/large.Rmd | 288 +++++++++++++++++++++++++++++++++++++ vignettes/limits.Rmd | 2 +- vignettes/telemetry.Rmd | 66 +++++++++ 12 files changed, 1042 insertions(+), 347 deletions(-) create mode 100644 .github/CODE_OF_CONDUCT.md create mode 100644 vignettes/extend.Rmd create mode 100644 vignettes/funnel.Rmd create mode 100644 vignettes/large.Rmd create mode 100644 vignettes/telemetry.Rmd diff --git a/.github/CODE_OF_CONDUCT.md b/.github/CODE_OF_CONDUCT.md new file mode 100644 index 000000000..3ac34c82d --- /dev/null +++ b/.github/CODE_OF_CONDUCT.md @@ -0,0 +1,126 @@ +# Contributor Covenant Code of Conduct + +## Our Pledge + +We as members, contributors, and leaders pledge to make participation in our +community a harassment-free experience for everyone, regardless of age, body +size, visible or invisible disability, ethnicity, sex characteristics, gender +identity and expression, level of experience, education, socio-economic status, +nationality, personal appearance, race, caste, color, religion, or sexual +identity and orientation. + +We pledge to act and interact in ways that contribute to an open, welcoming, +diverse, inclusive, and healthy community. + +## Our Standards + +Examples of behavior that contributes to a positive environment for our +community include: + +* Demonstrating empathy and kindness toward other people +* Being respectful of differing opinions, viewpoints, and experiences +* Giving and gracefully accepting constructive feedback +* Accepting responsibility and apologizing to those affected by our mistakes, + and learning from the experience +* Focusing on what is best not just for us as individuals, but for the overall + community + +Examples of unacceptable behavior include: + +* The use of sexualized language or imagery, and sexual attention or advances of + any kind +* Trolling, insulting or derogatory comments, and personal or political attacks +* Public or private harassment +* Publishing others' private information, such as a physical or email address, + without their explicit permission +* Other conduct which could reasonably be considered inappropriate in a + professional setting + +## Enforcement Responsibilities + +Community leaders are responsible for clarifying and enforcing our standards of +acceptable behavior and will take appropriate and fair corrective action in +response to any behavior that they deem inappropriate, threatening, offensive, +or harmful. + +Community leaders have the right and responsibility to remove, edit, or reject +comments, commits, code, wiki edits, issues, and other contributions that are +not aligned to this Code of Conduct, and will communicate reasons for moderation +decisions when appropriate. + +## Scope + +This Code of Conduct applies within all community spaces, and also applies when +an individual is officially representing the community in public spaces. +Examples of representing our community include using an official e-mail address, +posting via an official social media account, or acting as an appointed +representative at an online or offline event. + +## Enforcement + +Instances of abusive, harassing, or otherwise unacceptable behavior may be +reported to the community leaders responsible for enforcement at codeofconduct@posit.co. +All complaints will be reviewed and investigated promptly and fairly. + +All community leaders are obligated to respect the privacy and security of the +reporter of any incident. + +## Enforcement Guidelines + +Community leaders will follow these Community Impact Guidelines in determining +the consequences for any action they deem in violation of this Code of Conduct: + +### 1. Correction + +**Community Impact**: Use of inappropriate language or other behavior deemed +unprofessional or unwelcome in the community. + +**Consequence**: A private, written warning from community leaders, providing +clarity around the nature of the violation and an explanation of why the +behavior was inappropriate. A public apology may be requested. + +### 2. Warning + +**Community Impact**: A violation through a single incident or series of +actions. + +**Consequence**: A warning with consequences for continued behavior. No +interaction with the people involved, including unsolicited interaction with +those enforcing the Code of Conduct, for a specified period of time. This +includes avoiding interactions in community spaces as well as external channels +like social media. Violating these terms may lead to a temporary or permanent +ban. + +### 3. Temporary Ban + +**Community Impact**: A serious violation of community standards, including +sustained inappropriate behavior. + +**Consequence**: A temporary ban from any sort of interaction or public +communication with the community for a specified period of time. No public or +private interaction with the people involved, including unsolicited interaction +with those enforcing the Code of Conduct, is allowed during this period. +Violating these terms may lead to a permanent ban. + +### 4. Permanent Ban + +**Community Impact**: Demonstrating a pattern of violation of community +standards, including sustained inappropriate behavior, harassment of an +individual, or aggression toward or disparagement of classes of individuals. + +**Consequence**: A permanent ban from any sort of public interaction within the +community. + +## Attribution + +This Code of Conduct is adapted from the [Contributor Covenant][homepage], +version 2.1, available at +. + +Community Impact Guidelines were inspired by +[Mozilla's code of conduct enforcement ladder][https://github.com/mozilla/inclusion]. + +For answers to common questions about this code of conduct, see the FAQ at +. Translations are available at . + +[homepage]: https://www.contributor-covenant.org diff --git a/R/relational.R b/R/relational.R index 27dfefd7b..518cad468 100644 --- a/R/relational.R +++ b/R/relational.R @@ -129,10 +129,10 @@ check_funneled <- function(x, duckplyr_error, call = caller_env()) { duckplyr_error_msg <- if (is.character(duckplyr_error)) duckplyr_error duckplyr_error_parent <- if (is_condition(duckplyr_error)) duckplyr_error cli::cli_abort(parent = duckplyr_error_parent, call = call, c( - "This operation cannot be carried out by DuckDB, and the input is a lazy duckplyr frame.", + "This operation cannot be carried out by DuckDB, and the input is a funneled duckplyr frame.", "*" = duckplyr_error_msg, - "i" = "Use {.code compute(lazy = FALSE)} to materialize to temporary storage and continue with {.pkg duckplyr}.", - "i" = 'See the "Funneling" section in {.help duckdb_tibble} for other options.' + "i" = "Use {.code compute(funnel = FALSE)} to materialize to temporary storage and continue with {.pkg duckplyr}.", + "i" = 'See {.run vignette("funnel")} for other options.' )) } } diff --git a/README.Rmd b/README.Rmd index 896ea16e5..47d67f106 100644 --- a/README.Rmd +++ b/README.Rmd @@ -29,6 +29,8 @@ clean_output <- function(x, options) { index <- strsplit(paste(index, collapse = "\n"), "\n---\n")[[1]][[2]] writeLines(index, "index.md") + # FIXME: Change to the main site after release + x <- gsub('(`vignette[(]"([^"]+)"[)]`)', "[\\1](https://duckplyr.tidyverse.org/dev/articles/\\2.html)", x) x <- fansi::strip_sgr(x) x } @@ -130,14 +132,14 @@ class(out) Nothing has been computed yet. Querying the number of rows, or a column, starts the computation: -```{r} +```{r cache = TRUE} out$month ``` Note that, unlike dplyr, the results are not ordered, see `?config` for details. However, once materialized, the results are stable: -```{r} +```{r cache = TRUE} out ``` @@ -168,7 +170,7 @@ db_exec("LOAD httpfs") flights <- read_parquet_duckdb(urls) ``` -Unlike with local data frames, the default is to disallow automatic materialization of the results on access. +Unlike with local data frames, the default is to disallow automatic materialization if the result is too large. ```{r error = TRUE} nrow(flights) @@ -213,62 +215,25 @@ Over 10M rows analyzed in about 10 seconds over the internet, that's not bad. Of course, working with Parquet, CSV, or JSON files downloaded locally is possible as well. -## Using duckplyr in other packages +## Further reading -Refer to `vignette("developers", package = "duckplyr")`. +- `vignette("large")`: Tools for working with large data -## Telemetry +- `vignette("funnel")`: How duckplyr is both eager and lazy at the same time -As a drop-in replacement for dplyr, duckplyr will use DuckDB for the operations only if it can, and fall back to dplyr otherwise. -A fallback will not change the correctness of the results, but it may be slower or consume more memory. -We would like to guide our efforts towards improving duckplyr, focusing on the features with the most impact. -To this end, duckplyr collects and uploads telemetry data about fallback situations, but only if permitted by the user: +- `vignette("limits")`: Translation employed by duckplyr, and current limitations -- Collection is on by default, but can be turned off. -- Uploads are done upon request only. -- There is an option to automatically upload when the package is loaded, this is also opt-in. - -The data collected contains: - -- The package version -- The error message -- The operation being performed, and the arguments - - For the input data frames, only the structure is included (column types only), no column names or data - -```{r include = FALSE} -Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = "") -Sys.setenv(DUCKPLYR_FALLBACK_AUTOUPLOAD = "") -fallback_purge() -``` +- `vignette("developers")`: Using duckplyr for individual data frames and in other packages -Fallback is silent by default, but can be made verbose. - -```{r} -Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE) -out <- - nycflights13::flights %>% - duckplyr::as_duckdb_tibble() %>% - mutate(inflight_delay = arr_delay - dep_delay) -``` - -After logs have been collected, the upload options are displayed the next time the duckplyr package is loaded in an R session. - -```{r, echo = FALSE} -fallback_autoupload() -``` +- `vignette("telemetry")`: Telemetry in duckplyr -The `fallback_sitrep()` function describes the current configuration and the available options. +## Getting help -## How is this different from dbplyr? +If you encounter a clear bug, please file an issue with a minimal reproducible example on [GitHub](https://github.com/tidyverse/duckplyr/issues). For questions and other discussion, please use [forum.posit.co](https://forum.posit.co/). -The duckplyr package is a dplyr backend that uses DuckDB, a high-performance, embeddable analytical database. -It is designed to be a fully compatible drop-in replacement for dplyr, with *exactly* the same syntax and semantics: -- Input and output are data frames or tibbles. -- All dplyr verbs are supported, with fallback. -- All R data types and functions are supported, with fallback. -- No SQL is generated. +## Code of conduct -The dbplyr package is a dplyr backend that connects to SQL databases, and is designed to work with various databases that support SQL, including DuckDB. -Data must be copied into and collected from the database, and the syntax and semantics are similar but not identical to plain dplyr. +Please note that this project is released with a [Contributor Code of Conduct](https://duckplyr.tidyverse.org/CODE_OF_CONDUCT). +By participating in this project you agree to abide by its terms. diff --git a/README.md b/README.md index dabbb6df8..c37c4df4e 100644 --- a/README.md +++ b/README.md @@ -67,7 +67,7 @@ The following code aggregates the inflight delay by year and month for the first half of the year. We use a variant of the `nycflights13::flights` dataset, where the timezone has been set to UTC to work around a current limitation of duckplyr, see -`vignette("limits.html")`. +[`vignette("limits.html")`](https://duckplyr.tidyverse.org/dev/articles/limits.html.html). ``` r flights_df() @@ -114,7 +114,7 @@ starts the computation: ``` r out$month -#> [1] 2 4 5 1 3 6 +#> [1] 4 1 3 6 2 5 ``` Note that, unlike dplyr, the results are not ordered, see `?config` for @@ -125,12 +125,12 @@ out #> # A tibble: 6 × 4 #> year month mean_inflight_delay median_inflight_delay #> -#> 1 2013 2 -5.15 -6 -#> 2 2013 4 -2.67 -5 -#> 3 2013 5 -9.37 -10 -#> 4 2013 1 -3.86 -5 -#> 5 2013 3 -7.36 -9 -#> 6 2013 6 -4.24 -7 +#> 1 2013 4 -2.67 -5 +#> 2 2013 1 -3.86 -5 +#> 3 2013 3 -7.36 -9 +#> 4 2013 6 -4.24 -7 +#> 5 2013 2 -5.15 -6 +#> 6 2013 5 -9.37 -10 ``` Restart R, or call `duckplyr::methods_restore()` to revert to the @@ -168,11 +168,11 @@ flights <- read_parquet_duckdb(urls) ``` Unlike with local data frames, the default is to disallow automatic -materialization of the results on access. +materialization if the result is too large. ``` r nrow(flights) -#> Error: Materialization is disabled, use collect() or as_tibble() to materialize. +#> Error: Materialization would result in 9091 rows, which exceeds the limit of 9090. Use collect() or as_tibble() to materialize. ``` Queries on the remote data are executed lazily, and the results are not @@ -335,77 +335,33 @@ Over 10M rows analyzed in about 10 seconds over the internet, that’s not bad. Of course, working with Parquet, CSV, or JSON files downloaded locally is possible as well. -## Using duckplyr in other packages +## Further reading -Refer to `vignette("developers", package = "duckplyr")`. +- [`vignette("large")`](https://duckplyr.tidyverse.org/dev/articles/large.html): + Tools for working with large data -## Telemetry +- [`vignette("funnel")`](https://duckplyr.tidyverse.org/dev/articles/funnel.html): + How duckplyr is both eager and lazy at the same time -As a drop-in replacement for dplyr, duckplyr will use DuckDB for the -operations only if it can, and fall back to dplyr otherwise. A fallback -will not change the correctness of the results, but it may be slower or -consume more memory. We would like to guide our efforts towards -improving duckplyr, focusing on the features with the most impact. To -this end, duckplyr collects and uploads telemetry data about fallback -situations, but only if permitted by the user: +- [`vignette("limits")`](https://duckplyr.tidyverse.org/dev/articles/limits.html): + Translation employed by duckplyr, and current limitations -- Collection is on by default, but can be turned off. -- Uploads are done upon request only. -- There is an option to automatically upload when the package is loaded, - this is also opt-in. +- [`vignette("developers")`](https://duckplyr.tidyverse.org/dev/articles/developers.html): + Using duckplyr for individual data frames and in other packages -The data collected contains: +- [`vignette("telemetry")`](https://duckplyr.tidyverse.org/dev/articles/telemetry.html): + Telemetry in duckplyr -- The package version -- The error message -- The operation being performed, and the arguments - - For the input data frames, only the structure is included (column - types only), no column names or data +## Getting help -Fallback is silent by default, but can be made verbose. +If you encounter a clear bug, please file an issue with a minimal +reproducible example on +[GitHub](https://github.com/tidyverse/duckplyr/issues). For questions +and other discussion, please use +[forum.posit.co](https://forum.posit.co/). -``` r -Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE) -out <- - nycflights13::flights %>% - duckplyr::as_duckdb_tibble() %>% - mutate(inflight_delay = arr_delay - dep_delay) -#> Error processing duckplyr query with DuckDB, falling back to dplyr. -#> Caused by error in `check_df_for_rel()` at duckplyr/R/relational-duckdb.R:108:3: -#> ! Attributes are lost during conversion. Affected column: `time_hour`. -``` +## Code of conduct -After logs have been collected, the upload options are displayed the -next time the duckplyr package is loaded in an R session. - - #> The duckplyr package is configured to fall back to dplyr when it encounters an - #> incompatibility. Fallback events can be collected and uploaded for analysis to - #> guide future development. By default, data will be collected but no data will - #> be uploaded. - #> ℹ Automatic fallback uploading is not controlled and therefore disabled, see - #> `?duckplyr::fallback()`. - #> ✔ Number of reports ready for upload: 1. - #> → Review with `duckplyr::fallback_review()`, upload with - #> `duckplyr::fallback_upload()`. - #> ℹ Configure automatic uploading with `duckplyr::fallback_config()`. - -The `fallback_sitrep()` function describes the current configuration and -the available options. - -## How is this different from dbplyr? - -The duckplyr package is a dplyr backend that uses DuckDB, a -high-performance, embeddable analytical database. It is designed to be a -fully compatible drop-in replacement for dplyr, with *exactly* the same -syntax and semantics: - -- Input and output are data frames or tibbles. -- All dplyr verbs are supported, with fallback. -- All R data types and functions are supported, with fallback. -- No SQL is generated. - -The dbplyr package is a dplyr backend that connects to SQL databases, -and is designed to work with various databases that support SQL, -including DuckDB. Data must be copied into and collected from the -database, and the syntax and semantics are similar but not identical to -plain dplyr. +Please note that this project is released with a [Contributor Code of +Conduct](https://duckplyr.tidyverse.org/CODE_OF_CONDUCT). By +participating in this project you agree to abide by its terms. diff --git a/_pkgdown.yml b/_pkgdown.yml index c05fc45b9..12f2acb64 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -8,6 +8,20 @@ template: in_header: | +articles: +- title: Articles + navbar: ~ + contents: + - large + - funnel + - limits + - developers + - telemetry + +- title: internal + contents: + - extend + reference: - title: Using duckplyr contents: diff --git a/index.md b/index.md index 78bc8a755..e30694f13 100644 --- a/index.md +++ b/index.md @@ -112,7 +112,7 @@ Querying the number of rows, or a column, starts the computation: ``` r out$month -#> [1] 2 4 5 1 3 6 +#> [1] 4 1 3 6 2 5 ``` Note that, unlike dplyr, the results are not ordered, see `?config` for details. @@ -124,12 +124,12 @@ out #> # A tibble: 6 × 4 #> year month mean_inflight_delay median_inflight_delay #>     -#> 1 2013 2 -5.15 -6 -#> 2 2013 4 -2.67 -5 -#> 3 2013 5 -9.37 -10 -#> 4 2013 1 -3.86 -5 -#> 5 2013 3 -7.36 -9 -#> 6 2013 6 -4.24 -7 +#> 1 2013 4 -2.67 -5 +#> 2 2013 1 -3.86 -5 +#> 3 2013 3 -7.36 -9 +#> 4 2013 6 -4.24 -7 +#> 5 2013 2 -5.15 -6 +#> 6 2013 5 -9.37 -10 ``` Restart R, or call `duckplyr::methods_restore()` to revert to the default dplyr implementation. @@ -166,12 +166,12 @@ db_exec("LOAD httpfs") flights <- read_parquet_duckdb(urls) ``` -Unlike with local data frames, the default is to disallow automatic materialization of the results on access. +Unlike with local data frames, the default is to disallow automatic materialization if the result is too large. ``` r nrow(flights) -#> Error: Materialization is disabled, use collect() or as_tibble() to materialize. +#> Error: Materialization would result in 9091 rows, which exceeds the limit of 9090. Use collect() or as_tibble() to materialize. ``` Queries on the remote data are executed lazily, and the results are not materialized until explicitly requested. @@ -335,72 +335,25 @@ Over 10M rows analyzed in about 10 seconds over the internet, that's not bad. Of course, working with Parquet, CSV, or JSON files downloaded locally is possible as well. -## Using duckplyr in other packages +## Further reading -Refer to `vignette("developers", package = "duckplyr")`. +- `vignette("large")`: Tools for working with large data -## Telemetry +- `vignette("funnel")`: How duckplyr is both eager and lazy at the same time -As a drop-in replacement for dplyr, duckplyr will use DuckDB for the operations only if it can, and fall back to dplyr otherwise. -A fallback will not change the correctness of the results, but it may be slower or consume more memory. -We would like to guide our efforts towards improving duckplyr, focusing on the features with the most impact. -To this end, duckplyr collects and uploads telemetry data about fallback situations, but only if permitted by the user: +- `vignette("limits")`: Translation employed by duckplyr, and current limitations -- Collection is on by default, but can be turned off. -- Uploads are done upon request only. -- There is an option to automatically upload when the package is loaded, this is also opt-in. +- `vignette("developers")`: Using duckplyr for individual data frames and in other packages -The data collected contains: +- `vignette("telemetry")`: Telemetry in duckplyr -- The package version -- The error message -- The operation being performed, and the arguments - - For the input data frames, only the structure is included (column types only), no column names or data +## Getting help +If you encounter a clear bug, please file an issue with a minimal reproducible example on [GitHub](https://github.com/tidyverse/duckplyr/issues). For questions and other discussion, please use [forum.posit.co](https://forum.posit.co/). -Fallback is silent by default, but can be made verbose. +## Code of conduct -``` r -Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE) -out <- - nycflights13::flights %>% - duckplyr::as_duckdb_tibble() %>% - mutate(inflight_delay = arr_delay - dep_delay) -#> Error processing duckplyr query with DuckDB, falling back to dplyr. -#> Caused by error in `check_df_for_rel()` at duckplyr/R/relational-duckdb.R:108:3: -#> ! Attributes are lost during conversion. Affected column: `time_hour`. -``` - -After logs have been collected, the upload options are displayed the next time the duckplyr package is loaded in an R session. - - -``` -#> The duckplyr package is configured to fall back to dplyr when it encounters an -#> incompatibility. Fallback events can be collected and uploaded for analysis to -#> guide future development. By default, data will be collected but no data will -#> be uploaded. -#> ℹ Automatic fallback uploading is not controlled and therefore disabled, see -#> `?duckplyr::fallback()`. -#> ✔ Number of reports ready for upload: 1. -#> → Review with `duckplyr::fallback_review()`, upload with -#> `duckplyr::fallback_upload()`. -#> ℹ Configure automatic uploading with `duckplyr::fallback_config()`. -``` - -The `fallback_sitrep()` function describes the current configuration and the available options. - - -## How is this different from dbplyr? - -The duckplyr package is a dplyr backend that uses DuckDB, a high-performance, embeddable analytical database. -It is designed to be a fully compatible drop-in replacement for dplyr, with *exactly* the same syntax and semantics: - -- Input and output are data frames or tibbles. -- All dplyr verbs are supported, with fallback. -- All R data types and functions are supported, with fallback. -- No SQL is generated. - -The dbplyr package is a dplyr backend that connects to SQL databases, and is designed to work with various databases that support SQL, including DuckDB. -Data must be copied into and collected from the database, and the syntax and semantics are similar but not identical to plain dplyr. +Please note that this project is released with a [Contributor Code of Conduct](https://duckplyr.tidyverse.org/CODE_OF_CONDUCT). +By participating in this project you agree to abide by its terms. diff --git a/vignettes/developers.Rmd b/vignettes/developers.Rmd index 6ae117612..d5e374c27 100644 --- a/vignettes/developers.Rmd +++ b/vignettes/developers.Rmd @@ -1,8 +1,8 @@ --- -title: "Use of duckplyr in other packages" +title: "Selective use of duckplyr" output: rmarkdown::html_vignette vignette: > - %\VignetteIndexEntry{developers} + %\VignetteIndexEntry{30 Selective use of duckplyr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- @@ -30,21 +30,23 @@ knitr::opts_chunk$set( Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 0) ``` +This vignette demonstrates how to use duckplyr selectively, for individual data frames or for other packages. + ```{r attach} library(conflicted) library(dplyr) conflict_prefer("filter", "dplyr") ``` -## Use of duckplyr for individual data frames +## External data frame -To enable duckplyr **for individual data frames instead of session wide**, +To enable duckplyr for individual data frames instead of session-wide, -- do **not** load duckplyr with `library()`. +- do *not* load duckplyr with `library()`. - use `duckplyr::as_duckdb_tibble()` as the first step in your pipe, without attaching the package. ```{r} -unfunneled <- +lazy <- duckplyr::flights_df() |> duckplyr::as_duckdb_tibble() |> filter(!is.na(arr_delay), !is.na(dep_delay)) |> @@ -60,15 +62,15 @@ unfunneled <- The result is a tibble, with its own class. ```{r} -class(unfunneled) -names(unfunneled) +class(lazy) +names(lazy) ``` DuckDB is responsible for eventually carrying out the operations. -Despite the late filter, the summary is not computed for the months in the second half of the year. +Despite the filter coming very late in the pipeline, it is applied to the raw data. ```{r} -unfunneled |> +lazy |> explain() ``` @@ -76,166 +78,59 @@ All data frame operations are supported. Computation happens upon the first request. ```{r} -unfunneled$mean_inflight_delay +lazy$mean_inflight_delay ``` After the computation has been carried out, the results are preserved and available immediately: ```{r} -unfunneled +lazy ``` -## Funneling +## Own data -The default mode for `as_duckdb_tibble()` is unfunneled. -This allows applying all data frame operations on the results, including column subsetting or retrieving the number of rows. -In addition, if an operation cannot be carried out by duckdb, the dplyr fallback is used transparently. -Use `funnel = TRUE` to ensure that all operations are carried out by DuckDB, or fail. -This is also the default for the ingestion functions such as `read_parquet_duckdb()`. +Construct duckplyr frames directly with `duckplyr::duckdb_tibble()`: ```{r} -funneled <- - duckplyr::flights_df() |> - duckplyr::as_duckdb_tibble(funnel = TRUE) +data <- duckplyr::duckdb_tibble( + x = 1:10, + y = 5, + z = letters[1:10] +) +data ``` -Columns or the row count cannot be accessed directly in this mode: - -```{r error = TRUE} -nrow(funneled) -``` -Also, operations that are not (yet) supported will fail: +## In other packages -```{r error = TRUE} -funneled |> - mutate(inflight_delay = arr_delay - dep_delay) |> - summarize( - .by = c(year, month), - mean_inflight_delay = mean(inflight_delay, na.rm = TRUE), - median_inflight_delay = median(inflight_delay, na.rm = TRUE), - ) -``` +Like other dependencies, duckplyr must be declared in the `DESCRIPTION` file and optionally imported in the `NAMESPACE` file. +Because duckplyr does not import dplyr, it is necessary to import both packages. +The recipe below shows how to achieve this with the usethis package. -See `vignette("limits")` for current limitations, and the contributing guide for how to add support for additional operations. +- Add dplyr as a dependency with `usethis::use_package("dplyr")` +- Add duckplyr as a dependency with `usethis::use_package("duckplyr")` +- In your code, use a pattern like `data |> duckplyr::as_duckdb_tibble() |> dplyr::filter(...)` +- To avoid the package prefix and simply write `as_duckdb_tibble()` or `filter()`: + - Import the duckplyr function with `usethis::use_import_from("duckplyr", "as_duckdb_tibble")` + - Import the dplyr function with `usethis::use_import_from("dplyr", "filter")` -## Extensibility +Learn more about usethis at . -duckplyr also defines a set of generics that provide a low-level implementer's interface for dplyr's high-level user interface. -Other packages may then implement methods for those generics. -```{r extensibility} -library(conflicted) -library(dplyr) -conflict_prefer("filter", "dplyr") -library(duckplyr) -``` +## Funneling +The default mode for `as_duckdb_tibble()` and `duckdb_tibble()` is unfunneled. +This means that the dplyr operations are carried out by DuckDB when possible, and also available as data frames upon first request. +Use `as_duckdb_tibble(funnel = TRUE)` or `duckdb_tibble(.funnel = TRUE)` to avoid materializing intermediate data and to ensure that all operations are carried out by DuckDB or fail. +Funneling can also limit the number of rows or cells that are materialized: -```{r overwrite, echo = FALSE} -methods_overwrite() +```{r} +data <- duckplyr::duckdb_tibble(x = 1:5, .funnel = c(rows = 3)) +data ``` -```{r extensibility2} -# Create a relational to be used by examples below -new_dfrel <- function(x) { - stopifnot(is.data.frame(x)) - new_relational(list(x), class = "dfrel") -} -mtcars_rel <- new_dfrel(mtcars[1:5, 1:4]) - -# Example 1: return a data.frame -rel_to_df.dfrel <- function(rel, ...) { - unclass(rel)[[1]] -} -rel_to_df(mtcars_rel) - -# Example 2: A (random) filter -rel_filter.dfrel <- function(rel, exprs, ...) { - df <- unclass(rel)[[1]] - - # A real implementation would evaluate the predicates defined - # by the exprs argument - new_dfrel(df[sample.int(nrow(df), 3, replace = TRUE), ]) -} - -rel_filter( - mtcars_rel, - list( - relexpr_function( - "gt", - list(relexpr_reference("cyl"), relexpr_constant("6")) - ) - ) -) - -# Example 3: A custom projection -rel_project.dfrel <- function(rel, exprs, ...) { - df <- unclass(rel)[[1]] - - # A real implementation would evaluate the expressions defined - # by the exprs argument - new_dfrel(df[seq_len(min(3, base::ncol(df)))]) -} - -rel_project( - mtcars_rel, - list(relexpr_reference("cyl"), relexpr_reference("disp")) -) - -# Example 4: A custom ordering (eg, ascending by mpg) -rel_order.dfrel <- function(rel, exprs, ...) { - df <- unclass(rel)[[1]] - - # A real implementation would evaluate the expressions defined - # by the exprs argument - new_dfrel(df[order(df[[1]]), ]) -} - -rel_order( - mtcars_rel, - list(relexpr_reference("mpg")) -) - -# Example 5: A custom join -rel_join.dfrel <- function(left, right, conds, join, ...) { - left_df <- unclass(left)[[1]] - right_df <- unclass(right)[[1]] - - # A real implementation would evaluate the expressions - # defined by the conds argument, - # use different join types based on the join argument, - # and implement the join itself instead of relaying to left_join(). - new_dfrel(dplyr::left_join(left_df, right_df)) -} - -rel_join(new_dfrel(data.frame(mpg = 21)), mtcars_rel) - -# Example 6: Limit the maximum rows returned -rel_limit.dfrel <- function(rel, n, ...) { - df <- unclass(rel)[[1]] - - new_dfrel(df[seq_len(n), ]) -} - -rel_limit(mtcars_rel, 3) - -# Example 7: Suppress duplicate rows -# (ignoring row names) -rel_distinct.dfrel <- function(rel, ...) { - df <- unclass(rel)[[1]] - - new_dfrel(df[!duplicated(df), ]) -} - -rel_distinct(new_dfrel(mtcars[1:3, 1:4])) - -# Example 8: Return column names -rel_names.dfrel <- function(rel, ...) { - df <- unclass(rel)[[1]] - - names(df) -} - -rel_names(mtcars_rel) +```{r error = TRUE} +nrow(data) ``` + +Learn more about funneling in `vignette("funnel")`, and about the translation employed by duckplyr in `vignette("limits")`. diff --git a/vignettes/extend.Rmd b/vignettes/extend.Rmd new file mode 100644 index 000000000..456c7cbde --- /dev/null +++ b/vignettes/extend.Rmd @@ -0,0 +1,156 @@ +--- +title: "Implementer's interface" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{99 Implementer's interface} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +clean_output <- function(x, options) { + x <- gsub("0x[0-9a-f]+", "0xdeadbeef", x) + x <- gsub("dataframe_[0-9]*_[0-9]*", " dataframe_42_42 ", x) + x <- gsub("[0-9]*\\.___row_number ASC", "42.___row_number ASC", x) + x <- gsub("─", "-", x) + x +} + +local({ + hook_source <- knitr::knit_hooks$get("document") + knitr::knit_hooks$set(document = clean_output) +}) + +knitr::opts_chunk$set( + collapse = TRUE, + eval = identical(Sys.getenv("IN_PKGDOWN"), "true") || (getRversion() >= "4.1" && rlang::is_installed(c("conflicted", "nycflights13"))), + comment = "#>" +) + +Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 0) +``` + +```{r attach} +library(conflicted) +library(dplyr) +conflict_prefer("filter", "dplyr") +``` + +duckplyr also defines a set of generics that provide a low-level implementer's interface for dplyr's high-level user interface. +Other packages may then implement methods for those generics. + +```{r extensibility} +library(conflicted) +library(dplyr) +conflict_prefer("filter", "dplyr") +library(duckplyr) +``` + + +```{r overwrite, echo = FALSE} +methods_overwrite() +``` + +```{r extensibility2} +# Create a relational to be used by examples below +new_dfrel <- function(x) { + stopifnot(is.data.frame(x)) + new_relational(list(x), class = "dfrel") +} +mtcars_rel <- new_dfrel(mtcars[1:5, 1:4]) + +# Example 1: return a data.frame +rel_to_df.dfrel <- function(rel, ...) { + unclass(rel)[[1]] +} +rel_to_df(mtcars_rel) + +# Example 2: A (random) filter +rel_filter.dfrel <- function(rel, exprs, ...) { + df <- unclass(rel)[[1]] + + # A real implementation would evaluate the predicates defined + # by the exprs argument + new_dfrel(df[sample.int(nrow(df), 3, replace = TRUE), ]) +} + +rel_filter( + mtcars_rel, + list( + relexpr_function( + "gt", + list(relexpr_reference("cyl"), relexpr_constant("6")) + ) + ) +) + +# Example 3: A custom projection +rel_project.dfrel <- function(rel, exprs, ...) { + df <- unclass(rel)[[1]] + + # A real implementation would evaluate the expressions defined + # by the exprs argument + new_dfrel(df[seq_len(min(3, base::ncol(df)))]) +} + +rel_project( + mtcars_rel, + list(relexpr_reference("cyl"), relexpr_reference("disp")) +) + +# Example 4: A custom ordering (eg, ascending by mpg) +rel_order.dfrel <- function(rel, exprs, ...) { + df <- unclass(rel)[[1]] + + # A real implementation would evaluate the expressions defined + # by the exprs argument + new_dfrel(df[order(df[[1]]), ]) +} + +rel_order( + mtcars_rel, + list(relexpr_reference("mpg")) +) + +# Example 5: A custom join +rel_join.dfrel <- function(left, right, conds, join, ...) { + left_df <- unclass(left)[[1]] + right_df <- unclass(right)[[1]] + + # A real implementation would evaluate the expressions + # defined by the conds argument, + # use different join types based on the join argument, + # and implement the join itself instead of relaying to left_join(). + new_dfrel(dplyr::left_join(left_df, right_df)) +} + +rel_join(new_dfrel(data.frame(mpg = 21)), mtcars_rel) + +# Example 6: Limit the maximum rows returned +rel_limit.dfrel <- function(rel, n, ...) { + df <- unclass(rel)[[1]] + + new_dfrel(df[seq_len(n), ]) +} + +rel_limit(mtcars_rel, 3) + +# Example 7: Suppress duplicate rows +# (ignoring row names) +rel_distinct.dfrel <- function(rel, ...) { + df <- unclass(rel)[[1]] + + new_dfrel(df[!duplicated(df), ]) +} + +rel_distinct(new_dfrel(mtcars[1:3, 1:4])) + +# Example 8: Return column names +rel_names.dfrel <- function(rel, ...) { + df <- unclass(rel)[[1]] + + names(df) +} + +rel_names(mtcars_rel) +``` diff --git a/vignettes/funnel.Rmd b/vignettes/funnel.Rmd new file mode 100644 index 000000000..375f1f725 --- /dev/null +++ b/vignettes/funnel.Rmd @@ -0,0 +1,276 @@ +--- +title: "Funneling" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{10 Funneling} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +clean_output <- function(x, options) { + x <- gsub("0x[0-9a-f]+", "0xdeadbeef", x) + x <- gsub("dataframe_[0-9]*_[0-9]*", " dataframe_42_42 ", x) + x <- gsub("[0-9]*\\.___row_number ASC", "42.___row_number ASC", x) + x <- gsub("─", "-", x) + x +} + +local({ + hook_source <- knitr::knit_hooks$get("document") + knitr::knit_hooks$set(document = clean_output) +}) + +knitr::opts_chunk$set( + collapse = TRUE, + eval = identical(Sys.getenv("IN_PKGDOWN"), "true") || (getRversion() >= "4.1" && rlang::is_installed(c("conflicted", "nycflights13"))), + comment = "#>" +) + +Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 0) +``` + +This vignette discusses eager and lazy computation, and funneling. + +```{r attach} +library(conflicted) +library(dplyr) +conflict_prefer("filter", "dplyr") +``` + +## Introduction + +Data frames backed by duckplyr, with class `"duckplyr_df"`, behave as regular data frames in almost all respects. +In particular, direct column access like `df$x`, or retrieving the number of rows with `nrow()`, works identically. +Conceptually, duckplyr frames are "eager": from a user's perspective, they behave like regular data frames. + +```{r} +df <- + duckplyr::duckdb_tibble(x = 1:5) |> + mutate(y = x + 1) +df +class(df) +df$y +nrow(df) +``` + +Under the hood, two key differences provide improved performance and usability: +lazy materialization and funneling. + + +## Eager and lazy computation + +For a duckplyr frame that is the result of a dplyr operation, accessing column data or retrieving the number of rows will trigger a computation that is carried out by DuckDB, not dplyr. +In this sense, duckplyr frames are also "lazy": the computation is deferred until the last possible moment, allowing DuckDB to optimize the whole pipeline. + +### Example + +This is explained in the following example that computes the mean arrival delay for flights departing from Newark airport (EWR) by day and month: + +```{r} +flights <- duckplyr::flights_df() + +flights_duckdb <- + flights |> + duckplyr::as_duckdb_tibble() + +system.time( + mean_arr_delay_ewr <- + flights_duckdb |> + filter(origin == "EWR", !is.na(arr_delay)) |> + summarize( + .by = month, + mean_arr_delay = mean(arr_delay), + min_arr_delay = min(arr_delay), + max_arr_delay = max(arr_delay), + median_arr_delay = median(arr_delay), + ) +) +``` + +Setting up the pipeline is fast, the size of the data does not affect the setup costs. +Because the computation is deferred, DuckDB can optimize the whole pipeline, which can be seen in the output below: + +```{r} +mean_arr_delay_ewr |> + explain() +``` + +The first step is to prune the unneeded columns, only `origin`, `month`, and `arr_delay` are kept. +The result becomes available when accessed: + +```{r} +system.time(mean_arr_delay_ewr$mean_arr_delay[[1]]) +``` + +### Comparison + +The functionality is similar to lazy tables in dbplyr and lazy frames in dtplyr. However, the behavior is different: at the time of writing, the internal structure of a lazy table or frame is different from a data frame, and columns cannot be accessed directly. + +| | **Eager** 😃 | **Lazy** 😴 | +|-------------|:------------:|:-----------:| +| **dplyr** | ✅ | | +| **dbplyr** | | ✅ | +| **dtplyr** | | ✅ | +| **duckplyr**| ✅ | ✅ | + +In contrast, with dplyr, each intermediate step and also the final result is a proper data frame, and computed right away, forfeiting the opportunity for optimization: + +```{r} +system.time( + flights |> + filter(origin == "EWR", !is.na(arr_delay)) |> + summarize( + .by = c(month, day), + mean_arr_delay = mean(arr_delay), + min_arr_delay = min(arr_delay), + max_arr_delay = max(arr_delay), + median_arr_delay = median(arr_delay), + ) +) +``` + + +## Funneling + +Being both "eager" and "lazy" at the same time introduces a challenge: +it is too easy to accidentally trigger computation, +which may be prohibitive if an intermediate result is too large. +This is where funneling comes in. + + +### Concept + +For unfunneled duckplyr frames, as in the two previous examples the underlying DuckDB computation is carried out upon the first request. +Once the results are computed, they are cached and subsequent requests are fast. +This is a good choice for small to medium-sized data, where DuckDB can provide a nice speedup but materializing the data is affordable at any stage. +This is the default for `duckdb_tibble()` and `as_duckdb_tibble()`. + +For funneled duckplyr frames, accessing a column or requesting the number of rows triggers an error. +This is a good choice for large data sets where the cost of materializing the data may be prohibitive due to size or computation time, and the user wants to control when the computation is carried out. + + +### Example + +The example below demonstrates the use of funneled duckplyr frames. + +```{r} +flights_funneled <- + flights |> + duckplyr::as_duckdb_tibble(funnel = TRUE) +``` + +In this example, `flights_funneled` is a funneled duckplyr frame. +The data can be displayed, and column names and types can be accessed. + +```{r} +flights_funneled +names(flights_funneled)[1:10] +class(flights_funneled) +class(flights_funneled[[1]]) +``` + +On the other hand, accessing a column or requesting the number of rows triggers an error: + +```{r error = TRUE} +nrow(flights_funneled) +flights_funneled[[1]] +``` + + +### Enforcing DuckDB operation + +For operations not supported by duckplyr, the original dplyr implementation is used as a fallback. +As the original dplyr implementation accesses columns directly, the data must be materialized before a fallback can be executed. +Therefore, funneled frames allow you to check that all operations are supported by DuckDB: for a funneled frame with `funnel = FALSE`, fallbacks to dplyr are not possible. + +```{r error = TRUE} +flights_funneled |> + group_by(origin) |> + summarize(n = n()) |> + ungroup() +``` + +The same pipeline with an unfunneled frame works, but the computation is carried out by dplyr: + +```{r} +flights_funneled |> + duckplyr::as_duckdb_tibble(funnel = FALSE) |> + group_by(origin) |> + summarize(n = n()) |> + ungroup() +``` + +By using operations supported by duckplyr and avoiding fallbacks as much as possible, your pipelines will be executed by DuckDB in an optimized way. +See `?fallback` for details on fallbacks, and `vignette("limits")` for the operations supported by duckplyr. + + +### Unfunneling + +A funneled duckplyr frame can be converted to an unfunneled one with `as_duckdb_tibble(funnel = FALSE)`. +The `collect.duckplyr_df()` method triggers computation and converts to a plain tibble. +The difference between the two is the class of the returned object: + +```{r} +flights_funneled |> + duckplyr::as_duckdb_tibble(funnel = FALSE) |> + class() + +flights_funneled |> + collect() |> + class() +``` + +The same behavior is achieved with `as_tibble()` and `as.data.frame()`: + +```{r} +flights_funneled |> + as_tibble() |> + class() + +flights_funneled |> + as.data.frame() |> + class() +``` + +See `vignette("large")` for techniques for working with large data sets. + +### Comparison + +Funneled duckplyr frames behave like lazy tables in dbplyr and lazy frames in dtplyr: the computation only starts when you *explicitly* request it with `collect.duckplyr_df()` or through other means. +However, funneled duckplyr frames can be unfunneled at any time, and vice versa. +In dtplyr and dbplyr, there are no unfunneled frames: collection always needs to be explicit. + + +## Partial funneling + +Partial funneling is a compromise between funneling and unfunneling. +Materialization is allowed for data up to a certain size, measured in cells (values) or rows in the resulting data frame. + +```{r} +nrow(flights) +flights_partial <- + flights |> + duckplyr::as_duckdb_tibble(funnel = c(rows = 100000)) +``` + +In this example, the data is materialized only if the result has fewer than 100,000 rows. + +```{r error = TRUE} +flights_partial |> + select(origin) |> + nrow() +``` + +The original input is too large to be materialized, so the operation fails. +On the other hand, the result after aggregation is small enough to be materialized: + +```{r} +flights_partial |> + count(origin) |> + nrow() +``` + +Partial funneling is a good choice for data sets where the cost of materializing the data is prohibitive only for large results. +The default for the ingestion functions like `read_parquet_duckdb()` is to limit the result size to one million cells (values in the resulting data frame). +See `vignette("large")` for more details on working with large data sets. diff --git a/vignettes/large.Rmd b/vignettes/large.Rmd new file mode 100644 index 000000000..94b20c5b9 --- /dev/null +++ b/vignettes/large.Rmd @@ -0,0 +1,288 @@ +--- +title: "Large data" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{01 Large data} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +clean_output <- function(x, options) { + x <- gsub("0x[0-9a-f]+", "0xdeadbeef", x) + x <- gsub("dataframe_[0-9]*_[0-9]*", " dataframe_42_42 ", x) + x <- gsub("[0-9]*\\.___row_number ASC", "42.___row_number ASC", x) + x <- gsub("─", "-", x) + x +} + +local({ + hook_source <- knitr::knit_hooks$get("document") + knitr::knit_hooks$set(document = clean_output) +}) + +knitr::opts_chunk$set( + collapse = TRUE, + eval = identical(Sys.getenv("IN_PKGDOWN"), "true") || (getRversion() >= "4.1" && rlang::is_installed(c("conflicted", "dbplyr", "nycflights13"))), + comment = "#>" +) + +options(conflicts.policy = list(warn = FALSE)) + +Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 0) +``` + +This vignette discusses how to work with large data sets using duckplyr. + +```{r attach} +library(conflicted) +library(duckplyr) +conflict_prefer("filter", "dplyr") +``` + + +## Introduction + +Data frames and other objects in R are stored in RAM, which can be a bottleneck when working with large data sets. +A variety of tools have been developed to work with large data sets, also in R. +One examples is the dbplyr package, a dplyr backend that connects to SQL databases and is designed to work with various databases that support SQL. +This is a viable approach if the data is already stored in a database, or if the data is stored in Parquet or CSV files and loaded as a lazy table via `duckdb::tbl_file()`. + +The dbplyr package translates dplyr code to SQL. +The syntax and semantics are very similar, but not identical to plain dplyr. +In contrast, the duckplyr package aims to be a fully compatible drop-in replacement for dplyr, with *exactly* the same syntax and semantics: + +- Input and output are data frames or tibbles. +- All dplyr verbs are supported, with fallback. +- All R data types and functions are supported, with fallback. +- No SQL is generated, instead, DuckDB's "relational" interface is used. + +Full compatibility means fewer surprises and less cognitive load for the user. +With DuckDB as the backend, duckplyr can also handle large data sets that do not fit into RAM, keeping full dplyr compatibility. +The tools for bringing data into and out of R memory are modeled after the dplyr and dbplyr packages, and are described in the following sections. + +See `vignette("funnel")` on eager and lazy data, `vignette("limits")` for limitations in the translation employed by duckplyr, and `?fallback` for more information on fallback. + + +## To duckplyr + +The `duckdb_tibble()` function creates a duckplyr data frame from vectors: + +```{r} +df <- duckdb_tibble(x = 1:5, y = letters[1:5]) +df +``` + +The `duckdb_tibble()` function is a drop-in replacement for `tibble()`, and can be used in the same way. + +Similarly, `as_duckdb_tibble()` can be used to convert a data frame or another object to a duckplyr data frame: + +```{r} +flights_df() |> + as_duckdb_tibble() +``` + +Existing code that uses DuckDB via dbplyr can also take advantage. +The following code creates a DuckDB connection and writes a data frame to a table: + +```{r} +path_duckdb <- tempfile(fileext = ".duckdb") +con <- DBI::dbConnect(duckdb::duckdb(path_duckdb)) +DBI::dbWriteTable(con, "data", data.frame(x = 1:5, y = letters[1:5])) + +dbplyr_data <- tbl(con, "data") +dbplyr_data +dbplyr_data |> + explain() +``` + +The `explain()` output shows that the data is actually coming from a DuckDB table. +The `as_duckdb_tibble()` function can then be used to seamlessly convert the data to a duckplyr frame: + +```{r} +dbplyr_data |> + as_duckdb_tibble() +dbplyr_data |> + as_duckdb_tibble() |> + explain() +``` + +This only works for DuckDB connections. +For other databases, turn the data into an R data frame or export it to a file before using `as_duckdb_tibble()`. + +```{r} +DBI::dbDisconnect(con) +``` + +For other common cases, the `duckdb_tibble()` function fails with a helpful error message: + +```{r error = TRUE} +duckdb_tibble(a = 1) |> + group_by(a) |> + as_duckdb_tibble() +duckdb_tibble(a = 1) |> + rowwise() |> + as_duckdb_tibble() +readr::read_csv("a\n1", show_col_types = FALSE) |> + as_duckdb_tibble() +``` + +In all cases, `as_tibble()` can be used to proceed with the existing code. + +## From files + +DuckDB supports data ingestion from CSV, Parquet, and JSON files. +The `read_csv_duckdb()` function accepts a file path and returns a duckplyr frame. + +```{r} +path_csv_1 <- tempfile(fileext = ".csv") +writeLines("x,y\n1,a\n2,b\n3,c", path_csv_1) +read_csv_duckdb(path_csv_1) +``` + +Reading multiple files is also supported: + +```{r} +path_csv_2 <- tempfile(fileext = ".csv") +writeLines("x,y\n4,d\n5,e\n6,f", path_csv_2) +read_csv_duckdb(c(path_csv_1, path_csv_2)) +``` + +The `options` argument can be used to control the reading. + +Similarly, the `read_parquet_duckdb()` and `read_json_duckdb()` functions can be used to read Parquet and JSON files, respectively. + +For reading from HTTPS or S3 URLs, the [httpfs extension](https://duckdb.org/docs/extensions/httpfs/overview.html) must be installed and loaded in each session. + +```{r} +db_exec("INSTALL httpfs") +db_exec("LOAD httpfs") +``` + +Installation is fast if the extension is already installed. +Once loaded, the `read_csv_duckdb()`, `read_parquet_duckdb()`, and `read_json_duckdb()` functions can be used with URLs: + +```{r} +url <- "https://blobs.duckdb.org/flight-data-partitioned/Year=2024/data_0.parquet" +flights_parquet <- read_parquet_duckdb(url) +flights_parquet +``` + +In all cases, the data is read lazily: only the metadata is read initially, and the data is read as required. +This means that data can be read from files that are larger than the available RAM. +The Parquet format is particularly efficient for this purpose, as it stores data in a columnar format and allows reading only the columns that are required. +See `vignette("funnel")` for more details on the concept of lazy data. + +## From DuckDB + +In addition to `as_duckdb_tibble()`, arbitrary DuckDB queries can be executed and the result can be converted to a duckplyr frame. +For this, [attach](https://duckdb.org/docs/sql/statements/attach.html) an existing DuckDB database first: + +```{r} +sql_attach <- paste0( + "ATTACH DATABASE '", + path_duckdb, + "' AS external (READ_ONLY)" +) +db_exec(sql_attach) +``` + +Then, use `read_sql_duckdb()` to execute a query and return a duckplyr frame: + +```{r} +read_sql_duckdb("SELECT * FROM external.data") +``` + +## Materialization + +In dbplyr, `compute()` is used to materialize a lazy table in a temporary table on the database, and `collect()` is used to bring the data into R memory. +This interface works exactly the same in duckplyr: + +```{r} +simple_data <- + duckdb_tibble(a = 1) |> + mutate(b = 2) + +simple_data |> + explain() + +simple_data_computed <- + simple_data |> + compute() +``` + +The `compute.duckplyr_df()` function returns a duckplyr frame that is materialized in a temporary table. +The return value of the function is a duckplyr frame that can be used in further computations. +The materialization is done in a temporary table, so the data is not persisted after the session ends: + +```{r} +simple_data_computed |> + explain() +``` + +The `collect()` function brings the data into R memory and returns a plain tibble: + +```{r} +duckdb_tibble(a = 1) |> + mutate(b = 2) |> + collect() +``` + +## To files + +To materialize data in a persistent file, the `compute_csv()` and `compute_parquet()` functions can be used. +The `compute_csv()` function writes the data to a CSV file: + +```{r} +path_csv_out <- tempfile(fileext = ".csv") +duckdb_tibble(a = 1) |> + mutate(b = 2) |> + compute_csv(path_csv_out) +writeLines(readLines(path_csv_out)) +``` + +The `compute_parquet()` function writes the data to a Parquet file: + +```{r} +path_parquet_out <- tempfile(fileext = ".parquet") +duckdb_tibble(a = 1) |> + mutate(b = 2) |> + compute_parquet(path_parquet_out) |> + explain() +``` + +Just like with `compute.duckplyr_df()`, the return value of `compute_csv()` and `compute_parquet()` is a duckplyr frame that uses the created CSV or Parquet file and can be used in further computations. +At the time of writing, direct JSON export is not supported. + +## The big picture + +The functions shown in this vignette allow the construction of data transformation pipelines spanning multiple data sources and data that is too large to fit into memory. +Full compatibility with dplyr is provided, so existing code can be used with duckplyr with minimal changes. +The lazy evaluation of duckplyr frames allows for efficient data processing, as only the required data is read from disk. +The materialization functions allow the data to be persisted in temporary tables or files, depending on the use case. +A typical workflow might look like this: + +- Prepare all data sources as duckplyr frames: local data frames and files +- Combine the data sources using dplyr verbs +- Preview intermediate results as usual: the computation will be faster because only the first few rows are requested +- To avoid rerunning the whole pipeline all over, use `compute.duckplyr_df()` or `compute_parquet()` to materialize any intermediate result that is too large to fit into memory +- Collect the final result using `collect.duckplyr_df()` or write it to a file using `compute_csv()` or `compute_parquet()` + +There is a caveat: due to the design of duckplyr, if a dplyr verb is not supported or uses a function that is not supported, the data will be read into memory before being processed further. +By default, if the data pipeline starts with an ingestion function, the data will only be read into memory if it is not "too large" (currently defined as 1 million cells or values in the table): + +```{r error = TRUE} +flights_parquet |> + group_by(Month) +``` + +Because `group_by()` is not supported, the data will be attempted to read into memory before the `group_by()` operation is executed. +Once the data is small enough to fit into memory, this works transparently. + +```{r} +flights_parquet |> + count(Month, DayofMonth) |> + group_by(Month) +``` + +See `vignette("funnel")` for the concepts and mechanisms at play. diff --git a/vignettes/limits.Rmd b/vignettes/limits.Rmd index 88d0df092..34fe089f6 100644 --- a/vignettes/limits.Rmd +++ b/vignettes/limits.Rmd @@ -2,7 +2,7 @@ title: "Translations" output: rmarkdown::html_vignette vignette: > - %\VignetteIndexEntry{translations} + %\VignetteIndexEntry{20 Translations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- diff --git a/vignettes/telemetry.Rmd b/vignettes/telemetry.Rmd new file mode 100644 index 000000000..2f85cb739 --- /dev/null +++ b/vignettes/telemetry.Rmd @@ -0,0 +1,66 @@ +--- +title: "Telemetry" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{80 Telemetry} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + eval = identical(Sys.getenv("IN_PKGDOWN"), "true") || (getRversion() >= "4.1" && rlang::is_installed(c("conflicted", "nycflights13"))), + comment = "#>" +) + +Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 0) + +options(conflicts.policy = list(warn = FALSE)) +``` + +```{r attach} +library(conflicted) +library(duckplyr) +conflict_prefer("filter", "dplyr") +``` + +As a drop-in replacement for dplyr, duckplyr will use DuckDB for the operations only if it can, and fall back to dplyr otherwise. +A fallback will not change the correctness of the results, but it may be slower or consume more memory. +We would like to guide our efforts towards improving duckplyr, focusing on the features with the most impact. +To this end, duckplyr collects and uploads telemetry data about fallback situations, but only if permitted by the user: + +- Collection is on by default, but can be turned off. +- Uploads are done upon request only. +- There is an option to automatically upload when the package is loaded, this is also opt-in. + +The data collected contains: + +- The package version +- The error message +- The operation being performed, and the arguments + - For the input data frames, only the structure is included (column types only), no column names or data + +```{r include = FALSE} +Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = "") +Sys.setenv(DUCKPLYR_FALLBACK_AUTOUPLOAD = "") +fallback_purge() +``` + +Fallback is silent by default, but can be made verbose. + +```{r} +Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE) +out <- + nycflights13::flights %>% + duckplyr::as_duckdb_tibble() %>% + mutate(inflight_delay = arr_delay - dep_delay) +``` + +After logs have been collected, the upload options are displayed the next time the duckplyr package is loaded in an R session. + +```{r, echo = FALSE} +duckplyr:::fallback_autoupload() +``` + +The `fallback_sitrep()` function describes the current configuration and the available options.