Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unnest() #266

Open
mgirlich opened this issue Jul 2, 2021 · 3 comments
Open

Add unnest() #266

mgirlich opened this issue Jul 2, 2021 · 3 comments
Labels
feature a feature request or enhancement

Comments

@mgirlich
Copy link
Collaborator

mgirlich commented Jul 2, 2021

I tried to add unnest() but it seems to be quite difficult. This issues acts more like a reminder for issues I encountered. Maybe at some point data.table gets its own implementation:

The standard solution is something like

dt[, lapply(.SD, unlist, recursive = FALSE), by = ...]

but this has the following problems:

  • first result in j must not be NULL when using by -> user has to specify .ptype.
  • data.table doesn't support dataframe columns.
  • to unnest list of data frames in data.table syntax seems to be quite tricky.
@markfairbanks
Copy link
Collaborator

I don't know the full solution, but here are a few more thoughts/notes on this.

The unlist() syntax has been superseded by using list_col[[1]].

Note: If used we should bump the data.table dependency to v1.13.2 due to performance issues of this syntax in 1.13.0. Relevant issue

library(data.table)

nest_df <- data.table(y = 1:2)
nest_list <- list(nest_df, nest_df)

test_df <- data.table(
  x = c("a", "b"),
  df_list = nest_list
)

test_df[, df_list[[1]], by = x]
#>    x y
#> 1: a 1
#> 2: a 2
#> 3: b 1
#> 4: b 2

We could build out unnesting multiple list columns by creating a call to c() like this:

library(data.table)

nest_df <- data.table(y = 1:2)
nest_list <- list(nest_df, nest_df)
nest_list2 <- lapply(nest_list, setNames, "z")

test_df <- data.table(
  x = c("a", "b"),
  df_list = nest_list,
  df_list2 = nest_list2
)

test_df[, c(df_list[[1]], df_list2[[1]]), by = x]
#>    x y z
#> 1: a 1 1
#> 2: a 2 2
#> 3: b 1 1
#> 4: b 2 2

You can unnest vectors this way, however they come out auto-named unless you slightly change the syntax. This auto-naming also occurs using the unlist() syntax.

library(data.table)

test_df <- data.table(
  x = c("a", "b"),
  vec_list = list(1:2, 1:2)
)

# Auto named
test_df[, vec_list[[1]], by = x]
#>    x V1
#> 1: a  1
#> 2: a  2
#> 3: b  1
#> 4: b  2

# Assigning a name
test_df[, .(vec_list = vec_list[[1]]), by = x]
#>    x vec_list
#> 1: a        1
#> 2: a        2
#> 3: b        1
#> 4: b        2

In tidytable I handled this with a simple if statement, but in dtplyr we won't know the type of the data until evaluation occurs. A simple rename of the V1 column is possible, but issues can occur if they already have a column named V1:

library(data.table)

test_df <- data.table(
  V1 = c("a", "b"),
  vec_list = list(1:2, 1:2)
)

test_df[, vec_list[[1]], by = V1]
#>    V1 V1
#> 1:  a  1
#> 2:  a  2
#> 3:  b  1
#> 4:  b  2

And last one I can think of at the moment - unnesting lists of data frames and lists of vectors at the same time causes some issues where data.table tries to recycle the vectors and creates unnamed columns (this one might be worth opening a separate a data.table issue):

library(data.table)

nest_df <- data.table(y = 1:2)
nest_list <- list(nest_df, nest_df)

test_df <- data.table(
  x = c("a", "b"),
  df_list = nest_list,
  vec_list = list(1:2, 1:2)
)

test_df[, c(df_list[[1]], vec_list[[1]]), by = x]
#>    x y    
#> 1: a 1 1 2
#> 2: a 2 1 2
#> 3: b 1 1 2
#> 4: b 2 1 2

@mgirlich
Copy link
Collaborator Author

mgirlich commented Jul 5, 2021

@markfairbanks thanks for your notes and thoughts on this! Do you know any nice way to handle NULL?

library(data.table)

nest_df <- data.table(y = 1:2)
test_df <- data.table(
  x = c("a", "b"),
  df_list = list(nest_df, NULL),
  df_list2 = list(nest_df, nest_df)
)

test_df[, c(df_list[[1]], df_list2[[1]]), by = x]
#> Error in `[.data.table`(test_df, , c(df_list[[1]], df_list2[[1]]), by = x): j doesn't evaluate to the same number of columns for each group

tidyr::unnest(test_df, c(df_list, df_list2), names_repair = "unique")
#> New names:
#> * y -> y...2
#> * y -> y...3
#> # A tibble: 4 x 3
#>   x     y...2 y...3
#>   <chr> <int> <int>
#> 1 a         1     1
#> 2 a         2     2
#> 3 b        NA     1
#> 4 b        NA     2

Created on 2021-07-05 by the reprex package (v2.0.0)

@markfairbanks
Copy link
Collaborator

I don't unfortunately. NULL list values might have to be a limit of dtplyr for now.

@markfairbanks markfairbanks added the feature a feature request or enhancement label Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants