Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter error on data with lubridate intervals #475

Closed
jakeybob opened this issue Jul 15, 2024 · 2 comments
Closed

filter error on data with lubridate intervals #475

jakeybob opened this issue Jul 15, 2024 · 2 comments

Comments

@jakeybob
Copy link

Hi -- I've encountered an issue where dtplyr seems to fail when filtering data that has a lubridate::interval() column. I saw this originally on a tibble of ~50 columns, of various different data types (including several lubridate date/time etc types), and dropping the single interval() column seemed to fix it -- so it does seem to be specific to interval data.

I've submitted here (rather than as a lubridate issue) as it happens when the filtering is done with respect to other data (here an integer column).

It's easy enough to work around, but figured I'd raise an issue as the behaviour seems unexpected. Any thoughts appreciated! 😃

library(dplyr)
library(dtplyr)
library(lubridate)

# dummy data
df <- tibble(a = 1:3) |> 
  mutate(interval = interval(start = ymd("2024-01-01") - days(a), end = ymd("2024-01-01"))) 

# expected filter result using dplyr
df |> 
  filter(a == max(a))

# dtplyr filter result throws error
df |> 
  dtplyr::lazy_dt() |> 
  filter(a == max(a))

# dtplyr filter result (also throws error -- so nothing to do with max())
df |> 
  dtplyr::lazy_dt() |> 
  filter(a == 3)

# Error in `[<-`:
# ! Assigned data `map(.subset(x, unname), vectbl_set_names, NULL)` must be compatible with existing
#   data.
# ✖ Existing data has 1 row.
# ✖ Element 2 of assigned data has 3 rows.
# ℹ Row updates require a list value. Do you need `list()` or `as.list()`?
# Caused by error in `vectbl_recycle_rhs_rows()`:
# ! Can't recycle input of size 3 to size 1.

# dtplyr filter works when dropping lubridate::interval col
df |> 
  select(-interval) |> 
  dtplyr::lazy_dt() |> 
  filter(a == max(a))
sessionInfo()
─ Session info────────────────────────────────────────
 setting  value
 version  R version 4.4.0 (2024-04-24)
 os       macOS Sonoma 14.5
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/London
 date     2024-07-15
 pandoc   2.12 @ /Users/xxx/opt/anaconda3/bin/pandoc
─ Packages───────────────────────────────────────────
 package     * version date (UTC) lib source
 cli           3.6.2   2023-12-11 [1] CRAN (R 4.4.0)
 data.table    1.15.4  2024-03-30 [1] CRAN (R 4.4.0)
 dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
 dtplyr      * 1.3.1   2023-03-22 [1] CRAN (R 4.4.0)
 fansi         1.0.6   2023-12-08 [1] CRAN (R 4.4.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.4.0)
 glue          1.7.0   2024-01-09 [1] CRAN (R 4.4.0)
 lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
 lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.4.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.4.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.4.0)
 rlang         1.1.3   2024-01-10 [1] CRAN (R 4.4.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
 tibble        3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
 tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
 timechange    0.3.0   2024-01-18 [1] CRAN (R 4.4.0)
 utf8          1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
 vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
 withr         3.0.0   2024-01-16 [1] CRAN (R 4.4.0)
@eutwt
Copy link
Collaborator

eutwt commented Jul 21, 2024

Period objects and similar "multi-column" structures are not supported by data.table, as described in Rdatatable/data.table#4415. I don't think there's anything we can do on the dtplyr end.

Notice the length of the "start" slot when subsetting a data frame vs when subsetting a data.table. Subsetting the data.table (rather than just a column) produces an error.

suppressPackageStartupMessages({
library(lubridate)
library(data.table)
library(dplyr)
})

df <- tibble(a = 1:3) |> 
  mutate(interval = interval(start = ymd("2024-01-01") - days(a), end = ymd("2024-01-01"))) 
dt <- as.data.table(df)

str(df[3, 'interval', drop = TRUE])
#> Formal class 'Interval' [package "lubridate"] with 3 slots
#>   ..@ .Data: num 259200
#>   ..@ start: POSIXct[1:1], format: "2023-12-29"
#>   ..@ tzone: chr "UTC"
str(dt[3, interval])
#> Formal class 'Interval' [package "lubridate"] with 3 slots
#>   ..@ .Data: num 259200
#>   ..@ start: POSIXct[1:3], format: "2023-12-31" "2023-12-30" ...
#>   ..@ tzone: chr "UTC"
dt[3]
#> Error in dimnames(x) <- dn: length of 'dimnames' [1] not equal to array extent

Created on 2024-07-21 with reprex v2.0.2

@jakeybob
Copy link
Author

OK, thanks, appreciate the reply -- I wasn't aware of the underlying workings and multi-col structures etc; good to know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants