-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first() doesn't work in summaries #129
Comments
Closely related: tidyverse/dplyr#2290 |
Also reported in library(sparklyr)
sc <- spark_connect(mater = "local")
batting_tbl <- copy_to(sc, Lahman::Batting)
batting_tbl %>% head() %>%
group_by(yearID) %>%
summarise(meansting = mean(stint), firstname = first(playerID)) %>%
dbplyr::sql_render()
|
I thinks this is because library(sparklyr)
library(dplyr)
library(DBI)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0
batting_tbl <- copy_to(sc, Lahman::Batting)
t <- batting_tbl %>%
head(100) %>%
group_by(teamID) %>%
mutate(x = first(playerID)) %>%
select(teamID, x)
t
#> # Source: lazy query [?? x 2]
#> # Database: spark_connection
#> # Groups: teamID
#> teamID x
#> <chr> <chr>
#> 1 BS1 barnero01
#> 2 BS1 barnero01
#> 3 BS1 barnero01
#> 4 BS1 barnero01
#> 5 BS1 barnero01
#> 6 BS1 barnero01
#> 7 BS1 barnero01
#> 8 BS1 barnero01
#> 9 BS1 barnero01
#> 10 CH1 brannmi01
#> # ... with more rows
show_query(t)
#> <SQL>
#> SELECT `teamID`, `x`
#> FROM (SELECT `playerID`, `yearID`, `stint`, `teamID`, `lgID`, `G`, `AB`, `R`, `H`, `X2B`, `X3B`, `HR`, `RBI`, `SB`, `CS`, `BB`, `SO`, `IBB`, `HBP`, `SH`, `SF`, `GIDP`, first_value(`playerID`) OVER (PARTITION BY `teamID`) AS `x`
#> FROM (SELECT *
#> FROM `sparklyr_35d8306e463f`
#> LIMIT 100) `jyepziefva`) `bhsbzffqdj`
spark_disconnect(sc) If you try with library(sparklyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(DBI)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0
batting_tbl <- copy_to(sc, Lahman::Batting)
t <- batting_tbl %>%
head(100) %>%
group_by(teamID) %>%
summarise(x = max(playerID)) %>%
select(teamID, x)
t
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> # Source: lazy query [?? x 2]
#> # Database: spark_connection
#> teamID x
#> <chr> <chr>
#> 1 BS1 spaldal01
#> 2 CH1 simmojo01
#> 3 CL1 questjo01
#> 4 FW1 selmafr01
#> 5 NY2 startjo01
#> 6 PH1 senseco01
#> 7 RC1 sagerpo01
#> 8 TRO pikeli01
#> 9 WS3 nortofr01
show_query(t)
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> <SQL>
#> SELECT `teamID`, `x`
#> FROM (SELECT `teamID`, MAX(`playerID`) AS `x`
#> FROM (SELECT *
#> FROM `sparklyr_1d1870e96305`
#> LIMIT 100) `gpdizhnmrf`
#> GROUP BY `teamID`) `mgevjkueih`
spark_disconnect(sc) Created on 2018-09-20 by the reprex package (v0.2.0). |
Minimal reprex: library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(g = c(1, 1, 2, 2), x = 1:4) %>% group_by(g)
mf %>% mutate(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, `x`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_djljnlxwsc`
mf %>% summarise(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, '`x`' AS `f`
#> FROM `dbplyr_djljnlxwsc`
#> GROUP BY `g` The source of the bug appears to be that However, this is only going to deliver a better error message because in SQL |
library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(g = c(1, 1, 2, 2), x = 1:4) %>% group_by(g)
mf %>% mutate(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, `x`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_yhvcnautig`
mf %>% summarise(f = first(x)) %>% show_query()
#> Error: `first()` is only available in a windowed (`mutate()`) context |
Does It mean that it will never be possible to aggregate using first()?
|
You can use mutate instead of summarise, then deal with the (potentially) multiple rows.
|
The
first
function in dbplyr returns whatever expression it's given as a string, instead of returning the first value (which should respect previousgroup_by()
andarrange()
.It was discovered in this Stack Overflow question. That table (and the version I reproduced locally) is PostgreSQL, but the problem can be reproduced in SQLlite:
Returns:
This matches with the result of
show_query()
on the above.It has a similar behavior:
first(blabla)
, or another expression that's not a columngroup_by()
(it can thus be reproduced with justtbl(dbplyr::lahman_sqlite(), "Batting") %>% summarize(first(G))
)The text was updated successfully, but these errors were encountered: