Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first() doesn't work in summaries #129

Closed
dgrtwo opened this issue Jul 17, 2018 · 7 comments
Closed

first() doesn't work in summaries #129

dgrtwo opened this issue Jul 17, 2018 · 7 comments
Labels
bug an unexpected problem or unintended behavior func trans 🌍 Translation of individual functions to SQL
Milestone

Comments

@dgrtwo
Copy link
Member

dgrtwo commented Jul 17, 2018

The first function in dbplyr returns whatever expression it's given as a string, instead of returning the first value (which should respect previous group_by() and arrange().

It was discovered in this Stack Overflow question. That table (and the version I reproduced locally) is PostgreSQL, but the problem can be reproduced in SQLlite:

library(dplyr)

batting <- tbl(dbplyr::lahman_sqlite(), "Batting")

tbl(dbplyr::lahman_sqlite(), "Batting") %>%
  arrange(yearID) %>%
  group_by(playerID) %>%
  summarize(first = first(G))

Returns:

# Source:   lazy query [?? x 2]
# Database: sqlite 3.22.0 [/var/folders/8p/xzrrqphx2qb3d2s_fgqrk5xr0000gn/T//Rtmpe0JeOa/lahman.sqlite]
   playerID  first
   <chr>     <chr>
 1 aardsda01 `G`  
 2 aaronha01 `G`  
 3 aaronto01 `G`  
 4 aasedo01  `G`  
 5 abadan01  `G`  
 6 abadfe01  `G`  
 7 abadijo01 `G`  
 8 abbated01 `G`  
 9 abbeybe01 `G`  
10 abbeych01 `G`  
# ... with more rows

This matches with the result of show_query() on the above.

<SQL>
SELECT `playerID`, '`G`' AS `first`
FROM (SELECT *
FROM `Batting`
ORDER BY `yearID`)
GROUP BY `playerID`

It has a similar behavior:

  • If it's given first(blabla), or another expression that's not a column
  • If it's done without an arrange or a group_by() (it can thus be reproduced with just tbl(dbplyr::lahman_sqlite(), "Batting") %>% summarize(first(G)))
@karldw
Copy link
Contributor

karldw commented Aug 10, 2018

Closely related: tidyverse/dplyr#2290

@javierluraschi
Copy link
Contributor

Also reported in sparklyr under sparklyr/sparklyr#1504. sparklyr reprex:

library(sparklyr)

sc <- spark_connect(mater = "local")
batting_tbl <- copy_to(sc, Lahman::Batting)

batting_tbl %>% head() %>%
  group_by(yearID) %>%
  summarise(meansting = mean(stint), firstname = first(playerID)) %>%
  dbplyr::sql_render()
<SQL> SELECT `yearID`, AVG(`stint`) AS `meansting`, '`playerID`' AS `firstname`
FROM (SELECT *
FROM `sparklyr_88f216c7aaf6`
LIMIT 6) `ypzthwdkmo`
GROUP BY `yearID`

@edgararuiz-zz
Copy link
Contributor

I thinks this is because first is considered a window function, activated by mutate(), as oppossed to an grouped/aggregate function, activated by summarize. I post two examples, one with first and the other with max:

library(sparklyr)
library(dplyr)
library(DBI)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0
batting_tbl <- copy_to(sc, Lahman::Batting)

t <- batting_tbl %>% 
  head(100) %>%
  group_by(teamID) %>%
  mutate(x = first(playerID)) %>%
  select(teamID, x)
t
#> # Source:   lazy query [?? x 2]
#> # Database: spark_connection
#> # Groups:   teamID
#>    teamID x        
#>    <chr>  <chr>    
#>  1 BS1    barnero01
#>  2 BS1    barnero01
#>  3 BS1    barnero01
#>  4 BS1    barnero01
#>  5 BS1    barnero01
#>  6 BS1    barnero01
#>  7 BS1    barnero01
#>  8 BS1    barnero01
#>  9 BS1    barnero01
#> 10 CH1    brannmi01
#> # ... with more rows
show_query(t)
#> <SQL>
#> SELECT `teamID`, `x`
#> FROM (SELECT `playerID`, `yearID`, `stint`, `teamID`, `lgID`, `G`, `AB`, `R`, `H`, `X2B`, `X3B`, `HR`, `RBI`, `SB`, `CS`, `BB`, `SO`, `IBB`, `HBP`, `SH`, `SF`, `GIDP`, first_value(`playerID`) OVER (PARTITION BY `teamID`) AS `x`
#> FROM (SELECT *
#> FROM `sparklyr_35d8306e463f`
#> LIMIT 100) `jyepziefva`) `bhsbzffqdj`
spark_disconnect(sc)

If you try with max(), it will return one row per group, which is what I usually do in MS SQL

library(sparklyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DBI)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0
batting_tbl <- copy_to(sc, Lahman::Batting)

t <- batting_tbl %>% 
  head(100) %>%
  group_by(teamID) %>%
  summarise(x = max(playerID)) %>%
  select(teamID, x)
t
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning

#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> # Source:   lazy query [?? x 2]
#> # Database: spark_connection
#>   teamID x        
#>   <chr>  <chr>    
#> 1 BS1    spaldal01
#> 2 CH1    simmojo01
#> 3 CL1    questjo01
#> 4 FW1    selmafr01
#> 5 NY2    startjo01
#> 6 PH1    senseco01
#> 7 RC1    sagerpo01
#> 8 TRO    pikeli01 
#> 9 WS3    nortofr01
show_query(t)
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> <SQL>
#> SELECT `teamID`, `x`
#> FROM (SELECT `teamID`, MAX(`playerID`) AS `x`
#> FROM (SELECT *
#> FROM `sparklyr_1d1870e96305`
#> LIMIT 100) `gpdizhnmrf`
#> GROUP BY `teamID`) `mgevjkueih`
spark_disconnect(sc)

Created on 2018-09-20 by the reprex package (v0.2.0).

@hadley hadley added bug an unexpected problem or unintended behavior func trans 🌍 Translation of individual functions to SQL labels Jan 2, 2019
@hadley hadley added this to the v1.4.0 milestone Jan 9, 2019
@hadley
Copy link
Member

hadley commented Jan 10, 2019

Minimal reprex:

library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(g = c(1, 1, 2, 2), x = 1:4) %>% group_by(g)

mf %>% mutate(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, `x`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_djljnlxwsc`
mf %>% summarise(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, '`x`' AS `f`
#> FROM `dbplyr_djljnlxwsc`
#> GROUP BY `g`

The source of the bug appears to be that first appears in the window functions, but not in the aggregate functions, so somehow dplyr::first() is getting called locally. I can fix that.

However, this is only going to deliver a better error message because in SQL FIRST_VALUE() is only window function, not also an aggregation function as it is R (because, I think, it depends on a notion of order that does not exist in SQL, except in windows).

@hadley
Copy link
Member

hadley commented Jan 10, 2019

library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(g = c(1, 1, 2, 2), x = 1:4) %>% group_by(g)

mf %>% mutate(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, `x`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_yhvcnautig`
mf %>% summarise(f = first(x)) %>% show_query()
#> Error: `first()` is only available in a windowed (`mutate()`) context

@hadley hadley closed this as completed in d4d8f2a Jan 10, 2019
@Athospd
Copy link

Athospd commented Jun 28, 2020

Does It mean that it will never be possible to aggregate using first()?
e.g.

#> <SQL>
#> SELECT `h`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_yhvcnautig`
#> GROUP BY `h`

@MarkMc1089
Copy link

Does It mean that it will never be possible to aggregate using first()? e.g.

#> <SQL>
#> SELECT `h`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_yhvcnautig`
#> GROUP BY `h`

You can use mutate instead of summarise, then deal with the (potentially) multiple rows.

test_data <- tribble(
  ~ID, ~YEAR, ~TERM, ~EVENT_CODE,
  1, 2020, 1, 1,
  1, 2020, 2, 2,
  1, 2020, 3, 2,
  1, 2021, 1, 2,
  1, 2021, 2, 2,
  1, 2021, 3, 2,
  2, 2020, 1, 3,
  2, 2020, 2, 3,
  2, 2020, 3, 3
)

first_and_last <- test_data %>%
  # Would need window order if connecting to database
  # window_order(
  #   ID,
  #   YEAR,
  #   TERM
  # ) %>% 
  mutate(
    FIRST_EVENT = first(EVENT_CODE),
    LAST_EVENT  = last(EVENT_CODE),
    .by = c(ID, YEAR)
  )

# ID  YEAR  TERM EVENT_CODE FIRST_EVENT LAST_EVENT
# <dbl> <dbl> <dbl>      <dbl>       <dbl>      <dbl>
# 1     1  2020     1          1           1          2
# 2     1  2020     2          2           1          2
# 3     1  2020     3          2           1          2
# 4     1  2021     1          2           2          2
# 5     1  2021     2          2           2          2
# 6     1  2021     3          2           2          2
# 7     2  2020     1          3           3          3
# 8     2  2020     2          3           3          3
# 9     2  2020     3          3           3          3

first_and_last <- first_and_last %>% 
  slice_min(
    TERM,
    by = c(ID, YEAR)
  )

# ID  YEAR  TERM EVENT_CODE FIRST_EVENT LAST_EVENT
# <dbl> <dbl> <dbl>      <dbl>       <dbl>      <dbl>
# 1     1  2020     1          1           1          2
# 2     1  2021     1          2           2          2
# 3     2  2020     1          3           3          3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior func trans 🌍 Translation of individual functions to SQL
Projects
None yet
Development

No branches or pull requests

7 participants