improve spread, gather error message: Computation failed in `stat_*()`: Each row of output must be identified by a unique combination of keys. #55

guangingmai · 2020-04-16T09:32:01Z

I want to run ggalluvial in barplot. But it have some warning message, when i run the following code. Dose anyone know how to fix it?

p <- ggplot(data = physeq_phylum, aes(x=sampleid, y=Abundance, alluvium = Phylum, stratum = Phylum))
p + geom_alluvium(aes(fill = Phylum), alpha = .5, width = .6) + 
  geom_stratum(aes(fill = Phylum), width = .6)

Warning message:

## Warning message:
## Computation failed in `stat_alluvium()`:
## Each row of output must be identified by a unique combination of keys.
## Keys are shared for 8 rows:
## * 5, 6
## * 31, 32
## * 51, 52
## * 60, 61

The text was updated successfully, but these errors were encountered:

corybrunson · 2020-04-17T16:27:16Z

Hi @guangingmai, thanks for raising the issue. It's difficult to know exactly what the problem is without a reproducible example. Would you be able to share a subset of the data you're using that produces the same error? Check out the reprex package for how to generate an example.

The error message comes from tidyr::spread(). It is not the most informative, but it has been discussed in this issue thread. Probably you can resolve it by creating a new column of unique row IDs in the data set and passing this new column to alluvium. (Phylum would still be passed to stratum.)

Please let me know if this doesn't help!

guangingmai · 2020-04-18T09:33:26Z

First of all, Thanks for your reply.
I reshaped my dataset, and i found that the dataset with two same row IDs of one group in one column cannot work, but it can work on only if the unique row IDs of one group in one column. Why the former cannot work?

corybrunson · 2020-04-18T10:10:45Z

I'm glad you've found a solution, at least! I don't know what the columns contain, so i can't be sure why it works when another doesn't. If you can't share your entire data set, see if you can boil it down to a small data set that hits the same problem and that you can share.

guangingmai · 2020-04-18T12:52:18Z

You can try my code where the dataset is stored on the website.

data <- read.table('dataset.txt', header=T)
p <- ggplot(data = data, aes(x=Sample, y=Abundance, alluvium = Phylum, stratum = Phylum))
(p1 <- p + geom_alluvium(aes(fill = Phylum), alpha = .5, width = .6) + 
  geom_stratum(aes(fill = Phylum), width = .6))

Warning output:

## Warning message:
## Computation failed in `stat_alluvium()`:
## Each row of output must be identified by a unique combination of keys.
## Keys are shared for 3 rows:
## * 4, 5, 6

corybrunson · 2020-04-18T18:48:14Z

Could you say in more detail what sort of plot you're trying to produce? Most alluvial plots require three aesthetic specs: x (position along the horizontal axis), stratum (value in the stacked bar chart at each x value), and alluvium (identifier that links these position–value pairs for the same subject or observation). It looks like you've created two stacked bar plots, one for each sample—something that could be done with geom_bar(). What do you want the flows between them to represent?

mfoos · 2020-04-22T15:03:28Z

I hope it's okay if I piggyback here. I am trying to do a similar thing over a timecourse. I have multiple days and (for the reprex) multiple US states reporting some value (pct), but not every state reports every day, so there aren't always alluvia going between consecutive days. I've discovered that something about the shape of the data determines whether this fails or not, but I can't determine what, since the error message about duplicated rows is either misleading, or referring to the data in an in-between stage that is not exposed to me.

The difference between the plots below is just the sampling to generate the fake data. The second plot is exactly the output desired.

library(reprex)
#> Warning: package 'reprex' was built under R version 3.6.1
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
#> Warning: package 'tidyr' was built under R version 3.6.2
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 3.6.3
library(ggalluvial)

set.seed(123) # fails
fake_tmp <- data.frame(rowname = 1:20,
                       date = c("Day 1", "Day 2", "Day 3", "Day 4", "Day 5"),
                       pct = rnorm(20, mean = 5, sd = 2),
                       gene = sample(state.abb[1:20], 20, replace = TRUE))
tmp2 <- fake_tmp %>%
  gather(key, stratum, -rowname, -date, -pct)

ggplot(tmp2, aes(x = date, 
                 y = pct,
                 stratum = stratum,
                 alluvium = stratum)) +
  geom_alluvium(aes(fill = stratum)) +
  geom_stratum(aes(fill = stratum)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
#> Warning: Computation failed in `stat_alluvium()`:
#> Each row of output must be identified by a unique combination of keys.
#> Keys are shared for 2 rows:
#> * 7, 8

# Error refers to rows 7 & 8
tmp2[7:8,]
#>   rowname  date      pct  key stratum
#> 7       7 Day 2 5.921832 gene      GA
#> 8       8 Day 3 2.469878 gene      CT


set.seed(464) # succeeds
fake_tmp <- data.frame(rowname = 1:20,
                       date = c("Day 1", "Day 2", "Day 3", "Day 4", "Day 5"),
                       pct = rnorm(20, mean = 5, sd = 2),
                       gene = sample(state.abb[1:20], 20, replace = TRUE))
tmp2 <- fake_tmp %>%
  gather(key, stratum, -rowname, -date, -pct)

ggplot(tmp2, aes(x = date, 
                       y = pct,
                       stratum = stratum,
                       alluvium = stratum)) +
  geom_alluvium(aes(fill = stratum)) +
  geom_stratum(aes(fill = stratum)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

^{Created on 2020-04-22 by the reprex package (v0.3.0)}

corybrunson · 2020-04-22T16:39:42Z

@mfoos absolutely fine. Thanks for bringing it up.

First, an apology: I have not yet learned how to produce the intelligent and informative warning and error messages of other packages, in particular ggplot2 and its tidyverse siblings. I should probably create an issue and invite help on that.

The error message that identifies rows 7 and 8 in your first example was spit out by tidyr::spread(), which is used internally by to_alluvia_form(), which is in turn used by StatAlluvium$compute_panel(). By the time it's used, though, the data set has been reordered, so the row numbers in the message don't correspond to those of the input data set. It turns out that they refer to two rows with the same values of date and stratum. That is, one state has been measured twice for the same axis. You can identify these directly with this line:

count(tmp2, date, stratum)

Please check back if this doesn't resolve the issue. I'll at least have the next version check for this sort of problem and throw an error earlier, since i still run into the same issue from time to time.

mfoos · 2020-04-22T16:49:29Z

awesome awesome awesome, this is super helpful, thank you!

Andreas-Bio · 2020-05-03T17:49:35Z

@corybrunson The spread function is "Retired lifecycle".

Quote: "Development on spread() is complete, and for new code we recommend switching to pivot_wider()"

corybrunson · 2020-05-03T19:57:11Z

@andzandz11 thanks for mentioning this. A future major release, probably the one after next, will indeed replace gather() and spread() with pivot_longer() and pivot_wider(). The switch is underway, and the release will include some new features that the switch enables; check out the pivot and pivot-params branches if you're interested. Meanwhile, the retired functions will remain exported in tidyr, so i haven't bumped the switch up to the next release (the devel and devel-parsimony branches).

corybrunson changed the title ~~Computation failed in stat_alluvium()~~ improve spread, gather error message: Computation failed in stat_*(): Each row of output must be identified by a unique combination of keys. Jun 15, 2020

corybrunson added the documentation label Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve spread, gather error message: Computation failed in `stat_*()`: Each row of output must be identified by a unique combination of keys. #55

improve spread, gather error message: Computation failed in `stat_*()`: Each row of output must be identified by a unique combination of keys. #55

guangingmai commented Apr 16, 2020

corybrunson commented Apr 17, 2020

guangingmai commented Apr 18, 2020

corybrunson commented Apr 18, 2020

guangingmai commented Apr 18, 2020

corybrunson commented Apr 18, 2020

mfoos commented Apr 22, 2020

corybrunson commented Apr 22, 2020

mfoos commented Apr 22, 2020

Andreas-Bio commented May 3, 2020

corybrunson commented May 3, 2020

improve spread, gather error message: Computation failed in stat_*(): Each row of output must be identified by a unique combination of keys. #55

improve spread, gather error message: Computation failed in stat_*(): Each row of output must be identified by a unique combination of keys. #55

Comments

guangingmai commented Apr 16, 2020

corybrunson commented Apr 17, 2020

guangingmai commented Apr 18, 2020

corybrunson commented Apr 18, 2020

guangingmai commented Apr 18, 2020

corybrunson commented Apr 18, 2020

mfoos commented Apr 22, 2020

corybrunson commented Apr 22, 2020

mfoos commented Apr 22, 2020

Andreas-Bio commented May 3, 2020

corybrunson commented May 3, 2020

improve spread, gather error message: Computation failed in `stat_*()`: Each row of output must be identified by a unique combination of keys. #55

improve spread, gather error message: Computation failed in `stat_*()`: Each row of output must be identified by a unique combination of keys. #55