Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve spread, gather error message: Computation failed in stat_*(): Each row of output must be identified by a unique combination of keys. #55

Open
guangingmai opened this issue Apr 16, 2020 · 10 comments

Comments

@guangingmai
Copy link

I want to run ggalluvial in barplot. But it have some warning message, when i run the following code. Dose anyone know how to fix it?

p <- ggplot(data = physeq_phylum, aes(x=sampleid, y=Abundance, alluvium = Phylum, stratum = Phylum))
p + geom_alluvium(aes(fill = Phylum), alpha = .5, width = .6) + 
  geom_stratum(aes(fill = Phylum), width = .6) 

Warning message:

## Warning message:
## Computation failed in `stat_alluvium()`:
## Each row of output must be identified by a unique combination of keys.
## Keys are shared for 8 rows:
## * 5, 6
## * 31, 32
## * 51, 52
## * 60, 61
@corybrunson
Copy link
Owner

Hi @guangingmai, thanks for raising the issue. It's difficult to know exactly what the problem is without a reproducible example. Would you be able to share a subset of the data you're using that produces the same error? Check out the reprex package for how to generate an example.

The error message comes from tidyr::spread(). It is not the most informative, but it has been discussed in this issue thread. Probably you can resolve it by creating a new column of unique row IDs in the data set and passing this new column to alluvium. (Phylum would still be passed to stratum.)

Please let me know if this doesn't help!

@guangingmai
Copy link
Author

First of all, Thanks for your reply.
I reshaped my dataset, and i found that the dataset with two same row IDs of one group in one column cannot work, but it can work on only if the unique row IDs of one group in one column. Why the former cannot work?

@corybrunson
Copy link
Owner

I'm glad you've found a solution, at least! I don't know what the columns contain, so i can't be sure why it works when another doesn't. If you can't share your entire data set, see if you can boil it down to a small data set that hits the same problem and that you can share.

@guangingmai
Copy link
Author

You can try my code where the dataset is stored on the website.

data <- read.table('dataset.txt', header=T)
p <- ggplot(data = data, aes(x=Sample, y=Abundance, alluvium = Phylum, stratum = Phylum))
(p1 <- p + geom_alluvium(aes(fill = Phylum), alpha = .5, width = .6) + 
  geom_stratum(aes(fill = Phylum), width = .6)) 

Warning output:

## Warning message:
## Computation failed in `stat_alluvium()`:
## Each row of output must be identified by a unique combination of keys.
## Keys are shared for 3 rows:
## * 4, 5, 6

@corybrunson
Copy link
Owner

Could you say in more detail what sort of plot you're trying to produce? Most alluvial plots require three aesthetic specs: x (position along the horizontal axis), stratum (value in the stacked bar chart at each x value), and alluvium (identifier that links these position–value pairs for the same subject or observation). It looks like you've created two stacked bar plots, one for each sample—something that could be done with geom_bar(). What do you want the flows between them to represent?

@mfoos
Copy link

mfoos commented Apr 22, 2020

I hope it's okay if I piggyback here. I am trying to do a similar thing over a timecourse. I have multiple days and (for the reprex) multiple US states reporting some value (pct), but not every state reports every day, so there aren't always alluvia going between consecutive days. I've discovered that something about the shape of the data determines whether this fails or not, but I can't determine what, since the error message about duplicated rows is either misleading, or referring to the data in an in-between stage that is not exposed to me.

The difference between the plots below is just the sampling to generate the fake data. The second plot is exactly the output desired.

library(reprex)
#> Warning: package 'reprex' was built under R version 3.6.1
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
#> Warning: package 'tidyr' was built under R version 3.6.2
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 3.6.3
library(ggalluvial)

set.seed(123) # fails
fake_tmp <- data.frame(rowname = 1:20,
                       date = c("Day 1", "Day 2", "Day 3", "Day 4", "Day 5"),
                       pct = rnorm(20, mean = 5, sd = 2),
                       gene = sample(state.abb[1:20], 20, replace = TRUE))
tmp2 <- fake_tmp %>%
  gather(key, stratum, -rowname, -date, -pct)

ggplot(tmp2, aes(x = date, 
                 y = pct,
                 stratum = stratum,
                 alluvium = stratum)) +
  geom_alluvium(aes(fill = stratum)) +
  geom_stratum(aes(fill = stratum)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
#> Warning: Computation failed in `stat_alluvium()`:
#> Each row of output must be identified by a unique combination of keys.
#> Keys are shared for 2 rows:
#> * 7, 8

# Error refers to rows 7 & 8
tmp2[7:8,]
#>   rowname  date      pct  key stratum
#> 7       7 Day 2 5.921832 gene      GA
#> 8       8 Day 3 2.469878 gene      CT


set.seed(464) # succeeds
fake_tmp <- data.frame(rowname = 1:20,
                       date = c("Day 1", "Day 2", "Day 3", "Day 4", "Day 5"),
                       pct = rnorm(20, mean = 5, sd = 2),
                       gene = sample(state.abb[1:20], 20, replace = TRUE))
tmp2 <- fake_tmp %>%
  gather(key, stratum, -rowname, -date, -pct)

ggplot(tmp2, aes(x = date, 
                       y = pct,
                       stratum = stratum,
                       alluvium = stratum)) +
  geom_alluvium(aes(fill = stratum)) +
  geom_stratum(aes(fill = stratum)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Created on 2020-04-22 by the reprex package (v0.3.0)

@corybrunson
Copy link
Owner

@mfoos absolutely fine. Thanks for bringing it up.

First, an apology: I have not yet learned how to produce the intelligent and informative warning and error messages of other packages, in particular ggplot2 and its tidyverse siblings. I should probably create an issue and invite help on that.

The error message that identifies rows 7 and 8 in your first example was spit out by tidyr::spread(), which is used internally by to_alluvia_form(), which is in turn used by StatAlluvium$compute_panel(). By the time it's used, though, the data set has been reordered, so the row numbers in the message don't correspond to those of the input data set. It turns out that they refer to two rows with the same values of date and stratum. That is, one state has been measured twice for the same axis. You can identify these directly with this line:

count(tmp2, date, stratum)

Please check back if this doesn't resolve the issue. I'll at least have the next version check for this sort of problem and throw an error earlier, since i still run into the same issue from time to time.

@mfoos
Copy link

mfoos commented Apr 22, 2020

awesome awesome awesome, this is super helpful, thank you!

@Andreas-Bio
Copy link

@corybrunson The spread function is "Retired lifecycle".

Quote: "Development on spread() is complete, and for new code we recommend switching to pivot_wider()"

@corybrunson
Copy link
Owner

@andzandz11 thanks for mentioning this. A future major release, probably the one after next, will indeed replace gather() and spread() with pivot_longer() and pivot_wider(). The switch is underway, and the release will include some new features that the switch enables; check out the pivot and pivot-params branches if you're interested. Meanwhile, the retired functions will remain exported in tidyr, so i haven't bumped the switch up to the next release (the devel and devel-parsimony branches).

@corybrunson corybrunson changed the title Computation failed in stat_alluvium() improve spread, gather error message: Computation failed in stat_*(): Each row of output must be identified by a unique combination of keys. Jun 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants