Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding matrix completion in rclr/rAitchison #570

Open
antagomir opened this issue Apr 2, 2023 · 17 comments
Open

Adding matrix completion in rclr/rAitchison #570

antagomir opened this issue Apr 2, 2023 · 17 comments

Comments

@antagomir
Copy link
Contributor

antagomir commented Apr 2, 2023

The issue #433 has received request (from @johannesbjork) to add robust matrix completion in the robust CLR (rclr) method in the vegan::decostand function, following the procedure in Martino et al. (2019). I agree with this suggestion.

The R package ROptSpace includes an implementation of the OptSpace robust matrix completion algorithm used by Martino et al. (2019). There are many variants of matrix completion for missing values but this solution has been recently promoted in microbial ecology, and I am not immediately aware of other solutions that should be prioritized over this one.

It would be easy to add matrix completion in the rclr transformation in vegan::decostand by using the ROptSpace functionality. However, this would introduce a new dependency in vegan, and this may not be ideal.

Fortunately, it seems that the ROptSpace R package has been released with the MIT license and vegan with GPL-2. License compatibility allows including MIT licensed code in GPL-2. It seems possible to add this function in vegan and cite the original source, then use this to add the matrix completion step in decostand rclr method .

I can implement this change and open a PR but would like to cross-check first if the vegan developers?

Edit: another solution is to use R function filling::fill.OptSpace but the license is here MIT as well, so it doesn't really change the discussion; Edit2: ROptSpace is primary since "filling" uses it as dependency.

@handibles
Copy link

Hey Devs & co.,

Thanks as ever for the hard work. Came across this issue today (#433, #570) - as stands decostand(x, "rclr") replaces the 0 values with the average abundance within that sample (line 191), as matrix completion is not implemented.

This seems fairly dangerous given that there's no warning in the output or documentation - the average vegan user, being trusting and kind, would have no reason to suspect their data was suddenly a strange, non-sparse beast.

Could a note be added in the meantime to flag this?... I could mock up a PR or similar, but not sure warranted.

@antagomir
Copy link
Contributor Author

Thanks @handibles - I agree, and planning to go through this during the Xmas vacations if the situation allows. If there are further bottlenecks we should add a warning.

@handibles
Copy link

Gentle post-yule pre-student bump.

@antagomir
Copy link
Contributor Author

Yes, almost getting there..! Just a bit more time.

@LukeLikesDirt
Copy link

Hi antagomir, and thanks for incorporating and developing the robust Aitichson method within the vegdist function. I appreciate your work! I have hesitated to apply the robust Aitichson in vegdist, as I observed a decrease in the explained variation in my models without matrix completion, but I prefer to keep my workflow in R. Have you been able to incorporate the matrix completion algorithm? I'm keen to give it a whirl!
Cheers
Luke

@antagomir
Copy link
Contributor Author

If I am lucky I might be able to do it next week.

@antagomir
Copy link
Contributor Author

antagomir commented Feb 3, 2024

Ok @LukeLikesDirt , experimental version for the matrix completion is now available for testing and feedback at https://github.com/antagomir/vegan.

I am preparing a PR after doing the following:

  • check the relevant documentation and references for decostand and vegdist Rd files
  • polish coding style in decostand.R
  • make sure ROptSpace code is properly acknowledged

To get started testing the code:

devtools::install_github("antagomir/vegan")
library(vegan)

# Test data
set.seed(252)
testdata <- matrix(round(runif(1000, 0, 100)), nrow=20)
testdata <- testdata - 50
testdata[testdata < 0] <- 0
rownames(testdata) <- paste0("row", seq_len(nrow(testdata)))
colnames(testdata) <- paste0("col", seq_len(ncol(testdata)))

# Aitchison equals to CLR + Euclid (pseudocount is necessary with clr)
a1 <- vegan::vegdist(testdata+1, method = "aitchison")
a2 <- vegan::vegdist(vegan::decostand(testdata+1, "clr"), method = "euclidean")
max(abs(a1-a2)) < 1e-6 # Tolerance

# Robust aitchison equals to rCLR + Euclid
# and works without pseudocount
a1 <- vegan::vegdist(testdata, method = "robust.aitchison")
a2 <- vegan::vegdist(vegan::decostand(testdata, "rclr"), method = "euclidean")
max(abs(a1-a2)) < 1e-6 # Tolerance

# Robust aitchison and aitchison are equal when there are no zeroes
a1 <- vegan::vegdist(testdata+1, method = "robust.aitchison")
a2 <- vegan::vegdist(testdata+1, method = "aitchison")
max(abs(a1-a2)) < 1e-6 # Tolerance

@LukeLikesDirt
Copy link

Hi @antagomir

Thanks for getting to this so quickly, and apologies for taking so long to get back to you.

Unfortunately, I can't provide feedback because I can't download your VEGAN version due to some kind of Gfortran comparability error. I have Mac Monterey and Gfortran v12.1 for intel (the latest version for Monterey). Not asking you to troubleshoot my issue, but I tried the older versions of Gfortran but still failed to download your version of VEGAN. I have limited coding ability - I'm an ecologist - and it looks like I'll need to wait until the update to go live before testing.

Thank you for your work. Much appreciated.
Luke

@antagomir
Copy link
Contributor Author

Noprob. I can try to check and finalize this soonish.

Are you able to specify which vegan version you had problems with? Just for my checking purposes.

@LukeLikesDirt
Copy link

The initial issue I had was in early 2022, when I was running R 4.0.something, and a version of VEGAN that I would have installed during late 2021 (sorry for the vague response). I was comparing my results from VEGAN to those of DEICODE and noted a stark variation in my results - results based on DEICODE had greater power compared to those from VEGAN. I then manually computed robust Aichison distances using rCLR + Euclid and realised that VAGAN was not implementing the matrix completion step, at which point I returned to using DEICODE.

Looking forward to the new release!

Luke

@antagomir
Copy link
Contributor Author

Ok I have some DEICODE installation issues on 3 different systems it seems but would be happy to hear if anyone has the chance to run comparisons (while I continue to troubleshoot that one..)

@LauGuaUuu
Copy link

Thank you for this so interesting chat and issue. As @LukeLikesDirt , I have been trying to obtain similar transformed matrices as those obtained with DEICODE (so RCLR + Matrix completion) in R, but seems like those with vegan in R do not perform the Matrix completion, and seem to provide less robust outcomes. In cross talks, also people working with DEICODE in QIIME2 were trying to obtain the RCLR+Matrix transformed data from the DEICODE pipeline, with indeed only provide as outputs the ordinationa the distance matrix, but not the transformed data applying RCLR + Matrix completion, see issues: biocore/DEICODE#59 and biocore/DEICODE#51.

I am wondering if any has at date gotten this wished matrix transform type (RCLR+Matrix Completion) resolved to be obtained in a straightforward way, for those not bioinformatic programers... I have tried both the R options here and those in the issue mentioned before. But, I am not sure I am doing things right.

In R: I installed Vegan 2.6.5, and still get diverse results, and less robust as with DEICODE. But, I am not sure if my functions are doing/calling the matrix completion, as if I copy and run the new scripts provided by @antagomir in the revised scritps for the function "vegan::decostand(otu_table(phy), "rclr")", I am not sure how to force vegan to call those functions with Matrix completion with the same names you provided, and not the ones coming in the original package vegan version 2.6.5. Not sure if I can explain well... as said, I am not a bioinformatics programer...

Same with trying to obtain the RCLR+Matrix completion on the DEICODE pipeline: Here, even if I substitute the rpca.py file by the one revised in Github which provides RCLR+matrix completion output object, I do not get this to work...

So, is there a way to obtain, either with R or with QIIME2 this RCLR+Matrix Completion transformed matrix data?...

Thank you and happy day!

LAU

@Ivy-ops
Copy link

Ivy-ops commented Oct 10, 2024

@antagomir Is this "robust.aitchison" distance suitable for humann3(https://huttenhower.sph.harvard.edu/humann/) output data, the CPM(counts per million reads) normalized data? Thank you!

@antagomir
Copy link
Contributor Author

Hi @LauGuaUuu and @LukeLikesDirt - these should now be solved in PR #667 , also thanks to great help from @cameronmartino . The PR is pending for review and these updates should become available through the pkg if the PR is accepted.

@antagomir
Copy link
Contributor Author

@Ivy-ops I think this should be discussed elsewhere but yes, in general these are applicable to relative abundance data.

@LukeLikesDirt
Copy link

Thanks for your work and for keeping me updated Leo! I am looking forward to giving this a whirl.

Cheers
Luke

@handibles
Copy link

Looking forward to whirling this also! Thanks as ever @antagomir .

Re using it on HUMAnN as per above:

@antagomir Is this "robust.aitchison" distance suitable for humann3(https://huttenhower.sph.harvard.edu/humann/) output data, the CPM(counts per million reads) normalized data? Thank you!
@Ivy-ops I think this should be discussed elsewhere but yes, in general these are applicable to relative abundance data.

As:

  1. pathways are reconstructed based on whether the genes required are present or not. Because the same gene can contribute to many pathways, these abundances are not mutually exclusive and can't sum to 100% in the same way as a mix of taxa.
  2. pathway abundances are further based on the abundance of the least abundant gene in that pathway. This would also not be constrained by a simplex.

, had thought that compositional analysis would not be suitable for pathway abundances (genes should be okay but those tables are exhaustive and often uninformative). However, the HUMAnN community regularly use CLR transforms for pathways 1 2, and can see some logic for it.

As per @antagomir this could do with discussion elsewhere, but if using CLR on the pathway abundances, I would make sure that you're using the right features (i.e. exclude UNINTEGRATED and UNMAPPED, then use either the community- (e.g. PWY0-1301: melibiose degradation 57.5) or species-level (PWY0-1301: melibiose degradation | ...) abundances when calculating CLR - not both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants