Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eqtl analysis problem - decrease running time and memory burden - running mashr by chromosome #127

Open
jke20 opened this issue Aug 5, 2024 · 6 comments

Comments

@jke20
Copy link

jke20 commented Aug 5, 2024

Hi dear authors, thank you so much for developing mashr.
Recently, I was trying to apply mashr on my eqtl pipeline outputs for discovering tissue-specific effects and shared-tissue effects (conditions here are different tissues in human brains).
As you know, there are many gene-variant pairs and in our study, there are over 100 brain tissues. To decrease the computational burden, I wonder if I can run mashr by chromosome using the same covariance (strong matrix that takes the most significant eqtl from each gene across all chromosomes)? I don't know how will the final results be affected if I do that.
Thank you for your help in advance!

@pcarbo
Copy link
Member

pcarbo commented Aug 6, 2024

@jke20 Thanks for your feedback. Could you tell us a little bit more about the inputs you are providing to mash? If I understand correctly, your Bhat is roughly 10,000 x 100 (one row for each gene, one column for each brain tissue)?

@jke20
Copy link
Author

jke20 commented Aug 6, 2024

@pcarbo thank you for the reply, the matrix is 200,000,000 * 100. rows for gene-variant pairs and columns for tissues.

@pcarbo
Copy link
Member

pcarbo commented Aug 6, 2024

@jke20 Potentially you could take a random subset of the gene-variant pairs, then rerun mash a second time with fixg = TRUE on each chromosome, for example; see help(mash) for details.

@jke20
Copy link
Author

jke20 commented Aug 28, 2024

Hi, thank you very much for the help and i think mashr is running nicely right now.
Here is a following question:
Below I run mashr with two types of covariances:

# data driven covariances
U.pca = cov_pca(data.strong, 5)
U.ed = cov_ed(data.strong, U.pca)
# canonical covariances
U.c = cov_canonical(data.random)
# run mashr for null hypothesis
m = mash(data.random, Ulist = c(U.ed,U.c), outputlevel=1)
# rerun mashr on strong matrix
m2 = mash(data.strong, g=get_fitted_g(m), fixg=TRUE)

I wonder what's the difference between the results from above and the results if I run mash with only 1 type of covariance (like m = mash(data.random, Ulist = c(U.ed), outputlevel=1)). Thanks!

@surbut
Copy link
Collaborator

surbut commented Aug 28, 2024 via email

@pcarbo
Copy link
Member

pcarbo commented Aug 28, 2024

Thanks, Sarah.

Just to add to what Sarah said, in general mash will be faster with fewer matrices, but more matrices gives you more flexibility to model different sharing patterns. So there will be a tradeoff. In practice, as Sarah said, the data-driven matrices (U.ed) in your code are more adaptable, so Ulist = U.ed could be a convenient (i.e., slightly faster) option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants