Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel implementation #31

Merged
merged 3 commits into from
Jan 30, 2023
Merged

parallel implementation #31

merged 3 commits into from
Jan 30, 2023

Conversation

jiwen90
Copy link
Contributor

@jiwen90 jiwen90 commented Jan 28, 2023

This resolves #6 and #17.

@jackbibby1
Copy link
Owner

Hi. Thanks for this -- have been meaning to implement some parallelisation of the pathways, but haven't had the time yet. I'm currently getting an error when I run this on my machine, using the naive_cd4.rds dataset from this tutorial.

df <- readRDS("naive_cd4.rds")

pathways <- msigdbr("Homo sapiens", "H") %>%
  format_pathways()

p1 <- seurat_extract(df, meta1 = "Hour", value_meta1 = "0")
p2 <- seurat_extract(df, meta1 = "Hour", value_meta1 = "24")

scpa_out <- compare_pathways_parallel(samples = list(p1, p2), pathways = pathways, cores = 2)

--------

Cell numbers in population 1 = 4428
Cell numbers in population 2 = 5919
- If greater than 500 cells, these populations will be downsampled

All 50 pathways passed the min/max genes threshold

Calculating pathway fold changes...

Performing a two-sample analysis with SCPA...
Error in { : task 1 failed - "multi-argument returns are not permitted"

Is this running fine on your end?

Jack

@jiwen90
Copy link
Contributor Author

jiwen90 commented Jan 30, 2023

Hi,

It works for me. The "multi-argument returns are not permitted" is likely due to the tryCatch error message, which I just removed. I'm not sure what error was thrown on your end -- can you run it again, optionally without the tryCatch, to see what's the problem?

> df <- readRDS("naive_cd4.rds")

> pathways <- msigdbr("Homo sapiens", "H") %>%
  format_pathways()

> p1 <- seurat_extract(df, meta1 = "Hour", value_meta1 = "0")
> p2 <- seurat_extract(df, meta1 = "Hour", value_meta1 = "24")

> scpa_out <- compare_pathways_parallel(samples = list(p1, p2), pathways = pathways, cores = 2)

Cell numbers in population 1 = 4428
Cell numbers in population 2 = 5919
- If greater than 500 cells, these populations will be downsampled

All 50 pathways passed the min/max genes threshold

Calculating pathway fold changes...

Performing a two-sample analysis with SCPA...
Loading required package: doParallel
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel

> scpa_out
                                      Pathway          Pval       adjPval
1          HALLMARK_INTERFERON_GAMMA_RESPONSE 5.341958e-106 2.670979e-104
2                HALLMARK_ALLOGRAFT_REJECTION 1.253504e-102 6.267519e-101
3          HALLMARK_INTERFERON_ALPHA_RESPONSE  2.589021e-99  1.294511e-97
4              HALLMARK_INFLAMMATORY_RESPONSE  1.061001e-89  5.305003e-88
5            HALLMARK_TNFA_SIGNALING_VIA_NFKB  3.796084e-88  1.898042e-86
6            HALLMARK_IL6_JAK_STAT3_SIGNALING  4.522036e-82  2.261018e-80
7                     HALLMARK_MYC_TARGETS_V1  4.522036e-82  2.261018e-80
8                     HALLMARK_UV_RESPONSE_UP  1.379483e-80  6.897414e-79
9                          HALLMARK_APOPTOSIS  4.076176e-79  2.038088e-77
10                        HALLMARK_COMPLEMENT  3.234390e-76  1.617195e-74
...

@jackbibby1
Copy link
Owner

Hmm, odd. New error, but this time:

df <- readRDS("naive_cd4.rds")

pathways <- msigdbr("Homo sapiens", "H") %>%
  format_pathways()

p1 <- seurat_extract(df, meta1 = "Hour", value_meta1 = "0")
p2 <- seurat_extract(df, meta1 = "Hour", value_meta1 = "24")

scpa_out <- compare_pathways_parallel(samples = list(p1, p2), pathways = pathways, cores = 2)

-------

Cell numbers in population 1 = 4428
Cell numbers in population 2 = 5919
- If greater than 500 cells, these populations will be downsampled

All 50 pathways passed the min/max genes threshold

Calculating pathway fold changes...

Performing a two-sample analysis with SCPA...
Error in `dplyr::arrange()`:
! Problem with the implicit `transmute()` step.
✖ Problem while computing `..1 = qval`.
Caused by error in `mask$eval_all_mutate()`:
! object 'qval' not found
Run `rlang::last_error()` to see where the error occurred.

Having a closer look, in the scpa_result <- foreach(pathway = pathways_filtered) %dopar% { ... }, error = function(e) { }) } part of the function I'm just getting a bunch of NULL values, which will be causing the error with arrange...

> scpa_result

[[1]]
NULL

[[2]]
NULL

...

[[49]]
NULL

[[50]]
NULL

@jiwen90
Copy link
Contributor Author

jiwen90 commented Jan 30, 2023

Can you

  1. Check that all required packages are installed (doParallel and others)
  2. Run without tryCatch? The NULL values are from the catch block, which indicates an error inside the main function.

Thanks!

@jackbibby1
Copy link
Owner

  1. Yup, my bad -- that was me being an idiot. I ran without the tryCatch and it's running fine now

@jiwen90
Copy link
Contributor Author

jiwen90 commented Jan 30, 2023

The tryCatch block itself shouldn't cause errors and should be included for safe cluster shutdown. If you can run it with the tryCatch block, then this should be good to merge. Otherwise we need to find the underlying problem.

@jackbibby1 jackbibby1 merged commit cc2ec22 into jackbibby1:main Jan 30, 2023
@jackbibby1
Copy link
Owner

Yup -- looks like I caused an issue with the %>% dependency when I was running it. Anyway, it all looks to be working well now, so have merged everything. Thanks a bunch for that, it was really useful

@jiwen90
Copy link
Contributor Author

jiwen90 commented Jan 31, 2023

It may be worth noting that the greatest speedup comes from running many pathways (eg 1000+). With only 50 pathways, as in the tutorial, the processes spawn time begins to exceed actual execution time.

@jackbibby1
Copy link
Owner

Yeah, good point -- I'll add some more benchmarking with increasing gene set sizes

@jiwen90
Copy link
Contributor Author

jiwen90 commented Feb 2, 2023

Small usability point - I would remove the parallel parameter and instead set cores=1 as default. If cores > 1, then use parallel processing. This avoids an extra parameter and user confusion.

@jackbibby1
Copy link
Owner

Hmm yeah, I think that's probably best on reflection. I'll change it up in the next version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

To do
2 participants