-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Allow concurrent and large refgenie downloads #1172
base: master
Are you sure you want to change the base?
Changes from 10 commits
61758c3
8eb7a1d
60f066f
1c99871
d0dcc61
4d3758c
562df5f
c074c68
ee3bad5
fce4bb0
73e3a02
a3b02b8
0d10e6a
fe7b05a
c2230b2
b3d832b
5078fa2
b927c4f
6b3d6e3
64bd5e6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,25 @@ | ||
rule obtain_asset: | ||
output: | ||
# the name refers to the refgenie seek key (see attributes on http://refgenomes.databio.org) | ||
fai="refs/genome.fasta" | ||
fai="refs/genome.fasta", | ||
# Multiple outputs/seek keys are possible here. | ||
params: | ||
genome="human_alu", | ||
asset="fasta", | ||
tag="default" | ||
tag="default", | ||
log: | ||
"logs/refgenie/obtain_large_asset.log", | ||
wrapper: | ||
"master/bio/refgenie" | ||
|
||
rule obtain_large_asset: | ||
output: | ||
star_index=directory("refs/star_index/hg38/star_index"), | ||
params: | ||
genome="hg38", | ||
asset="star_index", | ||
tag="default", | ||
log: | ||
"logs/refgenie/obtain_large_asset.log", | ||
wrapper: | ||
"master/bio/refgenie" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,17 +5,27 @@ | |
|
||
import os | ||
import refgenconf | ||
from refgenconf.exceptions import RefgenconfError | ||
|
||
genome = snakemake.params.genome | ||
asset = snakemake.params.asset | ||
tag = snakemake.params.tag | ||
|
||
conf_path = os.environ["REFGENIE"] | ||
|
||
rgc = refgenconf.RefGenConf(conf_path, writable=True) | ||
|
||
# BUG If there are multiple concurrent refgenie commands, this will fail due to | ||
# unable to acquire lock of the config file. | ||
try: | ||
rgc = refgenconf.RefGenConf(conf_path, writable=True) | ||
except RefgenconfError: | ||
# If read lock timeout, attempt to skip the read lock | ||
rgc = refgenconf.RefGenConf( | ||
conf_path, writable=True, skip_read_lock=True, genome_exact=False | ||
) | ||
# pull asset if necessary | ||
gat, archive_data, server_url = rgc.pull(genome, asset, tag, force=False) | ||
gat, archive_data, server_url = rgc.pull( | ||
genome, asset, tag, force=False, force_large=True | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that is a good alternative to implement. As is, there is no way to override this while using the wrapper. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you feel comfortable implementing this? I'd introduce an (optional) |
||
) | ||
|
||
for seek_key, out in snakemake.output.items(): | ||
path = rgc.seek(genome, asset, tag_name=tag, seek_key=seek_key, strict_exists=True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't really find out what the exact implications of
skip_read_lock=TRUE
are, but it seems dangerous to use, to me. Have you also tried increasingwait_max=
as an alternative?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't attempt to, but I suspect that this might not be a great choice either. If someone is downloading an asset over a slow connection, even setting wait_max from its default of 60 to 600 might not make a difference and result in a hard-to-diagnose timeout error.
I'm not sure if this was some sort of conflict with the snakemake locking system as well. If we rely on that to protect other files, then the result of the wrapper is it either produces the output file, or the rule fails with a RefgenconfError error and recommends setting the skip_read_lock=TRUE param to try to fix the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I gathered by poking around a little, I think that the lock only happens while something is written to the conf file. So I would think that this lock is not in place the whole time you are doing the download and that the
wait_max=
should already help. But the documentation on this is not very clear and I didn't immediately find the mechanism in the code, so I might be misunderstanding this lock.Do you have the possibility to try
wait_max=
in your use case and test whether this actually helps?