-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: get ensembl-reference wrapper to download more than one chromosome #3432
base: master
Are you sure you want to change the base?
Changes from 6 commits
8d30eea
dfdd743
8fae1dd
249d5e4
23b8788
19536bc
5545e40
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -47,33 +47,45 @@ | |||||||||||||||||||||||||||
if chromosome: | ||||||||||||||||||||||||||||
if not datatype == "dna": | ||||||||||||||||||||||||||||
raise ValueError( | ||||||||||||||||||||||||||||
"invalid datatype, to select a single chromosome the datatype must be dna" | ||||||||||||||||||||||||||||
"Invalid datatype. To select individual chromosomes, the datatype must be dna." | ||||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
url = snakemake.params.get("url", "ftp://ftp.ensembl.org/pub") | ||||||||||||||||||||||||||||
spec = spec.format(build=build, release=release) | ||||||||||||||||||||||||||||
url_prefix = f"{url}/{branch}release-{release}/fasta/{species}/{datatype}/{species.capitalize()}.{spec}" | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
success = False | ||||||||||||||||||||||||||||
for suffix in suffixes: | ||||||||||||||||||||||||||||
success = False | ||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Track download success for each chromosome separately. The current implementation resets the success flag for each suffix, which could mask failures of individual chromosome downloads. Consider tracking success per chromosome: -success = False
+successes = set() # Track successful downloads Then update the success tracking after download: -success = True
+successes.add(suffix) # Record successful download And modify the final check: -if not success:
+if not successes: This change will help identify which specific chromosomes failed to download.
|
||||||||||||||||||||||||||||
url = f"{url_prefix}.{suffix}" | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
try: | ||||||||||||||||||||||||||||
shell("curl -sSf {url} > /dev/null 2> /dev/null") | ||||||||||||||||||||||||||||
except sp.CalledProcessError: | ||||||||||||||||||||||||||||
continue | ||||||||||||||||||||||||||||
if chromosome: | ||||||||||||||||||||||||||||
print( | ||||||||||||||||||||||||||||
f"Unable to download the requested chromosome sequence from Ensembl at: {url_prefix}.{suffix}.", | ||||||||||||||||||||||||||||
file=sys.stderr, | ||||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||||
break | ||||||||||||||||||||||||||||
else: | ||||||||||||||||||||||||||||
Comment on lines
+64
to
+70
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Critical: Remove break statement and use proper logging. The current implementation has two issues:
Apply these changes: if chromosome:
- print(
- f"Unable to download the requested chromosome sequence from Ensembl at: {url_prefix}.{suffix}.",
- file=sys.stderr,
- )
- break
+ shell.logger.error(
+ f"Unable to download chromosome sequence from: {url_prefix}.{suffix}"
+ )
+ continue
else:
continue 📝 Committable suggestion
Suggested change
|
||||||||||||||||||||||||||||
continue | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
shell("(curl -L {url} | gzip -d >> {snakemake.output[0]}) {log}") | ||||||||||||||||||||||||||||
success = True | ||||||||||||||||||||||||||||
break | ||||||||||||||||||||||||||||
if not chromosome: | ||||||||||||||||||||||||||||
break | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
if not success: | ||||||||||||||||||||||||||||
if len(suffixes) > 1: | ||||||||||||||||||||||||||||
url = f"{url_prefix}.[{'|'.join(suffixes)}]" | ||||||||||||||||||||||||||||
else: | ||||||||||||||||||||||||||||
url = f"{url_prefix}.{suffixes[0]}" | ||||||||||||||||||||||||||||
if not chromosome: | ||||||||||||||||||||||||||||
if len(suffixes) > 1: | ||||||||||||||||||||||||||||
url = f"{url_prefix}.[{'|'.join(suffixes)}]" | ||||||||||||||||||||||||||||
else: | ||||||||||||||||||||||||||||
url = f"{url_prefix}.{suffixes[0]}" | ||||||||||||||||||||||||||||
print( | ||||||||||||||||||||||||||||
f"Unable to download the requested reference sequence data from Ensembl at: {url}.", | ||||||||||||||||||||||||||||
file=sys.stderr, | ||||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||||
Comment on lines
+79
to
+87
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Improve error reporting with Snakemake logging. The error reporting can be improved by:
-if not success:
+if not successes:
if not chromosome:
if len(suffixes) > 1:
url = f"{url_prefix}.[{'|'.join(suffixes)}]"
else:
url = f"{url_prefix}.{suffixes[0]}"
- print(
+ shell.logger.error(
f"Unable to download the requested reference sequence data from Ensembl at: {url}.",
- file=sys.stderr,
)
+else:
+ failed = set(suffixes) - successes
+ if failed:
+ shell.logger.error(
+ f"Failed to download the following chromosomes: {', '.join(failed)}"
+ )
-print(
+shell.logger.error(
"Please check whether above URL is currently available (might be a temporal server issue). "
"Apart from that, did you check that this combination of species, build, and release is actually provided?",
- file=sys.stderr,
)
|
||||||||||||||||||||||||||||
print( | ||||||||||||||||||||||||||||
f"Unable to download requested sequence data from Ensembl ({url}). " | ||||||||||||||||||||||||||||
"Please check whether above URL is currently available (might be a temporal server issue). " | ||||||||||||||||||||||||||||
"Apart from that, did you check that this combination of species, build, and release is actually provided?", | ||||||||||||||||||||||||||||
file=sys.stderr, | ||||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be outside the loop to check if at least one suffix was successful? This way it will only check the last suffix, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is outside of the loop, and we are requesting multiple chromosomes, this will turn true on any working chromosome, and then stay that way. So we will not get any debugging output and error thrown, in case any of the chromosomes is not available. So for the chromosomes case, we should reset this for every
suffix in suffixes
. For the other case, checking whether"dna.primary_assembly.fa.gz"
or"dna.toplevel.fa.gz"
is available, it willbreak
out of thesuffix in suffixes
loop right after settingsuccess = True
and will otherwise be left withsuccess = False
after the last suffix that runs into theexcept:
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this way it only checks if the last chromosome was available, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you are right. Very good catch. Let me think about what the best solution is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe moving the error checking directly to the try/except?