Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: get ensembl-reference wrapper to download more than one chromosome #3432

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
7 changes: 4 additions & 3 deletions bio/reference/ensembl-sequence/wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,15 +47,15 @@
if chromosome:
if not datatype == "dna":
raise ValueError(
"invalid datatype, to select a single chromosome the datatype must be dna"
"Invalid datatype. To select individual chromosomes, the datatype must be dna."
)

url = snakemake.params.get("url", "ftp://ftp.ensembl.org/pub")
spec = spec.format(build=build, release=release)
url_prefix = f"{url}/{branch}release-{release}/fasta/{species}/{datatype}/{species.capitalize()}.{spec}"

success = False
for suffix in suffixes:
success = False
Copy link
Collaborator

@fgvieira fgvieira Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be outside the loop to check if at least one suffix was successful? This way it will only check the last suffix, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is outside of the loop, and we are requesting multiple chromosomes, this will turn true on any working chromosome, and then stay that way. So we will not get any debugging output and error thrown, in case any of the chromosomes is not available. So for the chromosomes case, we should reset this for every suffix in suffixes. For the other case, checking whether "dna.primary_assembly.fa.gz" or "dna.toplevel.fa.gz" is available, it will break out of the suffix in suffixes loop right after setting success = True and will otherwise be left with success = False after the last suffix that runs into the except:.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this way it only checks if the last chromosome was available, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you are right. Very good catch. Let me think about what the best solution is...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe moving the error checking directly to the try/except?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Track download success for each chromosome separately.

The current implementation resets the success flag for each suffix, which could mask failures of individual chromosome downloads. Consider tracking success per chromosome:

-success = False
+successes = set()  # Track successful downloads

Then update the success tracking after download:

-success = True
+successes.add(suffix)  # Record successful download

And modify the final check:

-if not success:
+if not successes:

This change will help identify which specific chromosomes failed to download.

Committable suggestion skipped: line range outside the PR's diff.

url = f"{url_prefix}.{suffix}"

try:
Expand All @@ -65,7 +65,8 @@

shell("(curl -L {url} | gzip -d >> {snakemake.output[0]}) {log}")
success = True
break
if not chromosome:
break

if not success:
if len(suffixes) > 1:
Expand Down
Loading