Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Update to the latest version of nextclade. #1701

Merged
merged 5 commits into from
Aug 16, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/backend/Dockerfile.lineage_qc
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ RUN update-ca-certificates

# install nextclade, check it installed correctly
RUN apt-get --yes install curl
RUN cd /usr/local/bin && curl -fsSL "https://github.com/nextstrain/nextclade/releases/download/2.14.0/nextclade-x86_64-unknown-linux-gnu" -o "nextclade" && chmod +x nextclade
RUN cd /usr/local/bin && curl -fsSL "https://github.com/nextstrain/nextclade/releases/download/3.8.2/nextclade-x86_64-unknown-linux-gnu" -o "nextclade" && chmod +x nextclade
RUN nextclade --version

# Poetry: install app
Expand Down
8 changes: 8 additions & 0 deletions src/backend/aspen/workflows/nextclade/prep_samples.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
if you're tweaking it to support multiple tools. Especially look closely at
the function `get_sample_ids_to_refresh` in here.
"""

import io
import json
import subprocess
Expand Down Expand Up @@ -127,6 +128,13 @@ def cli(
# generalized case and we'll need to figure out how to handle that,
# but right now the workflow is hardcoded to always expecting dataset.
nextclade_dataset_name = target_pathogen.nextclade_dataset_name
# Nextclade 3.2.8 has new names for datasets vs the 2.1 names in the db.
new_nextclade_dataset_names = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if normalizing the new names back to whatever the old one was is the right choice. I don't remember how the dataset name gets used, but if there's no logic based on the value elsewhere and its just being held so we know what reference was used, I think we shouldn't standardize to the old name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, whoops, I see I misread the direction of the lookup var.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think this is just the minimal change possible -- we only use this value to dowload the right dataset via the nextclade cli, and nothing else changes anywhere in our system

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't know: is there a reason we can't modify the row for the pathogens table so the nextclade_dataset_name for MPX is the new value instead? If that's doable, it seems preferable to go that way, but I also don't know if the refresh logic would freak out if old MPX and new MPX referenced different things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a skim through the code and it looks like this is the one-and-only place where the nextclade dataset name value is used, so it should be safe to update the db values instead - I'll update the PR

"SARS-CoV": "nextstrain/sars-cov-2/wuhan-hu-1/orfs",
"hMPXV": "nextstrain/mpox/all-clades",
}
if nextclade_dataset_name in new_nextclade_dataset_names:
nextclade_dataset_name = new_nextclade_dataset_names[nextclade_dataset_name]
if not nextclade_dataset_name:
print("No nextclade_dataset_name for this pathogen in the DB.")
if run_type == RunType.REFRESH_STALE:
Expand Down
2 changes: 1 addition & 1 deletion src/backend/aspen/workflows/nextclade/run_nextclade.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ shopt -s inherit_errexit # no silent breaking
# This is where we will store Nextclade's dataset for the target pathogen.
NEXTCLADE_DATASET_DIR=nextclade_dataset_bundle
# Inside the dataset, Nextclade uses this file to tag the dataset.
NEXTCLADE_TAG_FILENAME=tag.json
NEXTCLADE_TAG_FILENAME=pathogen.json

# Certain bits of info need to be passed around during the workflow.
# Using JSON file as an easy way to pass them around to various processes.
Expand Down
Loading
Loading