Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional enhancement: Revisit current way of merging private data (via annotations.tsv) #65

Open
j23414 opened this issue Feb 5, 2025 · 3 comments · May be fixed by #68
Open

Optional enhancement: Revisit current way of merging private data (via annotations.tsv) #65

j23414 opened this issue Feb 5, 2025 · 3 comments · May be fixed by #68
Labels
enhancement New feature or request

Comments

@j23414
Copy link
Collaborator

j23414 commented Feb 5, 2025

Context

Optional future work was to revisit the method we're merging private data. Currently we're merging private information during the ingest workflow and incorporating private information by a annotations.tsv file.

However, since then, there has been discussion of providing a more consistent pattern of incorporating private user data:

I was personally curious about the config.additional_inputs method proposed in nextstrain/avian-flu#106 but was open to discussion of other methods. I understand if there are more pressing priorities, so just logging the potential future work here.

Description

Examples

Possible solution

@j23414 j23414 added the enhancement New feature or request label Feb 5, 2025
@j23414
Copy link
Collaborator Author

j23414 commented Feb 19, 2025

Revisit refactoring out, adding columns

WNV/ingest/rules/curate.smk

Lines 100 to 101 in 9a49047

| ./scripts/add-field-names \
--metadata-columns {params.metadata_columns} \

@j23414
Copy link
Collaborator Author

j23414 commented Feb 20, 2025

Just connecting to docs for "additional-metadata" here:

https://github.com/nextstrain/avian-flu?tab=readme-ov-file#use-additional-metadata-andor-sequences

@jameshadfield
Copy link
Member

Right now the WNV phylo interface defines inputs via:

# Sequences must be FASTA and metadata must be TSV
# Both files must be zstd compressed
sequences_url: "https://data.nextstrain.org/files/workflows/WNV/sequences.fasta.zst"
metadata_url: "https://data.nextstrain.org/files/workflows/WNV/metadata.tsv.zst"

# Pull in metadata and sequences from the ingest workflow
input_metadata: "data/metadata.tsv"
input_sequences: "data/sequences.fasta"

It's not clear to me which one is used where... looking through the code (but not running) it seems rule decompress takes the (e.g.) metadata_url and produces the (hardcoded) "data/metadata.tsv" output. Then rule filter_manual and rule subsample use config["input_metadata"], which is going to be the decompressed version of the S3 data.

The avian-flu interface for multiple data inputs - and the one I would like to become the nextstrain standard - would be a list of dictionaries:

inputs:
  - name: <input name>
    metadata: <local path, HTTP[S], S3>
    sequences: <local path, HTTP[S], S3>

These sources would all be merged (via augur merge and seqkit rmdup) at the very start of the workflow.

If config overlays are used, then they can use an additional config key with the same structure:

additional_inputs:
  - name: <input name>
    metadata: <local path, HTTP[S], S3>
    sequences: <local path, HTTP[S], S3>

Which is (hopefully!) self explanatory. We use this rather than inputs as snakemake will "merge" lists by overwriting the original list, so it's hard to specify "additional" data this way. We can use this behaviour to our advantage by using inputs in the overlay YAML when we want to replace the default inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants