Optional enhancement: Revisit current way of merging private data (via annotations.tsv) #65

j23414 · 2025-02-05T17:50:41Z

Context

Optional future work was to revisit the method we're merging private data. Currently we're merging private information during the ingest workflow and incorporating private information by a annotations.tsv file.

However, since then, there has been discussion of providing a more consistent pattern of incorporating private user data:

Provide a generic pattern for including additional user data alongside curated data pathogen-repo-guide#72

I was personally curious about the config.additional_inputs method proposed in nextstrain/avian-flu#106 but was open to discussion of other methods. I understand if there are more pressing priorities, so just logging the potential future work here.

Description

Examples

Possible solution

The text was updated successfully, but these errors were encountered:

j23414 · 2025-02-19T22:29:17Z

Revisit refactoring out, adding columns

WNV/ingest/rules/curate.smk

Lines 100 to 101 in 9a49047

    
                       | ./scripts/add-field-names \ 
        
                           --metadata-columns {params.metadata_columns} \

j23414 · 2025-02-20T20:38:38Z

Just connecting to docs for "additional-metadata" here:

https://github.com/nextstrain/avian-flu?tab=readme-ov-file#use-additional-metadata-andor-sequences

jameshadfield · 2025-02-21T01:43:20Z

Right now the WNV phylo interface defines inputs via:

# Sequences must be FASTA and metadata must be TSV
# Both files must be zstd compressed
sequences_url: "https://data.nextstrain.org/files/workflows/WNV/sequences.fasta.zst"
metadata_url: "https://data.nextstrain.org/files/workflows/WNV/metadata.tsv.zst"

# Pull in metadata and sequences from the ingest workflow
input_metadata: "data/metadata.tsv"
input_sequences: "data/sequences.fasta"

It's not clear to me which one is used where... looking through the code (but not running) it seems rule decompress takes the (e.g.) metadata_url and produces the (hardcoded) "data/metadata.tsv" output. Then rule filter_manual and rule subsample use config["input_metadata"], which is going to be the decompressed version of the S3 data.

The avian-flu interface for multiple data inputs - and the one I would like to become the nextstrain standard - would be a list of dictionaries:

inputs:
  - name: <input name>
    metadata: <local path, HTTP[S], S3>
    sequences: <local path, HTTP[S], S3>

These sources would all be merged (via augur merge and seqkit rmdup) at the very start of the workflow.

If config overlays are used, then they can use an additional config key with the same structure:

additional_inputs:
  - name: <input name>
    metadata: <local path, HTTP[S], S3>
    sequences: <local path, HTTP[S], S3>

Which is (hopefully!) self explanatory. We use this rather than inputs as snakemake will "merge" lists by overwriting the original list, so it's hard to specify "additional" data this way. We can use this behaviour to our advantage by using inputs in the overlay YAML when we want to replace the default inputs.

j23414 added the enhancement New feature or request label Feb 5, 2025

j23414 linked a pull request Feb 24, 2025 that will close this issue

WIP: Refactor to merge additional data during phylogenetic (instead of ingest) workflow #68

Draft

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional enhancement: Revisit current way of merging private data (via annotations.tsv) #65

Optional enhancement: Revisit current way of merging private data (via annotations.tsv) #65

j23414 commented Feb 5, 2025

j23414 commented Feb 19, 2025

j23414 commented Feb 20, 2025

jameshadfield commented Feb 21, 2025

Optional enhancement: Revisit current way of merging private data (via annotations.tsv) #65

Optional enhancement: Revisit current way of merging private data (via annotations.tsv) #65

Comments

j23414 commented Feb 5, 2025

Context

Description

Examples

Possible solution

j23414 commented Feb 19, 2025

j23414 commented Feb 20, 2025

jameshadfield commented Feb 21, 2025