Skip to content

Commit

Permalink
ingest/ncbi: Replace "invalid" characters from strain
Browse files Browse the repository at this point in the history
Replace what iqtree considers "invalid" characters with "_" in `strain`
so that augur tree/iqtree does not change the strain name in the
phylogenetic workflow and cause an error in augur refine.

Similar to the changes made for the curate-andersen-lab-data script in
<b6f9b56>.
  • Loading branch information
joverlee521 committed Feb 7, 2025
1 parent 94d2a92 commit 7d86b03
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion ingest/build-configs/ncbi/bin/transform-to-match-fauna
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Transforms to specific fields in the NDJSON record to match the output
metdata from fauna for easier downstream use in the phylogenetic workflow
"""
import json
import re
from sys import stdin, stdout


Expand Down Expand Up @@ -31,7 +32,12 @@ if __name__ == "__main__":
# Keep a copy of the original strain name since we are editing it below
record["original_strain"] = record["strain"]
# Remove spaces from strain names since they are not allowed in our phylo workflow.
record["strain"] = record["original_strain"].replace(" ", "")
# Replace invalid characters with `_` to match iqtree so augur tree will not modify strain
# <https://github.com/iqtree/iqtree2/blob/74da454bbd98d6ecb8cb955975a50de59785fbde/utils/tools.cpp#L607>
# Similar to the changes made for the curate-andersen-lab-data script in
# <https://github.com/nextstrain/avian-flu/commit/b6f9b561afc4e73e8f3a14c4925aa874325f04d9>.
strain = record["original_strain"].replace(" ", "")
record["strain"] = re.sub(r'[^\w\_\-\.\|\/]', '_', strain)

json.dump(record, stdout, allow_nan=False, indent=None, separators=',:')
print()

0 comments on commit 7d86b03

Please sign in to comment.