Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many sequences in an output tree #358

Open
imdanique opened this issue Nov 6, 2023 · 1 comment
Open

Too many sequences in an output tree #358

imdanique opened this issue Nov 6, 2023 · 1 comment

Comments

@imdanique
Copy link

Hi, thanks for developing Usher.

I'm encountering an issue. My goal is to find the 10 nearest neighbors for each sequence in a FASTA file and build a tree with input sequences + neighbors. Here's what I did:

   #Alignment
    mafft --thread 16  --keeplength --addfragments all_seq.fasta ref.fa > aligned_seq.fa

    # Ensuring that first seq is ref
    head -1 aligned_seq.fa

    # Downloads the problematic sites in ref genome
    wget -O - "https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf" > problematic_sites_sarsCov2.vcf

    # Converts fasta to vcf with correction for the problematic sites
    faToVcf -includeNoAltN -maskSites=problematic_sites_sarsCov2.vcf aligned_seq.fa aligned_seq.vcf

    # Downloads the latest global lineages
    wget -O - "http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/public-latest.all.masked.pb.gz" 

    # Runs usher for lineage assignment
    usher -i public-latest.all.masked.pb \
    -v aligned_seq.vcf -k 10  -T 16 -d result 

Then, I tried to calculate all IDs in output subtrees:

from Bio import Phylo
import glob
# Initialize an empty list to store all the trees
trees = []

# Loop through all the files containing the subtrees
for file in glob.glob("result/*.nh"):
    tree = Phylo.read(file, "newick")
    # Extract sequence IDs from the tree and add them to the list
    sequence_ids.extend([leaf.name for leaf in tree.get_terminals()])

# Open a text file to write the sequence IDs
with open('result/sequence_ids.txt', 'w') as f:
    for id in sequence_ids:
        f.write(id + '\n')

Calculated all unique entries:

sort usher_results_k10/sequence_ids.txt | uniq | grep -v "node" | wc -l
>  2595922

In my input I had 1909 sequences. I expected a maximum of 19,090 output sequences (10 x input sequences), but I ended up with 2,595,922 sequences in 1152 output trees. How do I modify the Usher command to provide only 10 neighbors? Any insights on what went wrong would be greatly appreciated.

@AngieHinrichs
Copy link
Contributor

Hi @imdanique. In addition to the -k output files subtree-*.nh, usher also makes an output file for the whole tree, final-tree.nh, and the glob("result/*.nh") in your Python script might be picking that up as well. Try changing that to glob("result/subtree-*.nh") to match only the subtree output files, and let us know if that doesn't fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants