You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm encountering an issue. My goal is to find the 10 nearest neighbors for each sequence in a FASTA file and build a tree with input sequences + neighbors. Here's what I did:
#Alignment
mafft --thread 16 --keeplength --addfragments all_seq.fasta ref.fa > aligned_seq.fa
# Ensuring that first seq is ref
head -1 aligned_seq.fa
# Downloads the problematic sites in ref genome
wget -O - "https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf"> problematic_sites_sarsCov2.vcf
# Converts fasta to vcf with correction for the problematic sites
faToVcf -includeNoAltN -maskSites=problematic_sites_sarsCov2.vcf aligned_seq.fa aligned_seq.vcf
# Downloads the latest global lineages
wget -O - "http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/public-latest.all.masked.pb.gz"# Runs usher for lineage assignment
usher -i public-latest.all.masked.pb \
-v aligned_seq.vcf -k 10 -T 16 -d result
Then, I tried to calculate all IDs in output subtrees:
fromBioimportPhyloimportglob# Initialize an empty list to store all the treestrees= []
# Loop through all the files containing the subtreesforfileinglob.glob("result/*.nh"):
tree=Phylo.read(file, "newick")
# Extract sequence IDs from the tree and add them to the listsequence_ids.extend([leaf.nameforleafintree.get_terminals()])
# Open a text file to write the sequence IDswithopen('result/sequence_ids.txt', 'w') asf:
foridinsequence_ids:
f.write(id+'\n')
In my input I had 1909 sequences. I expected a maximum of 19,090 output sequences (10 x input sequences), but I ended up with 2,595,922 sequences in 1152 output trees. How do I modify the Usher command to provide only 10 neighbors? Any insights on what went wrong would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
Hi @imdanique. In addition to the -k output files subtree-*.nh, usher also makes an output file for the whole tree, final-tree.nh, and the glob("result/*.nh") in your Python script might be picking that up as well. Try changing that to glob("result/subtree-*.nh") to match only the subtree output files, and let us know if that doesn't fix it.
Hi, thanks for developing Usher.
I'm encountering an issue. My goal is to find the 10 nearest neighbors for each sequence in a FASTA file and build a tree with input sequences + neighbors. Here's what I did:
Then, I tried to calculate all IDs in output subtrees:
Calculated all unique entries:
In my input I had 1909 sequences. I expected a maximum of 19,090 output sequences (10 x input sequences), but I ended up with 2,595,922 sequences in 1152 output trees. How do I modify the Usher command to provide only 10 neighbors? Any insights on what went wrong would be greatly appreciated.
The text was updated successfully, but these errors were encountered: