Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reformat discards intermediate noranks #109

Open
4 tasks done
fgvieira opened this issue Dec 4, 2024 · 5 comments
Open
4 tasks done

Reformat discards intermediate noranks #109

fgvieira opened this issue Dec 4, 2024 · 5 comments

Comments

@fgvieira
Copy link
Contributor

fgvieira commented Dec 4, 2024

Prerequisites

  • make sure you're are using the latest version by taxonkit version
  • read the usage

Describe your issue

  • describe the problem
  • provide a reproducible example

When running:

$ echo -e "2387\n2399\n2646461" | taxonkit reformat --taxid-field 1 --format '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}' 

I get:

taxid k p c o f g s t
2387
2399 Transposon Tn2921
2646461 Eukaryota Rhodophyta Bangiophyceae Porphyridiales Porphyridiaceae Porphyridium

However, the taxon 2399 has other ranks:

1 (root) -> 2787854 (other_entries) -> 28384 (other sequences) -> 2387 (transposons) -> 2399

and taxon 2646461 is below Porphyridium with name unclassified Porphyridium.

Would it be possible to get:

taxid k p c o f g s t
2387 unclassified other entries superkingdom unclassified other sequences phylum unclassified transposons class
2399 unclassified other entries superkingdom unclassified other sequences phylum unclassified transposons class Transposon Tn2921
2646461 Eukaryota Rhodophyta Bangiophyceae Porphyridiales Porphyridiaceae Porphyridium unclassified Porphyridium

I tried --pseudo-strain, but it just adds the last rank (if norank) to the subspecies. For node 2646461 works ok but the rest is above species so it does not make much sense (and that info is lost from the children nodes):

$ echo -e "2387\n2399\n2646461" | taxonkit reformat --taxid-field 1 --format '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}' --pseudo-strain
taxid k p c o f g s t
2387 transposons
2399 Transposon Tn2921
2646461 Eukaryota Rhodophyta Bangiophyceae Porphyridiales Porphyridiaceae Porphyridium unclassified Porphyridium

I also tried the option --fill-miss-rank but it fills in from the top-down and all intermediate ranks are lost:

$ echo -e "2387\n2399\n2646461" | taxonkit reformat --taxid-field 1 --format '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}' --fill-miss-rank
taxid k p c o f g s t
2387 unclassified other entries superkingdom unclassified other entries phylum unclassified other entries class unclassified other entries order unclassified other entries family unclassified other entries genus unclassified other entries species unclassified other entries subspecies/strain
2399 unclassified other entries superkingdom unclassified other entries phylum unclassified other entries class unclassified other entries order unclassified other entries family unclassified other entries genus Transposon Tn2921 unclassified Transposon Tn2921 subspecies/strain
2646461 Eukaryota Rhodophyta Bangiophyceae Porphyridiales Porphyridiaceae Porphyridium unclassified Porphyridium species unclassified Porphyridium subspecies/strain

Thanks,

@shenwei356
Copy link
Owner

Wow, I've never met this kind of taxon 😢

$ echo 2387 \
    | taxonkit lineage -t \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht
2787854   no rank   other entries  
28384     no rank   other sequences
2387      no rank   transposons 

--fill-miss-rank needs at least one node with one of the seven ranks, I can't figure out a way for this right now. Sorry

@fgvieira
Copy link
Contributor Author

fgvieira commented Dec 4, 2024

I was thinking that it could assign each missing rank to the highest (i.e. closest to root) possible.
For example:

  • Assign the highest rank between phylum and genus:
    root / superkingdom / phylum/ norank / genus: norank => class

  • Assign the highest rank after genus:
    root / superkingdom / phylum/ class / genus / norank: norank => species

  • Assign the highest rank after species:
    root / superkingdom / phylum/ class / genus / species / norank: norank => subspecies

  • Assign the highest rank after root (superkingdom) to norank1, then assign the highest rank after superkingdom (phylum) to norank2, and then assign the highest rank after phylum (class) to norank3,:
    root / norank1 / norank2 / norank3 / species: norank1 => superkingdom; norank2 => phylum; norank3 => class

PS - I am assuming --format '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}'. So, the highest rank would have the be from those included in the format.

@shenwei356
Copy link
Owner

For 1,2,3, it's the current way.

But for 4, how do you know norank1 is a superkingdom, or norank2 is a phylum. No one can tell, I think.

How many taxids of this kind are there? If there're a few of them, manually edit would be faster. 😞

@fgvieira
Copy link
Contributor Author

fgvieira commented Dec 4, 2024

In case 1, why should norank be a class? The only thing we know is that it is between phylum and genus, but it could also be an order or a family (there is no way to know).

It is the same logic, when assigning norank1 to superkingdom and norank2 to phylum (or for that matter, norank1 to phylum and norank2 to class). We only know that they are between root and species, and so just assign one of the possible ones.

@fgvieira
Copy link
Contributor Author

fgvieira commented Dec 6, 2024

Just a follow up, as far as I can see, right now --fill-miss-rank discards all non-standard ranks (so all norank info is discarded) and replace it with the highest know rank (with a prefix and suffix).

What I suggest is to try to place the noranks info into some likely/reasonable ranks given the other known ranks. So, if a norank is between phylum and order (and class is in the output format) then I think we can assign it to class.

EDIT: it is a bit tricky to explain the issue but, in case it helps, it seems TaxAllnomy (paper) does it (among other things) and have a interactive website to visualize the "fixed" NCBI taxonomy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants