Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No taxIDs for proteomes #36

Open
bheimbu opened this issue Jan 30, 2025 · 10 comments
Open

No taxIDs for proteomes #36

bheimbu opened this issue Jan 30, 2025 · 10 comments
Assignees

Comments

@bheimbu
Copy link

bheimbu commented Jan 30, 2025

Dear @josuebarrera,

great software, many thanks for this. I have a question related to unpublished proteomes without taxIDs? Is it possible to use the taxID of the genus instead or is it simply not possible to use such proteomes?

Cheers Bastian

@josuebarrera
Copy link
Owner

Dear @bheimbu,

Thank you for reaching out! If you just obtained a proteome from a species that is not yet described in the NCBI Taxonomy, you can "borrow" the taxID of another species from the same genus that doesn't have any data on the NR. Let me give you an example:

Let's say that you have a proteome from a yet undescribed species of the genus Arabidopsis. Since it doesn't have an NCBI Taxonomy ID, you could borrow the taxID of Arabidopsis lyrata x Arabidopsis halleri, which doesn't have any published proteins on the NR. This way you could still make use of all the proteomic data that is available within the genus and find species-specific genes. The results will be the same as if you had a taxID for your undescribed species, just bear in mind that all the results will have the name Arabidopsis lyrata x Arabidopsis halleri which you can change later.

I hope this helps, let me know if you need anything else.

Cheers,
Josué

@bheimbu
Copy link
Author

bheimbu commented Jan 30, 2025

Hi @josuebarrera,

that should be actually no problem for me. I'm working with oribatid mites, that means there is not much protein data available anyway. I'll try your suggestions though.

Anyway, can I ask you an unrelated question here or should I open a new thread?

Cheers Bastian

@josuebarrera
Copy link
Owner

Dear @bheimbu,

I'm glad this workaround works for you!
We can use the same thread for any remaining questions you might have. Please ask away!

@josuebarrera josuebarrera self-assigned this Jan 30, 2025
@bheimbu
Copy link
Author

bheimbu commented Jan 30, 2025

Awesome.

First, some background info: I'm working with oribatid mites, as mentioned before. I've found some orthologs for some very interesting proteins throughout Oribatida, and presume that these genes have been acquired via horizontal gene transfer from either bacteria or fungi; probably a long, long time ago since oribatids are ancient old.

So I have used these orthologs (protein sequences) and ran genEra. And most of which are assigned to "cellular organisms" but some are probably "contamination or HGT". I'm honest to you, I don't get the idea behind the later in detail. What does it exactly mean if homologs can only be found eg. from cellular organism to Opisthokonta and then again in the species of interest? Does this imply a probable HGT event from an organism within or below Opisthokonta to my species of interest?

And to the "cellular organisms": Does it mean that these protein-coding genes were already present in the LUCA? So these genes are ancient old and have been neo-functionalized? I really need some help here. Find attached my gene_ages.tsv.

image

Sry, for all these questions, I'm really new to phylostratigraphy.

Cheers Bastian

@josuebarrera
Copy link
Owner

Dear @bheimbu,

Sorry for the delayed response. You made a very good inquiry. We developed a "taxonomic representativeness" score in GenEra, knowing that these types of cases would arise. If you look at the fourth column of gene_ages.tsv you'll notice a number that goes from 100 to 6. This number is calculated in the following way:

L = 100 x (RP/(AP-1))

Where:

  • L is the taxonomic representativeness score.
  • RP is the number of internode taxonomic levels where GenEra found gene homologs for that protein.
  • AP is the total number of taxonomic levels that separate the most distantly related match (e.g., a bacteria) from the gene of your species of interest (your oribatid mites).

Therefore, a taxonomic representativeness score of 100 means that GenEra was able to find homologs in every taxonomic level, all the way from your species of interest to cellular organisms, whereas a score of 6 means that GenEra only found homologs in a very distantly related species (e.g., a bacteria) and in your species of interest. By default, GenEra flags every gene with a taxonomic representativeness score lower than 30 as Possible contamination or HGT. This threshold can be modified according to the user's needs (argument -l), in case they want to be more stringent with this parameter. GenEra generates a file named [TAXID]_ambiguous_phylostrata.tsv with all the genes that were labeled as HGT or contamination, giving a list of possible taxonomic levels to which these genes could be assigned, which I think will be useful for you.

Given your results, I would probably establish a parameter -l 65 or something similar so you can properly evaluate all the proteins with "patchy" matches across the NCBI Taxonomy by looking at the ambiguous_phylostrata table. Bear in mind that this score is a bit simplistic for the complex nature of HGT, so I would definitely complement any potential results with a gene tree analysis or something akin.

Cheers,
Josué

@bheimbu
Copy link
Author

bheimbu commented Feb 2, 2025

Dear @josuebarrera,

many thanks for your reply. I'm relieved that I'm not completely wrong here ;)

Thanks for the clarification, I'm wondering whether I could settle for a different value for '-l' perhaps (lower than 65), when I use the option '-a'?

I could also potentially try the option '-s' as I have a phylogenetic tree including all my oribatid mite species of interest. This should work then right?

Cheers Bastian

@bheimbu
Copy link
Author

bheimbu commented Feb 3, 2025

Just a quick question: Must the phylogenetic tree be based on the same protein data used for the analysis of gene-founder events? Or can it be any tree that includes all species of interest, I highly doubt it though?!

Cheers Bastian

@josuebarrera
Copy link
Owner

Dear @bheimbu,

You don't need the phylogeny of your mites to add the proteomes to your analysis. As you mentioned, you can use the option -a to include all the desired proteomes in your analysis. Since you specify their taxonomic IDs with this option, GenEra will automatically consider their phylogenetic relationships for the analysis (see this thread from another user for a detailed explanation on this issue). In the unlikely scenario that all your mites lack an NCBI Taxonomy ID, you can still use the same workaround that I described earlier, so there's no need for you to worry about generating a species phylogeny!

Once you finish your initial GenEra run, you can play around with the -l parameter. You would just need to re-run the analysis from step 3, which should run fairly fast since you can reuse the files from steps 1 and 2 of the pipeline.

Hope this helps, let me know if you need anything else!

Cheers,
Josué

@bheimbu
Copy link
Author

bheimbu commented Feb 3, 2025

Dear @josuebarrera,

so there would be no additional benefit of using -s (a phylogenetic tree)?

Cheers Bastian

@josuebarrera
Copy link
Owner

Dear @bheimbu,

The parameter -s was introduced to deal with homology detection failure, which is a classic critique of phylostratigraphy. Basically, it allows GenEra to discern whether the lack of homologs outside a group of organisms has a biological meaning or if it can be explained by the lack of sensitivity of pairwise alignment methods (see this paper for more info on the issue). If you would like to use this option, you can use any species tree that has branch lengths as substitutions per site and includes all your species of interest and an outgroup.

The other option to add a phylogenetic tree is using parameter -z, which is used to integrate infraspecies-level relationships into GenEra (e.g., gene ages across subspecies, varieties, or strains). Since you're likely dealing with different species, this won't be of any use to you.

Hope this info helps!

Cheers,
Josué

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants