Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: UpdateSymbolList incorrectly renames genes #8179

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

samuel-marsh
Copy link
Collaborator

@samuel-marsh samuel-marsh commented Dec 12, 2023

Hi Seurat Team,

This is PR that builds on previous fix described in #4545. This is by no means perfect fix (explanation below) so I leave it to you to decide path forward. If you decide a different solution is warranted I'm happy to help with PR if desired.

The issues is with potential for UpdateSymbolList to inappropriately rename genes. In original case the search for alias symbols was removed from internals of UpdateSymbolList by manually setting the parameter to previous symbols only. However, that unfortunately still causes issues as there are a number of previous symbols which are now symbols of different genes. For instance the genes MCM2, MCM7, and CCNL1 which are all currently approved genes. However, in current form UpdateSymbolList reverts changes:

> UpdateSymbolList(symbols = c("MCM2", "MCM7", "CCNL1"))
  |==============================================================================================================| 100%
Found updated symbols for 2 symbols
MCM2 -> MCM7
CCNL1 -> MCM2
[1] "MCM7" "MCM7" "MCM2"

Using the most recent 10X human reference genome (which is filtered so this is not full extent of potential issues), I have found >100 genes which would be inappropriately swapped.

The "simple" solution which is in this PR to avoid potential issues I added parameter to require that an object be specified and synonyms only be used if they are genes not already found in the object. This limits the function to use with Seurat object but protects against inappropriate renames (though not completely).

The reason it's not complete solution is because most Seurat objects are filtered versions of the count matrix and this often results in objects with half the genes present in the annotation file. Therefore the function does still leave the possibility to inappropriately rename a gene if it was gene that was filtered out during object creation. In order to avoid completely, it different fix and for Seurat to store full feature list from the counts input somewhere in the object to check against vs checking against the current features with Features.

Again if current PR solution is not desired that is totally fine but wanted you to be alert to issue.

Best,
Sam

Note/Edit: CI failure appears to be related to BioCManager install error not this PR.

@samuel-marsh
Copy link
Collaborator Author

samuel-marsh commented Dec 19, 2023

Hi Seurat Team,

Again still don't have perfect answer but just thought I would provide update here as alternative (though more conservative) method for updating genes.

I have been testing function in dev branch (branch: file_cache_dev; https://github.com/samuel-marsh/scCustomize/blob/c14063f8cd34f3f9f94903c7e82980c68cbd3a84/R/Utilities.R#L1868) of scCustomize to handle things slightly differently.

The function I wrote pulls the entire HGNC data csv (stores as cache using BiocFileCache to avoid need to download every time but allows for updating via cache update). It then filters input symbols to only symbols which are NOT currently approved symbols and only checks those unapproved symbols to see if they are listed in previous symbols and if they are it provides updated approved symbol.

I say it's more conservative because while it prevents mis-naming of genes it does allow for potential of genes whose names are not updated. There are examples of genes who have swapped symbols with each other. Therefore they could potentially be filtered out and not updated.

The level of conservative-ness probably depends on age of input gene set you are inputting as more recent gene sets are less likely to have genes which go un-renamed compared to older.

Again I don't really know what perfect solution to the issue is when only input is gene symbols and not entrez/ensembl ids. But thought I would make you aware of this potential solution in addition to problems described in first post here.

Best,
Sam

@mschilli87
Copy link
Contributor

Is there any update on this front? Is this still maintained/recommended or has this been solved in a different way meanwhile?

@samuel-marsh
Copy link
Collaborator Author

Hi @mschilli87,

I’ve created functioning my package scCustomize which can handle this now. It also works offline after first use with internet.

https://samuel-marsh.github.io/scCustomize/articles/Update_Gene_Symbols.html

Best,
Sam

@mschilli87
Copy link
Contributor

mschilli87 commented May 22, 2024

So maybe this PR should be replaced by one removing Seurat's own implementation and importing that one instead then? If this got merged, the code would be duplicated and need to be maintained in two places.

@samuel-marsh
Copy link
Collaborator Author

So this PR doesn’t implement that function from scCustomize as I was going for minimizing dependencies and minimal difference in output format of existing function. I leave it to Seurat team to decide how they want to implement or change.

Best,
Sam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants