-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking change: one organism per dataset with matching feature reference #1074
Comments
From my perspective, (2) is strongly preferable. I'd be curious to understand for which fraction of datasets it would be feasible and what the estimate of the re-curation costs would be to help inform our decision. Thinking out loud, might we make a first attempt using one of the following approaches and provide authors an opt-in opportunity to review? Approach ideas:
|
It depends on what our schema says at that point (so I hesitate to dive too deep into the discuss while that's still unknown). But I think it would perfectly reasonable to say "a Dataset with only 1 organism_ontology_term_id value MUST have feature_reference to match (SARS-COV-2 & spike-ins are also allowed)". So we either get updated submission for those or they're removed. Reversing the mapping is not a viable option in my opinion. There was already 1 mapping that obscured things and removed many genes that had no mapping. Reversing would only add another layer of obscurity (and no telling if it reverses accurately compared to the initial conversion) and you won't recover the removed genes. |
We would also need to address the case where a dataset contains multiple organisms such as https://cellxgene.cziscience.com/collections/367d95c0-0eb0-4dae-8276-9407239421ee or https://cellxgene.cziscience.com/collections/e1fa9900-3fc9-4b57-9dce-c95724c88716. |
My proposal is to aim for these eventual requirements:
This is similar to our philosophy with Visium - the aggregated Datasets are of value to contributors but hold no reuse so they're permitted as long as we get that data in a reusable format. So as we get schema for each of these (allow their gene IDs and understand other standards like dev_stage) & have a timeline for the schema release, we reach out to the contributor & request their assistance in Revising their Datasets. I am hopeful for high response rate given that most find the current representation inadequate. |
Browsed the Datasets a little...
|
Design
In 2025, the multiple species roadmap requires the following policy changes:
obs['organism_ontology_term_id']
per dataset, previously requested in Handling of multi organism datasets.var['feature_reference']
must match thatobs['organism_ontology_term_id']
(with SARS-COV-2 & spike-ins also allowed)Impact
Datasets with aggregated human and mouse
There is one aggregated dataset (Combined_human_and_mouse_limb_scRNAseq) that's Human+Mouse - https://cellxgene.cziscience.com/collections/4fefa187-5d14-4f1e-915b-c892ed320aab. The individual datasets are available in the collection.
Datasets with orthologous gene references
The following organisms appear in the corpus:
allowed by the current requirement:
Organisms. Data MUST be from a Metazoan organism or SARS-COV-2 and defined in the NCBI organismal classification. For data that is neither Human, Mouse, nor SARS-COV-2, features MUST be translated into orthologous genes from the pinned Human and Mouse gene annotations.
data from the dlPFC
Homo sapiens
Macaca mulatta
Pan troglodytes
spanning the adult rhesus
macaque brain (cell class
"medium spiny neurons" subset)
spanning the adult rhesus
macaque brain (cell class
"microglia" subset)
spanning the adult rhesus
macaque brain (cell class
"dopaminergic neurons" subset)
spanning the adult rhesus
macaque brain
spanning the adult rhesus
macaque brain (cell class
"serotonergic neurons" subset)
spanning the adult rhesus
macaque brain (cell class
"basket cells" subset)
spanning the adult rhesus
macaque brain (cell class
"ependymal cells" subset)
spanning the adult rhesus
macaque brain (cell class
"cerebellar neurons" subset)
spanning the adult rhesus
macaque brain (cell class
"oligodendrocytes" subset)
spanning the adult rhesus
macaque brain (1.5 million
cell subset)
spanning the adult rhesus
macaque brain (cell class
"glutamatergic neurons"
subset)
spanning the adult rhesus
macaque brain (cell class
"vascular cells" subset)
spanning the adult rhesus
macaque brain (cell class
"astrocytes" subset)
spanning the adult rhesus
macaque brain (cell class
"GABAergic neurons" subset)
spanning the adult rhesus
macaque brain (cell class
"oligodendrocyte precursor
cells" subset)
pancreatic beta cells
Mus musculus
Sus scrofa domesticus
pancreatic alpha cells
Mus musculus
Sus scrofa domesticus
premature Rhesus macaques
exposed to intra-uterine
inflammation
diversity in primary motor
cortex of human, marmoset
monkey, and mouse 3-species
integration inhibitory neurons
Homo sapiens
Mus musculus
diversity in primary motor
cortex of human, marmoset
monkey, and mouse 3-species
integration non-nuerons
Homo sapiens
Mus musculus
diversity in primary motor
cortex of human, marmoset
monkey, and mouse 3-species
integration excitory neurons
Homo sapiens
Mus musculus
diversity in primary motor
cortex of human, marmoset
monkey, and mouse 4-species
integration excitory neurons
Homo sapiens
Macaca mulatta
Mus musculus
The text was updated successfully, but these errors were encountered: