Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking change: one organism per dataset with matching feature reference #1074

Open
brianraymor opened this issue Oct 28, 2024 · 5 comments
Assignees
Labels
6.0 Next major CELLxGENE schema version multispecies discovery Adding new species to CELLxGENE schema CELLxGENE Discover dataset schema

Comments

@brianraymor
Copy link
Contributor

brianraymor commented Oct 28, 2024

Design

In 2025, the multiple species roadmap requires the following policy changes:

  1. There MUST be one obs['organism_ontology_term_id'] per dataset, previously requested in Handling of multi organism datasets.
  2. The var['feature_reference'] must match that obs['organism_ontology_term_id'] (with SARS-COV-2 & spike-ins also allowed)

Impact

Datasets with aggregated human and mouse

There is one aggregated dataset (Combined_human_and_mouse_limb_scRNAseq) that's Human+Mouse - https://cellxgene.cziscience.com/collections/4fefa187-5d14-4f1e-915b-c892ed320aab. The individual datasets are available in the collection.

Datasets with orthologous gene references

The following organisms appear in the corpus:

  • Callithrix jacchus
  • Gorilla gorilla
  • Macaca mulatta
  • Pan troglodytes
  • Sus scrofa domesticus

allowed by the current requirement:

Organisms. Data MUST be from a Metazoan organism or SARS-COV-2 and defined in the NCBI organismal classification. For data that is neither Human, Mouse, nor SARS-COV-2, features MUST be translated into orthologous genes from the pinned Human and Mouse gene annotations.

Collection Dataset Organism(s)
e1fa9900-3fc9-4b57-9dce-c95724c88716 Single-nucleus transcriptome
data from the dlPFC
Callithrix jacchus
Homo sapiens
Macaca mulatta
Pan troglodytes
4dca242c-d302-4dba-a68f-4c61e7bad553 Gorilla: Great apes study
Gorilla gorilla
4dca242c-d302-4dba-a68f-4c61e7bad553 Macaque: Great apes study
Macaca mulatta
4dca242c-d302-4dba-a68f-4c61e7bad553 Marmoset: Great apes study
Callithrix jacchus
4dca242c-d302-4dba-a68f-4c61e7bad553 Chimpanzee: Great apes study
Pan troglodytes
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"medium spiny neurons" subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"microglia" subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"dopaminergic neurons" subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"serotonergic neurons" subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"basket cells" subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"ependymal cells" subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"cerebellar neurons" subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"oligodendrocytes" subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (1.5 million
cell subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"glutamatergic neurons"
subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"vascular cells" subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"astrocytes" subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"GABAergic neurons" subset)
Macaca mulatta
8c4bcf0d-b4df-45c7-888c-74fb0013e9e7 A single-cell multi-omic atlas
spanning the adult rhesus
macaque brain (cell class
"oligodendrocyte precursor
cells" subset)
Macaca mulatta
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb amygdala.GAD
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb all.EPENDYMA.comp
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb hypothalamus.NEURONS
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb cortex.GAD.all
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb cortex.ASTROCYTES
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb all.MICROGLIA_MACROPHAGE.comp
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb pfc.a25.GLUT
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb hippocampus.GLUT
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb all.ASTROCYTES.comp
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb cbm.all
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb cortex.OLIGO
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb bf.NEURONS
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb striatum.SPN
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb brainstem.all
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb striatum.GAD
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb cortex.GLUT.all
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb pfc.a25.GAD
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb all.OLIGO_OPC.comp
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb hippocampus.GAD
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb amygdala.GLUT
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb thalamus.NEURONS
Callithrix jacchus
0fd39ad7-5d2d-41c2-bda0-c55bde614bdb all.ENDOTHELIA.nocbm
Callithrix jacchus
0a77d4c0-d5d0-40f0-aa1a-5e1429bcbd7e pig pancreatic islet cells
Sus scrofa domesticus
0a77d4c0-d5d0-40f0-aa1a-5e1429bcbd7e Cross species map of
pancreatic beta cells
Homo sapiens
Mus musculus
Sus scrofa domesticus
0a77d4c0-d5d0-40f0-aa1a-5e1429bcbd7e Cross species map of
pancreatic alpha cells
Homo sapiens
Mus musculus
Sus scrofa domesticus
6e067060-f7e4-466c-86f3-ec3dd33c0381 Inflammatory blockade in
premature Rhesus macaques
exposed to intra-uterine
inflammation
Macaca mulatta
367d95c0-0eb0-4dae-8276-9407239421ee Evolution of cellular
diversity in primary motor
cortex of human, marmoset
monkey, and mouse 3-species
integration inhibitory neurons
Callithrix jacchus
Homo sapiens
Mus musculus
367d95c0-0eb0-4dae-8276-9407239421ee Evolution of cellular
diversity in primary motor
cortex of human, marmoset
monkey, and mouse 3-species
integration non-nuerons
Callithrix jacchus
Homo sapiens
Mus musculus
367d95c0-0eb0-4dae-8276-9407239421ee Evolution of cellular
diversity in primary motor
cortex of human, marmoset
monkey, and mouse 3-species
integration excitory neurons
Callithrix jacchus
Homo sapiens
Mus musculus
367d95c0-0eb0-4dae-8276-9407239421ee Evolution of cellular
diversity in primary motor
cortex of human, marmoset
monkey, and mouse 4-species
integration excitory neurons
Callithrix jacchus
Homo sapiens
Macaca mulatta
Mus musculus
@brianraymor brianraymor added schema CELLxGENE Discover dataset schema multispecies discovery Adding new species to CELLxGENE labels Oct 28, 2024
@brianraymor brianraymor self-assigned this Oct 28, 2024
@ambrosejcarr
Copy link
Contributor

From my perspective, (2) is strongly preferable. I'd be curious to understand for which fraction of datasets it would be feasible and what the estimate of the re-curation costs would be to help inform our decision.

Thinking out loud, might we make a first attempt using one of the following approaches and provide authors an opt-in opportunity to review?

Approach ideas:

  1. Use an orthology service and reverse the mapping (presumably this is very straightforward, but might create errors)
  2. Grab the count matrices from GEO or other archive they were submitted to, create a gene-gene correlation matrix, transfer labels where we have correlation ~= 1 (more accurate, but could fail if authors did different curation for archive submissions)

@jahilton
Copy link
Collaborator

It depends on what our schema says at that point (so I hesitate to dive too deep into the discuss while that's still unknown). But I think it would perfectly reasonable to say "a Dataset with only 1 organism_ontology_term_id value MUST have feature_reference to match (SARS-COV-2 & spike-ins are also allowed)". So we either get updated submission for those or they're removed.

Reversing the mapping is not a viable option in my opinion. There was already 1 mapping that obscured things and removed many genes that had no mapping. Reversing would only add another layer of obscurity (and no telling if it reverses accurately compared to the initial conversion) and you won't recover the removed genes.
I anticipate the contributors will be responsive/motivated as they will get a more accurate representation of their data.

@brianraymor
Copy link
Contributor Author

But I think it would perfectly reasonable to say "a Dataset with only 1 organism_ontology_term_id value MUST have feature_reference to match (SARS-COV-2 & spike-ins are also allowed)".

We would also need to address the case where a dataset contains multiple organisms such as https://cellxgene.cziscience.com/collections/367d95c0-0eb0-4dae-8276-9407239421ee or https://cellxgene.cziscience.com/collections/e1fa9900-3fc9-4b57-9dce-c95724c88716.

@jahilton
Copy link
Collaborator

jahilton commented Nov 8, 2024

My proposal is to aim for these eventual requirements:

  • If only 1 organism in obs, then the feature_reference must match (SARS-COV-2 & spike-ins are also allowed)
  • If multiple organisms in obs, then the feature_reference must match 1 of them. This must be all is_primary_data:False. The obs must be represented in a single-organism Dataset.

This is similar to our philosophy with Visium - the aggregated Datasets are of value to contributors but hold no reuse so they're permitted as long as we get that data in a reusable format.

So as we get schema for each of these (allow their gene IDs and understand other standards like dev_stage) & have a timeline for the schema release, we reach out to the contributor & request their assistance in Revising their Datasets. I am hopeful for high response rate given that most find the current representation inadequate.
Once we get the non-human/-mouse Datasets listed above up to organism-specific standards, we can lock down the single-organism restriction

@jahilton
Copy link
Collaborator

Browsed the Datasets a little...

  • All in the table above have human feature reference
  • The largest dataset in 8c4bcf0d Collection has a mmulatta_homolog_ensembl_gene column in var w/ ENSMMUG.... IDs
  • The lone dataset in 6e067060 Collection has a author_Rh_Gene_stable_ID column in var w/ ENSMMUG.... IDs
  • The largest dataset in 0fd39ad7 Collection has a marmoset_gene_stable_id column in var w/ ENSCJAG... IDs
  • The pig-only dataset in 0a77d4c0 Collection has a pig_ensembl_ID column in var w/ ENSSSCG00000027257... IDs
  • The 2 x-species datasets in 0a77d4c0 Collection have mouse_ensembl_ID-mouse (ENSMUSG...) & pig_ensembl_ID-pig (ENSSSCG...) columns in var

@brianraymor brianraymor added the 5.4 Next minor CELLxGENE schema version after 5.3 label Nov 18, 2024
@brianraymor brianraymor changed the title Address datasets with orthologous gene references Breaking change: one organism per dataset with matching feature reference Nov 18, 2024
@brianraymor brianraymor added 6.0 Next major CELLxGENE schema version and removed 5.4 Next minor CELLxGENE schema version after 5.3 labels Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.0 Next major CELLxGENE schema version multispecies discovery Adding new species to CELLxGENE schema CELLxGENE Discover dataset schema
Projects
None yet
Development

No branches or pull requests

3 participants