Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft relaxed schema compliance #1025

Open
brianraymor opened this issue Sep 27, 2024 · 2 comments
Open

Draft relaxed schema compliance #1025

brianraymor opened this issue Sep 27, 2024 · 2 comments
Assignees
Labels
discovery schema CELLxGENE Discover dataset schema

Comments

@brianraymor
Copy link
Contributor

brianraymor commented Sep 27, 2024

Context

There are emerging requirements for reusing the cellxgene-schema CLIschema+validator for scenarios that are more relaxed than CELLxGENE Discover's current requirements.

Relaxation

The following sections blue-sky possible approaches to documenting relaxed requirements; however, the solution should be driven by concrete scenarios and not theory.

Fine Granularity: Per Schema variant

A limited number of schema variants could be documented such as the "cross modality schema". schema_reference could be reused for the curator to define the preferred schema for validation.


Fine Granularity: Per Metadata field

For each metadata field, the schema defines separate requirements for strict and relaxed. Generally, relaxed will indicate that the field MUST NOT be present, but it's also possible to relax other requirements.


uns (Dataset Metadata)

relaxed

Key relaxed
Annotator Curator MAY annotate.
Value list[str]. str values MUST match one or more of the values in the set:
  • "obs['cell_type_ontology_term_id']"
  • "obs['development_stage_ontology_term_id']"
  • ...

If present, relaxed validation MUST be performed on the specified metadata field.


Concrete example: If the assay is silver tier Visium Spatial Gene Expression then assuming that cell_type_ontology_term_id defined its relaxed validation as:

  1. cell_type_ontology_term_id MUST NOT be present in obs
  2. "cell_type_onotlogy_term_id" MUST be annotated in uns['relaxed']

Then the silver tier dataset would simply meet those requirements.


Coarse Granularity: Per Dataset

The schema documents a relaxed subset of the current required fields. This subset may not include cell_type_ontology_term_id or perhaps development_stage_ontology_term_id. If a current required field is not included in the relaxed subset, then it MUST NOT be present in the dataset.

Curators annotate whether strict or relaxed validation is desired.


uns (Dataset Metadata)

strict

Key strict
Annotator Curator MUST annotate.
Value bool. This MUST be True for strict validation and MUST be False for relaxed validation.

References

Compliance to the MiAIRR Data Standard is currently a binary state, i.e., a data either is or is not compliant, there are not “grades” of compliance. However, additional requirements for specific use cases might be defined in the future.

@brianraymor brianraymor added schema CELLxGENE Discover dataset schema discovery labels Sep 27, 2024
@brianraymor brianraymor self-assigned this Sep 27, 2024
@brianraymor brianraymor changed the title Drafting relaxed schema compliance Draft relaxed schema compliance Oct 4, 2024
@nayib-jose-gloria
Copy link
Contributor

I'd prefer not to overload "relaxed" to mean anything besides "MUST NOT contain". If we want to "relax" in some other way, it should probably be a new schema variant or additional flag.

@nayib-jose-gloria
Copy link
Contributor

I like the idea of using a combination of schema_reference to point to variant schemas, and uns.relaxed to point to which requirements to ignore in that given schema reference.

We may have dependent columns that need to be relaxed, like tissue_type and tissue_ontology_term_id. Just wanted to note that we'll have to account for that dependency either by logging an error if tissue_ontology_term_id is relaxed and tissue_type is not, or automatically relaxing dependent columns of relaxed columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discovery schema CELLxGENE Discover dataset schema
Projects
None yet
Development

No branches or pull requests

2 participants