Skip to content
Luke Zappia edited this page Mar 7, 2023 · 6 revisions

This page describes the interface between different sections of the benchmarking pipeline. For more detail on the overall structure see the Pipeline page.

Datasets

Raw datasets

Each dataset script should produce a .h5ad file containing an AnnData object with the following structure:

  • adata.X should contain raw counts
  • adata.obs should include a column containing batch labels for each cell (with any name)
  • adata.obs should include a column containing annotation labels for each cell (with any name)

Some basic quality control is performed by the preparation script but any proper filtering of cells should be performed here if the source provides unfiltered cells.

The saved .h5ad file can contain any other information (embeddings, other .obs columns etc.) but this will be removed during the preparation step.

Prepared datasets

The dataset preparation step produces two .h5ad files, one containing the reference subset and one containing the query subset (split according to the provided query batches). These files contain ONLY the following:

  • adata.X a sparse matrix containing raw counts
  • adata.obs["Batch"] containing batch labels for each cell
  • adata.obs["Label"] containing annotation labels for each cell
  • adata.obs["Unseen"] containing unseen population labels for each cell
  • adata.uns["Species"] containing the species of the dataset

This script also performs minimal filtering of the dataset, removing cells with less than 100 counts or 100 expressed features and features with zero counts (in the reference). Labels with fewer than 20 cells are also removed from both the reference and the query. The query cannot contain labels not present in the reference unless they are explicitly marked as unseen populations. The output of this step is the input to both the feature selection methods and the integration steps.

Methods

The method scripts take the reference AnnData from the preparation step and produce a TSV file with a column named "Feature" containing the names of the selected features. Other columns containing information from the method are allowed but will not be used by later steps.

Integration

Reference building

scVI

The scVI integration step takes the reference dataset from the preparation step and produces a directory containing the scVI model and a .h5ad file with the following structure:

  • adata.obs["Batch"] containing batch labels for each cell
  • adata.obs["Label"] containing annotation labels for each cell
  • adata.obs["Unseen"] containing unseen population labels for each cell
  • adata.obsm["X_emb"] containing the integrated embedding

Note that the integration output does not contain any expression data. This is to save disk space by not duplicating data. If a metric or another stage requires expression data it needs to accept both the integration output and the prepared dataset as input. PNG files showing plots of the unintegrated and integrated UMAPs coloured by batch and label are also produced.

scANVI

The scANVI integration step takes the integrated scVI model and produces a directory containing the scANVI model and a .h5ad file with the following fields IN ADDITION to those from scVI:

  • adata.obs["ReferenceLabel"] containing the labels used in training the scANVI model

Query mapping

The query mapping steps take the reference model from scVI or scANVI and produce a directory containing the corresponding query model and a .h5ad file with the same structure as the integration steps. NOTE that the embeddings here only contain the query data NOT the reference.

Similar plots to the integration steps are also produced, with additional panels showing the dataset (reference/query) and unseen population label.

Label prediction

The label prediction step takes the output of the integration step, trains a classifier on the integrated embedding and predicts labels for the mapped dataset. The output is a TSV file with the following columns for each query cell:

  • ID containing a unique cell ID
  • Label containing the ground truth cell label
  • Unseen containing unseen population labels for each cell
  • PredLabel containing the predicted cell label
  • MaxProb containing the probability for the predicted label
  • Prob_{label} columns containing the probability for each label in the reference dataset

Metrics

The metrics scripts take the .h5ad file produced by either the integration, mapping or label prediction steps (depending on the type of metric) and produce a TSV file with a SINGLE ROW and the following columns:

  • Dataset containing the name of the dataset that was evaluated
  • Method containing the name of the feature selection method that was evaluated
  • Integration containing the name of integration that was evaluated ("scVI" or "scANVI")
  • Type containing the type of the metric (eg. "Integration", "Classification" etc.)
  • Metric containing the name of the metric
  • Value containing the calculated metric score. If necessary, scores should be adjusted so that 1 is the best possible score and 0 is the worst possible score.
Clone this wiki locally