Skip to content

OpenPipelines.bio v2.0.0

Latest
Compare
Choose a tag to compare
@DriesSchaumont DriesSchaumont released this 17 Dec 13:28
· 1461 commits to main since this release
60028cb

BREAKING CHANGES

  • velocity/scvelo: update scvelo to 0.3.3, which also removes support for using loom input files. The component now uses a MuData object as input. Several arguments were added to support selecting different inputs from the MuData file: counts_layer, modality, layer_spliced, layer_unspliced, layer_ambiguous. An output_h5mu argument was has been added (PR #932).

  • src/annotate/onclass and src/annotate/celltypist: Input parameter for gene name layers of input datasets has been updated to --input_var_gene_names and reference_var_gene_names (PR #919).

  • Several components under src/scgpt (cross_check_genes, tokenize_pad, binning) now processes the input (query) datasets differently. Instead of subsetting datasets based on genes in the model vocabulary and/or highly variable genes, these components require an input .var column with a boolean mask specifying this information. The results are written back to the original input data, preserving the dataset structure (PR #832).

  • query/cellxgene_census: The default output layer has been changed from .layers["counts"] to .X to be more aligned with the standard OpenPipelines format (PR #933).
    Use argument --output_layer_counts counts to revert the behaviour to the previous default.

  • Added cell multiplexing support to the from_cellranger_multi_to_h5mu component and the cellranger_multi workflow. For the from_cellranger_multi_to_h5mu component, the output argument now requires a value containing a wildcard character *, which will be replaced by the sample ID to form the final output file names. Additionally, a sample_csv argument is added to the from_cellragner_multi_to_h5mu component which describes the sample name per output file. No change is required for the output_h5mu argument from the cellranger_multi workflow, the workflow will just emit multiple events in case of a multiplexed run, one for each sample. The id of the events (and default output file names) are set by --sample_ids (in case of cell multiplexing), or (as before) by the user provided id for the input (PR #803 and PR #902).

  • demux/bcl_convert: update BCL convert from 3.10 to 4.2 (PR #774).

  • demux/cellranger_mkfastq, mapping/cellranger_count, mapping/cellranger_multi and reference/build_cellranger_reference: update cellranger to 8.0.1 (PR #774 and PR #811).

  • Removed --disable_library_compatibility_check in favour of --check_library_compatibility to the mapping/cellranger_multi component and the ingestion/cellranger_multi workflow (PR #818).

  • lianapy: bumped version to 1.3.0 (PR #827 and PR #862). Additionally, groupby is now a required argument.

  • concat: this component was deprecated and has now been removed, use concatenate_h5mu instead (PR #796).

  • The workflows folder in the root of the project no longer contains symbolic links to the build workflows in target.
    Using any workflows that was previously linked in this directory will now result in an error which will indicate
    the location of the workflow to be used instead (PR #796).

  • XGBoost: bump version to 2.0.3 (PR #646).

  • Several components: update anndata to 0.11.1 and mudata to 0.3.1 (PR #645 and PR #901), and scanpy to 1.10.4 (PR #901).

  • filter/filter_with_hvg: this component was deprecated and has now been removed. Use feature_annotation/highly_variable_features_scanpy instead (PR #843).

  • dataflow/concat: this component was deprecated and has now been removed. Use dataflow/concatenate_h5mu instead (PR #857).

  • convert/from_h5mu_to_seurat: bump seurat to latest version (PR #850).

  • workflows/ingestion/bd_rhapsody: Upgrade BD Rhapsody 1.x to 2.x, thereby changing the interface of the workflow (PR #846).

  • mapping/bd_rhapsody: Upgrade BD Rhapsody 1.x to 2.x, thereby changing the interface of the workflow (PR #846).

  • reference/make_bdrhap_reference: Upgrade BD Rhapsody 1.x to 2.x, thereby changing the interface of the workflow (PR #846).

  • reference/build_star_reference: Rename mapping/star_build_reference to reference/build_star_reference (PR #846).

  • reference/cellranger_mkgtf: Rename reference/mkgtf to reference/cellranger_mkgtf (PR #846).

  • labels_transfer/xgboost: Align interface with new annotation workflow

    • Store label probabilities instead of uncertainties
    • Take .h5mu format as an input instead of .h5ad
  • reference/build_cellranger_arc_reference: a default value of "output" is now specified for the argument --genome, inline with reference/build_cellranger_reference component. Additionally, providing a value for --organism is no longer required and its default value of Homo Sapiens has been removed (PR #864).

NEW FUNCTIONALITY

Important

Workflows from the workflows/annotation and workflows/integration/scgpt_leiden namespaces, plus their newly implemented dependencies, are not yet considered to be part of the stable public API. Their functionality and interface may be subject to change.

  • velocyto_to_h5mu: now writes counts to .X (PR #932)

  • qc/calculate_atac_qc_metrics: new component for calculating ATAC QC metrics (PR #868).

  • workflows/annotation/scgpt_integration_knn workflow: Cell-type annotation based on scGPT integration with KNN label transfer (PR #875).

  • CI: Use params.resources_test in test workflows in order to point to an alternative location (e.g. a cache) (PR #889).

  • Added demux/cellranger_atac_mkfastq component: demultiplex raw sequencing data for ATAC experiments (PR #726).

  • process_samples, process_batches and rna_multisample workflows: added functionality to scale the log-normalized
    gene expression data to unit variance and zero mean. The scaled data will be output to a different layer and the
    representation with reduced dimensions will be created and stored in addition to the non-scaled data (PR #733).

  • transform/scaling: add --input_layer and --output_layer arguments (PR #733).

  • CI: added checking of mudata contents for multiple workflows (PR #783).

  • Added multiple arguments to the cellranger_multi workflow in order to maintain feature parity with the mapping/cellranger_multi component (PR #803).

  • convert/from_cellranger_to_h5mu: add support for antigen analysis.

  • Added demux/cellranger_atac_mkfastq component: demultiplex raw sequencing data for ATAC experiments (PR #726).

  • Added reference/build_cellranger_reference component: build reference file compatible with ATAC and ATAC+GEX experiments (PR #726).

  • demux/bcl_convert: add support for no lane splitting (PR #804).

  • reference/cellranger_mkgtf component: Added cellranger mkgtf as a standalone component (PR #771).

  • scgpt/cross_check_genes component: Added a gene-model cross check component for scGPT (PR #758).

  • scgpt/embedding: component: Added scGPT embedding component (PR #761)

  • scgpt/tokenize_pad: component: Added scGPT padding and tokenization component (PR #754).

  • scgpt/binning component: Added a scGPT pre-processing binning component (PR #765).

  • workflows/integration/scgpt_leiden workflow with scGPT integration followed by Leiden clustering (PR #794).

  • scgpt/cell_type_annotation component: Added scGPT cell type annotation component (PR #798).

  • resources_test_scripts/scGPT.sh: Added script to include scGPT test resources (PR #800).

  • transform/clr component: Added the option to set the axis along which to apply CLR. Possible to override
    on workflow level as well (PR #767).

  • annotate/celltypist component: Added a CellTypist annotation component (PR #825).

  • dataflow/split_h5mu component: Added a component to split a single h5mu file into multiple h5mu files based on the values of an .obs column (PR #824).

  • workflows/test_workflows/ingestion components & workflows/ingestion: Added standalone components for integration testing of ingestion workflows (PR #801).

  • workflows/ingestion/make_reference: Add additional arguments passed through to the STAR and BD Rhapsody reference components (PR #846).

  • annotate/random_forest_annotation component: Added a random forest cell type annotation component (PR #848).

  • dataflow/concatenate_h5mu: data from .uns, both originating from the global and per-modality slots, is now retained in the final concatenated output object. Additionally, added the uns_merge_mode argument in order to tune the behavior when conflicting keys are detected across samples (PR #859).

  • dimred/densmap component: Added a densMAP dimensionality reduction component (PR #748).

  • annotate/scanvi component: Added a component to annotate cells using scANVI (PR #833).

  • transform/bpcells_regress_out component: Added a component to regress out effects of confounding variables in the count matrix using BPCells (PR #863).

  • transform/regress_out: Allow providing 'input' and 'output' layers for scanpy regress_out functionality (PR #863).

  • workflows/ingestion/make_reference: add possibility to build CellRanger ARC references. Added --motifs_file, --non_nuclear_contigs and --output_cellranger_arc arguments (PR #864).

  • Test resources (reference_gencodev41_chr1): switch reference genome for CellRanger to ARC variant (PR #864).

  • transform/bpcells_regress_out component: Added a component to regress out effects of confounding variables in the count matrix using BPCells (PR #863).

  • transform/regress_out: Allow providing 'input' and 'output' layers for scanpy regress_out functionality (PR #863).

  • Added transform/tfidf component: normalize ATAC data with TF-IDF (PR #870).

  • Added dimred/lsi component (PR #552).

  • metadata/duplicate_obs component: Added a component to make a copy from one .obs field or index to another .obs field within the same MuData object (PR #874, PR #899).

  • annotate/onclass: component: Added a component to annotate cell types using OnClass (PR #844).

  • annotate/svm component: Added a component to annotate cell types using support vector machine (SVM) (PR #845).

  • metadata/duplicate_var component: Added a component to make a copy from one .var field or index to another .var field within the same MuData object (PR #877, PR #899).

  • filter/subset_obsp component: Added a component to subset an .obsp matrix by column based on the value of an .obs field. The resulting subset is moved to an .obsm field (PR #888).

  • labels_transfer/knn component: Enable using additional distance functions for KNN classification (PR #830) and allow to perform KNN classification based on a pre-calculated neighborhood graph (PR #890).

MAJOR CHANGES

  • Bump popv to 0.4.2 (PR #901)

MINOR CHANGES

  • Pin scikit-learn for labels_transfer/xgboost to <1.6 (PR #931).

  • filter/filter_with_scrublet: provide cleaner error message when running scrublet on an empty modality (PR #929).

  • Several component (cleanup): remove workaround for using being able to use shared utility functions with Nextflow Fusion (PR #920).

  • scgpt/cell_type_annotation component update: Added support for multi-processing (PR #832).

  • Several annotation (src/annotate/) components (onclass, celltypist, random_forest_annotation, scanvi, svm_annotation): Updated input parameteres to ensure uniformity across components, implemented functionality to cross-check the overlap of genes between query and reference (model) datasets and implemented logic to allow for subsetting of genes (PR #919).

  • workflows/annotation/scgpt_annotation workflow: Added a scGPT transformer-based cell type annotation workflow (PR #832).

  • scgpt/cross_check_genes component update: Highly variable genes are now cross-checked based on the boolean mask in var_input. The filtering information is stored in the --output_var_filter .var field instead of subsetting the dataset (PR #832).

  • scgpt/binning component update: This component now requires the --var_input parameter to provide gene filtering information. Binned data is written to the --output_obsm_binned_counts .obsm field in the original input data (PR #832).

  • scgpt/pad_tokenize component update: Genes are padded and tokenized based on filtering information in --var_input and --input_obsm_binned_counts (PR #832).

  • resources_test_scripts/scgpt.sh: Update scGPT test resources to avoid subsetting of datasets (PR #926).

  • workflows/integration/scgpt_leiden workflow update: Update workflow such that input dataset is not subsetted for HVG but uses boolean masks in .var field instead (PR #875).

  • dataflow/split_h5mu: Optimize resource usage of the component (PR #913).

  • Several components: bump python version (PR #901).

  • resources_test_scripts/cellranger_atac_tiny_bcl.sh script: generate counts from fastq files using CellRanger atac count (PR #726).

  • cellbender_remove_background_v0_2: update base image to nvcr.io/nvidia/pytorch:23.12-py3 (PR #646).

  • Bump scvelo to 0.3.2 (PR #828).

  • Pin numpy<2 for several components (PR #815).

  • Added resources_test_scripts/cellranger_atac_tiny_bcl.sh script: download tiny bcl file with an ATAC experiment, download a motifs file, demultiplex bcl files to reads in fastq format (PR #726).

  • mapping/cellranger_multi component now outputs logs on failure of the cellranger multi process (PR #766).

  • Bump viash-actions to v6 (PR #821).

  • reference/make_reference: Do not try to extract genome fasta and transcriptome gtf if they are not gzipped (PR #856).

  • Changes related to syncing the test resources (PR #867):

    • Add .info.test_resources to _viash.yaml to specify where test resources need to be synced from.
    • download/sync_test_resources: Use .info.test_resources in _viash.yaml to detect where test resources need to be synced from.
    • Update CI to use project/sync-and-cache instead of project/sync-and-cache-s3.

BUG FIXES

  • Fix failing tests for ingestion/cellranger_postprocessing, ingestion/conversion and multiomics/process_batches (PR #869).

  • convert/from_10xh5_to_h5mu: add .uns slot to mdata root when metrics file is provided (PR #887).

  • Fix ingestion components not working when optional arguments are unset (PR #894).

  • transform/normalize_total component: pass the target_sum argument to sc.pp.normalize_total() (PR #823).

  • from_cellranger_multi_to_h5mu: fix missing pytest dependency (PR #897).

  • scvi_leiden workflow: fix the input layer argument of the workflow not being passed to the scVI component (PR #936 and PR #938).