Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cell type annotation: Harmony/KNN workflow #836

Open
wants to merge 48 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
776aa42
resolve conflicts
VladimirShitov Mar 25, 2024
6f49154
resolve conflicts
DriesSchaumont Mar 27, 2024
0aead50
resolve merge conflicts
dorien-er Mar 28, 2024
3173799
resolve merge conflicts
jakubmajercik Mar 29, 2024
42628e9
Implement cellranger mkgtf (#771)
jakubmajercik Apr 25, 2024
1ed4338
resolve merge conflicts
DriesSchaumont May 13, 2024
c1bb7dd
resolve merge conflicts
dorien-er May 15, 2024
0ca63dd
initial script
dorien-er May 31, 2024
02b6d67
update params
dorien-er Jun 4, 2024
3f97c73
add unit tests pynndescent knn
dorien-er Jul 8, 2024
1367a25
undo unrequired changes
dorien-er Jul 8, 2024
725cb22
undo unrequired changes
dorien-er Jul 8, 2024
aab55fc
update changelog
dorien-er Jul 8, 2024
ed65300
update for github runners
dorien-er Jul 8, 2024
6c42512
add common params and utilities
dorien-er Jul 9, 2024
64b5df4
add harmony knn annotation subworkflow
dorien-er Jul 10, 2024
27a6509
remove split modalities
dorien-er Jul 15, 2024
acb909b
Remove muon as test dependency for concatenate_h5mu. (#773)
DriesSchaumont Mar 27, 2024
5b55c59
scGPT binning component (#765)
dorien-er Mar 28, 2024
ab5c8a2
Resolve merge conflicts
jakubmajercik Mar 29, 2024
f468f3b
Implement cellranger mkgtf (#771)
jakubmajercik Apr 25, 2024
855cb7f
resolve merge conflicts
DriesSchaumont May 13, 2024
966e9b9
Resolve merge conflicts
dorien-er May 15, 2024
e2049f1
update changelog
dorien-er Jul 8, 2024
68f9446
resolve conflicts
dorien-er Jul 15, 2024
adfb3f3
resolve conflicts
dorien-er Jul 15, 2024
08f6b60
resolve conflicts
dorien-er Jul 15, 2024
ee5e9bf
add integration tests
dorien-er Jul 15, 2024
1648952
update dependencies
dorien-er Jul 16, 2024
23d3c26
update changelog
dorien-er Jul 16, 2024
40f7ab2
generate common params, add leiden clustering
dorien-er Jul 17, 2024
89c4242
update common params
dorien-er Jul 17, 2024
a6e49a3
merge remote tracking branch main into harmony-knn
dorien-er Sep 4, 2024
71fa4bd
remove outdated files
dorien-er Sep 4, 2024
3ccbb81
Merge remote-tracking branch 'origin/main' into harmony-knn-annoation…
dorien-er Sep 6, 2024
5850c1a
update to viash 9
dorien-er Sep 6, 2024
c20a529
refactor
dorien-er Sep 6, 2024
8f1a64e
add test workflow to harmony knn
dorien-er Sep 9, 2024
7304634
move obsm_integrated to output argument
dorien-er Sep 9, 2024
c67a4ff
cleanup
dorien-er Sep 10, 2024
3f36a07
remove unused common args
dorien-er Sep 10, 2024
cf8bc4d
update changelog
dorien-er Sep 10, 2024
f4efba8
update changelog
dorien-er Sep 10, 2024
8a893fb
Merge branch 'main' into harmony-knn-annoation-workflow
dorien-er Nov 18, 2024
3b399b2
Merge branch 'main' into harmony-knn-annoation-workflow
dorien-er Nov 22, 2024
99b66a4
cleanup
dorien-er Nov 22, 2024
8993be5
Merge remote-tracking branch 'origin/main' into harmony-knn-annoation…
dorien-er Dec 4, 2024
9b9f7de
cleanup
dorien-er Dec 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# openpipelines x.x.x

# NEW FUNCTIONALITY

* `workflows/annotation/harmony_knn` workflow: Cell-type annotation based on harmony integration with KNN label transfer (PR #836).

# MINOR CHANGES

* Several component (cleanup): remove workaround for using being able to use shared utility functions with Nextflow Fusion (PR #920).
Expand Down
156 changes: 156 additions & 0 deletions src/workflows/annotation/harmony_knn/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
name: "harmony_knn"
namespace: "workflows/annotation"
description: "Cell type annotation workflow by performing harmony integration of reference and query dataset followed by KNN label transfer."
authors:
- __merge__: /src/authors/dorien_roosen.yaml
roles: [ author, maintainer ]
- __merge__: /src/authors/weiwei_schultz.yaml
roles: [ contributor ]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the test_dependencies to info?

argument_groups:
- name: Query Input
arguments:
- name: "--id"
required: true
type: string
description: ID of the sample.
example: foo
- name: "--input"
required: true
type: file
description: Input dataset consisting of the (unlabeled) query observations. The dataset is expected to be pre-processed in the same way as --reference.
example: input.h5mu
- name: "--modality"
description: Which modality to process. Should match the modality of the --reference dataset.
type: string
default: "rna"
required: false
- name: "--input_obsm_embedding"
example: "X_pca"
type: string
description: Embedding .obsm column to use as input for integration. Should match the embedding .obsm columng of the --reference dataset.
- name: "--input_obs_batch_label"
type: string
description: "The .obs field in the input (query) dataset containing the batch labels."
example: "sample"
required: true
- name: "--overwrite_existing_key"
type: boolean_true
description: If provided, will overwrite existing fields in the input dataset when data are copied during the reference alignment process.

- name: Reference input
arguments:
- name: "--reference"
required: true
type: file
description: Reference dataset consisting of the labeled observations to train the KNN classifier on. The dataset is expected to be pre-processed in the same way as the --input query dataset.
example: reference.h5mu
- name: "--reference_obs_targets"
type: string
example: [ ann_level_1, ann_level_2, ann_level_3, ann_level_4, ann_level_5, ann_finest_level ]
required: true
multiple: true
description: The `.obs` key(s) of the target labels to transfer.
- name: "--reference_obs_batch_label"
type: string
description: "The .obs field in the reference dataset containing the batch labels."
example: "sample"
required: true

- name: Harmony integration options
arguments:
- name: "--theta"
type: double
description: |
Diversity clustering penalty parameter. Specify for each variable in group.by.vars.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading this, I am not sure that I know what group.by.vars means here? Is it related to another argument?

A value of theta=0 does not encourage any diversity. Larger values of theta result in more diverse clusters."
min: 0
default: [2]
multiple: true

- name: Leiden clustering options
arguments:
- name: "--leiden_resolution"
type: double
description: Control the coarseness of the clustering. Higher values lead to more clusters.
min: 0
default: [1]
multiple: true

- name: Neighbor classifier arguments
arguments:
- name: "--weights"
type: string
default: "uniform"
choices: ["uniform", "distance"]
description: |
Weight function used in prediction. Possible values are:
`uniform` (all points in each neighborhood are weighted equally) or
`distance` (weight points by the inverse of their distance)
- name: "--n_neighbors"
type: integer
default: 15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add min?

required: false
description: |
The number of neighbors to use in k-neighbor graph structure used for fast approximate nearest neighbor search with PyNNDescent.
Larger values will result in more accurate search results at the cost of computation time.

- name: "Outputs"
arguments:
- name: "--output"
type: file
required: true
direction: output
description: The query data in .h5mu format with predicted labels predicted from the classifier trained on the reference.
example: output.h5mu
- name: "--output_obs_predictions"
type: string
required: false
multiple: true
description: |
In which `.obs` slots to store the predicted cell labels.
If provided, must have the same length as `--reference_obs_targets`.
If empty, will default to the `reference_obs_targets` combined with the `"_pred"` suffix.
- name: "--output_obs_probability"
type: string
required: false
multiple: true
description: |
In which `.obs` slots to store the probability of the predictions.
If provided, must have the same length as `--reference_obs_targets`.
If empty, will default to the `reference_obs_targets` combined with the `"_probability"` suffix.
- name: "--output_obsm_integrated"
type: string
default: "X_integrated_harmony"
required: false
description: "In which .obsm slot to store the integrated embedding."
- name: "--output_compression"
type: string
description: |
The compression format to be used on the output h5mu object.
choices: ["gzip", "lzf"]
required: false
example: "gzip"

dependencies:
- name: workflows/integration/harmony_leiden
alias: harmony_leiden_workflow
- name: labels_transfer/knn
- name: dataflow/split_h5mu
- name: dataflow/concatenate_h5mu
- name: metadata/add_id
- name: metadata/duplicate_obs

resources:
- type: nextflow_script
path: main.nf
entrypoint: run_wf

test_resources:
- type: nextflow_script
path: test.nf
entrypoint: test_wf
- path: /resources_test/scgpt

runners:
- type: nextflow
16 changes: 16 additions & 0 deletions src/workflows/annotation/harmony_knn/integration_test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash

# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)

# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"

nextflow \
run . \
-main-script src/workflows/annotation/harmony_knn/test.nf \
-entry test_wf \
-resume \
-profile docker,no_publish \
-c src/workflows/utils/labels_ci.config \
-c src/workflows/utils/integration_tests.config \
162 changes: 162 additions & 0 deletions src/workflows/annotation/harmony_knn/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
workflow run_wf {
take:
input_ch

main:

output_ch = input_ch
// Set aside the output for this workflow to avoid conflicts
| map {id, state ->
def new_state = state + ["workflow_output": state.output]
[id, new_state]
}
Comment on lines +9 to +12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| map {id, state ->
def new_state = state + ["workflow_output": state.output]
[id, new_state]
}
| map {id, state ->
def new_state = state + ["workflow_output": state.output]
[id, new_state]
}

// add id as _meta join id to be able to merge with source channel and end of workflow
| map{ id, state ->
def new_state = state + ["_meta": ["join_id": id]]
[id, new_state]
}
Comment on lines +13 to +17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is _meta required here? I think the number of input and output events from this workflow are the same and that the IDs of the events match?

| view {"After adding join_id: $it"}
// Add 'query' id to .obs columns of query dataset
| add_id.run(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_id and duplicate_obs could be performed in parallel here by:

  • splitting the input channel into two channels: one for the reference and one for the query
  • performing add_id and duplicate_obs
  • joining the channel back together

And since you have add_id and duplicate_obstwice, please add a key argument to .run (e.g. key: "add_id_query" and key: "add_id_reference" ). This makes sure that the process names remain unique.

fromState: [
"input": "input",
],
args:[
"input_id": "query",
"obs_output": "dataset",
],
toState: ["input": "output"]
)
// Add 'reference'id to .obs columns of reference dataset
| add_id.run(
fromState:[
"input": "reference",
],
args:[
"input_id": "reference",
"obs_output": "dataset"
],
toState: ["reference": "output"]
)
// Make sure that query and reference dataset have batch information in the same .obs column
// By copying the respective .obs columns to the obs column "batch_label"
| duplicate_obs.run(
fromState: [
"input": "input",
"modality": "modality",
"input_obs_key": "input_obs_batch_label",
"overwrite_existing_key": "overwrite_existing_key"
],
args: [
"output_obs_key": "batch_label"
],
toState: [
"input": "output"
]
)
| duplicate_obs.run(
fromState: [
"input": "reference",
"modality": "modality",
"input_obs_key": "reference_obs_batch_label",
"overwrite_existing_key": "overwrite_existing_key"
],
args: [
"output_obs_key": "batch_label"
],
toState: [
"reference": "output"
]
)
// Concatenate query and reference datasets prior to integration
| concatenate_h5mu.run(
fromState: { id, state -> [
"input": [state.input, state.reference]
]
},
args: [
"input_id": ["query", "reference"],
"other_axis_mode": "move"
],
toState: ["input": "output"]
)
| view {"After concatenation: $it"}
// Run harmony integration with leiden clustering
| harmony_leiden_workflow.run(
fromState: { id, state ->
[
"id": id,
"input": state.input,
"modality": state.modality,
"embedding": state.obsm_embedding,
"obsm_integrated": state.output_obsm_integrated,
"theta": state.theta,
"leiden_resolution": state.leiden_resolution,
]
},
args: [
"uns_neighbors": "harmonypy_integration_neighbors",
"obsp_neighbor_distances": "harmonypy_integration_distances",
"obsp_neighbor_connectivities": "harmonypy_integration_connectivities",
"obs_cluster": "harmony_integration_leiden",
"obsm_umap": "X_leiden_harmony_umap",
"obs_covariates": "batch_label"
],
toState: ["input": "output"]
)
| view {"After integration: $it"}
// Split integrated dataset back into a separate reference and query dataset
| split_h5mu.run(
fromState: [
"input": "input",
"modality": "modality"
],
args: [
"obs_feature": "dataset",
"output_files": "sample_files.csv",
"drop_obs_nan": "true",
"output": "ref_query"
],
toState: [
"output": "output",
"output_files": "output_files"
],
auto: [ publish: true ]
)
| view {"After sample splitting: $it"}
// map the integrated query and reference datasets back to the state
| map {id, state ->
def outputDir = state.output
def files = readCsv(state.output_files.toUriString())
def query_file = files.findAll{ dat -> dat.name == 'query' }
assert query_file.size() == 1, 'there should only be one query file'
def reference_file = files.findAll{ dat -> dat.name == 'reference' }
assert reference_file.size() == 1, 'there should only be one reference file'
def integrated_query = outputDir.resolve(query_file.filename)
def integrated_reference = outputDir.resolve(reference_file.filename)
def newKeys = ["integrated_query": integrated_query, "integrated_reference": integrated_reference]
[id, state + newKeys]
}
| view {"After splitting query: $it"}
// Perform KNN label transfer from integrated reference to integrated query
| knn.run(
fromState: [
"input": "integrated_query",
"modality": "modality",
"input_obsm_features": "output_obsm_integrated",
"reference": "integrated_reference",
"reference_obsm_features": "output_obsm_integrated",
"reference_obs_targets": "reference_obs_targets",
"output_obs_predictions": "output_obs_predictions",
"output_obs_probability": "output_obs_probability",
"output_compression": "output_compression",
"weights": "weights",
"n_neighbors": "n_neighbors",
"output": "workflow_output"
],
toState: {id, output, state -> ["output": output.output]},
)

emit:
output_ch
}
10 changes: 10 additions & 0 deletions src/workflows/annotation/harmony_knn/nextflow.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
manifest {
nextflowVersion = '!>=20.12.1-edge'
}

params {
rootDir = java.nio.file.Paths.get("$projectDir/../../../../").toAbsolutePath().normalize().toString()
}

// include common settings
includeConfig("${params.rootDir}/src/workflows/utils/labels.config")
Loading
Loading