Skip to content
This repository has been archived by the owner on Aug 20, 2024. It is now read-only.

Add samplesheet examples #82

Merged
merged 7 commits into from
Aug 17, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 152 additions & 39 deletions docs/samplesheets/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,79 +5,192 @@ description: Examples of advanced sample sheet creation techniques.

# Sample sheet channel manipulation examples

## Separate entries based on a condition
## Introduction

You can use the [`.branch()` operator](https://www.nextflow.io/docs/latest/operator.html#branch) to separate the channel entries based on a condition. This is especially useful when you can get multiple types of input data.
Understanding channel structure and manipulation is critical for getting the most out of Nextflow. nf-validation helps initialise your channels from the text inputs to get you started, but further work might be required to fit your exact use case. In this page we run through some common cases for transforming the output of `.fromSamplesheet`.

### Glossary

- A channel is the Nextflow object, referenced in the code
- An item is each thing passing through the channel, equivalent to one row in the samplesheet
- An element is each thing in the item, e.g., the meta value, fastq_1 etc. It may be a file or value

## Default mode

Each item in the channel emitted by `.fromSamplesheet()` is a flat tuple, corresponding with each row of the samplesheet. Each item will be composed of a meta value (if present) and any additional elements from columns in the samplesheet, e.g.:

This example shows a channel which can have entries for WES or WGS data. These analysis are different so we want to separate the WES and WGS entries from each other. We also don't want the `bed` file input for the WGS data, so the resulting channel with WGS data should not have this file in it.
```csv
sample,fastq_1,fastq_2,bed
sample1,fastq1.R1.fq.gz,fastq1.R2.fq.gz,sample1.bed
sample2,fastq2.R1.fq.gz,fastq2.R2.fq.gz,
```

Might create a channel where each element consists of 4 items, a map value followed by three files:

```groovy
// Columns:
[ val([ sample: sample ]), file(fastq1), file(fastq2), file(bed) ]

// Resulting in:
[ [ id: "sample" ], fastq1.R1.fq.gz, fastq1.R2.fq.gz, sample1.bed]
[ [ id: "sample2" ], fastq2.R1.fq.gz, fastq2.R2.fq.gz, [] ] // A missing value from the samplesheet is an empty list
```

This channel can be used as input of a process where the input declaration is:

```nextflow
tuple val(meta), path(fastq_1), path(fastq_2), path(bed)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely convinced this is the cleanest and most clear way of representing the channel structure, maybe something like [meta, fastq_1.fastq, fastq_2.fastq, sample1.bed] will be more clear for beginners?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to match an input/output declaration for a process. Perhaps I could include both. What does the .fastq or .bed mean in your example?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to emulate a filename, but yeah this is just a matter of representation. I'm not sure which way is the best or most clear to beginners. (although both should be fine I think)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look now, I've added a clarifying statement about what the channel is and how you would use it.

```

It may be necessary to manipulate this channel to fit your process inputs. For more documentation, check out the [Nextflow operator docs](https://www.nextflow.io/docs/latest/operator.html), however here are some common use cases with `.fromSamplesheet()`.

## Changing the structure of channel items

Each item in the channel will be a flat tuple, but some processes will use multiple files as a list in their input channel, this is common in nf-core modules. For example, consider the following input declaration in a process, where FASTQ could be > 1 file:

```nextflow
process ZCAT_FASTQS {
input:
tuple val(meta), path(fastq)

"""
zcat $fastq
"""
}
```

The output of `.fromSamplesheet()` can be used by default with a process with the following input declaration:

```nextflow
val(meta), path(fastq_1), path(fastq_2)
```

To manipulate each item within a channel, you should use the [Nextflow `.map()` operator](https://www.nextflow.io/docs/latest/operator.html#map). This will apply a function to each element of the channel in turn. Here, we convert the flat tuple into a tuple composed of a meta and a list of FASTQ files:

```nextflow
Channel.fromSamplesheet("input")
.map { meta, fastq_1, fastq_2 -> tuple(meta, [ fastq_1, fastq_2 ]) }
.set { input }

input.view() // Channel has 2 elements: meta, fastqs
```

This is now compatible with the process defined above and will not raise a warning about input cardinality:

```nextflow
ZCAT_FASTQS(input)
```

## Removing elements in channel items

For example, to remove the BED file from the channel created above, we could not return it from the map. Note the absence of the `bed` item in the return of the closure below:

```nextflow
Channel.fromSamplesheet("input")
.map { meta, fastq_1, fastq_2, bed -> tuple(meta, fastq_1, fastq_2) }
.set { input }

input.view() // Channel has 3 elements: meta, fastq_1, fastq_2
```

In this way you can drop items from a channel.

## Separating channel items

We could perform this twice to create one channel containing the FASTQs and one containing the BED files, however Nextflow has a native operator to separate channels called [`.multiMap()`](https://www.nextflow.io/docs/latest/operator.html#multimap). Here, we separate the FASTQs and BEDs into two separate channels using `multiMap`. Note, the channels are both contained in `input` and accessed as an attribute using dot notation:

```nextflow
Channel.fromSamplesheet("input")
.multiMap { meta, fastq_1, fastq_2, bed ->
fastq: tuple(meta, fastq_1, fastq_2)
bed: tuple(meta, bed)
}
.set { input }
```

The channel has two attributes, `fastq` and `bed`, which can be accessed separately.

```nextflow
input.fastq.view() // Channel has 3 elements: meta, fastq_1, fastq_2
input.bed.view() // Channel has 2 elements: meta, bed
```

Importantly, `multiMap` applies to every item in the channel and returns an item to both channels for every input, i.e. `input`, `input.fastq` and `input.bed` all contain the same number of items, however each item will be different.

## Separate items based on a condition

You can use the [`.branch()` operator](https://www.nextflow.io/docs/latest/operator.html#branch) to separate the channel entries based on a condition. This is especially useful when you can get multiple types of input data.

This example shows a channel which can have entries for WES or WGS data. WES data includes a BED file denoting the target regions, but WGS data does not. These analysis are different so we want to separate the WES and WGS entries from each other. We can separate the two using `.branch` based on the presence of the BED file:

```nextflow
// Channel with four elements - see docs for examples
params.input = "samplesheet.csv"

Channel.fromSamplesheet("input")
.branch { meta, bam, bai, bed ->
WGS: meta.type == "WGS"
return [meta, bam, bai]
WES: meta.type == "WES"
// The original channel structure will be used when no return statement is used.
.branch { meta, fastq_1, fastq_2, bed ->
// If BED does not exist
WGS: !bed
return [meta, fastq_1, fastq_2]
// If BED exists
WES: bed
// The original channel structure will be used when no return statement is used.
}
.set { input }

input.WGS.view() // Channel has 3 elements: meta, bam, bai
input.WES.view() // Channel has 4 elements: meta, bam, bai, bed
input.WGS.view() // Channel has 3 elements: meta, fastq_1, fastq_2
input.WES.view() // Channel has 4 elements: meta, fastq_1, fastq_2, bed
```

## Count entries with a common value
Unlike `multiMap`, the outputs of `.branch()`, the resulting channels will contain a different number of items.

## Combining a channel

After splitting the channel, it may be necessary to rejoin the channel. There are many ways to join a channel, but here we will demonstrate the simplest which uses the [Nextflow join operator](https://www.nextflow.io/docs/latest/operator.html#join) to rejoin any of the channels from above based on the first element in each item, the `meta` value.

```nextflow
input.fastq.view() // Channel has 3 elements: meta, fastq_1, fastq_2
input.bed.view() // Channel has 2 elements: meta, bed

input.fastq
.join( input.bed )
.set { input_joined }

input_joined.view() // Channel has 4 elements: meta, fastq_1, fastq_2, bed
```

## Count items with a common value

This example is based on this [code](https://github.com/mribeirodantas/NextflowSnippets/blob/main/snippets/countBy.md) from [Marcel Ribeiro-Dantas](https://github.com/mribeirodantas).

It's useful to determine the count of channel entries with similar values when you want to merge them later on (to prevent pipeline bottlenecks with `.groupTuple()`).

This example contains a channel where multiple samples can be in the same family. Later on in the pipeline we want to merge the analyzed files so one file gets created for each family. The result will be a channel with an extra meta field containing the count of channel entries with the same family name.

```groovy
```nextflow
// channel created by fromSamplesheet() previous to modification:
// [[id:example1, family:family1], example1.txt]
// [[id:example2, family:family1], example2.txt]
// [[id:example3, family:family2], example3.txt]

params.input = "samplesheet.csv"

Channel.fromSamplesheet("input")
.tap { ch_raw } // Create a copy of the original channel
.tap { ch_raw } // Create a copy of the original channel
.map { meta, txt -> [ meta.family ] } // Isolate the value to count on
.reduce([:]) { counts, family ->
.reduce([:]) { counts, family -> // Creates a map like this: [family1:2, family2:1]
counts[family] = (counts[family] ?: 0) + 1
counts
} // Creates a map like this: [family1:2, family2:1]
.combine(ch_raw) // Add the count map to the original channel
.map { counts, meta, txt ->
}
.combine(ch_raw) // Add the count map to the original channel
.map { counts, meta, txt -> // Add the counts of the current family to the meta
new_meta = meta + [count:counts[meta.family]]
[ new_meta, txt ]
} // Add the counts of the current family to the meta
}
.set { input }

input.view()
// [[id:example1, family:family1, count:2], example1.txt]
// [[id:example2, family:family1, count:2], example2.txt]
// [[id:example3, family:family2, count:1], example3.txt]
```

## Split into multiple channels

Sometimes you don't want all inputs to remain in the same channel (e.g. when the files need to be pre-processed separately).

Following code shows an example where a `cram` file and a `bed` file are given in the samplesheet. The result contains two channels: one with the `cram` file and one with the `bed` file.

```groovy
// channel created by fromSamplesheet() previous to modification:
// [[id:example], example.cram, example.cram.crai, example.bed]
params.input = "samplesheet.csv"
Channel.fromSamplesheet("input")
.multiMap { meta, cram, crai, bed ->
cram: [meta, cram, crai]
bed: [meta, bed]
}
.set { input }

input.cram.view() // [[id:example], example.cram, example.cram.crai]
input.bed.view() // [[id:example], example.bed]
```