Skip to content

Commit

Permalink
[DOC] Add a raptor layout tutorial
Browse files Browse the repository at this point in the history
Signed-off-by: Lydia Buntrock <[email protected]>
  • Loading branch information
Irallia committed Dec 6, 2022
1 parent 6f587e7 commit aed342e
Show file tree
Hide file tree
Showing 2 changed files with 218 additions and 5 deletions.
140 changes: 140 additions & 0 deletions doc/tutorial/02_layout/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Create a layout with Raptor {#tutorial_layout}

You will learn how to construct a Raptor layout of large collections of nucleotide sequences.

\attention
This is a tutorial only if you want to use the advanced option of the HIBF.
You can skip this chapter if you want to use raptor with the default IBF.

\tutorial_head{Easy, 30 min, \ref tutorial_first_steps,
[<b>Interleaved Bloom Filter (IBF)</b>](https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1interleaved__bloom__filter.html#details)\,
\ref raptor::hierarchical_interleaved_bloom_filter "Hierarchical Interleaved Bloom Filter (HIBF)"}

[TOC]

# Create a Layout of large collections of nucleotide sequences

To realise this distinction between user bins and technical bins, a layout must be calculated. For this purpose we have
developed our own tool (Chopper)[https://www.seqan.de/apps/chopper.html] and integrated it into raptor. So you can
simply call it up with `raptor layout` without having to install Chopper separately.

## General Idea & main parameters

\image html hibf.svg width=40%

The figure above shows the storage of the user bins in the technical bins. The resulting tree represents the layout.
The first step is to estimate the number of (representative) k-mers per user bin by computing HyperLogLog (HLL) sketches
[1] of the input data. These HLL sketches are stored in a directory and will be used in computing an HIBF layout. The
HIBF layout tries to minimize the disk space consumption of the resulting index. The space is estimated using a k-mer
count per user bin which represents the potential denisity in a technical bin in an interleaved Bloom filter.

Using all default values a first call will look like:

```bash
raptor layout --input-file all_bin_path.txt --tmax 64
```

The `input-file` looks exactly as in our previous calls of `raptor index`; it contains all the paths of our database
files.

/todo Chopper braucht ein `input_data.tsv` input, wobei es momentan nur eine Spalte (mit den Pfaden) gibt, also geht auch `.txt`.

The parameter `tmax` limits the number of technical bins on each level of the HIBF. Choosing a good tmax is not trivial.
The smaller tmax, the more levels the layout needs to represent the data. This results in a higher space consumption of
the index. While querying each individual level is cheap, querying many levels might also lead to an increased runtime.
A good tmax is usually the square root of the number of user bins rounded to the next multiple of 64. Note that your
tmax will be rounded to the next multiple of 64 anyway.

/note
At the expense of a longer runtime, you can enable the statistic mode that determines the best tmax using the option
`--determine-best-tmax`.
When this flag is set, the program will compute multiple layouts for tmax in [64 , 128, 256, ... , tmax] as well as
tmax=sqrt(number of user bins). The layout algorithm itself only optimizes the space consumption. When determining the
best layout, we additionally keep track of the average number of queries needed to traverse each layout. This query cost
is taken into account when determining the best tmax for your data.
Note that the option `--tmax` serves as upper bound. Once the layout quality starts dropping, the computation is
stopped. To run all layout computations, pass the flag `--force-all-binnings`.
The flag `--force-all-binnings` forces all layouts up to `--tmax` to be computed, regardless of the layout quality. If
the flag `--determine-best-tmax` is not set, this flag is ignored and has no effect.

By default, we then get `binning.out` as the output file.

Now we can pass the resulting layout to raptor to build the index.

\note
Raptor also has a help page, which can be accessed as usual by typing `raptor layout -h` or `raptor layout --help`.


## Additional parameters

Now lets look at the additional parameters of the layout:

- `--output-filename`
A file name for the resulting layout. Default: "binning.out".
- `--threads`
The number of threads to be used for parallel processing. (k-mer hashes parallelisiert)

Parameter Tweaking:
- `--kmer-size`
The k-mer size influences the estimated counts. Choosing a k-mer size that is too small for your data will
result in files appearing more similar than they really are. Likewise, a large k-mer size might miss out on
certain similarities. For DNA sequences, a k-mer size between [16,32] has proven to work well. Default: 19.
- `--sketch-bits`
The number of bits the HyperLogLog sketch should use to distribute the values into bins. Default: 12. Value
must be in range [5,32].
- `--disable-sketch-output`
Although the sketches will improve the layout, you might want to disable writing the sketch files to disk.
Doing so will save disk space. However, you cannot use either --estimate-unions or --rearrange-user-bins in
chopper layout without the sketches. Note that this option does not decrease run time as sketches have to be
computed either way.

- `--num-hash-functions` (unsigned long)
The number of hash functions to use when building the HIBF from the resulting layout. This parameter is
needed to correctly estimate the index size when computing the layout. Default: 2.
- `--false-positive-rate` (double)
The false positive rate you aim for when building the HIBF from the resulting layout. This parameter is
needed to correctly estimate the index size when computing the layout. Default: 0.05.

HyperLogLog Sketches:
To improve the layout, you can estimate the sequence similarities using HyperLogLog sketches.
- `--estimate-union`
Use sketches to estimate the sequence similarity among a set of user bins. This will improve the layout
computation as merging user bins that do not increase technical bin sizes will be preferred. Attention: Only
possible if the directory [INPUT-PREFIX]_sketches is present.
- `--rearrange-user-bins`
As a preprocessing step, rearranging the order of the given user bins based on their sequence similarity may
lead to favourable small unions and thus a smaller index. Attention: Also enables --estimate-union and is
only possible if the directory [INPUT-PREFIX]_sketches is present.

<- die machen das layout besser dauern aber laenger

Parameter Tweaking:
- `--alpha`
The layout algorithm optimizes the space consumption of the resulting HIBF but currently has no means of
optimizing the runtime for querying such an HIBF. In general, the ratio of merged bins and split bins
influences the query time because a merged bin always triggers another search on a lower level. To influence
this ratio, alpha can be used. The higher alpha, the less merged bins are chosen in the layout. This
improves query times but leads to a bigger index. Default: 1.2.
- `--max-rearrangement-ratio`
When the option --rearrange-user-bins is set, this option can influence the rearrangement algorithm. The
algorithm only rearranges the order of user bins in fixed intervals. The higher --max-rearrangement-ratio,
the larger the intervals. This potentially improves the layout, but increases the runtime of the layout
algorithm. Default: 0.5. Value must be in range [0,1].

A call could then look like this:
```bash
raptor layout --input-file all_bin_path.txt \
--tmax 64 \
--kmer-size 16 \
--sketch-bits 5 \
--num-hash-functions 3 \
--false-positive-rate 0.25 \
--estimate-union \
--rearrange-user-bins \
--alpha 1.5 \
--max-rearrangement-ratio 0.25 \
--threads 4 \
--output-filename binning.layout
```

An assignment follows in the index tutorial on the HIBF.
83 changes: 78 additions & 5 deletions doc/tutorial/03_index/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

You will learn how to construct a Raptor index of large collections of nucleotide sequences.

\tutorial_head{Easy, 30 min, \ref tutorial_first_steps,
\tutorial_head{Easy, 30 min, \ref tutorial_first_steps \, for using HIBF: \ref tutorial_layout,
[<b>Interleaved Bloom Filter (IBF)</b>](https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1interleaved__bloom__filter.html#details)\,
\ref raptor::hierarchical_interleaved_bloom_filter "Hierarchical Interleaved Bloom Filter (HIBF)"}

Expand Down Expand Up @@ -240,22 +240,86 @@ does not need any special parameters.
/todo This is a placeholder section and needs more information including the chopper layout.

Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter
(HIBF) (raptor::hierarchical_interleaved_bloom_filter), which can be used with `--hibf` can be used. This uses a more
(HIBF) (raptor::hierarchical_interleaved_bloom_filter), which can be used with `--hibf`. This uses a more
space-saving method of storing the bins. It distinguishes between the user bins, which reflect the individual samples as
before, and the so-called technical bins, which throw some bins together. This is especially useful when there are
samples of very different sizes.

To use the HIBF, a layout must be created before creating an index. We have written an extra tutorial for this
\ref tutorial_layout.

### HIBF indexing with the use of the layout

The layout replaces the `all_bin_path.txt` and is given instead with the HIBF parameter: `--hibf binning.out`.

Since the HIBF calculates the size of the index itself, it is no longer possible to specify a size here. But we can
offer the option to name the desired false positive rate with `--fpr`.
offer the option to name the desired false positive rate with `--fpr`. Thus, for example, a call looks like this:

```bash
raptor build --kmer 19 --hash 3 --hibf --fpr 0.1 --output raptor.index all_bin_paths.txt
raptor build --hibf binning.layout \
--kmer 16 \
--window 20 \
--hash 3 \
--fpr 0.25 \
--threads 2 \
--output hibf.index
```

\assignment{Assignment 4: A default HIBF}
We want to start this time with the default parameters and then we will look at all the possibilities to improve the
index.

Since we cannot see the advantages of the hibf with our small example. And certainly not the differences when we change
the parameters. Let's not go back to our small example from above, but to the one from the introduction:

```console
$ tree -L 2 example_data
example_data
├── 1024
│   ├── bins
│   └── reads
└── 64
├── bins
└── reads
```
And use the data of the `1024` Folder.

/hint
To create the `all_bin_paths.txt` you can use:
```
seq -f "example_data/1024/bins/bin_%02g.fasta" 0 1 1023 > all_bin_paths.txt
```
\endhint

Now run `raptor layout`and `raptor build` with its default parameters and call the new index `hibf_raptor.index`.

\endassignment

\solution
You should have run:
```bash
raptor layout --input-file all_bin_path.txt --tmax 64
raptor build --hibf binning.out --output raptor.index
```

/hint
Your `tmax` is the squareroot of `1024`, which is `32`. Round this to a multiple of `64`, so we take `64`.

Your directory should look like this:
```bash
tmp$ ls -la
...
```
\endsolution

\note wichtig!!! --false-positive-rate "$fpr" muss die selbe sein wie die von raptor Genauso die k-mer size. Hash functions auch oder?
\warning test

\note
For a detailed explanation of the Hierarchical Interleaved Bloom Filter (HIBF), please refer to the
`raptor::hierarchical_interleaved_bloom_filter` API.

\assignment{Assignment 4: HIBF}
\assignment{Assignment 5: HIBF with usefull parameters}
Lets use the HIBF for our small example. Thus run the example above with a false positive rate of 0.05 and call the new
index `hibf_raptor.index`.
As our example is small, we will keep the kmer size of 4
Expand All @@ -267,7 +331,16 @@ As our example is small, we will keep the kmer size of 4
\solution
You should have run:
```bash
raptor layout --input-file all_bin_path.txt --tmax 64
raptor build --hibf binning.out --fpr 0.1 --output raptor.index
raptor build --kmer 4 --hibf --fpr 0.05 --output hibf_raptor.index all_paths.txt
raptor build --kmer "$kmer_size" \
--window "$kmer_size" \
--hash $num_hash_fn \
--fpr "$fpr" \
--threads $task.cpus \
--output hibf.index \
--hibf $layout
```
/todo Currently not working: `[Error] The list of input files cannot be empty.`

Expand Down

0 comments on commit aed342e

Please sign in to comment.