Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Add a raptor layout tutorial #201

Merged
merged 7 commits into from
Dec 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
282 changes: 282 additions & 0 deletions doc/tutorial/02_layout/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
# Create a layout with Raptor {#tutorial_layout}

You will learn how to construct a Raptor layout of large collections of nucleotide sequences.

\attention
This is a tutorial only if you want to use the advanced option `--hibf` for the `raptor index`.
You can skip this chapter if you want to use raptor with the default IBF.

\tutorial_head{Easy, 30 min, \ref tutorial_first_steps,
[<b>Interleaved Bloom Filter (IBF)</b>](https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1interleaved__bloom__filter.html#details)\,
\ref raptor::hierarchical_interleaved_bloom_filter "Hierarchical Interleaved Bloom Filter (HIBF)"}

[TOC]

# IBF vs HIBF

Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter
(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses a more space-saving method of storing the bins. It
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses a more space-saving method of storing the bins

You could add here that the HIBF is also faster than the IBF in many cases. This is dependant on the search parameters(num errors, thresholding), but I would expect that the HIBF is faster most of the time. Especially when the number of bins is large. Is this correct @eseiler @smehringer ?

distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins,
which throw some bins together. This is especially useful when there are samples of very different sizes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which throw some bins together

You could add that it also splits bins. Maybe merge is better than throw together.


To use the HIBF, a layout must be created

# Create a Layout of the HIBF

To realise this distinction between user bins and technical bins, a layout must be calculated before creating an index.
For this purpose we have developed our own tool [Chopper](https://www.seqan.de/apps/chopper.html) and integrated it into
raptor. So you can simply call it up with `raptor layout` without having to install Chopper separately.

## General Idea & main parameters

\image html hibf.svg width=40%

The figure above shows the storage of the user bins in the technical bins. The resulting tree represents the layout.
The first step is to estimate the number of (representative) k-mers per user bin by computing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number of (representative) k-mers

Does this mean unique k-mers, i.e. the set cardinality? I feel like representative is a bit ambiguous in its meaning.

[HyperLogLog (HLL) sketches](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) of the input data. These HLL
sketches are stored in a directory and will be used in computing an HIBF layout. We will go into more detail later
Irallia marked this conversation as resolved.
Show resolved Hide resolved
\ref HLL. The HIBF layout tries to minimize the disk space consumption of the resulting index. The space is estimated
using a k-mer count per user bin which represents the potential denisity in a technical bin in an Interleaved Bloom
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

denisity

density

filter.

Using all default values a first call will look like:

```bash
raptor layout --input-file all_bin_path.txt --tmax 64
Irallia marked this conversation as resolved.
Show resolved Hide resolved
```

The `input-file` looks exactly as in our previous calls of `raptor index`; it contains all the paths of our database
files.

The parameter `--tmax` limits the number of technical bins on each level of the HIBF. Choosing a good \f$t_{max}\f$ is
not trivial. The smaller \f$t_{max}\f$, the more levels the layout needs to represent the data. This results in a higher
space consumption of the index. While querying each individual level is cheap, querying many levels might also lead to
an increased runtime. A good \f$t_{max}\f$ is usually the square root of the number of user bins rounded to the next
multiple of `64`. Note that your \f$t_{max}\f$ will be rounded to the next multiple of 64 anyway.

\note
At the expense of a longer runtime, you can enable the statistic mode that determines the best \f$t_{max}\f$ using the
option `--determine-best-tmax`.
When this flag is set, the program will compute multiple layouts for \f$t_{max}\f$ in `[64 , 128, 256, ... , tmax]` as
well as `tmax = sqrt(number of user bins)`. The layout algorithm itself only optimizes the space consumption. When
determining the best layout, we additionally keep track of the average number of queries needed to traverse each layout.
This query cost is taken into account when determining the best \f$t_{max}\f$ for your data.
Note that the option `--tmax` serves as upper bound. Once the layout quality starts dropping, the computation is
stopped. To run all layout computations, pass the flag `--force-all-binnings`.
The flag `--force-all-binnings` forces all layouts up to `--tmax` to be computed, regardless of the layout quality. If
the flag `--determine-best-tmax` is not set, this flag is ignored and has no effect.

We then get the resulting layout (default: `binning.out`) as an output file, which we then pass to Raptor to create the
index. You can change this default with `--output-filename`.

\note
Raptor also has a help page, which can be accessed as usual by typing `raptor layout -h` or `raptor layout --help`.

\assignment{Assignment 1: Create a first layout}
Lets take the bigger example form the introduction \ref tutorial_first_steps and create a layout for it.
```console
$ tree -L 2 example_data
example_data
├── 1024
│   ├── bins
│   └── reads
└── 64
├── bins
└── reads
```
And use the data of the `1024` Folder.

\hint
First we need a file with all paths to the fasta files. For this use the command:
```bash
seq -f "1024/bins/bin_%04g.fasta" 0 1 1023 > all_bin_paths.txt
```
\endhint

Then first determine the best tmax value and calculate a layout with default values and this tmax.
\endassignment

\solution
Your `all_bin_paths.txt` should look like:
```txt
1024/bins/bin_0001.fasta
1024/bins/bin_0002.fasta
1024/bins/bin_0003.fasta
...
1024/bins/bin_1021.fasta
1024/bins/bin_1022.fasta
1024/bins/bin_1023.fasta
```

/note
Sometimes it would be better to use the absolute paths instead.

And you should have run:
```bash
raptor layout --input-file all_bin_paths.txt --determine-best-tmax --tmax 64
```
With the output:
```bash
## ### Parameters ###
## number of user bins = 1023
## number of hash functions = 2
## false positive rate = 0.05
## ### Notation ###
## X-IBF = An IBF with X number of bins.
## X-HIBF = An HIBF with tmax = X, e.g a maximum of X technical bins on each level.
## ### Column Description ###
## tmax : The maximum number of technical bin on each level
## c_tmax : The technical extra cost of querying an tmax-IBF, compared to 64-IBF
## l_tmax : The estimated query cost for an tmax-HIBF, compared to an 64-HIBF
## m_tmax : The estimated memory consumption for an tmax-HIBF, compared to an 64-HIBF
## (l*m)_tmax : Computed by l_tmax * m_tmax
## size : The expected total size of an tmax-HIBF
# tmax c_tmax l_tmax m_tmax (l*m)_tmax size
64 1.00 2.00 1.00 2.00 12.8MiB
# Best t_max (regarding expected query runtime): 64
```
And afterwards:
```bash
raptor layout --input-file all_bin_paths.txt --tmax 64
```
Your directory should look like this:
```bash
$ ls
1024/ all_bin_paths.txt chopper_sketch.count mini/
64/ binning.out chopper_sketch_sketches/
```

\note
We will use this mini-example in the following, both with further parameters and then for `raptor index --hibf`.
Therefore, we recommend not deleting the files including the built indexes.

\endsolution

## Additional parameters

To create an index and thus a layout, the individual samples of the data set are chopped up into k-mers and determine in
their so-called bin the specific bit setting of the Bloom Filter by passing them through hash functions. This means that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so-called bin the specific bit setting of the Bloom Filter by passing them through hash functions

I don't userstand this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add an Example like this:

Query ACGT with kmers ACG, CGT.
Sample 1: AATGT, Sample 2: ACCGT, Sample 3: ACGTA
2 Hash funktions
-> Bins 1 to 3 for ACG could look like |0000|0000|0101|
Bins 1 to 3 for CGT could look like |0000|0110|1100|
-> The query seems to match Sample 3

@eseiler do you have an image for this?

a k-mer from sample `i` marks in bin `i` with `j` hash functions `j` bits with a `1`.
If a query is then searched, its k-mers are thrown into the hash functions and looked at in which bins it only points
to ones. This can also result in false positives. Thus, the result only indicates that the query is probably part of a
sample.

This principle also applies to the Hierarchical Interleaved Bloom Filter, except that the bins are then stored even more
efficiently as described above and this is described by the layout. This means that you already have to know some
parameters for the layout, which you would otherwise specify in the index:

With `--kmer-size` you can specify the length of the k-mers, which should be long enough to avoid random hits.
By using multiple hash functions, you can sometimes further reduce the possibility of false positives
(`--num-hash-functions`). We found a useful [Bloom Filter Calculator](https://hur.st/bloomfilter/) to get a calculation
if it could help. As it is not ours, we do not guarantee its accuracy.

Each Bloom Filter has a bit vector length that, across all Bloom Filters, gives the size of the Interleaved Bloom
Filter, which we can specify in the IBF case. Since the HIBF calculates the size of the index itself, it is no longer
possible to specify a size here. But we can offer the option to name the desired false positive rate with
`--false-positive-rate`.

\note These parameters must be set identically for `raptor index`.

A call could then look like this:
```bash
raptor layout --input-file all_bin_path.txt \
--tmax 64 \
--kmer-size 17 \
--num-hash-functions 4 \
--false-positive-rate 0.25 \
--output-filename binning.layout
```

### Parallelization

Raptor supports parallelization. By specifying `--threads`, for example, the k-mer hashes are processed simultaneously.


\assignment{Assignment 2: Create a more specific layout}
Now lets run the above example with more parameters.

Use the same `all_bin_paths.txt` and create a `binning2.out`. Take a kmer size of `16`, `3` hash functions, a false
positive rate of `0.1` and use `2` threads.
\endassignment

\solution
And you should have run:
```bash
raptor layout --input-file all_bin_paths.txt \
--tmax 64 \
--kmer-size 16 \
--num-hash-functions 3 \
--false-positive-rate 0.1 \
--threads 2 \
--output-filename binning2.layout
```
Your directory should look like this:
```bash
$ ls
1024/ all_bin_paths.txt binning2.layout chopper_sketch_sketches/
64/ binning.out chopper_sketch.count mini/
```
\endsolution

### HyperLogLog sketch {#HLL}

The first step is to estimate the number of (representative) k-mers per user bin by computing
[HyperLogLog (HLL) sketches](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) of the input data. These HLL
sketches are stored in a directory and will be used in computing an HIBF layout.

We will also give a short explanation of the HLL sketches here to explain the possible parameters, whereby each bin is
sketched individually.

\note
Most parameters are advanced and only need to be changed if the calculation takes significantly too long or the memory
usage is too high.

So the question is how many elements of our multiset are identical?
With exact calculation, we need more storage space and runtime than with the HLL estimate.
So, to find this out, we form (binary) 64 bit hash values of the data. These are equally distributed over all possible
hash values. If you go through this hashed data, you can then estimate how many different elements you have seen so far
only by reading leading zeros. (For the i'th element with `p` leading zeros, it is estimated that you have seen
\f$2^p\f$ different elements). You then simply save the maximum of these (\f$2^{p_{max}}\f$ different elements).

<!-- bei p (at least) leading zeros ist die wahrscheinlichkeit dieses vorkommens genau 1/(2^p) -->

However, if we are unlucky and come across a hash value that consists of only `0`'s, then \f$p_{max}\f$ is of course
maximum of all possible hash values, no matter how many different elements are actually present.
To avoid this, we cut each hash value into `m` parts and calculate the \f$p_{max}\f$ over each of these parts. From
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid this, we cut each hash value into m parts

This is not correct. We cut the stream of hash values into m substreams and use the first b bits of each hash value to determine into which substream it belongs.

these we then calculate the *harmonic mean* as the total \f$p_{max}\f$.

We can influence this m with `--sketch-bits`. `m` must be a power of two so that we can divide the `64` bit evenly, so
we use `--sketch-bits` to set a `b` with \f$m = 2^b\f$.

If we choose our `b` (`m`) to be very large, then we need more memory but get higher accuracy. (Storage consumption is
growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user
bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should mention that the relative error of the HLL estimate increases with a decreasing b (m) and that we believe that anything above m=512 should be fine.


#### Advanced options for HLL sketches

The following options should only be touched if the calculation takes a long time.

We have implemented another preprocessing that summarises the technical bins even better with regard to the similarities
of the input data. This can be switched off with the flag `--skip-similarity-preprocessing` if it costs too much
runtime.

With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If
you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and
memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which could be bad

I suggest instead:
which could result in a less memory-efficient layout.

we were not able to determine a too great influence, so we recommend that this value only be used for fine tuning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would instead write that this value should only be used if the layouting takes to much memory or time. Because there it has a large influence.


One last observation about these advanced options: If you expect hardly any similarity in the data set, then the
similarity preprocessing makes very little difference.

### Another advanced Option: alpha

You should only touch the parameter `--alpha` if you have understood very well how the layout works and you are
dissatisfied with the resulting index, e.g. there is still a lot of space in RAM but the index is very slow.

The layout algorithm optimizes the space consumption of the resulting HIBF but currently has no means of optimizing the
runtime for querying such an HIBF. In general, the ratio of merged bins and split bins influences the query time because
a merged bin always triggers another search on a lower level. To influence this ratio, alpha can be used.

Alpha is a parameter for weighting the storage space calculation of the lower-level IBFs. It functions as a lower-level
penalty, i.e. if alpha is large, the DP algorithm tries to avoid lower levels, which at the same time leads to the
top-level IBF becoming somewhat larger. This improves query times but leads to a bigger index.
Loading