Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Tutorial: small corrections #205

Merged
merged 4 commits into from
Jan 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/tutorial/01_introduction/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ the corresponding bins in which they were found:
For a list of options, see the help pages:
```console
raptor --help
raptor layout --help
raptor build --help
raptor search --help
raptor upgrade --help
Expand Down
42 changes: 31 additions & 11 deletions doc/tutorial/02_layout/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,13 @@ You can skip this chapter if you want to use raptor with the default IBF.
# IBF vs HIBF

Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter
smehringer marked this conversation as resolved.
Show resolved Hide resolved
(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses a more space-saving method of storing the bins. It
distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins,
which throw some bins together. This is especially useful when there are samples of very different sizes.
(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses an almost always more space-saving method of storing
the bins (except if the input samples are all of the same size). It distinguishes between the *user bins*, which reflect
the individual input samples, and the *technical bins*, which are physical storage units within the HIBF.
*Technical bins* may store a single user bin, a split part of a user bin, or several (merged) user bins. This is
especially useful when samples vary dramatically in size.

To use the HIBF, a layout must be created
To use the HIBF, a layout must be created.

# Create a Layout of the HIBF

Expand All @@ -36,9 +38,13 @@ The first step is to estimate the number of (representative) k-mers per user bin
[HyperLogLog (HLL) sketches](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) of the input data. These HLL
sketches are stored in a directory and will be used in computing an HIBF layout. We will go into more detail later
\ref HLL. The HIBF layout tries to minimize the disk space consumption of the resulting index. The space is estimated
using a k-mer count per user bin which represents the potential denisity in a technical bin in an Interleaved Bloom
using a k-mer count per user bin which represents the potential density in a technical bin in an Interleaved Bloom
filter.

\note
The term representative indicates that the k-mer content could be transformed by a function which reduces its size and
distribution, e.g. by using minimizers.

Using all default values a first call will look like:

```bash
Expand Down Expand Up @@ -157,6 +163,15 @@ Therefore, we recommend not deleting the files including the built indexes.
To create an index and thus a layout, the individual samples of the data set are chopped up into k-mers and determine in
their so-called bin the specific bit setting of the Bloom Filter by passing them through hash functions. This means that
a k-mer from sample `i` marks in bin `i` with `j` hash functions `j` bits with a `1`.

Example:
Query ACGT with kmers ACG, CGT.
Sample 1: AATGT, Sample 2: ACCGT, Sample 3: ACGTA
2 Hash funktions
-> Bins 1 to 3 for ACG could look like |0000|0000|0101|
Bins 1 to 3 for CGT could look like |0000|0110|1100|
-> The query seems to match Sample 3

Comment on lines +166 to +174
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eseiler do you have an image for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the first time, we explain how this work? Otherwise, I would just reference the other part.

This also seems a bit too detailed?
If we do it here, we could also think about just explaining a plain Bloom Filter. The principle is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this example, because of this discussion: #201 (comment)
We have also a really short explanation of a BF and IBF here: https://github.com/seqan/raptor/blob/main/doc/tutorial/03_index/index.md#general-idea--main-parameters.
Should we merge these explanations? Or would you leave just this one out as its too detailed?

If a query is then searched, its k-mers are thrown into the hash functions and looked at in which bins it only points
to ones. This can also result in false positives. Thus, the result only indicates that the query is probably part of a
sample.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down below when mentioning the https://hur.st/bloomfilter/ website, we should mention that the number of inserted elements is the number of kmers in a single bin, and you want to use the biggest bin to be sure.

Thinking about it, does it fit here? Because the layout kinda does all the computations for you?
It does fit, if we rephrase it a bit like: You can look for optimal parameter settings, because we don't optimize the number of hash functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this paragraph from the index tutorial because I thought you might need it here too. But I can also take it out completely. -> https://github.com/seqan/raptor/blob/main/doc/tutorial/03_index/index.md#general-idea--main-parameters

Expand All @@ -169,6 +184,8 @@ With `--kmer-size` you can specify the length of the k-mers, which should be lon
By using multiple hash functions, you can sometimes further reduce the possibility of false positives
(`--num-hash-functions`). We found a useful [Bloom Filter Calculator](https://hur.st/bloomfilter/) to get a calculation
if it could help. As it is not ours, we do not guarantee its accuracy.
To use this calculator the number of inserted elements is the number of kmers in a single bin and you should use the
biggest bin to be sure.

Each Bloom Filter has a bit vector length that, across all Bloom Filters, gives the size of the Interleaved Bloom
Filter, which we can specify in the IBF case. Since the HIBF calculates the size of the index itself, it is no longer
Expand Down Expand Up @@ -242,15 +259,17 @@ only by reading leading zeros. (For the i'th element with `p` leading zeros, it

However, if we are unlucky and come across a hash value that consists of only `0`'s, then \f$p_{max}\f$ is of course
maximum of all possible hash values, no matter how many different elements are actually present.
To avoid this, we cut each hash value into `m` parts and calculate the \f$p_{max}\f$ over each of these parts. From
these we then calculate the *harmonic mean* as the total \f$p_{max}\f$.
To avoid this, we cut the stream of hash values into `m` substreams and use the first `b` bits of each hash value to
determine into which substream it belongs. From these, we calculate the *harmonic mean* as the total \f$p_{max}\f$.

We can influence this m with `--sketch-bits`. `m` must be a power of two so that we can divide the `64` bit evenly, so
we use `--sketch-bits` to set a `b` with \f$m = 2^b\f$.

If we choose our `b` (`m`) to be very large, then we need more memory but get higher accuracy. (Storage consumption is
growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user
bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`).
growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user
bins and observe a long layout computation time, then it is worth choosing a somewhat smaller `b` (`m`). Furthermore,
the relative error of the HLL estimate increases with a decreasing `b` (`m`). Based on our benchmarks, we believe that
anything above `m = 512` should be fine.

#### Advanced options for HLL sketches

Expand All @@ -262,8 +281,9 @@ runtime.

With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If
you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and
memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however,
we were not able to determine a too great influence, so we recommend that this value only be used for fine tuning.
memory. If it is close to `1`, however, just little re-arranging is done, which could result in a less memory-efficient
layout. This parameter should only be changed if the layouting takes to much memory or time, because there it can have a
large influence.

One last observation about these advanced options: If you expect hardly any similarity in the data set, then the
similarity preprocessing makes very little difference.
Expand Down
2 changes: 2 additions & 0 deletions doc/tutorial/03_index/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,8 @@ With `--kmer` you can specify the length of the k-mers, which should be long eno
By using multiple hash functions, you can sometimes further reduce the possibility of false positives (`--hash`). We
found a useful [Bloom Filter Calculator](https://hur.st/bloomfilter/) to get a calculation if it could help. As it is
not ours, we do not guarantee its accuracy.
To use this calculator the number of inserted elements is the number of kmers in a single bin and you should use the
biggest bin to be sure.

Each Bloom Filter has a bit vector length, which over all Bloom Filters gives the size of the Interleaved Bloom Filter,
which we specify with `--size`. We can therefore specify how much space the bins take up in total, whereby the following
Expand Down
4 changes: 2 additions & 2 deletions doc/tutorial/04_search/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -302,8 +302,8 @@ Two files are stored:
- `correction_*.bin`: Depends on pattern, window, kmer/shape, p_max, and false positive rate.

\assignment{Assignment 3: Search with minimizers.}
We want to use the `minimiser.index` from the index tutorial assignment again and use the same queries `CGCGTTCATT` and
`CGCGTCATT` to search in it.
We want to use the `minimiser.index` from the index tutorial assignment (\ref tutorial_index ) again and use the same
queries `CGCGTTCATT` and `CGCGTCATT` to search in it.

Lets search now with a tau of `0.9` and a pmax of `0.9`, with creating a `search5.output`.
\endassignment
Expand Down