From aed342eacd378a591085a20f19dcda45d43aec6d Mon Sep 17 00:00:00 2001
From: Lydia Buntrock <lydia.buntrock@fu-berlin.de>
Date: Tue, 15 Nov 2022 14:42:59 +0100
Subject: [PATCH] [DOC] Add a raptor layout tutorial

Signed-off-by: Lydia Buntrock <lydia.buntrock@fu-berlin.de>
---
 doc/tutorial/02_layout/index.md | 140 ++++++++++++++++++++++++++++++++
 doc/tutorial/03_index/index.md  |  83 +++++++++++++++++--
 2 files changed, 218 insertions(+), 5 deletions(-)
 create mode 100644 doc/tutorial/02_layout/index.md
diff --git a/doc/tutorial/02_layout/index.md b/doc/tutorial/02_layout/index.md
new file mode 100644
index 00000000..a2381c52
--- /dev/null
+++ b/doc/tutorial/02_layout/index.md
@@ -0,0 +1,140 @@
+# Create a layout with Raptor {#tutorial_layout}
+
+You will learn how to construct a Raptor layout of large collections of nucleotide sequences.
+
+\attention
+This is a tutorial only if you want to use the advanced option of the HIBF.
+You can skip this chapter if you want to use raptor with the default IBF.
+
+\tutorial_head{Easy, 30 min, \ref tutorial_first_steps,
+[<b>Interleaved Bloom Filter (IBF)</b>](https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1interleaved__bloom__filter.html#details)\,
+\ref raptor::hierarchical_interleaved_bloom_filter "Hierarchical Interleaved Bloom Filter (HIBF)"}
+
+[TOC]
+
+# Create a Layout of large collections of nucleotide sequences
+
+To realise this distinction between user bins and technical bins, a layout must be calculated. For this purpose we have
+developed our own tool (Chopper)[https://www.seqan.de/apps/chopper.html] and integrated it into raptor. So you can
+simply call it up with `raptor layout` without having to install Chopper separately.
+
+## General Idea & main parameters
+
+\image html hibf.svg width=40%
+
+The figure above shows the storage of the user bins in the technical bins. The resulting tree represents the layout.
+The first step is to estimate the number of (representative) k-mers per user bin by computing HyperLogLog (HLL) sketches
+[1] of the input data. These HLL sketches are stored in a directory and will be used in computing an HIBF layout. The
+HIBF layout tries to minimize the disk space consumption of the resulting index. The space is estimated using a k-mer
+count per user bin which represents the potential denisity in a technical bin in an interleaved Bloom filter.
+
+Using all default values a first call will look like:
+
+```bash
+raptor layout --input-file all_bin_path.txt --tmax 64
+```
+
+The `input-file` looks exactly as in our previous calls of `raptor index`; it contains all the paths of our database
+files.
+
+/todo Chopper braucht ein `input_data.tsv` input, wobei es momentan nur eine Spalte (mit den Pfaden) gibt, also geht auch `.txt`.
+
+The parameter `tmax` limits the number of technical bins on each level of the HIBF. Choosing a good tmax is not trivial.
+The smaller tmax, the more levels the layout needs to represent the data. This results in a higher space consumption of
+the index. While querying each individual level is cheap, querying many levels might also lead to an increased runtime.
+A good tmax is usually the square root of the number of user bins rounded to the next multiple of 64. Note that your
+tmax will be rounded to the next multiple of 64 anyway.
+
+/note
+At the expense of a longer runtime, you can enable the statistic mode that determines the best tmax using the option
+`--determine-best-tmax`.
+When this flag is set, the program will compute multiple layouts for tmax in [64 , 128, 256, ... , tmax] as well as
+tmax=sqrt(number of user bins). The layout algorithm itself only optimizes the space consumption. When determining the
+best layout, we additionally keep track of the average number of queries needed to traverse each layout. This query cost
+is taken into account when determining the best tmax for your data.
+Note that the option `--tmax` serves as upper bound. Once the layout quality starts dropping, the computation is
+stopped. To run all layout computations, pass the flag `--force-all-binnings`.
+The flag `--force-all-binnings` forces all layouts up to `--tmax` to be computed, regardless of the layout quality. If
+the flag `--determine-best-tmax` is not set, this flag is ignored and has no effect.
+
+By default, we then get `binning.out` as the output file.
+
+Now we can pass the resulting layout to raptor to build the index.
+
+\note
+Raptor also has a help page, which can be accessed as usual by typing `raptor layout -h` or `raptor layout --help`.
+
+
+## Additional parameters
+
+Now lets look at the additional parameters of the layout:
+
+- `--output-filename`
+      A file name for the resulting layout. Default: "binning.out".
+- `--threads`
+      The number of threads to be used for parallel processing. (k-mer hashes parallelisiert)
+
+Parameter Tweaking:
+- `--kmer-size`
+      The k-mer size influences the estimated counts. Choosing a k-mer size that is too small for your data will
+      result in files appearing more similar than they really are. Likewise, a large k-mer size might miss out on
+      certain similarities. For DNA sequences, a k-mer size between [16,32] has proven to work well. Default: 19.
+- `--sketch-bits`
+      The number of bits the HyperLogLog sketch should use to distribute the values into bins. Default: 12. Value
+      must be in range [5,32].
+- `--disable-sketch-output`
+      Although the sketches will improve the layout, you might want to disable writing the sketch files to disk.
+      Doing so will save disk space. However, you cannot use either --estimate-unions or --rearrange-user-bins in
+      chopper layout without the sketches. Note that this option does not decrease run time as sketches have to be
+      computed either way.
+
+- `--num-hash-functions` (unsigned long)
+      The number of hash functions to use when building the HIBF from the resulting layout. This parameter is
+      needed to correctly estimate the index size when computing the layout. Default: 2.
+- `--false-positive-rate` (double)
+      The false positive rate you aim for when building the HIBF from the resulting layout. This parameter is
+      needed to correctly estimate the index size when computing the layout. Default: 0.05.
+
+HyperLogLog Sketches:
+To improve the layout, you can estimate the sequence similarities using HyperLogLog sketches.
+- `--estimate-union`
+      Use sketches to estimate the sequence similarity among a set of user bins. This will improve the layout
+      computation as merging user bins that do not increase technical bin sizes will be preferred. Attention: Only
+      possible if the directory [INPUT-PREFIX]_sketches is present.
+- `--rearrange-user-bins`
+      As a preprocessing step, rearranging the order of the given user bins based on their sequence similarity may
+      lead to favourable small unions and thus a smaller index. Attention: Also enables --estimate-union and is
+      only possible if the directory [INPUT-PREFIX]_sketches is present.
+
+<- die machen das layout besser dauern aber laenger
+
+  Parameter Tweaking:
+- `--alpha`
+      The layout algorithm optimizes the space consumption of the resulting HIBF but currently has no means of
+      optimizing the runtime for querying such an HIBF. In general, the ratio of merged bins and split bins
+      influences the query time because a merged bin always triggers another search on a lower level. To influence
+      this ratio, alpha can be used. The higher alpha, the less merged bins are chosen in the layout. This
+      improves query times but leads to a bigger index. Default: 1.2.
+- `--max-rearrangement-ratio`
+      When the option --rearrange-user-bins is set, this option can influence the rearrangement algorithm. The
+      algorithm only rearranges the order of user bins in fixed intervals. The higher --max-rearrangement-ratio,
+      the larger the intervals. This potentially improves the layout, but increases the runtime of the layout
+      algorithm. Default: 0.5. Value must be in range [0,1].
+
+A call could then look like this:
+```bash
+raptor layout --input-file all_bin_path.txt \
+              --tmax 64 \
+              --kmer-size 16 \
+              --sketch-bits 5 \
+              --num-hash-functions 3 \
+              --false-positive-rate 0.25 \
+              --estimate-union \
+              --rearrange-user-bins \
+              --alpha 1.5 \
+              --max-rearrangement-ratio 0.25 \
+              --threads 4 \
+              --output-filename binning.layout
+```
+
+An assignment follows in the index tutorial on the HIBF.
diff --git a/doc/tutorial/03_index/index.md b/doc/tutorial/03_index/index.md
index 9d0e4e51..cb9faa3b 100644
--- a/doc/tutorial/03_index/index.md
+++ b/doc/tutorial/03_index/index.md
@@ -2,7 +2,7 @@
 
 You will learn how to construct a Raptor index of large collections of nucleotide sequences.
 
-\tutorial_head{Easy, 30 min, \ref tutorial_first_steps,
+\tutorial_head{Easy, 30 min, \ref tutorial_first_steps \, for using HIBF: \ref tutorial_layout,
 [<b>Interleaved Bloom Filter (IBF)</b>](https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1interleaved__bloom__filter.html#details)\,
 \ref raptor::hierarchical_interleaved_bloom_filter "Hierarchical Interleaved Bloom Filter (HIBF)"}
 
@@ -240,22 +240,86 @@ does not need any special parameters.
 /todo This is a placeholder section and needs more information including the chopper layout.
 
 Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter
-(HIBF) (raptor::hierarchical_interleaved_bloom_filter), which can be used with `--hibf` can be used. This uses a more
+(HIBF) (raptor::hierarchical_interleaved_bloom_filter), which can be used with `--hibf`. This uses a more
 space-saving method of storing the bins. It distinguishes between the user bins, which reflect the individual samples as
 before, and the so-called technical bins, which throw some bins together. This is especially useful when there are
 samples of very different sizes.
+
+To use the HIBF, a layout must be created before creating an index. We have written an extra tutorial for this
+\ref tutorial_layout.
+
+### HIBF indexing with the use of the layout
+
+The layout replaces the `all_bin_path.txt` and is given instead with the HIBF parameter: `--hibf binning.out`.
+
 Since the HIBF calculates the size of the index itself, it is no longer possible to specify a size here. But we can
-offer the option to name the desired false positive rate with `--fpr`.
+offer the option to name the desired false positive rate with `--fpr`. Thus, for example, a call looks like this:
 
 ```bash
-raptor build --kmer 19 --hash 3 --hibf --fpr 0.1 --output raptor.index all_bin_paths.txt
+raptor build --hibf binning.layout \
+             --kmer 16  \
+             --window 20  \
+             --hash 3 \
+             --fpr 0.25  \
+             --threads 2 \
+             --output hibf.index
+```
+
+\assignment{Assignment 4: A default HIBF}
+We want to start this time with the default parameters and then we will look at all the possibilities to improve the
+index.
+
+Since we cannot see the advantages of the hibf with our small example. And certainly not the differences when we change
+the parameters. Let's not go back to our small example from above, but to the one from the introduction:
+
+```console
+$ tree -L 2 example_data
+example_data
+├── 1024
+│   ├── bins
+│   └── reads
+└── 64
+    ├── bins
+    └── reads
 ```
+And use the data of the `1024` Folder.
+
+/hint
+To create the `all_bin_paths.txt` you can use:
+```
+seq -f "example_data/1024/bins/bin_%02g.fasta" 0 1 1023 > all_bin_paths.txt
+```
+\endhint
+
+Now run `raptor layout`and `raptor build` with its default parameters and call the new index `hibf_raptor.index`.
+
+\endassignment
+
+\solution
+You should have run:
+```bash
+raptor layout --input-file all_bin_path.txt --tmax 64
+raptor build --hibf binning.out --output raptor.index
+```
+
+/hint
+Your `tmax` is the squareroot of `1024`, which is `32`. Round this to a multiple of `64`, so we take `64`.
+
+Your directory should look like this:
+```bash
+tmp$ ls -la
+...
+```
+\endsolution
+
+\note wichtig!!! --false-positive-rate "$fpr"  muss die selbe sein wie die von raptor Genauso die k-mer size. Hash functions auch oder?
+\warning test
 
 \note
 For a detailed explanation of the Hierarchical Interleaved Bloom Filter (HIBF), please refer to the
 `raptor::hierarchical_interleaved_bloom_filter` API.
 
-\assignment{Assignment 4: HIBF}
+\assignment{Assignment 5: HIBF with usefull parameters}
 Lets use the HIBF for our small example. Thus run the example above with a false positive rate of 0.05 and call the new
 index `hibf_raptor.index`.
 As our example is small, we will keep the kmer size of 4
@@ -267,7 +331,16 @@ As our example is small, we will keep the kmer size of 4
 \solution
 You should have run:
 ```bash
+raptor layout --input-file all_bin_path.txt --tmax 64
+raptor build --hibf binning.out --fpr 0.1 --output raptor.index
 raptor build --kmer 4 --hibf --fpr 0.05 --output hibf_raptor.index all_paths.txt
+    raptor build --kmer "$kmer_size"  \
+                 --window "$kmer_size"  \
+                 --hash $num_hash_fn  \
+                 --fpr "$fpr"  \
+                 --threads $task.cpus  \
+                 --output hibf.index  \
+                 --hibf $layout
 ```
 /todo Currently not working: `[Error] The list of input files cannot be empty.`