From aed342eacd378a591085a20f19dcda45d43aec6d Mon Sep 17 00:00:00 2001 From: Lydia Buntrock Date: Tue, 15 Nov 2022 14:42:59 +0100 Subject: [PATCH] [DOC] Add a raptor layout tutorial Signed-off-by: Lydia Buntrock --- doc/tutorial/02_layout/index.md | 140 ++++++++++++++++++++++++++++++++ doc/tutorial/03_index/index.md | 83 +++++++++++++++++-- 2 files changed, 218 insertions(+), 5 deletions(-) create mode 100644 doc/tutorial/02_layout/index.md diff --git a/doc/tutorial/02_layout/index.md b/doc/tutorial/02_layout/index.md new file mode 100644 index 00000000..a2381c52 --- /dev/null +++ b/doc/tutorial/02_layout/index.md @@ -0,0 +1,140 @@ +# Create a layout with Raptor {#tutorial_layout} + +You will learn how to construct a Raptor layout of large collections of nucleotide sequences. + +\attention +This is a tutorial only if you want to use the advanced option of the HIBF. +You can skip this chapter if you want to use raptor with the default IBF. + +\tutorial_head{Easy, 30 min, \ref tutorial_first_steps, +[Interleaved Bloom Filter (IBF)](https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1interleaved__bloom__filter.html#details)\, +\ref raptor::hierarchical_interleaved_bloom_filter "Hierarchical Interleaved Bloom Filter (HIBF)"} + +[TOC] + +# Create a Layout of large collections of nucleotide sequences + +To realise this distinction between user bins and technical bins, a layout must be calculated. For this purpose we have +developed our own tool (Chopper)[https://www.seqan.de/apps/chopper.html] and integrated it into raptor. So you can +simply call it up with `raptor layout` without having to install Chopper separately. + +## General Idea & main parameters + +\image html hibf.svg width=40% + +The figure above shows the storage of the user bins in the technical bins. The resulting tree represents the layout. +The first step is to estimate the number of (representative) k-mers per user bin by computing HyperLogLog (HLL) sketches +[1] of the input data. These HLL sketches are stored in a directory and will be used in computing an HIBF layout. The +HIBF layout tries to minimize the disk space consumption of the resulting index. The space is estimated using a k-mer +count per user bin which represents the potential denisity in a technical bin in an interleaved Bloom filter. + +Using all default values a first call will look like: + +```bash +raptor layout --input-file all_bin_path.txt --tmax 64 +``` + +The `input-file` looks exactly as in our previous calls of `raptor index`; it contains all the paths of our database +files. + +/todo Chopper braucht ein `input_data.tsv` input, wobei es momentan nur eine Spalte (mit den Pfaden) gibt, also geht auch `.txt`. + +The parameter `tmax` limits the number of technical bins on each level of the HIBF. Choosing a good tmax is not trivial. +The smaller tmax, the more levels the layout needs to represent the data. This results in a higher space consumption of +the index. While querying each individual level is cheap, querying many levels might also lead to an increased runtime. +A good tmax is usually the square root of the number of user bins rounded to the next multiple of 64. Note that your +tmax will be rounded to the next multiple of 64 anyway. + +/note +At the expense of a longer runtime, you can enable the statistic mode that determines the best tmax using the option +`--determine-best-tmax`. +When this flag is set, the program will compute multiple layouts for tmax in [64 , 128, 256, ... , tmax] as well as +tmax=sqrt(number of user bins). The layout algorithm itself only optimizes the space consumption. When determining the +best layout, we additionally keep track of the average number of queries needed to traverse each layout. This query cost +is taken into account when determining the best tmax for your data. +Note that the option `--tmax` serves as upper bound. Once the layout quality starts dropping, the computation is +stopped. To run all layout computations, pass the flag `--force-all-binnings`. +The flag `--force-all-binnings` forces all layouts up to `--tmax` to be computed, regardless of the layout quality. If +the flag `--determine-best-tmax` is not set, this flag is ignored and has no effect. + +By default, we then get `binning.out` as the output file. + +Now we can pass the resulting layout to raptor to build the index. + +\note +Raptor also has a help page, which can be accessed as usual by typing `raptor layout -h` or `raptor layout --help`. + + +## Additional parameters + +Now lets look at the additional parameters of the layout: + +- `--output-filename` + A file name for the resulting layout. Default: "binning.out". +- `--threads` + The number of threads to be used for parallel processing. (k-mer hashes parallelisiert) + +Parameter Tweaking: +- `--kmer-size` + The k-mer size influences the estimated counts. Choosing a k-mer size that is too small for your data will + result in files appearing more similar than they really are. Likewise, a large k-mer size might miss out on + certain similarities. For DNA sequences, a k-mer size between [16,32] has proven to work well. Default: 19. +- `--sketch-bits` + The number of bits the HyperLogLog sketch should use to distribute the values into bins. Default: 12. Value + must be in range [5,32]. +- `--disable-sketch-output` + Although the sketches will improve the layout, you might want to disable writing the sketch files to disk. + Doing so will save disk space. However, you cannot use either --estimate-unions or --rearrange-user-bins in + chopper layout without the sketches. Note that this option does not decrease run time as sketches have to be + computed either way. + +- `--num-hash-functions` (unsigned long) + The number of hash functions to use when building the HIBF from the resulting layout. This parameter is + needed to correctly estimate the index size when computing the layout. Default: 2. +- `--false-positive-rate` (double) + The false positive rate you aim for when building the HIBF from the resulting layout. This parameter is + needed to correctly estimate the index size when computing the layout. Default: 0.05. + +HyperLogLog Sketches: +To improve the layout, you can estimate the sequence similarities using HyperLogLog sketches. +- `--estimate-union` + Use sketches to estimate the sequence similarity among a set of user bins. This will improve the layout + computation as merging user bins that do not increase technical bin sizes will be preferred. Attention: Only + possible if the directory [INPUT-PREFIX]_sketches is present. +- `--rearrange-user-bins` + As a preprocessing step, rearranging the order of the given user bins based on their sequence similarity may + lead to favourable small unions and thus a smaller index. Attention: Also enables --estimate-union and is + only possible if the directory [INPUT-PREFIX]_sketches is present. + +<- die machen das layout besser dauern aber laenger + + Parameter Tweaking: +- `--alpha` + The layout algorithm optimizes the space consumption of the resulting HIBF but currently has no means of + optimizing the runtime for querying such an HIBF. In general, the ratio of merged bins and split bins + influences the query time because a merged bin always triggers another search on a lower level. To influence + this ratio, alpha can be used. The higher alpha, the less merged bins are chosen in the layout. This + improves query times but leads to a bigger index. Default: 1.2. +- `--max-rearrangement-ratio` + When the option --rearrange-user-bins is set, this option can influence the rearrangement algorithm. The + algorithm only rearranges the order of user bins in fixed intervals. The higher --max-rearrangement-ratio, + the larger the intervals. This potentially improves the layout, but increases the runtime of the layout + algorithm. Default: 0.5. Value must be in range [0,1]. + +A call could then look like this: +```bash +raptor layout --input-file all_bin_path.txt \ + --tmax 64 \ + --kmer-size 16 \ + --sketch-bits 5 \ + --num-hash-functions 3 \ + --false-positive-rate 0.25 \ + --estimate-union \ + --rearrange-user-bins \ + --alpha 1.5 \ + --max-rearrangement-ratio 0.25 \ + --threads 4 \ + --output-filename binning.layout +``` + +An assignment follows in the index tutorial on the HIBF. diff --git a/doc/tutorial/03_index/index.md b/doc/tutorial/03_index/index.md index 9d0e4e51..cb9faa3b 100644 --- a/doc/tutorial/03_index/index.md +++ b/doc/tutorial/03_index/index.md @@ -2,7 +2,7 @@ You will learn how to construct a Raptor index of large collections of nucleotide sequences. -\tutorial_head{Easy, 30 min, \ref tutorial_first_steps, +\tutorial_head{Easy, 30 min, \ref tutorial_first_steps \, for using HIBF: \ref tutorial_layout, [Interleaved Bloom Filter (IBF)](https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1interleaved__bloom__filter.html#details)\, \ref raptor::hierarchical_interleaved_bloom_filter "Hierarchical Interleaved Bloom Filter (HIBF)"} @@ -240,22 +240,86 @@ does not need any special parameters. /todo This is a placeholder section and needs more information including the chopper layout. Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter -(HIBF) (raptor::hierarchical_interleaved_bloom_filter), which can be used with `--hibf` can be used. This uses a more +(HIBF) (raptor::hierarchical_interleaved_bloom_filter), which can be used with `--hibf`. This uses a more space-saving method of storing the bins. It distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins, which throw some bins together. This is especially useful when there are samples of very different sizes. + +To use the HIBF, a layout must be created before creating an index. We have written an extra tutorial for this +\ref tutorial_layout. + +### HIBF indexing with the use of the layout + +The layout replaces the `all_bin_path.txt` and is given instead with the HIBF parameter: `--hibf binning.out`. + Since the HIBF calculates the size of the index itself, it is no longer possible to specify a size here. But we can -offer the option to name the desired false positive rate with `--fpr`. +offer the option to name the desired false positive rate with `--fpr`. Thus, for example, a call looks like this: ```bash -raptor build --kmer 19 --hash 3 --hibf --fpr 0.1 --output raptor.index all_bin_paths.txt +raptor build --hibf binning.layout \ + --kmer 16 \ + --window 20 \ + --hash 3 \ + --fpr 0.25 \ + --threads 2 \ + --output hibf.index +``` + +\assignment{Assignment 4: A default HIBF} +We want to start this time with the default parameters and then we will look at all the possibilities to improve the +index. + +Since we cannot see the advantages of the hibf with our small example. And certainly not the differences when we change +the parameters. Let's not go back to our small example from above, but to the one from the introduction: + +```console +$ tree -L 2 example_data +example_data +├── 1024 +│   ├── bins +│   └── reads +└── 64 + ├── bins + └── reads ``` +And use the data of the `1024` Folder. + +/hint +To create the `all_bin_paths.txt` you can use: +``` +seq -f "example_data/1024/bins/bin_%02g.fasta" 0 1 1023 > all_bin_paths.txt +``` +\endhint + +Now run `raptor layout`and `raptor build` with its default parameters and call the new index `hibf_raptor.index`. + +\endassignment + +\solution +You should have run: +```bash +raptor layout --input-file all_bin_path.txt --tmax 64 +raptor build --hibf binning.out --output raptor.index +``` + +/hint +Your `tmax` is the squareroot of `1024`, which is `32`. Round this to a multiple of `64`, so we take `64`. + +Your directory should look like this: +```bash +tmp$ ls -la +... +``` +\endsolution + +\note wichtig!!! --false-positive-rate "$fpr" muss die selbe sein wie die von raptor Genauso die k-mer size. Hash functions auch oder? +\warning test \note For a detailed explanation of the Hierarchical Interleaved Bloom Filter (HIBF), please refer to the `raptor::hierarchical_interleaved_bloom_filter` API. -\assignment{Assignment 4: HIBF} +\assignment{Assignment 5: HIBF with usefull parameters} Lets use the HIBF for our small example. Thus run the example above with a false positive rate of 0.05 and call the new index `hibf_raptor.index`. As our example is small, we will keep the kmer size of 4 @@ -267,7 +331,16 @@ As our example is small, we will keep the kmer size of 4 \solution You should have run: ```bash +raptor layout --input-file all_bin_path.txt --tmax 64 +raptor build --hibf binning.out --fpr 0.1 --output raptor.index raptor build --kmer 4 --hibf --fpr 0.05 --output hibf_raptor.index all_paths.txt + raptor build --kmer "$kmer_size" \ + --window "$kmer_size" \ + --hash $num_hash_fn \ + --fpr "$fpr" \ + --threads $task.cpus \ + --output hibf.index \ + --hibf $layout ``` /todo Currently not working: `[Error] The list of input files cannot be empty.`