From e91139a194d4d374fba9c344ba9cd617eb0fc5b0 Mon Sep 17 00:00:00 2001 From: Lydia Buntrock Date: Tue, 15 Nov 2022 14:42:59 +0100 Subject: [PATCH 1/7] [DOC] Add a raptor layout tutorial Signed-off-by: Lydia Buntrock --- doc/tutorial/02_layout/index.md | 165 ++++++++++++++++++++++++++++++++ doc/tutorial/03_index/index.md | 94 ++++++++++++++++-- 2 files changed, 249 insertions(+), 10 deletions(-) create mode 100644 doc/tutorial/02_layout/index.md diff --git a/doc/tutorial/02_layout/index.md b/doc/tutorial/02_layout/index.md new file mode 100644 index 00000000..faa0f9e9 --- /dev/null +++ b/doc/tutorial/02_layout/index.md @@ -0,0 +1,165 @@ +# Create a layout with Raptor {#tutorial_layout} + +You will learn how to construct a Raptor layout of large collections of nucleotide sequences. + +\attention +This is a tutorial only if you want to use the advanced option `--hibf` for the `raptor index`. +You can skip this chapter if you want to use raptor with the default IBF. + +\tutorial_head{Easy, 30 min, \ref tutorial_first_steps, +[Interleaved Bloom Filter (IBF)](https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1interleaved__bloom__filter.html#details)\, +\ref raptor::hierarchical_interleaved_bloom_filter "Hierarchical Interleaved Bloom Filter (HIBF)"} + +[TOC] + +# IBF vs HIBF + +Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter +(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses a more space-saving method of storing the bins. It +distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins, +which throw some bins together. This is especially useful when there are samples of very different sizes. + +To use the HIBF, a layout must be created + +# Create a Layout of the HIBF + +To realise this distinction between user bins and technical bins, a layout must be calculated before creating an index. +For this purpose we have developed our own tool [Chopper](https://www.seqan.de/apps/chopper.html) and integrated it into +raptor. So you can simply call it up with `raptor layout` without having to install Chopper separately. + +## General Idea & main parameters + +\image html hibf.svg width=40% + +The figure above shows the storage of the user bins in the technical bins. The resulting tree represents the layout. +The first step is to estimate the number of (representative) k-mers per user bin by computing +[HyperLogLog (HLL) sketches](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) of the input data. These HLL +sketches are stored in a directory and will be used in computing an HIBF layout. The HIBF layout tries to minimize the +disk space consumption of the resulting index. The space is estimated using a k-mer count per user bin which represents +the potential denisity in a technical bin in an interleaved Bloom filter. + +Using all default values a first call will look like: + +```bash +raptor layout --input-file all_bin_path.txt --tmax 64 +``` + +The `input-file` looks exactly as in our previous calls of `raptor index`; it contains all the paths of our database +files. + +\todo Chopper braucht ein `input_data.tsv` input, wobei es momentan nur eine Spalte (mit den Pfaden) gibt, also geht auch `.txt`. + +The parameter `--tmax` limits the number of technical bins on each level of the HIBF. Choosing a good `tmax` is not +trivial. The smaller `tmax`, the more levels the layout needs to represent the data. This results in a higher space +consumption of the index. While querying each individual level is cheap, querying many levels might also lead to an +increased runtime. A good `tmax` is usually the square root of the number of user bins rounded to the next multiple of +`64`. Note that your `tmax` will be rounded to the next multiple of 64 anyway. + +\note +At the expense of a longer runtime, you can enable the statistic mode that determines the best `tmax` using the option +`--determine-best-tmax`. +When this flag is set, the program will compute multiple layouts for `tmax` in `[64 , 128, 256, ... , tmax]` as well as +`tmax = sqrt(number of user bins)`. The layout algorithm itself only optimizes the space consumption. When determining +the best layout, we additionally keep track of the average number of queries needed to traverse each layout. This query +cost is taken into account when determining the best `tmax` for your data. +Note that the option `--tmax` serves as upper bound. Once the layout quality starts dropping, the computation is +stopped. To run all layout computations, pass the flag `--force-all-binnings`. +The flag `--force-all-binnings` forces all layouts up to `--tmax` to be computed, regardless of the layout quality. If +the flag `--determine-best-tmax` is not set, this flag is ignored and has no effect. + +We then get the resulting layout (default: `binning.out`) as an output file, which we then pass to Raptor to create the +index. You can change this default with `--output-filename`. + +\note +Raptor also has a help page, which can be accessed as usual by typing `raptor layout -h` or `raptor layout --help`. + +## Additional parameters + +To create an index and thus a layout, the individual samples of the data set are chopped up into k-mers and determine in +their so-called bin the specific bit setting of the Bloom Filter by passing them through hash functions. This means that +a k-mer from sample `i` marks in bin `i` with `j` hash functions `j` bits with a `1`. +If a query is then searched, its k-mers are thrown into the hash functions and looked at in which bins it only points +to ones. This can also result in false positives. Thus, the result only indicates that the query is probably part of a +sample. + +This principle also applies to the Hierarchical Interleaved Bloom Filter, except that the bins are then stored even more +efficiently as described above and this is described by the layout. This means that you already have to know some +parameters for the layout, which you would otherwise specify in the index: + +With `--kmer-size` you can specify the length of the k-mers, which should be long enough to avoid random hits. +By using multiple hash functions, you can sometimes further reduce the possibility of false positives +(`--num-hash-functions`). We found a useful [Bloom Filter Calculator](https://hur.st/bloomfilter/) to get a calculation +if it could help. As it is not ours, we do not guarantee its accuracy. + +Each Bloom filter has a bit vector length that, across all Bloom filters, gives the size of the Interleaved Bloom +filter, which we can specify in the IBF case. Since the HIBF calculates the size of the index itself, it is no longer +possible to specify a size here. But we can offer the option to name the desired false positive rate with +`--false-positive-rate`. + +\note These parameters must be set identically for `raptor index`. + +\todo This is not checked at the moment? + + + +Thus, for example, a call looks like this: + + +Parameter Tweaking: +- `--sketch-bits` + The number of bits the HyperLogLog sketch should use to distribute the values into bins. Default: 12. Value + must be in range [5,32]. +- `--disable-sketch-output` + Although the sketches will improve the layout, you might want to disable writing the sketch files to disk. + Doing so will save disk space. However, you cannot use either --estimate-unions or --rearrange-user-bins in + chopper layout without the sketches. Note that this option does not decrease run time as sketches have to be + computed either way. + +HyperLogLog Sketches: +To improve the layout, you can estimate the sequence similarities using HyperLogLog sketches. +- `--estimate-union` + Use sketches to estimate the sequence similarity among a set of user bins. This will improve the layout + computation as merging user bins that do not increase technical bin sizes will be preferred. Attention: Only + possible if the directory [INPUT-PREFIX]_sketches is present. +- `--rearrange-user-bins` + As a preprocessing step, rearranging the order of the given user bins based on their sequence similarity may + lead to favourable small unions and thus a smaller index. Attention: Also enables --estimate-union and is + only possible if the directory [INPUT-PREFIX]_sketches is present. + +<- die machen das layout besser dauern aber laenger + + Parameter Tweaking: +- `--alpha` + The layout algorithm optimizes the space consumption of the resulting HIBF but currently has no means of + optimizing the runtime for querying such an HIBF. In general, the ratio of merged bins and split bins + influences the query time because a merged bin always triggers another search on a lower level. To influence + this ratio, alpha can be used. The higher alpha, the less merged bins are chosen in the layout. This + improves query times but leads to a bigger index. Default: 1.2. +- `--max-rearrangement-ratio` + When the option --rearrange-user-bins is set, this option can influence the rearrangement algorithm. The + algorithm only rearranges the order of user bins in fixed intervals. The higher --max-rearrangement-ratio, + the larger the intervals. This potentially improves the layout, but increases the runtime of the layout + algorithm. Default: 0.5. Value must be in range [0,1]. + +A call could then look like this: +```bash +raptor layout --input-file all_bin_path.txt \ + --tmax 64 \ + --kmer-size 16 \ + --sketch-bits 5 \ + --num-hash-functions 3 \ + --false-positive-rate 0.25 \ + --estimate-union \ + --rearrange-user-bins \ + --alpha 1.5 \ + --max-rearrangement-ratio 0.25 \ + --threads 4 \ + --output-filename binning.layout +``` + +An assignment follows in the index tutorial on the HIBF. + + +### Parallelization + +Raptor supports parallelization. By specifying `--threads`, for example, the k-mer hashes are processed simultaneously. diff --git a/doc/tutorial/03_index/index.md b/doc/tutorial/03_index/index.md index 87c0c881..7ae7fc0f 100644 --- a/doc/tutorial/03_index/index.md +++ b/doc/tutorial/03_index/index.md @@ -2,7 +2,7 @@ You will learn how to construct a Raptor index of large collections of nucleotide sequences. -\tutorial_head{Easy, 30 min, \ref tutorial_first_steps, +\tutorial_head{Easy, 30 min, \ref tutorial_first_steps \, for using HIBF: \ref tutorial_layout, [Interleaved Bloom Filter (IBF)](https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1interleaved__bloom__filter.html#details)\, \ref raptor::hierarchical_interleaved_bloom_filter "Hierarchical Interleaved Bloom Filter (HIBF)"} @@ -192,7 +192,7 @@ tmp$ ls -la ... 107B 14 Nov 16:11 mini_1.fasta ... 107B 14 Nov 16:11 mini_2.fasta ... 107B 14 Nov 16:11 mini_3.fasta -... 8,0M 14 Nov 16:21 minimiser_raptor.index +... 8,0M 14 Nov 16:21 minimiser.index ... 256B 14 Nov 16:20 precomputed_minimisers/ ... 19B 10 Nov 16:40 query.fasta ... 1,2K 14 Nov 16:15 raptor.index @@ -237,27 +237,91 @@ does not need any special parameters. ## IBF vs HIBF {#hibf} -/todo This is a placeholder section and needs more information including the chopper layout. - Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter -(HIBF) (raptor::hierarchical_interleaved_bloom_filter), which can be used with `--hibf` can be used. This uses a more +(HIBF) (raptor::hierarchical_interleaved_bloom_filter), which can be used with `--hibf`. This uses a more space-saving method of storing the bins. It distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins, which throw some bins together. This is especially useful when there are samples of very different sizes. + +To use the HIBF, a layout must be created before creating an index. We have written an extra tutorial for this +\ref tutorial_layout. + +### HIBF indexing with the use of the layout + +The layout replaces the `all_bin_path.txt` and is given instead with the HIBF parameter: `--hibf binning.out`. + Since the HIBF calculates the size of the index itself, it is no longer possible to specify a size here. But we can -offer the option to name the desired false positive rate with `--fpr`. +offer the option to name the desired false positive rate with `--fpr`. Thus, for example, a call looks like this: ```bash -raptor build --kmer 19 --hash 3 --hibf --fpr 0.1 --output raptor.index all_bin_paths.txt +raptor build --hibf binning.layout \ + --kmer 16 \ + --window 20 \ + --hash 3 \ + --fpr 0.25 \ + --threads 2 \ + --output hibf.index +``` + +\assignment{Assignment 4: A default HIBF} +We want to start this time with the default parameters and then we will look at all the possibilities to improve the +index. + +Since we cannot see the advantages of the hibf with our small example. And certainly not the differences when we change +the parameters. Let's not go back to our small example from above, but to the one from the introduction: + +```console +$ tree -L 2 example_data +example_data +├── 1024 +│   ├── bins +│   └── reads +└── 64 + ├── bins + └── reads +``` +And use the data of the `1024` Folder. + +Lets use the HIBF with the default parameters and call the new index `hibf.index`. + +/hint +To create the `all_bin_paths.txt` you can use: ``` +seq -f "example_data/1024/bins/bin_%02g.fasta" 0 1 1023 > all_bin_paths.txt +``` +\endhint + +Now run `raptor layout`and `raptor build` with its default parameters and call the new index `hibf_raptor.index`. + +\endassignment + +\solution +You should have run: +```bash +raptor layout --input-file all_bin_path.txt --tmax 64 +raptor build --hibf binning.out --output hibf.index +``` + +/hint +Your `tmax` is the squareroot of `1024`, which is `32`. Round this to a multiple of `64`, so we take `64`. + +Your directory should look like this: +```bash +tmp$ ls -la +... +``` +\endsolution + +\note wichtig!!! --false-positive-rate "$fpr" muss die selbe sein wie die von raptor Genauso die k-mer size. Hash functions auch oder? +\warning test \note For a detailed explanation of the Hierarchical Interleaved Bloom Filter (HIBF), please refer to the `raptor::hierarchical_interleaved_bloom_filter` API. -\assignment{Assignment 4: HIBF} +\assignment{Assignment 5: HIBF with usefull parameters} Lets use the HIBF for our small example. Thus run the example above with a false positive rate of 0.05 and call the new -index `hibf.index`. +index `hibf2.index`. As our example is small, we will keep the kmer size of 4 \hint ... @@ -267,7 +331,17 @@ As our example is small, we will keep the kmer size of 4 \solution You should have run: ```bash -raptor build --kmer 4 --hibf --fpr 0.05 --output hibf.index all_paths.txt +raptor layout --input-file all_bin_path.txt --output-filename binning2.out --tmax 64 +raptor build --hibf binning2.out --fpr 0.1 --output hibf2.index all_paths.txt + + raptor build --kmer 4 \ + --window "$kmer_size" \ + --hash 3 \ + --fpr 0.05 \ + --threads 4 \ + --output hibf2.index \ + --hibf binning2.out \ + all_paths.txt ``` /todo Currently not working: `[Error] The list of input files cannot be empty.` From b16f7a8e9ca85e29d19c36b09e87050b8ab1795e Mon Sep 17 00:00:00 2001 From: Lydia Buntrock Date: Wed, 7 Dec 2022 17:20:51 +0100 Subject: [PATCH 2/7] [DOC] Explain HLL sketches Signed-off-by: Lydia Buntrock --- doc/tutorial/02_layout/index.md | 122 +++++++++++++++++++++----------- 1 file changed, 80 insertions(+), 42 deletions(-) diff --git a/doc/tutorial/02_layout/index.md b/doc/tutorial/02_layout/index.md index faa0f9e9..fb612f8f 100644 --- a/doc/tutorial/02_layout/index.md +++ b/doc/tutorial/02_layout/index.md @@ -34,9 +34,10 @@ raptor. So you can simply call it up with `raptor layout` without having to inst The figure above shows the storage of the user bins in the technical bins. The resulting tree represents the layout. The first step is to estimate the number of (representative) k-mers per user bin by computing [HyperLogLog (HLL) sketches](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) of the input data. These HLL -sketches are stored in a directory and will be used in computing an HIBF layout. The HIBF layout tries to minimize the -disk space consumption of the resulting index. The space is estimated using a k-mer count per user bin which represents -the potential denisity in a technical bin in an interleaved Bloom filter. +sketches are stored in a directory and will be used in computing an HIBF layout. We will go into more detail later +\ref HLL. The HIBF layout tries to minimize the disk space consumption of the resulting index. The space is estimated +using a k-mer count per user bin which represents the potential denisity in a technical bin in an Interleaved Bloom +filter. Using all default values a first call will look like: @@ -49,19 +50,19 @@ files. \todo Chopper braucht ein `input_data.tsv` input, wobei es momentan nur eine Spalte (mit den Pfaden) gibt, also geht auch `.txt`. -The parameter `--tmax` limits the number of technical bins on each level of the HIBF. Choosing a good `tmax` is not -trivial. The smaller `tmax`, the more levels the layout needs to represent the data. This results in a higher space -consumption of the index. While querying each individual level is cheap, querying many levels might also lead to an -increased runtime. A good `tmax` is usually the square root of the number of user bins rounded to the next multiple of -`64`. Note that your `tmax` will be rounded to the next multiple of 64 anyway. +The parameter `--tmax` limits the number of technical bins on each level of the HIBF. Choosing a good \f$t_{max}\f$ is +not trivial. The smaller \f$t_{max}\f$, the more levels the layout needs to represent the data. This results in a higher +space consumption of the index. While querying each individual level is cheap, querying many levels might also lead to +an increased runtime. A good \f$t_{max}\f$ is usually the square root of the number of user bins rounded to the next +multiple of `64`. Note that your \f$t_{max}\f$ will be rounded to the next multiple of 64 anyway. \note -At the expense of a longer runtime, you can enable the statistic mode that determines the best `tmax` using the option -`--determine-best-tmax`. -When this flag is set, the program will compute multiple layouts for `tmax` in `[64 , 128, 256, ... , tmax]` as well as -`tmax = sqrt(number of user bins)`. The layout algorithm itself only optimizes the space consumption. When determining -the best layout, we additionally keep track of the average number of queries needed to traverse each layout. This query -cost is taken into account when determining the best `tmax` for your data. +At the expense of a longer runtime, you can enable the statistic mode that determines the best \f$t_{max}\f$ using the +option `--determine-best-tmax`. +When this flag is set, the program will compute multiple layouts for \f$t_{max}\f$ in `[64 , 128, 256, ... , tmax]` as +well as `tmax = sqrt(number of user bins)`. The layout algorithm itself only optimizes the space consumption. When +determining the best layout, we additionally keep track of the average number of queries needed to traverse each layout. +This query cost is taken into account when determining the best \f$t_{max}\f$ for your data. Note that the option `--tmax` serves as upper bound. Once the layout quality starts dropping, the computation is stopped. To run all layout computations, pass the flag `--force-all-binnings`. The flag `--force-all-binnings` forces all layouts up to `--tmax` to be computed, regardless of the layout quality. If @@ -91,8 +92,8 @@ By using multiple hash functions, you can sometimes further reduce the possibili (`--num-hash-functions`). We found a useful [Bloom Filter Calculator](https://hur.st/bloomfilter/) to get a calculation if it could help. As it is not ours, we do not guarantee its accuracy. -Each Bloom filter has a bit vector length that, across all Bloom filters, gives the size of the Interleaved Bloom -filter, which we can specify in the IBF case. Since the HIBF calculates the size of the index itself, it is no longer +Each Bloom Filter has a bit vector length that, across all Bloom Filters, gives the size of the Interleaved Bloom +Filter, which we can specify in the IBF case. Since the HIBF calculates the size of the index itself, it is no longer possible to specify a size here. But we can offer the option to name the desired false positive rate with `--false-positive-rate`. @@ -100,33 +101,74 @@ possible to specify a size here. But we can offer the option to name the desired \todo This is not checked at the moment? +### HyperLogLog sketch {#HLL} +The first step is to estimate the number of (representative) k-mers per user bin by computing +[HyperLogLog (HLL) sketches](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) of the input data. These HLL +sketches are stored in a directory and will be used in computing an HIBF layout. + +We will also give a short explanation of the HLL sketches here to explain the possible parameters. + +\note +Most parameters are advanced and only need to be changed if the calculation takes significantly too long or the memory +usage is too high. + +So the question is how many elements of our multiset are identical? +With exact calculation, we need more storage space and runtime than with the HLL estimate. +So, to find this out, we form (binary) 64 bit hash values of the data. These are equally distributed over all possible +hash values. If you go through this hashed data, you can then estimate how many different elements you have seen so far +only by reading leading zeros. (For the i'th element with `p` leading zeros, it is estimated that you have seen +\f$2^p\f$ different elements). You then simply save the maximum of these (\f$2^{p_{max}}\f$ different elements). + + + +However, if we are unlucky and come across a hash value that consists of only `0`'s, then \f$p_{max}\f$ is of course +maximum of all possible hash values, no matter how many different elements are actually present. +To avoid this, we cut each hash value into `m` parts and calculate the \f$p_{max}\f$ over each of these parts. From +these we then calculate the *harmonic mean* as the total \f$p_{max}\f$. + +We can influence this m with `--sketch-bits`. `m` must be a power of two so that we can divide the `64` bit evenly, so +we use `--sketch-bits` to set a `b` with \f$m = 2^b\f$. -Thus, for example, a call looks like this: +If we choose our `b` (`m`) to be very large, then we need more memory but get higher accuracy. (Storage consumption is +growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user +bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`). +\todo +Wird bisher ein sketch über alles berechnet oder einzelne sketches die gemerged werden? Laut Felix ist das mergen ebenfalls sehr schnell. (zb 10 genome sketchen und dann mergen) -Parameter Tweaking: -- `--sketch-bits` - The number of bits the HyperLogLog sketch should use to distribute the values into bins. Default: 12. Value - must be in range [5,32]. -- `--disable-sketch-output` - Although the sketches will improve the layout, you might want to disable writing the sketch files to disk. - Doing so will save disk space. However, you cannot use either --estimate-unions or --rearrange-user-bins in - chopper layout without the sketches. Note that this option does not decrease run time as sketches have to be - computed either way. +#### Advanced options for HLL sketches -HyperLogLog Sketches: -To improve the layout, you can estimate the sequence similarities using HyperLogLog sketches. -- `--estimate-union` - Use sketches to estimate the sequence similarity among a set of user bins. This will improve the layout - computation as merging user bins that do not increase technical bin sizes will be preferred. Attention: Only - possible if the directory [INPUT-PREFIX]_sketches is present. -- `--rearrange-user-bins` - As a preprocessing step, rearranging the order of the given user bins based on their sequence similarity may - lead to favourable small unions and thus a smaller index. Attention: Also enables --estimate-union and is - only possible if the directory [INPUT-PREFIX]_sketches is present. +The following options should only be touched if the calculation takes a long time. -<- die machen das layout besser dauern aber laenger +We have implemented another preprocessing that summarises the technical bins even better with regard to the similarities +of the input data. This can be switched off with the flag `--skip-similarity-preprocessing` if it costs too much +runtime. + +\todo +Add parameter `--skip-similarity-preprocessing` instead of `--estimate-union` and `--rearrange-user-bins`. Set it as +advanced. + +\todo +`--disable-sketch-output` wahrscheinlich unsinnig, da nur der zwischenstand zwischen count und layout + +With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If +you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and +memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however, +we were not able to determine a too great influence, so we recommend that this value only be used for fine tuning. + +\todo +Wenn r=1, dann `--rearrange-user-bins` aus, daher die flag nicht nötig. + +One last observation about these advanced options: If you expect hardly any similarity in the data set, then the +similarity preprocessing makes very little difference. + +### Others + +- beeinflusst die laufzeit gar nicht, ist für DP +- alpha ist ein schätzer für das mergen +- alpha macht mergen teurer +- kleines alpha macht es günstiger Parameter Tweaking: - `--alpha` @@ -135,11 +177,7 @@ To improve the layout, you can estimate the sequence similarities using HyperLog influences the query time because a merged bin always triggers another search on a lower level. To influence this ratio, alpha can be used. The higher alpha, the less merged bins are chosen in the layout. This improves query times but leads to a bigger index. Default: 1.2. -- `--max-rearrangement-ratio` - When the option --rearrange-user-bins is set, this option can influence the rearrangement algorithm. The - algorithm only rearranges the order of user bins in fixed intervals. The higher --max-rearrangement-ratio, - the larger the intervals. This potentially improves the layout, but increases the runtime of the layout - algorithm. Default: 0.5. Value must be in range [0,1]. + A call could then look like this: ```bash From 1054d9d9613410bf8e3c13fc7840674e52044a27 Mon Sep 17 00:00:00 2001 From: Lydia Buntrock Date: Mon, 12 Dec 2022 11:59:30 +0100 Subject: [PATCH 3/7] [DOC] Add explanation for advanced option alpha. Signed-off-by: Lydia Buntrock --- doc/tutorial/02_layout/index.md | 62 ++++++++++++++------------------- 1 file changed, 27 insertions(+), 35 deletions(-) diff --git a/doc/tutorial/02_layout/index.md b/doc/tutorial/02_layout/index.md index fb612f8f..b2a9907d 100644 --- a/doc/tutorial/02_layout/index.md +++ b/doc/tutorial/02_layout/index.md @@ -101,6 +101,10 @@ possible to specify a size here. But we can offer the option to name the desired \todo This is not checked at the moment? +### Parallelization + +Raptor supports parallelization. By specifying `--threads`, for example, the k-mer hashes are processed simultaneously. + ### HyperLogLog sketch {#HLL} The first step is to estimate the number of (representative) k-mers per user bin by computing @@ -137,6 +141,20 @@ bins and observe a long runtime, then it is worth choosing a somewhat smaller `b \todo Wird bisher ein sketch über alles berechnet oder einzelne sketches die gemerged werden? Laut Felix ist das mergen ebenfalls sehr schnell. (zb 10 genome sketchen und dann mergen) +A call could then look like this: +```bash +raptor layout --input-file all_bin_path.txt \ + --tmax 64 \ + --kmer-size 16 \ + --sketch-bits 5 \ + --num-hash-functions 3 \ + --false-positive-rate 0.25 \ + --threads 4 \ + --output-filename binning.layout +``` + +An assignment follows in the index tutorial on the HIBF. + #### Advanced options for HLL sketches The following options should only be touched if the calculation takes a long time. @@ -163,41 +181,15 @@ Wenn r=1, dann `--rearrange-user-bins` aus, daher die flag nicht nötig. One last observation about these advanced options: If you expect hardly any similarity in the data set, then the similarity preprocessing makes very little difference. -### Others - -- beeinflusst die laufzeit gar nicht, ist für DP -- alpha ist ein schätzer für das mergen -- alpha macht mergen teurer -- kleines alpha macht es günstiger +### Another advanced Option: alpha - Parameter Tweaking: -- `--alpha` - The layout algorithm optimizes the space consumption of the resulting HIBF but currently has no means of - optimizing the runtime for querying such an HIBF. In general, the ratio of merged bins and split bins - influences the query time because a merged bin always triggers another search on a lower level. To influence - this ratio, alpha can be used. The higher alpha, the less merged bins are chosen in the layout. This - improves query times but leads to a bigger index. Default: 1.2. +You should only touch the parameter `--alpha` if you have understood very well how the layout works and you are +dissatisfied with the resulting index, e.g. there is still a lot of space in RAM but the index is very slow. +The layout algorithm optimizes the space consumption of the resulting HIBF but currently has no means of optimizing the +runtime for querying such an HIBF. In general, the ratio of merged bins and split bins influences the query time because +a merged bin always triggers another search on a lower level. To influence this ratio, alpha can be used. -A call could then look like this: -```bash -raptor layout --input-file all_bin_path.txt \ - --tmax 64 \ - --kmer-size 16 \ - --sketch-bits 5 \ - --num-hash-functions 3 \ - --false-positive-rate 0.25 \ - --estimate-union \ - --rearrange-user-bins \ - --alpha 1.5 \ - --max-rearrangement-ratio 0.25 \ - --threads 4 \ - --output-filename binning.layout -``` - -An assignment follows in the index tutorial on the HIBF. - - -### Parallelization - -Raptor supports parallelization. By specifying `--threads`, for example, the k-mer hashes are processed simultaneously. +Alpha is a parameter for weighting the storage space calculation of the lower-level IBFs. It functions as a lower-level +penalty, i.e. if alpha is large, the DP algorithm tries to avoid lower levels, which at the same time leads to the +top-level IBF becoming somewhat larger. This improves query times but leads to a bigger index. From 6ac3143121b1573c9ea689cc2cd10f2a234754ab Mon Sep 17 00:00:00 2001 From: Lydia Buntrock Date: Mon, 12 Dec 2022 13:28:36 +0100 Subject: [PATCH 4/7] [DOC] Add some assignments to the layout docu Signed-off-by: Lydia Buntrock --- doc/tutorial/02_layout/index.md | 131 ++++++++++++++++++++++++++++---- 1 file changed, 117 insertions(+), 14 deletions(-) diff --git a/doc/tutorial/02_layout/index.md b/doc/tutorial/02_layout/index.md index b2a9907d..4fb02dd8 100644 --- a/doc/tutorial/02_layout/index.md +++ b/doc/tutorial/02_layout/index.md @@ -74,6 +74,86 @@ index. You can change this default with `--output-filename`. \note Raptor also has a help page, which can be accessed as usual by typing `raptor layout -h` or `raptor layout --help`. +\assignment{Assignment 1: Create a first layout} +Lets take the bigger example form the introduction \ref tutorial_first_steps and create a layout for it. +```console +$ tree -L 2 example_data +example_data +├── 1024 +│   ├── bins +│   └── reads +└── 64 + ├── bins + └── reads +``` +And use the data of the `1024` Folder. + +\hint +First we need a file with all paths to the fasta files. For this use the command: +```bash +for i in {0001..1023}; do echo "1024/bins/bin_$i.fasta" >> all_bin_paths.txt; done +``` +\endhint + +Then first determine the best tmax value and calculate a layout with default values and this tmax. +\endassignment + +\solution +Your `all_bin_paths.txt` should look like: +```txt +1024/bins/bin_0001.fasta +1024/bins/bin_0002.fasta +1024/bins/bin_0003.fasta +... +1024/bins/bin_1021.fasta +1024/bins/bin_1022.fasta +1024/bins/bin_1023.fasta +``` + +/note +Sometimes it would be better to use the absolute paths instead. + +And you should have run: +```bash +raptor layout --input-file all_bin_paths.txt --determine-best-tmax --tmax 64 +``` +With the output: +```bash +## ### Parameters ### +## number of user bins = 1023 +## number of hash functions = 2 +## false positive rate = 0.05 +## ### Notation ### +## X-IBF = An IBF with X number of bins. +## X-HIBF = An HIBF with tmax = X, e.g a maximum of X technical bins on each level. +## ### Column Description ### +## tmax : The maximum number of technical bin on each level +## c_tmax : The technical extra cost of querying an tmax-IBF, compared to 64-IBF +## l_tmax : The estimated query cost for an tmax-HIBF, compared to an 64-HIBF +## m_tmax : The estimated memory consumption for an tmax-HIBF, compared to an 64-HIBF +## (l*m)_tmax : Computed by l_tmax * m_tmax +## size : The expected total size of an tmax-HIBF +# tmax c_tmax l_tmax m_tmax (l*m)_tmax size +64 1.00 2.00 1.00 2.00 12.8MiB +# Best t_max (regarding expected query runtime): 64 +``` +And afterwards: +```bash +raptor layout --input-file all_bin_paths.txt --tmax 64 +``` +Your directory should look like this: +```bash +$ ls +1024/ all_bin_paths.txt chopper_sketch.count mini/ +64/ binning.out chopper_sketch_sketches/ +``` + +\note +We will use this mini-example in the following, both with further parameters and then for `raptor index --hibf`. +Therefore, we recommend not deleting the files including the built indexes. + +\endsolution + ## Additional parameters To create an index and thus a layout, the individual samples of the data set are chopped up into k-mers and determine in @@ -101,10 +181,47 @@ possible to specify a size here. But we can offer the option to name the desired \todo This is not checked at the moment? +A call could then look like this: +```bash +raptor layout --input-file all_bin_path.txt \ + --tmax 64 \ + --kmer-size 17 \ + --num-hash-functions 4 \ + --false-positive-rate 0.25 \ + --output-filename binning.layout +``` + ### Parallelization Raptor supports parallelization. By specifying `--threads`, for example, the k-mer hashes are processed simultaneously. + +\assignment{Assignment 2: Create a more specific layout} +Now lets run the above example with more parameters. + +Use the same `all_bin_paths.txt` and create a `binning2.out`. Take a kmer size of `16`, `3` hash functions, a false +positive rate of `0.1` and use `2` threads. +\endassignment + +\solution +And you should have run: +```bash +raptor layout --input-file all_bin_paths.txt \ + --tmax 64 \ + --kmer-size 16 \ + --num-hash-functions 3 \ + --false-positive-rate 0.1 \ + --threads 2 \ + --output-filename binning2.layout +``` +Your directory should look like this: +```bash +$ ls +1024/ all_bin_paths.txt binning2.layout chopper_sketch_sketches/ +64/ binning.out chopper_sketch.count mini/ +``` +\endsolution + ### HyperLogLog sketch {#HLL} The first step is to estimate the number of (representative) k-mers per user bin by computing @@ -141,20 +258,6 @@ bins and observe a long runtime, then it is worth choosing a somewhat smaller `b \todo Wird bisher ein sketch über alles berechnet oder einzelne sketches die gemerged werden? Laut Felix ist das mergen ebenfalls sehr schnell. (zb 10 genome sketchen und dann mergen) -A call could then look like this: -```bash -raptor layout --input-file all_bin_path.txt \ - --tmax 64 \ - --kmer-size 16 \ - --sketch-bits 5 \ - --num-hash-functions 3 \ - --false-positive-rate 0.25 \ - --threads 4 \ - --output-filename binning.layout -``` - -An assignment follows in the index tutorial on the HIBF. - #### Advanced options for HLL sketches The following options should only be touched if the calculation takes a long time. From 2d3f3d946709ae67b0425aad9dd646fb38313612 Mon Sep 17 00:00:00 2001 From: Lydia Buntrock Date: Mon, 12 Dec 2022 13:50:36 +0100 Subject: [PATCH 5/7] [DOC] Rework the hibf example for the index tutorial using the new layouts. Signed-off-by: Lydia Buntrock --- doc/tutorial/03_index/index.md | 83 +++++++++++----------------------- 1 file changed, 26 insertions(+), 57 deletions(-) diff --git a/doc/tutorial/03_index/index.md b/doc/tutorial/03_index/index.md index 7ae7fc0f..b47fd574 100644 --- a/doc/tutorial/03_index/index.md +++ b/doc/tutorial/03_index/index.md @@ -41,7 +41,7 @@ raptor build --size 1k --output raptor.index all_bin_paths.txt ``` \assignment{Assignment 1: Create example files and index them} -Copy and paste the following FASTA files to some location, e.g. the tmp directory: +Copy and paste the following FASTA files to some location, e.g. a tmp directory: mini_1.fasta: ```fasta @@ -263,10 +263,10 @@ raptor build --hibf binning.layout \ --output hibf.index ``` -\assignment{Assignment 4: A default HIBF} -We want to start this time with the default parameters and then we will look at all the possibilities to improve the -index. +\note wichtig!!! --false-positive-rate "$fpr" muss die selbe sein wie die von raptor Genauso die k-mer size. Hash functions auch oder? +\warning test +\assignment{Assignment 4: A default HIBF} Since we cannot see the advantages of the hibf with our small example. And certainly not the differences when we change the parameters. Let's not go back to our small example from above, but to the one from the introduction: @@ -280,74 +280,43 @@ example_data ├── bins └── reads ``` -And use the data of the `1024` Folder. +And use the data of the `1024` Folder and the two layouts we've created in the \ref tutorial_layout tutorial : +`binning.out` and `binning2.layout`. -Lets use the HIBF with the default parameters and call the new index `hibf.index`. +Lets use the HIBF with the default parameters and call the new indexes `hibf.index` and `hibf2.index`. +\note +Your kmer size, number of hash functions and the false positive rate must be the same as in the layout. -/hint -To create the `all_bin_paths.txt` you can use: -``` -seq -f "example_data/1024/bins/bin_%02g.fasta" 0 1 1023 > all_bin_paths.txt -``` +\hint +For the second layout we took a kmer size of `16`, `3` hash functions and a false positive rate of `0.1`. \endhint - -Now run `raptor layout`and `raptor build` with its default parameters and call the new index `hibf_raptor.index`. - \endassignment \solution You should have run: ```bash -raptor layout --input-file all_bin_path.txt --tmax 64 raptor build --hibf binning.out --output hibf.index +raptor build --hibf binning2.layout --kmer 16 --hash 3 --fpr 0.1 --output hibf2.index ``` -/hint -Your `tmax` is the squareroot of `1024`, which is `32`. Round this to a multiple of `64`, so we take `64`. - Your directory should look like this: ```bash -tmp$ ls -la -... +$ ls -la +... 384B 12 Dez 13:44 ./ +... 544B 12 Dez 12:14 ../ +... 128B 23 Sep 2020 1024/ +... 128B 23 Sep 2020 64/ +... 25K 12 Dez 12:52 all_bin_paths.txt +... 37K 12 Dez 12:56 binning.out +... 37K 12 Dez 13:26 binning2.layout +... 55K 12 Dez 13:26 chopper_sketch.count +... 32K 12 Dez 12:52 chopper_sketch_sketches/ +... 13M 12 Dez 13:44 hibf.index +... 7,2M 12 Dez 13:44 hibf2.index +... 64B 8 Dez 16:24 mini/ ``` \endsolution -\note wichtig!!! --false-positive-rate "$fpr" muss die selbe sein wie die von raptor Genauso die k-mer size. Hash functions auch oder? -\warning test - \note -For a detailed explanation of the Hierarchical Interleaved Bloom Filter (HIBF), please refer to the +For a more detailed explanation of the Hierarchical Interleaved Bloom Filter (HIBF), please refer to the `raptor::hierarchical_interleaved_bloom_filter` API. - -\assignment{Assignment 5: HIBF with usefull parameters} -Lets use the HIBF for our small example. Thus run the example above with a false positive rate of 0.05 and call the new -index `hibf2.index`. -As our example is small, we will keep the kmer size of 4 -\hint -... -\endhint -\endassignment - -\solution -You should have run: -```bash -raptor layout --input-file all_bin_path.txt --output-filename binning2.out --tmax 64 -raptor build --hibf binning2.out --fpr 0.1 --output hibf2.index all_paths.txt - - raptor build --kmer 4 \ - --window "$kmer_size" \ - --hash 3 \ - --fpr 0.05 \ - --threads 4 \ - --output hibf2.index \ - --hibf binning2.out \ - all_paths.txt -``` -/todo Currently not working: `[Error] The list of input files cannot be empty.` - -Your directory should look like this: -```bash -tmp$ ls -la -... -``` -\endsolution From c538eb1983d2e9f85106ad321a121e5c60dc20d3 Mon Sep 17 00:00:00 2001 From: Lydia Buntrock Date: Mon, 12 Dec 2022 16:21:38 +0100 Subject: [PATCH 6/7] [DOC] Add a search assignment for the hibf. Signed-off-by: Lydia Buntrock --- doc/tutorial/04_search/index.md | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/doc/tutorial/04_search/index.md b/doc/tutorial/04_search/index.md index b2f0d54b..262fdfa4 100644 --- a/doc/tutorial/04_search/index.md +++ b/doc/tutorial/04_search/index.md @@ -351,18 +351,31 @@ If no minimizers are used, our thresholding ensures that the (H)IBF gives no fal there are few false negatives, because the minimizer compression is not lossless. \assignment{Assignment 4: Search with hibf.} -We want to use the `hibf.index` from the index tutorial assignment again and use ... +We want to use the `hibf.index` from the index tutorial assignment again. -Lets search now with ..., with creating a `search6.output`. +Lets search now for the `1024/reads/mini.fastq` querries with creating a `search6.output`. \endassignment \solution You should have run ```bash -raptor search --index hibf.index --query query.fasta --output search6.output +raptor search --hibf --index hibf.index --query 1024/reads/mini.fastq --output search6.output ``` Your `search6.output` should look like: ```text +#0 1024/bins/bin_0712.fasta +#1 1024/bins/bin_0406.fasta ... +#1021 1024/bins/bin_0533.fasta +#1022 1024/bins/bin_0624.fasta +#QUERY_NAME USER_BINS +0 +1 +... +1047555 +1047556 ``` +\note +You can also calculate the second hibf index B and you will see that they will give slightly different results +(`diff search6.output search7.output`). \endsolution From b3e12204f347e30b3d611bdfc9a6e4805fb417e9 Mon Sep 17 00:00:00 2001 From: Lydia Buntrock Date: Tue, 13 Dec 2022 11:19:00 +0100 Subject: [PATCH 7/7] [DOC] Apply review Signed-off-by: Lydia Buntrock --- doc/tutorial/02_layout/index.md | 22 +++------------------- doc/tutorial/03_index/index.md | 3 --- doc/tutorial/04_search/index.md | 2 -- 3 files changed, 3 insertions(+), 24 deletions(-) diff --git a/doc/tutorial/02_layout/index.md b/doc/tutorial/02_layout/index.md index 4fb02dd8..6fba8379 100644 --- a/doc/tutorial/02_layout/index.md +++ b/doc/tutorial/02_layout/index.md @@ -48,8 +48,6 @@ raptor layout --input-file all_bin_path.txt --tmax 64 The `input-file` looks exactly as in our previous calls of `raptor index`; it contains all the paths of our database files. -\todo Chopper braucht ein `input_data.tsv` input, wobei es momentan nur eine Spalte (mit den Pfaden) gibt, also geht auch `.txt`. - The parameter `--tmax` limits the number of technical bins on each level of the HIBF. Choosing a good \f$t_{max}\f$ is not trivial. The smaller \f$t_{max}\f$, the more levels the layout needs to represent the data. This results in a higher space consumption of the index. While querying each individual level is cheap, querying many levels might also lead to @@ -91,7 +89,7 @@ And use the data of the `1024` Folder. \hint First we need a file with all paths to the fasta files. For this use the command: ```bash -for i in {0001..1023}; do echo "1024/bins/bin_$i.fasta" >> all_bin_paths.txt; done +seq -f "1024/bins/bin_%04g.fasta" 0 1 1023 > all_bin_paths.txt ``` \endhint @@ -179,8 +177,6 @@ possible to specify a size here. But we can offer the option to name the desired \note These parameters must be set identically for `raptor index`. -\todo This is not checked at the moment? - A call could then look like this: ```bash raptor layout --input-file all_bin_path.txt \ @@ -228,7 +224,8 @@ The first step is to estimate the number of (representative) k-mers per user bin [HyperLogLog (HLL) sketches](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) of the input data. These HLL sketches are stored in a directory and will be used in computing an HIBF layout. -We will also give a short explanation of the HLL sketches here to explain the possible parameters. +We will also give a short explanation of the HLL sketches here to explain the possible parameters, whereby each bin is +sketched individually. \note Most parameters are advanced and only need to be changed if the calculation takes significantly too long or the memory @@ -255,9 +252,6 @@ If we choose our `b` (`m`) to be very large, then we need more memory but get hi growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`). -\todo -Wird bisher ein sketch über alles berechnet oder einzelne sketches die gemerged werden? Laut Felix ist das mergen ebenfalls sehr schnell. (zb 10 genome sketchen und dann mergen) - #### Advanced options for HLL sketches The following options should only be touched if the calculation takes a long time. @@ -266,21 +260,11 @@ We have implemented another preprocessing that summarises the technical bins eve of the input data. This can be switched off with the flag `--skip-similarity-preprocessing` if it costs too much runtime. -\todo -Add parameter `--skip-similarity-preprocessing` instead of `--estimate-union` and `--rearrange-user-bins`. Set it as -advanced. - -\todo -`--disable-sketch-output` wahrscheinlich unsinnig, da nur der zwischenstand zwischen count und layout - With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however, we were not able to determine a too great influence, so we recommend that this value only be used for fine tuning. -\todo -Wenn r=1, dann `--rearrange-user-bins` aus, daher die flag nicht nötig. - One last observation about these advanced options: If you expect hardly any similarity in the data set, then the similarity preprocessing makes very little difference. diff --git a/doc/tutorial/03_index/index.md b/doc/tutorial/03_index/index.md index b47fd574..b04d67ce 100644 --- a/doc/tutorial/03_index/index.md +++ b/doc/tutorial/03_index/index.md @@ -263,9 +263,6 @@ raptor build --hibf binning.layout \ --output hibf.index ``` -\note wichtig!!! --false-positive-rate "$fpr" muss die selbe sein wie die von raptor Genauso die k-mer size. Hash functions auch oder? -\warning test - \assignment{Assignment 4: A default HIBF} Since we cannot see the advantages of the hibf with our small example. And certainly not the differences when we change the parameters. Let's not go back to our small example from above, but to the one from the introduction: diff --git a/doc/tutorial/04_search/index.md b/doc/tutorial/04_search/index.md index 262fdfa4..3cd5e9b8 100644 --- a/doc/tutorial/04_search/index.md +++ b/doc/tutorial/04_search/index.md @@ -124,8 +124,6 @@ and apply a thresholding step to determine the membership of a query in a bin. The term representative indicates that the k-mer content could be transformed by a function which reduces its size and distribution, e.g. using minimizers. -\todo Mehr auf das image eingehen. - There are also a few parameters for the search, which we will now explain. Instead of an exact search, we can also allow errors in the query, this we achieve with a positive value of `--error`.