generated from seqan/app-template
-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Lydia Buntrock <[email protected]>
- Loading branch information
Showing
2 changed files
with
218 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
# Create a layout with Raptor {#tutorial_layout} | ||
|
||
You will learn how to construct a Raptor layout of large collections of nucleotide sequences. | ||
|
||
\attention | ||
This is a tutorial only if you want to use the advanced option of the HIBF. | ||
You can skip this chapter if you want to use raptor with the default IBF. | ||
|
||
\tutorial_head{Easy, 30 min, \ref tutorial_first_steps, | ||
[<b>Interleaved Bloom Filter (IBF)</b>](https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1interleaved__bloom__filter.html#details)\, | ||
\ref raptor::hierarchical_interleaved_bloom_filter "Hierarchical Interleaved Bloom Filter (HIBF)"} | ||
|
||
[TOC] | ||
|
||
# Create a Layout of large collections of nucleotide sequences | ||
|
||
To realise this distinction between user bins and technical bins, a layout must be calculated. For this purpose we have | ||
developed our own tool (Chopper)[https://www.seqan.de/apps/chopper.html] and integrated it into raptor. So you can | ||
simply call it up with `raptor layout` without having to install Chopper separately. | ||
|
||
## General Idea & main parameters | ||
|
||
\image html hibf.svg width=40% | ||
|
||
The figure above shows the storage of the user bins in the technical bins. The resulting tree represents the layout. | ||
The first step is to estimate the number of (representative) k-mers per user bin by computing HyperLogLog (HLL) sketches | ||
[1] of the input data. These HLL sketches are stored in a directory and will be used in computing an HIBF layout. The | ||
HIBF layout tries to minimize the disk space consumption of the resulting index. The space is estimated using a k-mer | ||
count per user bin which represents the potential denisity in a technical bin in an interleaved Bloom filter. | ||
|
||
Using all default values a first call will look like: | ||
|
||
```bash | ||
raptor layout --input-file all_bin_path.txt --tmax 64 | ||
``` | ||
|
||
The `input-file` looks exactly as in our previous calls of `raptor index`; it contains all the paths of our database | ||
files. | ||
|
||
/todo Chopper braucht ein `input_data.tsv` input, wobei es momentan nur eine Spalte (mit den Pfaden) gibt, also geht auch `.txt`. | ||
|
||
The parameter `tmax` limits the number of technical bins on each level of the HIBF. Choosing a good tmax is not trivial. | ||
The smaller tmax, the more levels the layout needs to represent the data. This results in a higher space consumption of | ||
the index. While querying each individual level is cheap, querying many levels might also lead to an increased runtime. | ||
A good tmax is usually the square root of the number of user bins rounded to the next multiple of 64. Note that your | ||
tmax will be rounded to the next multiple of 64 anyway. | ||
|
||
/note | ||
At the expense of a longer runtime, you can enable the statistic mode that determines the best tmax using the option | ||
`--determine-best-tmax`. | ||
When this flag is set, the program will compute multiple layouts for tmax in [64 , 128, 256, ... , tmax] as well as | ||
tmax=sqrt(number of user bins). The layout algorithm itself only optimizes the space consumption. When determining the | ||
best layout, we additionally keep track of the average number of queries needed to traverse each layout. This query cost | ||
is taken into account when determining the best tmax for your data. | ||
Note that the option `--tmax` serves as upper bound. Once the layout quality starts dropping, the computation is | ||
stopped. To run all layout computations, pass the flag `--force-all-binnings`. | ||
The flag `--force-all-binnings` forces all layouts up to `--tmax` to be computed, regardless of the layout quality. If | ||
the flag `--determine-best-tmax` is not set, this flag is ignored and has no effect. | ||
|
||
By default, we then get `binning.out` as the output file. | ||
|
||
Now we can pass the resulting layout to raptor to build the index. | ||
|
||
\note | ||
Raptor also has a help page, which can be accessed as usual by typing `raptor layout -h` or `raptor layout --help`. | ||
|
||
|
||
## Additional parameters | ||
|
||
Now lets look at the additional parameters of the layout: | ||
|
||
- `--output-filename` | ||
A file name for the resulting layout. Default: "binning.out". | ||
- `--threads` | ||
The number of threads to be used for parallel processing. (k-mer hashes parallelisiert) | ||
|
||
Parameter Tweaking: | ||
- `--kmer-size` | ||
The k-mer size influences the estimated counts. Choosing a k-mer size that is too small for your data will | ||
result in files appearing more similar than they really are. Likewise, a large k-mer size might miss out on | ||
certain similarities. For DNA sequences, a k-mer size between [16,32] has proven to work well. Default: 19. | ||
- `--sketch-bits` | ||
The number of bits the HyperLogLog sketch should use to distribute the values into bins. Default: 12. Value | ||
must be in range [5,32]. | ||
- `--disable-sketch-output` | ||
Although the sketches will improve the layout, you might want to disable writing the sketch files to disk. | ||
Doing so will save disk space. However, you cannot use either --estimate-unions or --rearrange-user-bins in | ||
chopper layout without the sketches. Note that this option does not decrease run time as sketches have to be | ||
computed either way. | ||
|
||
- `--num-hash-functions` (unsigned long) | ||
The number of hash functions to use when building the HIBF from the resulting layout. This parameter is | ||
needed to correctly estimate the index size when computing the layout. Default: 2. | ||
- `--false-positive-rate` (double) | ||
The false positive rate you aim for when building the HIBF from the resulting layout. This parameter is | ||
needed to correctly estimate the index size when computing the layout. Default: 0.05. | ||
|
||
HyperLogLog Sketches: | ||
To improve the layout, you can estimate the sequence similarities using HyperLogLog sketches. | ||
- `--estimate-union` | ||
Use sketches to estimate the sequence similarity among a set of user bins. This will improve the layout | ||
computation as merging user bins that do not increase technical bin sizes will be preferred. Attention: Only | ||
possible if the directory [INPUT-PREFIX]_sketches is present. | ||
- `--rearrange-user-bins` | ||
As a preprocessing step, rearranging the order of the given user bins based on their sequence similarity may | ||
lead to favourable small unions and thus a smaller index. Attention: Also enables --estimate-union and is | ||
only possible if the directory [INPUT-PREFIX]_sketches is present. | ||
|
||
<- die machen das layout besser dauern aber laenger | ||
|
||
Parameter Tweaking: | ||
- `--alpha` | ||
The layout algorithm optimizes the space consumption of the resulting HIBF but currently has no means of | ||
optimizing the runtime for querying such an HIBF. In general, the ratio of merged bins and split bins | ||
influences the query time because a merged bin always triggers another search on a lower level. To influence | ||
this ratio, alpha can be used. The higher alpha, the less merged bins are chosen in the layout. This | ||
improves query times but leads to a bigger index. Default: 1.2. | ||
- `--max-rearrangement-ratio` | ||
When the option --rearrange-user-bins is set, this option can influence the rearrangement algorithm. The | ||
algorithm only rearranges the order of user bins in fixed intervals. The higher --max-rearrangement-ratio, | ||
the larger the intervals. This potentially improves the layout, but increases the runtime of the layout | ||
algorithm. Default: 0.5. Value must be in range [0,1]. | ||
|
||
A call could then look like this: | ||
```bash | ||
raptor layout --input-file all_bin_path.txt \ | ||
--tmax 64 \ | ||
--kmer-size 16 \ | ||
--sketch-bits 5 \ | ||
--num-hash-functions 3 \ | ||
--false-positive-rate 0.25 \ | ||
--estimate-union \ | ||
--rearrange-user-bins \ | ||
--alpha 1.5 \ | ||
--max-rearrangement-ratio 0.25 \ | ||
--threads 4 \ | ||
--output-filename binning.layout | ||
``` | ||
|
||
An assignment follows in the index tutorial on the HIBF. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters