-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Add a raptor layout tutorial #201
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Codecov ReportBase: 0.00% // Head: 100.00% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #201 +/- ##
===========================================
+ Coverage 0 100.00% +100.00%
===========================================
Files 0 53 +53
Lines 0 1593 +1593
===========================================
+ Hits 0 1593 +1593
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
aed342e
to
efddb9b
Compare
efddb9b
to
cca7793
Compare
531a50a
to
70f8fba
Compare
70f8fba
to
fa02cc3
Compare
edffaf1
to
d315bf0
Compare
d315bf0
to
23133de
Compare
Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
…youts. Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
23133de
to
c538eb1
Compare
doc/tutorial/02_layout/index.md
Outdated
advanced. | ||
|
||
\todo | ||
`--disable-sketch-output` wahrscheinlich unsinnig, da nur der zwischenstand zwischen count und layout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raptor layout
does both, so probably not that useful?
It probably stores the sketches? Might be useful if you want to run raptor layout
multiple times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had understood the flag to mean that you save the intermediate state so that you pass your result from chopper count
to chopper layout
. And @feldroop said that this has to happen here anyway. So you don't want to/can't switch this off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can just remove it :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened an issue and remove this todo: #204
5ad8eda
to
f7e105e
Compare
Signed-off-by: Lydia Buntrock <[email protected]>
f7e105e
to
b3e1220
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a bit of stuff. Some things you can mybe just ignore, because you know better than me for which audiences this tutorial is intended. Overall I think it is very nice that we have this :)
# IBF vs HIBF | ||
|
||
Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter | ||
(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses a more space-saving method of storing the bins. It |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses a more space-saving method of storing the bins
You could add here that the HIBF is also faster than the IBF in many cases. This is dependant on the search parameters(num errors, thresholding), but I would expect that the HIBF is faster most of the time. Especially when the number of bins is large. Is this correct @eseiler @smehringer ?
Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter | ||
(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses a more space-saving method of storing the bins. It | ||
distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins, | ||
which throw some bins together. This is especially useful when there are samples of very different sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which throw some bins together
You could add that it also splits bins. Maybe merge is better than throw together.
\image html hibf.svg width=40% | ||
|
||
The figure above shows the storage of the user bins in the technical bins. The resulting tree represents the layout. | ||
The first step is to estimate the number of (representative) k-mers per user bin by computing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number of (representative) k-mers
Does this mean unique k-mers, i.e. the set cardinality? I feel like representative is a bit ambiguous in its meaning.
[HyperLogLog (HLL) sketches](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) of the input data. These HLL | ||
sketches are stored in a directory and will be used in computing an HIBF layout. We will go into more detail later | ||
\ref HLL. The HIBF layout tries to minimize the disk space consumption of the resulting index. The space is estimated | ||
using a k-mer count per user bin which represents the potential denisity in a technical bin in an Interleaved Bloom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
denisity
density
## Additional parameters | ||
|
||
To create an index and thus a layout, the individual samples of the data set are chopped up into k-mers and determine in | ||
their so-called bin the specific bit setting of the Bloom Filter by passing them through hash functions. This means that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so-called bin the specific bit setting of the Bloom Filter by passing them through hash functions
I don't userstand this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could add an Example like this:
Query ACGT with kmers ACG, CGT.
Sample 1: AATGT, Sample 2: ACCGT, Sample 3: ACGTA
2 Hash funktions
-> Bins 1 to 3 for ACG could look like |0000|0000|0101|
Bins 1 to 3 for CGT could look like |0000|0110|1100|
-> The query seems to match Sample 3
@eseiler do you have an image for this?
|
||
However, if we are unlucky and come across a hash value that consists of only `0`'s, then \f$p_{max}\f$ is of course | ||
maximum of all possible hash values, no matter how many different elements are actually present. | ||
To avoid this, we cut each hash value into `m` parts and calculate the \f$p_{max}\f$ over each of these parts. From |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid this, we cut each hash value into
m
parts
This is not correct. We cut the stream of hash values into m substreams and use the first b bits of each hash value to determine into which substream it belongs.
|
||
If we choose our `b` (`m`) to be very large, then we need more memory but get higher accuracy. (Storage consumption is | ||
growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user | ||
bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should mention that the relative error of the HLL estimate increases with a decreasing b (m) and that we believe that anything above m=512 should be fine.
With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If | ||
you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and | ||
memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however, | ||
we were not able to determine a too great influence, so we recommend that this value only be used for fine tuning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would instead write that this value should only be used if the layouting takes to much memory or time. Because there it has a large influence.
|
||
With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If | ||
you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and | ||
memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which could be bad
I suggest instead:
which could result in a less memory-efficient layout.
Lydia will do another PR for Felix's suggestions |
Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
No description provided.