-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Tutorial: small corrections #205
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
||
Example: | ||
Query ACGT with kmers ACG, CGT. | ||
Sample 1: AATGT, Sample 2: ACCGT, Sample 3: ACGTA | ||
2 Hash funktions | ||
-> Bins 1 to 3 for ACG could look like |0000|0000|0101| | ||
Bins 1 to 3 for CGT could look like |0000|0110|1100| | ||
-> The query seems to match Sample 3 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eseiler do you have an image for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the first time, we explain how this work? Otherwise, I would just reference the other part.
This also seems a bit too detailed?
If we do it here, we could also think about just explaining a plain Bloom Filter. The principle is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this example, because of this discussion: #201 (comment)
We have also a really short explanation of a BF and IBF here: https://github.com/seqan/raptor/blob/main/doc/tutorial/03_index/index.md#general-idea--main-parameters.
Should we merge these explanations? Or would you leave just this one out as its too detailed?
doc/tutorial/02_layout/index.md
Outdated
distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins, | ||
which throw some bins together. This is especially useful when there are samples of very different sizes. | ||
(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses an almost always more space-saving method of storing | ||
the bins (the HIBF is only not smaller if all bins are the same size). It distinguishes between the user bins, which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the bins (the HIBF is only not smaller if all bins are the same size). It distinguishes between the user bins, which | |
the bins (except if the input samples are all of the same size). It distinguishes between the *user bins*, which |
doc/tutorial/02_layout/index.md
Outdated
which throw some bins together. This is especially useful when there are samples of very different sizes. | ||
(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses an almost always more space-saving method of storing | ||
the bins (the HIBF is only not smaller if all bins are the same size). It distinguishes between the user bins, which | ||
reflect the individual samples as before, and the so-called technical bins, which merges some bins together and splits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reflect the individual samples as before, and the so-called technical bins, which merges some bins together and splits | |
reflect the individual input samples, and the *technical bins*, which are physical storage units within the HIBF. *Technical bins* may store a single user bin, a split part of a user bin or several (merged) user bins. |
doc/tutorial/02_layout/index.md
Outdated
growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user | ||
bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`). | ||
growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user | ||
bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`). Furthermore, the relative error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`). Furthermore, the relative error | |
bins and observe a long layout computation time, then it is worth choosing a somewhat smaller `b` (`m`). Furthermore, the relative error |
Codecov ReportBase: 100.00% // Head: 100.00% // No change to project coverage 👍
Additional details and impacted files@@ Coverage Diff @@
## main #205 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 53 53
Lines 1595 1599 +4
=========================================
+ Hits 1595 1599 +4
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea, but I would have to re-read the whole tutorial to be certain, and we should ask someone who hasn't seen it yet:
I like that we also explain how everything works, e.g. IBF, HIBF, HyperLogLog.
However, it doesn't always seem to be necessary to understand how everything works.
What I'm trying to say is that we may have to types of information in our tutorial:
- Here is how you use our tool.
- Here is how our tool works.
So, we may want to "mark" the detailed/how everything works stuff as such.
Which would mean that someone who just wants to use the tool doesn't have to read everything.
We would have to decide on how that should look like. I wouldn't make collapsible, but maybe use a different background color?
@Irallia @smehringer Thoughts?
@feldroop Thoughts, if you have time? :)
doc/tutorial/02_layout/index.md
Outdated
(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses an almost always more space-saving method of storing | ||
the bins (except if the input samples are all of the same size). It distinguishes between the *user bins*, which reflect | ||
the individual input samples, and the *technical bins*, which are physical storage units within the HIBF. | ||
*Technical bins* may store a single user bin, a split part of a user bin or several (merged) user bins. This is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*Technical bins* may store a single user bin, a split part of a user bin or several (merged) user bins. This is | |
*Technical bins* may store a single user bin, a split part of a user bin, or several (merged) user bins. This is |
doc/tutorial/02_layout/index.md
Outdated
the bins (except if the input samples are all of the same size). It distinguishes between the *user bins*, which reflect | ||
the individual input samples, and the *technical bins*, which are physical storage units within the HIBF. | ||
*Technical bins* may store a single user bin, a split part of a user bin or several (merged) user bins. This is | ||
especially useful when there are samples of very different sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
especially useful when there are samples of very different sizes. | |
especially useful when samples vary dramatically in size. |
doc/tutorial/02_layout/index.md
Outdated
filter. | ||
|
||
\note | ||
The term representative indicates that the k-mer content could be transformed by a function which reduces its size and | ||
distribution, e.g. using minimizers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distribution, e.g. using minimizers. | |
distribution, e.g. by using minimizers. |
|
||
Example: | ||
Query ACGT with kmers ACG, CGT. | ||
Sample 1: AATGT, Sample 2: ACCGT, Sample 3: ACGTA | ||
2 Hash funktions | ||
-> Bins 1 to 3 for ACG could look like |0000|0000|0101| | ||
Bins 1 to 3 for CGT could look like |0000|0110|1100| | ||
-> The query seems to match Sample 3 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the first time, we explain how this work? Otherwise, I would just reference the other part.
This also seems a bit too detailed?
If we do it here, we could also think about just explaining a plain Bloom Filter. The principle is the same.
-> Bins 1 to 3 for ACG could look like |0000|0000|0101| | ||
Bins 1 to 3 for CGT could look like |0000|0110|1100| | ||
-> The query seems to match Sample 3 | ||
|
||
If a query is then searched, its k-mers are thrown into the hash functions and looked at in which bins it only points | ||
to ones. This can also result in false positives. Thus, the result only indicates that the query is probably part of a | ||
sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Down below when mentioning the https://hur.st/bloomfilter/
website, we should mention that the number of inserted elements is the number of kmers in a single bin, and you want to use the biggest bin to be sure.
Thinking about it, does it fit here? Because the layout kinda does all the computations for you?
It does fit, if we rephrase it a bit like: You can look for optimal parameter settings, because we don't optimize the number of hash functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this paragraph from the index tutorial because I thought you might need it here too. But I can also take it out completely. -> https://github.com/seqan/raptor/blob/main/doc/tutorial/03_index/index.md#general-idea--main-parameters
doc/tutorial/02_layout/index.md
Outdated
To avoid this, we cut each hash value into `m` parts and calculate the \f$p_{max}\f$ over each of these parts. From | ||
these we then calculate the *harmonic mean* as the total \f$p_{max}\f$. | ||
To avoid this, we cut the stream of hash values into `m` substreams and use the first `b` bits of each hash value to | ||
determine into which substream it belongs. From these we then calculate the *harmonic mean* as the total \f$p_{max}\f$. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
determine into which substream it belongs. From these we then calculate the *harmonic mean* as the total \f$p_{max}\f$. | |
determine into which substream it belongs. From these, we calculate the *harmonic mean* as the total \f$p_{max}\f$. |
doc/tutorial/02_layout/index.md
Outdated
@@ -262,8 +279,9 @@ runtime. | |||
|
|||
With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If | |||
you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and | |||
memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however, | |||
we were not able to determine a too great influence, so we recommend that this value only be used for fine tuning. | |||
memory. If it is close to `1`, however, just little re-arranging is done, which could result in a less memory-efficient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory. If it is close to `1`, however, just little re-arranging is done, which could result in a less memory-efficient | |
memory. If it is close to `1`, however, just little re-arranging is done, which could result in a less memory-efficient |
Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
ec0d117
to
b676bee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have another look once I'm done with updating raptor. Some things will change then
This PR contains corrections of some smaller mistakes I found in the overall tutorial. And the review suggestions from @feldroop in #201.