Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Tutorial: small corrections #205

Merged
merged 4 commits into from
Jan 18, 2023

Conversation

Irallia
Copy link
Contributor

@Irallia Irallia commented Dec 14, 2022

This PR contains corrections of some smaller mistakes I found in the overall tutorial. And the review suggestions from @feldroop in #201.

@vercel
Copy link

vercel bot commented Dec 14, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
raptor ✅ Ready (Inspect) Visit Preview Jan 10, 2023 at 5:44PM (UTC)

@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 14, 2022
Comment on lines +165 to +174

Example:
Query ACGT with kmers ACG, CGT.
Sample 1: AATGT, Sample 2: ACCGT, Sample 3: ACGTA
2 Hash funktions
-> Bins 1 to 3 for ACG could look like |0000|0000|0101|
Bins 1 to 3 for CGT could look like |0000|0110|1100|
-> The query seems to match Sample 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eseiler do you have an image for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the first time, we explain how this work? Otherwise, I would just reference the other part.

This also seems a bit too detailed?
If we do it here, we could also think about just explaining a plain Bloom Filter. The principle is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this example, because of this discussion: #201 (comment)
We have also a really short explanation of a BF and IBF here: https://github.com/seqan/raptor/blob/main/doc/tutorial/03_index/index.md#general-idea--main-parameters.
Should we merge these explanations? Or would you leave just this one out as its too detailed?

distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins,
which throw some bins together. This is especially useful when there are samples of very different sizes.
(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses an almost always more space-saving method of storing
the bins (the HIBF is only not smaller if all bins are the same size). It distinguishes between the user bins, which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the bins (the HIBF is only not smaller if all bins are the same size). It distinguishes between the user bins, which
the bins (except if the input samples are all of the same size). It distinguishes between the *user bins*, which

which throw some bins together. This is especially useful when there are samples of very different sizes.
(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses an almost always more space-saving method of storing
the bins (the HIBF is only not smaller if all bins are the same size). It distinguishes between the user bins, which
reflect the individual samples as before, and the so-called technical bins, which merges some bins together and splits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reflect the individual samples as before, and the so-called technical bins, which merges some bins together and splits
reflect the individual input samples, and the *technical bins*, which are physical storage units within the HIBF. *Technical bins* may store a single user bin, a split part of a user bin or several (merged) user bins.

growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user
bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`).
growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user
bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`). Furthermore, the relative error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`). Furthermore, the relative error
bins and observe a long layout computation time, then it is worth choosing a somewhat smaller `b` (`m`). Furthermore, the relative error

@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 15, 2022
@codecov
Copy link

codecov bot commented Dec 15, 2022

Codecov Report

Base: 100.00% // Head: 100.00% // No change to project coverage 👍

Coverage data is based on head (b676bee) compared to base (66ed903).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #205   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           53        53           
  Lines         1595      1599    +4     
=========================================
+ Hits          1595      1599    +4     
Impacted Files Coverage Δ
src/argument_parsing/search_parsing.cpp 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@eseiler eseiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea, but I would have to re-read the whole tutorial to be certain, and we should ask someone who hasn't seen it yet:

I like that we also explain how everything works, e.g. IBF, HIBF, HyperLogLog.
However, it doesn't always seem to be necessary to understand how everything works.
What I'm trying to say is that we may have to types of information in our tutorial:

  • Here is how you use our tool.
  • Here is how our tool works.

So, we may want to "mark" the detailed/how everything works stuff as such.
Which would mean that someone who just wants to use the tool doesn't have to read everything.

We would have to decide on how that should look like. I wouldn't make collapsible, but maybe use a different background color?

@Irallia @smehringer Thoughts?
@feldroop Thoughts, if you have time? :)

(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses an almost always more space-saving method of storing
the bins (except if the input samples are all of the same size). It distinguishes between the *user bins*, which reflect
the individual input samples, and the *technical bins*, which are physical storage units within the HIBF.
*Technical bins* may store a single user bin, a split part of a user bin or several (merged) user bins. This is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*Technical bins* may store a single user bin, a split part of a user bin or several (merged) user bins. This is
*Technical bins* may store a single user bin, a split part of a user bin, or several (merged) user bins. This is

the bins (except if the input samples are all of the same size). It distinguishes between the *user bins*, which reflect
the individual input samples, and the *technical bins*, which are physical storage units within the HIBF.
*Technical bins* may store a single user bin, a split part of a user bin or several (merged) user bins. This is
especially useful when there are samples of very different sizes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
especially useful when there are samples of very different sizes.
especially useful when samples vary dramatically in size.

filter.

\note
The term representative indicates that the k-mer content could be transformed by a function which reduces its size and
distribution, e.g. using minimizers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
distribution, e.g. using minimizers.
distribution, e.g. by using minimizers.

Comment on lines +165 to +174

Example:
Query ACGT with kmers ACG, CGT.
Sample 1: AATGT, Sample 2: ACCGT, Sample 3: ACGTA
2 Hash funktions
-> Bins 1 to 3 for ACG could look like |0000|0000|0101|
Bins 1 to 3 for CGT could look like |0000|0110|1100|
-> The query seems to match Sample 3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the first time, we explain how this work? Otherwise, I would just reference the other part.

This also seems a bit too detailed?
If we do it here, we could also think about just explaining a plain Bloom Filter. The principle is the same.

-> Bins 1 to 3 for ACG could look like |0000|0000|0101|
Bins 1 to 3 for CGT could look like |0000|0110|1100|
-> The query seems to match Sample 3

If a query is then searched, its k-mers are thrown into the hash functions and looked at in which bins it only points
to ones. This can also result in false positives. Thus, the result only indicates that the query is probably part of a
sample.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down below when mentioning the https://hur.st/bloomfilter/ website, we should mention that the number of inserted elements is the number of kmers in a single bin, and you want to use the biggest bin to be sure.

Thinking about it, does it fit here? Because the layout kinda does all the computations for you?
It does fit, if we rephrase it a bit like: You can look for optimal parameter settings, because we don't optimize the number of hash functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this paragraph from the index tutorial because I thought you might need it here too. But I can also take it out completely. -> https://github.com/seqan/raptor/blob/main/doc/tutorial/03_index/index.md#general-idea--main-parameters

To avoid this, we cut each hash value into `m` parts and calculate the \f$p_{max}\f$ over each of these parts. From
these we then calculate the *harmonic mean* as the total \f$p_{max}\f$.
To avoid this, we cut the stream of hash values into `m` substreams and use the first `b` bits of each hash value to
determine into which substream it belongs. From these we then calculate the *harmonic mean* as the total \f$p_{max}\f$.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
determine into which substream it belongs. From these we then calculate the *harmonic mean* as the total \f$p_{max}\f$.
determine into which substream it belongs. From these, we calculate the *harmonic mean* as the total \f$p_{max}\f$.

@@ -262,8 +279,9 @@ runtime.

With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If
you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and
memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however,
we were not able to determine a too great influence, so we recommend that this value only be used for fine tuning.
memory. If it is close to `1`, however, just little re-arranging is done, which could result in a less memory-efficient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
memory. If it is close to `1`, however, just little re-arranging is done, which could result in a less memory-efficient
memory. If it is close to `1`, however, just little re-arranging is done, which could result in a less memory-efficient

@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Jan 5, 2023
Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
@eseiler eseiler force-pushed the DOC/tutorial/small_corrections branch from ec0d117 to b676bee Compare January 10, 2023 17:40
@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Jan 10, 2023
@Irallia Irallia requested a review from eseiler January 11, 2023 14:02
Copy link
Member

@eseiler eseiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have another look once I'm done with updating raptor. Some things will change then

@eseiler eseiler merged commit 2c60c2d into seqan:main Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants