[DOC] Tutorial: small corrections #205

Irallia · 2022-12-14T13:45:48Z

This PR contains corrections of some smaller mistakes I found in the overall tutorial. And the review suggestions from @feldroop in #201.

vercel · 2022-12-14T13:45:52Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
raptor	✅ Ready (Inspect)	Visit Preview	Jan 10, 2023 at 5:44PM (UTC)

Irallia · 2022-12-14T13:46:49Z

doc/tutorial/02_layout/index.md

+
+Example:
+Query ACGT with kmers ACG, CGT.
+Sample 1: AATGT, Sample 2: ACCGT, Sample 3: ACGTA
+2 Hash funktions
+-> Bins 1 to 3 for ACG could look like |0000|0000|0101|
+   Bins 1 to 3 for CGT could look like |0000|0110|1100|
+-> The query seems to match Sample 3
+


@eseiler do you have an image for this?

Is this the first time, we explain how this work? Otherwise, I would just reference the other part.

This also seems a bit too detailed?
If we do it here, we could also think about just explaining a plain Bloom Filter. The principle is the same.

I added this example, because of this discussion: #201 (comment)
We have also a really short explanation of a BF and IBF here: https://github.com/seqan/raptor/blob/main/doc/tutorial/03_index/index.md#general-idea--main-parameters.
Should we merge these explanations? Or would you leave just this one out as its too detailed?

doc/tutorial/02_layout/index.md

smehringer · 2022-12-15T08:15:24Z

doc/tutorial/02_layout/index.md

-distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins,
-which throw some bins together. This is especially useful when there are samples of very different sizes.
+(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses an almost always more space-saving method of storing
+the bins (the HIBF is only not smaller if all bins are the same size). It distinguishes between the user bins, which


Suggested change

the bins (the HIBF is only not smaller if all bins are the same size). It distinguishes between the user bins, which

the bins (except if the input samples are all of the same size). It distinguishes between the *user bins*, which

smehringer · 2022-12-15T08:17:59Z

doc/tutorial/02_layout/index.md

-which throw some bins together. This is especially useful when there are samples of very different sizes.
+(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses an almost always more space-saving method of storing
+the bins (the HIBF is only not smaller if all bins are the same size). It distinguishes between the user bins, which
+reflect the individual samples as before, and the so-called technical bins, which merges some bins together and splits


Suggested change

reflect the individual samples as before, and the so-called technical bins, which merges some bins together and splits

reflect the individual input samples, and the *technical bins*, which are physical storage units within the HIBF. *Technical bins* may store a single user bin, a split part of a user bin or several (merged) user bins.

smehringer · 2022-12-15T08:19:59Z

doc/tutorial/02_layout/index.md

-growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user
-bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`).
+growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`).  If we have many user
+bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`). Furthermore, the relative error


Suggested change

bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`). Furthermore, the relative error

bins and observe a long layout computation time, then it is worth choosing a somewhat smaller `b` (`m`). Furthermore, the relative error

codecov · 2022-12-15T15:42:47Z

Codecov Report

Base: 100.00% // Head: 100.00% // No change to project coverage 👍

Coverage data is based on head (b676bee) compared to base (66ed903).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #205   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           53        53           
  Lines         1595      1599    +4     
=========================================
+ Hits          1595      1599    +4

Impacted Files	Coverage Δ
src/argument_parsing/search_parsing.cpp	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

eseiler

Another idea, but I would have to re-read the whole tutorial to be certain, and we should ask someone who hasn't seen it yet:

I like that we also explain how everything works, e.g. IBF, HIBF, HyperLogLog.
However, it doesn't always seem to be necessary to understand how everything works.
What I'm trying to say is that we may have to types of information in our tutorial:

Here is how you use our tool.
Here is how our tool works.

So, we may want to "mark" the detailed/how everything works stuff as such.
Which would mean that someone who just wants to use the tool doesn't have to read everything.

We would have to decide on how that should look like. I wouldn't make collapsible, but maybe use a different background color?

@Irallia @smehringer Thoughts?
@feldroop Thoughts, if you have time? :)

eseiler · 2022-12-19T11:40:27Z

doc/tutorial/02_layout/index.md

+(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses an almost always more space-saving method of storing
+the bins (except if the input samples are all of the same size). It distinguishes between the *user bins*, which reflect
+the individual input samples, and the *technical bins*, which are physical storage units within the HIBF.
+*Technical bins* may store a single user bin, a split part of a user bin or several (merged) user bins. This is


Suggested change

*Technical bins* may store a single user bin, a split part of a user bin or several (merged) user bins. This is

*Technical bins* may store a single user bin, a split part of a user bin, or several (merged) user bins. This is

eseiler · 2022-12-19T11:40:50Z

doc/tutorial/02_layout/index.md

+the bins (except if the input samples are all of the same size). It distinguishes between the *user bins*, which reflect
+the individual input samples, and the *technical bins*, which are physical storage units within the HIBF.
+*Technical bins* may store a single user bin, a split part of a user bin or several (merged) user bins. This is
+especially useful when there are samples of very different sizes.


Suggested change

especially useful when there are samples of very different sizes.

especially useful when samples vary dramatically in size.

eseiler · 2022-12-19T11:41:09Z

doc/tutorial/02_layout/index.md

 filter.

+\note
+The term representative indicates that the k-mer content could be transformed by a function which reduces its size and
+distribution, e.g. using minimizers.


Suggested change

distribution, e.g. using minimizers.

distribution, e.g. by using minimizers.

eseiler · 2022-12-19T11:43:30Z

doc/tutorial/02_layout/index.md

+
+Example:
+Query ACGT with kmers ACG, CGT.
+Sample 1: AATGT, Sample 2: ACCGT, Sample 3: ACGTA
+2 Hash funktions
+-> Bins 1 to 3 for ACG could look like |0000|0000|0101|
+   Bins 1 to 3 for CGT could look like |0000|0110|1100|
+-> The query seems to match Sample 3
+


Is this the first time, we explain how this work? Otherwise, I would just reference the other part.

This also seems a bit too detailed?
If we do it here, we could also think about just explaining a plain Bloom Filter. The principle is the same.

eseiler · 2022-12-19T11:46:22Z

doc/tutorial/02_layout/index.md

+-> Bins 1 to 3 for ACG could look like |0000|0000|0101|
+   Bins 1 to 3 for CGT could look like |0000|0110|1100|
+-> The query seems to match Sample 3
+
 If a query is then searched, its k-mers are thrown into the hash functions and looked at in which bins it only points
 to ones. This can also result in false positives. Thus, the result only indicates that the query is probably part of a
 sample.


Down below when mentioning the https://hur.st/bloomfilter/ website, we should mention that the number of inserted elements is the number of kmers in a single bin, and you want to use the biggest bin to be sure.

Thinking about it, does it fit here? Because the layout kinda does all the computations for you?
It does fit, if we rephrase it a bit like: You can look for optimal parameter settings, because we don't optimize the number of hash functions

I copied this paragraph from the index tutorial because I thought you might need it here too. But I can also take it out completely. -> https://github.com/seqan/raptor/blob/main/doc/tutorial/03_index/index.md#general-idea--main-parameters

eseiler · 2022-12-19T11:47:02Z

doc/tutorial/02_layout/index.md

-To avoid this, we cut each hash value into `m` parts and calculate the \f$p_{max}\f$ over each of these parts. From
-these we then calculate the *harmonic mean* as the total \f$p_{max}\f$.
+To avoid this, we cut the stream of hash values into `m` substreams and use the first `b` bits of each hash value to
+determine into which substream it belongs. From these we then calculate the *harmonic mean* as the total \f$p_{max}\f$.


Suggested change

determine into which substream it belongs. From these we then calculate the *harmonic mean* as the total \f$p_{max}\f$.

determine into which substream it belongs. From these, we calculate the *harmonic mean* as the total \f$p_{max}\f$.

eseiler · 2022-12-19T11:47:13Z

doc/tutorial/02_layout/index.md

@@ -262,8 +279,9 @@ runtime.

 With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If
 you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and
-memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however,
-we were not able to determine a too great influence, so we recommend that this value only be used for fine tuning.
+memory. If it is close to `1`, however, just little re-arranging is done,  which could result in a less memory-efficient


Suggested change

memory. If it is close to `1`, however, just little re-arranging is done, which could result in a less memory-efficient

memory. If it is close to `1`, however, just little re-arranging is done, which could result in a less memory-efficient

Signed-off-by: Lydia Buntrock <[email protected]>

eseiler

I'll have another look once I'm done with updating raptor. Some things will change then

Irallia requested review from feldroop and smehringer December 14, 2022 13:45

seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 14, 2022

Irallia commented Dec 14, 2022

View reviewed changes

smehringer reviewed Dec 15, 2022

View reviewed changes

doc/tutorial/02_layout/index.md Show resolved Hide resolved

smehringer reviewed Dec 15, 2022

View reviewed changes

seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 15, 2022

eseiler reviewed Dec 19, 2022

View reviewed changes

seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Jan 5, 2023

Irallia added 4 commits January 10, 2023 18:40

[DOC] Some tiny changes

73dbea5

Signed-off-by: Lydia Buntrock <[email protected]>

[DOC] Apply review from seqan#201 PR from @feldroop.

ddea261

Signed-off-by: Lydia Buntrock <[email protected]>

[DOC] Apply review

c9c19c4

Signed-off-by: Lydia Buntrock <[email protected]>

[MISC] Apply second review

b676bee

Signed-off-by: Lydia Buntrock <[email protected]>

eseiler force-pushed the DOC/tutorial/small_corrections branch from ec0d117 to b676bee Compare January 10, 2023 17:40

seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Jan 10, 2023

Irallia requested a review from eseiler January 11, 2023 14:02

eseiler approved these changes Jan 18, 2023

View reviewed changes

eseiler merged commit 2c60c2d into seqan:main Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOC] Tutorial: small corrections #205

[DOC] Tutorial: small corrections #205

Irallia commented Dec 14, 2022

vercel bot commented Dec 14, 2022 •

edited

Loading

Irallia Dec 14, 2022

eseiler Dec 19, 2022

Irallia Jan 5, 2023

smehringer Dec 15, 2022

smehringer Dec 15, 2022

smehringer Dec 15, 2022

codecov bot commented Dec 15, 2022 •

edited

Loading

eseiler left a comment •

edited

Loading

eseiler Dec 19, 2022

eseiler Dec 19, 2022

eseiler Dec 19, 2022

eseiler Dec 19, 2022

eseiler Dec 19, 2022

Irallia Jan 5, 2023

eseiler Dec 19, 2022

eseiler Dec 19, 2022

eseiler left a comment

	the bins (the HIBF is only not smaller if all bins are the same size). It distinguishes between the user bins, which
	the bins (except if the input samples are all of the same size). It distinguishes between the user bins, which

	reflect the individual samples as before, and the so-called technical bins, which merges some bins together and splits
	reflect the individual input samples, and the technical bins, which are physical storage units within the HIBF. Technical bins may store a single user bin, a split part of a user bin or several (merged) user bins.

	bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`). Furthermore, the relative error
	bins and observe a long layout computation time, then it is worth choosing a somewhat smaller `b` (`m`). Furthermore, the relative error

	Technical bins may store a single user bin, a split part of a user bin or several (merged) user bins. This is
	Technical bins may store a single user bin, a split part of a user bin, or several (merged) user bins. This is

	especially useful when there are samples of very different sizes.
	especially useful when samples vary dramatically in size.

	distribution, e.g. using minimizers.
	distribution, e.g. by using minimizers.

	determine into which substream it belongs. From these we then calculate the harmonic mean as the total \f$p_{max}\f$.
	determine into which substream it belongs. From these, we calculate the harmonic mean as the total \f$p_{max}\f$.

	memory. If it is close to `1`, however, just little re-arranging is done, which could result in a less memory-efficient
	memory. If it is close to `1`, however, just little re-arranging is done, which could result in a less memory-efficient

[DOC] Tutorial: small corrections #205

[DOC] Tutorial: small corrections #205

Conversation

Irallia commented Dec 14, 2022

vercel bot commented Dec 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 15, 2022 • edited Loading

Codecov Report

eseiler left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eseiler left a comment

Choose a reason for hiding this comment

vercel bot commented Dec 14, 2022 •

edited

Loading

codecov bot commented Dec 15, 2022 •

edited

Loading

eseiler left a comment •

edited

Loading