New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[DOC] Add a raptor layout tutorial #201

Merged

eseiler merged 7 commits into seqan:main from Irallia:DOC/tutorial/raptor_layout

Dec 13, 2022

Contributor

Irallia commented Dec 6, 2022

No description provided.

seqan-actions added lint and removed lint labels

vercel bot commented Dec 6, 2022 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
raptor	✅ Ready (Inspect)	Visit Preview	Dec 13, 2022 at 2:07PM (UTC)

codecov bot commented Dec 6, 2022 •

edited

Loading

Codecov Report

Base: 0.00% // Head: 100.00% // Increases project coverage by +100.00% 🎉

Coverage data is based on head (b3e1220) compared to base (8e0bd82).
Patch has no changes to coverable lines.

Additional details and impacted files

@@             Coverage Diff             @@
##           main      #201        +/-   ##
===========================================
+ Coverage      0   100.00%   +100.00%     
===========================================
  Files         0        53        +53     
  Lines         0      1593      +1593     
===========================================
+ Hits          0      1593      +1593

Impacted Files	Coverage Δ
src/search/raptor_search.cpp	`100.00% <0.00%> (ø)`
src/threshold/one_error_model.cpp	`100.00% <0.00%> (ø)`
include/raptor/upgrade/upgrade_index.hpp	`100.00% <0.00%> (ø)`
src/threshold/precompute_correction.cpp	`100.00% <0.00%> (ø)`
src/build/hibf/chopper_build.cpp	`100.00% <0.00%> (ø)`
include/raptor/search/do_parallel.hpp	`100.00% <0.00%> (ø)`
src/argument_parsing/upgrade_parsing.cpp	`100.00% <0.00%> (ø)`
src/argument_parsing/init_shared_meta.cpp	`100.00% <0.00%> (ø)`
src/search/search_multiple.cpp	`100.00% <0.00%> (ø)`
include/raptor/argument_parsing/validators.hpp	`100.00% <0.00%> (ø)`
... and 43 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Irallia force-pushed the DOC/tutorial/raptor_layout branch from aed342e to efddb9b Compare

December 6, 2022 15:10

seqan-actions added lint and removed lint labels

Irallia force-pushed the DOC/tutorial/raptor_layout branch from efddb9b to cca7793 Compare

December 7, 2022 12:28

seqan-actions added lint and removed lint labels

eseiler removed the lint label

Irallia force-pushed the DOC/tutorial/raptor_layout branch from 531a50a to 70f8fba Compare

December 8, 2022 16:25

seqan-actions added lint and removed lint labels

Irallia force-pushed the DOC/tutorial/raptor_layout branch from 70f8fba to fa02cc3 Compare

December 12, 2022 11:01

seqan-actions added lint and removed lint labels

Irallia force-pushed the DOC/tutorial/raptor_layout branch from edffaf1 to d315bf0 Compare

December 12, 2022 12:50

seqan-actions added lint and removed lint labels

Irallia force-pushed the DOC/tutorial/raptor_layout branch from d315bf0 to 23133de Compare

December 12, 2022 13:04

seqan-actions added lint and removed lint labels

Irallia added 4 commits

December 12, 2022 16:47


          [DOC] Add a raptor layout tutorial

e91139a

Signed-off-by: Lydia Buntrock <[email protected]>


          [DOC] Explain HLL sketches

b16f7a8

Signed-off-by: Lydia Buntrock <[email protected]>


          [DOC] Add explanation for advanced option alpha.

1054d9d

Signed-off-by: Lydia Buntrock <[email protected]>


          [DOC] Add some assignments to the layout docu

6ac3143

Signed-off-by: Lydia Buntrock <[email protected]>

Irallia added 2 commits

December 12, 2022 16:47


          [DOC] Rework the hibf example for the index tutorial using the new la…

2d3f3d9

…youts.

Signed-off-by: Lydia Buntrock <[email protected]>


          [DOC] Add a search assignment for the hibf.

c538eb1

Signed-off-by: Lydia Buntrock <[email protected]>

Irallia force-pushed the DOC/tutorial/raptor_layout branch from 23133de to c538eb1 Compare

December 12, 2022 15:47

Irallia requested a review from eseiler

December 12, 2022 15:47

Irallia marked this pull request as ready for review

December 12, 2022 15:47

seqan-actions added lint and removed lint labels

eseiler requested changes

View reviewed changes

doc/tutorial/02_layout/index.md Outdated Show resolved Hide resolved

doc/tutorial/02_layout/index.md Outdated Show resolved Hide resolved

doc/tutorial/02_layout/index.md Outdated Show resolved Hide resolved

doc/tutorial/02_layout/index.md Outdated Show resolved Hide resolved

doc/tutorial/02_layout/index.md Outdated

+              advanced.
+              \todo
+              `--disable-sketch-output` wahrscheinlich unsinnig, da nur der zwischenstand zwischen count und layout

Member

eseiler Dec 12, 2022

raptor layout does both, so probably not that useful?
It probably stores the sketches? Might be useful if you want to run raptor layout multiple times

Contributor Author

Irallia Dec 13, 2022

I had understood the flag to mean that you save the intermediate state so that you pass your result from chopper count to chopper layout. And @feldroop said that this has to happen here anyway. So you don't want to/can't switch this off.

Member

eseiler Dec 13, 2022

Then we can just remove it :D

Contributor Author

Irallia Dec 13, 2022

I opened an issue and remove this todo: #204

doc/tutorial/02_layout/index.md Outdated Show resolved Hide resolved

doc/tutorial/02_layout/index.md Outdated Show resolved Hide resolved

doc/tutorial/03_index/index.md Outdated Show resolved Hide resolved

Irallia requested a review from eseiler

December 13, 2022 10:19

seqan-actions added lint and removed lint labels

eseiler approved these changes

View reviewed changes

Irallia force-pushed the DOC/tutorial/raptor_layout branch from 5ad8eda to f7e105e Compare

December 13, 2022 14:01

seqan-actions added lint and removed lint labels


          [DOC] Apply review

b3e1220

Signed-off-by: Lydia Buntrock <[email protected]>

Irallia force-pushed the DOC/tutorial/raptor_layout branch from f7e105e to b3e1220 Compare

December 13, 2022 14:05

seqan-actions added lint and removed lint labels

feldroop reviewed

View reviewed changes

Member

feldroop left a comment

Found a bit of stuff. Some things you can mybe just ignore, because you know better than me for which audiences this tutorial is intended. Overall I think it is very nice that we have this :)

doc/tutorial/02_layout/index.md

+              # IBF vs HIBF
+              Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter
+              (HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses a more space-saving method of storing the bins. It

Member

feldroop Dec 13, 2022

This uses a more space-saving method of storing the bins

You could add here that the HIBF is also faster than the IBF in many cases. This is dependant on the search parameters(num errors, thresholding), but I would expect that the HIBF is faster most of the time. Especially when the number of bins is large. Is this correct @eseiler @smehringer ?

doc/tutorial/02_layout/index.md

+              Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter
+              (HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses a more space-saving method of storing the bins. It
+              distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins,
+              which throw some bins together. This is especially useful when there are samples of very different sizes.

Member

feldroop Dec 13, 2022

which throw some bins together

You could add that it also splits bins. Maybe merge is better than throw together.

doc/tutorial/02_layout/index.md

+              \image html hibf.svg width=40%
+              The figure above shows the storage of the user bins in the technical bins. The resulting tree represents the layout.
+              The first step is to estimate the number of (representative) k-mers per user bin by computing

Member

feldroop Dec 13, 2022

number of (representative) k-mers

Does this mean unique k-mers, i.e. the set cardinality? I feel like representative is a bit ambiguous in its meaning.

doc/tutorial/02_layout/index.md Show resolved Hide resolved

doc/tutorial/02_layout/index.md

+              [HyperLogLog (HLL) sketches](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) of the input data. These HLL
+              sketches are stored in a directory and will be used in computing an HIBF layout. We will go into more detail later
+              \ref HLL. The HIBF layout tries to minimize the disk space consumption of the resulting index. The space is estimated
+              using a k-mer count per user bin which represents the potential denisity in a technical bin in an Interleaved Bloom

Member

feldroop Dec 13, 2022

denisity

density

doc/tutorial/02_layout/index.md

+              ## Additional parameters
+              To create an index and thus a layout, the individual samples of the data set are chopped up into k-mers and determine in
+              their so-called bin the specific bit setting of the Bloom Filter by passing them through hash functions. This means that

Member

feldroop Dec 13, 2022

so-called bin the specific bit setting of the Bloom Filter by passing them through hash functions

I don't userstand this.

Contributor Author

Irallia Dec 14, 2022

I could add an Example like this:

Query ACGT with kmers ACG, CGT.
Sample 1: AATGT, Sample 2: ACCGT, Sample 3: ACGTA
2 Hash funktions
-> Bins 1 to 3 for ACG could look like |0000|0000|0101|
Bins 1 to 3 for CGT could look like |0000|0110|1100|
-> The query seems to match Sample 3

@eseiler do you have an image for this?

doc/tutorial/02_layout/index.md

+              However, if we are unlucky and come across a hash value that consists of only `0`'s, then \f$p_{max}\f$ is of course
+              maximum of all possible hash values, no matter how many different elements are actually present.
+              To avoid this, we cut each hash value into `m` parts and calculate the \f$p_{max}\f$ over each of these parts. From

Member

feldroop Dec 13, 2022

To avoid this, we cut each hash value into m parts

This is not correct. We cut the stream of hash values into m substreams and use the first b bits of each hash value to determine into which substream it belongs.

doc/tutorial/02_layout/index.md

+              If we choose our `b` (`m`) to be very large, then we need more memory but get higher accuracy. (Storage consumption is
+              growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user
+              bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`).

Member

feldroop Dec 13, 2022

I think you should mention that the relative error of the HLL estimate increases with a decreasing b (m) and that we believe that anything above m=512 should be fine.

doc/tutorial/02_layout/index.md

+              With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If
+              you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and
+              memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however,
+              we were not able to determine a too great influence, so we recommend that this value only be used for fine tuning.

Member

feldroop Dec 13, 2022

I would instead write that this value should only be used if the layouting takes to much memory or time. Because there it has a large influence.

doc/tutorial/02_layout/index.md

+              With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If
+              you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and
+              memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however,

Member

feldroop Dec 13, 2022

which could be bad

I suggest instead:
which could result in a less memory-efficient layout.

Member

eseiler commented Dec 13, 2022

Lydia will do another PR for Felix's suggestions

eseiler merged commit 4bbcd74 into seqan:main

Irallia added a commit to Irallia/raptor that referenced this pull request


          [DOC] Apply review from seqan#201 PR from @feldroop.

a1bc706

Signed-off-by: Lydia Buntrock <[email protected]>

Irallia mentioned this pull request

[DOC] Tutorial: small corrections #205

Merged

eseiler pushed a commit to Irallia/raptor that referenced this pull request


          [DOC] Apply review from seqan#201 PR from @feldroop.

ddea261

Signed-off-by: Lydia Buntrock <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet