Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Add a raptor layout tutorial #201

Merged
merged 7 commits into from
Dec 13, 2022

Conversation

Irallia
Copy link
Contributor

@Irallia Irallia commented Dec 6, 2022

No description provided.

@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 6, 2022
@vercel
Copy link

vercel bot commented Dec 6, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
raptor ✅ Ready (Inspect) Visit Preview Dec 13, 2022 at 2:07PM (UTC)

@codecov
Copy link

codecov bot commented Dec 6, 2022

Codecov Report

Base: 0.00% // Head: 100.00% // Increases project coverage by +100.00% 🎉

Coverage data is based on head (b3e1220) compared to base (8e0bd82).
Patch has no changes to coverable lines.

Additional details and impacted files
@@             Coverage Diff             @@
##           main      #201        +/-   ##
===========================================
+ Coverage      0   100.00%   +100.00%     
===========================================
  Files         0        53        +53     
  Lines         0      1593      +1593     
===========================================
+ Hits          0      1593      +1593     
Impacted Files Coverage Δ
src/search/raptor_search.cpp 100.00% <0.00%> (ø)
src/threshold/one_error_model.cpp 100.00% <0.00%> (ø)
include/raptor/upgrade/upgrade_index.hpp 100.00% <0.00%> (ø)
src/threshold/precompute_correction.cpp 100.00% <0.00%> (ø)
src/build/hibf/chopper_build.cpp 100.00% <0.00%> (ø)
include/raptor/search/do_parallel.hpp 100.00% <0.00%> (ø)
src/argument_parsing/upgrade_parsing.cpp 100.00% <0.00%> (ø)
src/argument_parsing/init_shared_meta.cpp 100.00% <0.00%> (ø)
src/search/search_multiple.cpp 100.00% <0.00%> (ø)
include/raptor/argument_parsing/validators.hpp 100.00% <0.00%> (ø)
... and 43 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@Irallia Irallia force-pushed the DOC/tutorial/raptor_layout branch from aed342e to efddb9b Compare December 6, 2022 15:10
@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 6, 2022
@Irallia Irallia force-pushed the DOC/tutorial/raptor_layout branch from efddb9b to cca7793 Compare December 7, 2022 12:28
@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 7, 2022
@eseiler eseiler removed the lint [INTERNAL] used for linting label Dec 7, 2022
@Irallia Irallia force-pushed the DOC/tutorial/raptor_layout branch from 531a50a to 70f8fba Compare December 8, 2022 16:25
@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 8, 2022
@Irallia Irallia force-pushed the DOC/tutorial/raptor_layout branch from 70f8fba to fa02cc3 Compare December 12, 2022 11:01
@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 12, 2022
@Irallia Irallia force-pushed the DOC/tutorial/raptor_layout branch from edffaf1 to d315bf0 Compare December 12, 2022 12:50
@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 12, 2022
@Irallia Irallia force-pushed the DOC/tutorial/raptor_layout branch from d315bf0 to 23133de Compare December 12, 2022 13:04
@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 12, 2022
@Irallia Irallia force-pushed the DOC/tutorial/raptor_layout branch from 23133de to c538eb1 Compare December 12, 2022 15:47
@Irallia Irallia requested a review from eseiler December 12, 2022 15:47
@Irallia Irallia marked this pull request as ready for review December 12, 2022 15:47
@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 12, 2022
doc/tutorial/02_layout/index.md Outdated Show resolved Hide resolved
doc/tutorial/02_layout/index.md Outdated Show resolved Hide resolved
doc/tutorial/02_layout/index.md Outdated Show resolved Hide resolved
doc/tutorial/02_layout/index.md Outdated Show resolved Hide resolved
advanced.

\todo
`--disable-sketch-output` wahrscheinlich unsinnig, da nur der zwischenstand zwischen count und layout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raptor layout does both, so probably not that useful?
It probably stores the sketches? Might be useful if you want to run raptor layout multiple times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had understood the flag to mean that you save the intermediate state so that you pass your result from chopper count to chopper layout. And @feldroop said that this has to happen here anyway. So you don't want to/can't switch this off.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can just remove it :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened an issue and remove this todo: #204

doc/tutorial/02_layout/index.md Outdated Show resolved Hide resolved
doc/tutorial/02_layout/index.md Outdated Show resolved Hide resolved
doc/tutorial/03_index/index.md Outdated Show resolved Hide resolved
@Irallia Irallia requested a review from eseiler December 13, 2022 10:19
@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 13, 2022
@Irallia Irallia force-pushed the DOC/tutorial/raptor_layout branch from 5ad8eda to f7e105e Compare December 13, 2022 14:01
@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 13, 2022
Signed-off-by: Lydia Buntrock <[email protected]>
@Irallia Irallia force-pushed the DOC/tutorial/raptor_layout branch from f7e105e to b3e1220 Compare December 13, 2022 14:05
@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Dec 13, 2022
Copy link
Member

@feldroop feldroop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a bit of stuff. Some things you can mybe just ignore, because you know better than me for which audiences this tutorial is intended. Overall I think it is very nice that we have this :)

# IBF vs HIBF

Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter
(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses a more space-saving method of storing the bins. It
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses a more space-saving method of storing the bins

You could add here that the HIBF is also faster than the IBF in many cases. This is dependant on the search parameters(num errors, thresholding), but I would expect that the HIBF is faster most of the time. Especially when the number of bins is large. Is this correct @eseiler @smehringer ?

Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter
(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses a more space-saving method of storing the bins. It
distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins,
which throw some bins together. This is especially useful when there are samples of very different sizes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which throw some bins together

You could add that it also splits bins. Maybe merge is better than throw together.

\image html hibf.svg width=40%

The figure above shows the storage of the user bins in the technical bins. The resulting tree represents the layout.
The first step is to estimate the number of (representative) k-mers per user bin by computing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number of (representative) k-mers

Does this mean unique k-mers, i.e. the set cardinality? I feel like representative is a bit ambiguous in its meaning.

doc/tutorial/02_layout/index.md Show resolved Hide resolved
[HyperLogLog (HLL) sketches](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) of the input data. These HLL
sketches are stored in a directory and will be used in computing an HIBF layout. We will go into more detail later
\ref HLL. The HIBF layout tries to minimize the disk space consumption of the resulting index. The space is estimated
using a k-mer count per user bin which represents the potential denisity in a technical bin in an Interleaved Bloom
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

denisity

density

## Additional parameters

To create an index and thus a layout, the individual samples of the data set are chopped up into k-mers and determine in
their so-called bin the specific bit setting of the Bloom Filter by passing them through hash functions. This means that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so-called bin the specific bit setting of the Bloom Filter by passing them through hash functions

I don't userstand this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add an Example like this:

Query ACGT with kmers ACG, CGT.
Sample 1: AATGT, Sample 2: ACCGT, Sample 3: ACGTA
2 Hash funktions
-> Bins 1 to 3 for ACG could look like |0000|0000|0101|
Bins 1 to 3 for CGT could look like |0000|0110|1100|
-> The query seems to match Sample 3

@eseiler do you have an image for this?


However, if we are unlucky and come across a hash value that consists of only `0`'s, then \f$p_{max}\f$ is of course
maximum of all possible hash values, no matter how many different elements are actually present.
To avoid this, we cut each hash value into `m` parts and calculate the \f$p_{max}\f$ over each of these parts. From
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid this, we cut each hash value into m parts

This is not correct. We cut the stream of hash values into m substreams and use the first b bits of each hash value to determine into which substream it belongs.


If we choose our `b` (`m`) to be very large, then we need more memory but get higher accuracy. (Storage consumption is
growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user
bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should mention that the relative error of the HLL estimate increases with a decreasing b (m) and that we believe that anything above m=512 should be fine.

With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If
you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and
memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however,
we were not able to determine a too great influence, so we recommend that this value only be used for fine tuning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would instead write that this value should only be used if the layouting takes to much memory or time. Because there it has a large influence.


With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If
you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and
memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which could be bad

I suggest instead:
which could result in a less memory-efficient layout.

@eseiler
Copy link
Member

eseiler commented Dec 13, 2022

Lydia will do another PR for Felix's suggestions

@eseiler eseiler merged commit 4bbcd74 into seqan:main Dec 13, 2022
Irallia added a commit to Irallia/raptor that referenced this pull request Dec 14, 2022
eseiler pushed a commit to Irallia/raptor that referenced this pull request Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants