Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: SortFastq #38

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

kockan
Copy link

@kockan kockan commented Aug 21, 2023

Initial attempt at #3
Tested on an empty FASTQ as well as a ~10GB one. It seems to be working.

max-records fgbio fqtk
500000 186.07s user 19.88s system 130% cpu 2:37.46 total 221.10s user 27.20s system 82% cpu 4:59.44 total
1000000 194.21s user 21.99s system 114% cpu 3:08.19 total 226.42s user 26.46s system 84% cpu 4:57.96 total
2000000 196.37s user 13.59s system 170% cpu 2:02.82 total 228.77s user 24.76s system 89% cpu 4:44.37 total
4000000 212.98s user 13.57s system 202% cpu 1:52.13 total 230.88s user 20.35s system 119% cpu 3:29.84 total

Looks like fgbio is consistently faster. Should look into why. Library/sorting algorithm/my usage?

Should add unit tests at minimum + any feedback if dev interest in merging.

impl Command for SortFastq {
/// Executes the sort_fastq command
fn execute(&self) -> Result<()> {
let mut fq_reader = FastqReader::from_path(&self.input)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to use FastqReader::with_capacity(1024 * 1024) or something large like we do in the demux tool

LimitedBufferBuilder,
> = ExternalSorterBuilder::new()
.with_tmp_dir(Path::new("./"))
.with_buffer(LimitedBufferBuilder::new(self.max_records, true))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. any reason not to expose the tmp dir on the command line, and have it by default be None (use the system default)?
  2. I would think a big win for this tool would be to use with_threads_number. Shall we use multiple threads?
  3. I would also try to use with_rw_buf_size with 1024 * 1024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants