Replace bedtools sort with unix sort in BEDTOOLS_GENOMECOV #6063

adamrtalbot · 2024-07-30T09:01:11Z

bedtools sort uses a large amount of CPUs and memory, but when using it here it doesn't require the additional genome based features of bedtools. Replacing it should speed up the process and make it many times more efficient.

First discovered by @pabloaledo

PR checklist

`bedtools sort` uses a large amount of CPUs and memory, but when using it here it doesn't require the additional genome based features of `bedtools`. Replacing it should speed up the process and make it many times more efficient.

JoseEspinosa

We actually had issues with bedtools sort in nf-core/atacseq, see this and this PR. Actually, in our case, we switch to use gnu sort and be able to use the --parallel and --buffer-size arguments, see here. The reason for these changes was that the previous implementation of the module was very memory gready when dealing with big files resulting for the merge of the files at the library level, making the pipeline to fail for many users.

adamrtalbot · 2024-07-30T09:13:54Z

Thanks @JoseEspinosa, do you think we should use --parallel and --buffer-size here as well? Will GNU sort be sufficient for big files?

JoseEspinosa · 2024-07-30T09:14:23Z

We actually had issues with bedtools sort in nf-core/atacseq, see this and this PR. Actually, in our case, we switch to use gnu sort and be able to use the --parallel and --buffer-size arguments, see here. The reason for these changes was that the previous implementation of the module was very memory gready when dealing with big files resulting for the merge of the files at the library level, making the pipeline to fail for many users.

So I am totally for this change. Also, I would suggest to use gnu sort and add an args2 to be able to use the above-mentioned arguments to tune sort.

Allows customisation of GNU

adamrtalbot · 2024-07-30T09:38:03Z

Also, I would suggest to use gnu sort and add an args2 to be able to use the above-mentioned arguments to tune sort.

Rather than complicate things with args2, perhaps we should just add those features if they don't cause any harm?

JoseEspinosa

I am not wrong sort in biocontainers/bedtools:2.31.1--hf5e1c6e_0 is not gnu sort and thus, the arguments --parallel and the --buffer-size won't work. This should do the trick. Also, not sure if we should also include LC_ALL=C sort to be sure that the sort order remains the same across different systems. Finally for reference, our implementation in nf-core/atacseq was inspired by this implementation on hicar of single sort process.

adamrtalbot · 2024-07-30T09:45:50Z

Rather than complicate things with args2, perhaps we should just add those features if they don't cause any harm?

the default biocontainer comes with a version of sort that does not support --parallel, so that won't work. The Seqera container (community.wave.seqera.io/library/bedtools:2.31.1--8fd0e3802b0dc02e) does, so we could switch to that but I'm not certain what the current thinking is regarding switching containers. I notice that @JoseEspinosa switched to using that container in the PRs above.

JoseEspinosa · 2024-07-30T09:49:05Z

Rather than complicate things with args2, perhaps we should just add those features if they don't cause any harm?

the default biocontainer comes with a version of sort that does not support --parallel, so that won't work. The Seqera container (community.wave.seqera.io/library/bedtools:2.31.1--8fd0e3802b0dc02e) does, so we could switch to that but I'm not certain what the current thinking is regarding switching containers. I notice that @JoseEspinosa switched to using that container in the PRs above.

For me is OK to add them as default, this is what I actually did in the atacseq module.
For the container, I thought that for mulled containers it was ok to use wave, since if you want to use sort from conda-forge you will need to construct one mulled container.

adamrtalbot · 2024-07-30T09:52:23Z

OK let's go for the Seqera container. I'll wait for someone else to throw an objection before merging.

JoseEspinosa

Wonderful! 🚀

JoseEspinosa

One last thought, should we control for task.memory being not null (not set) ?

adamrtalbot · 2024-07-30T10:26:43Z

One last thought, should we control for task.memory being not null (not set) ?

Good point. It should always be set but I've just moved the logic to handle the case where it has been explicitly set to null.

SPPearce

Can you just add a comment as to why it has been changed please, so nobody decides to change it back in future ;)

ewels · 2024-07-30T13:37:32Z

One last thought, should we control for task.memory being not null (not set) ?

Good point. It should always be set but I've just moved the logic to handle the case where it has been explicitly set to null.

Some HPC clusters refuse job submissions where memory has been set, so there is a valid use case for this being set to null deliberately.

adamrtalbot · 2024-07-30T13:45:13Z

We still have an open discussion around Seqera containers and private registries. Clearly, people are using them (great!) but support for private or offline registries isn't solved and it makes me wary of going too far down this route.

SPPearce · 2024-07-30T14:07:14Z

We still have an open discussion around Seqera containers and private registries. Clearly, people are using them (great!) but support for private or offline registries isn't solved and it makes me wary of going too far down this route.

Are we actually having that discussion anywhere?

edmundmiller · 2024-07-30T15:45:06Z

Bedtools recommends this method itself https://bedtools.readthedocs.io/en/latest/content/tools/sort.html

So method wise no controversy.

Seqera containers, @maxulysse has been copying them over to quay.io for the simplicity of having all the containers in one place (Why that makes it easier I still don't understand). So I think that's your quick fix there.

Are we actually having that discussion anywhere?

#5832

FriederikeHanssen · 2024-07-30T15:48:56Z

Seqera containers, @maxulysse has been copying them over to quay.io for the simplicity of having all the containers in one place (Why that makes it easier I still don't understand). So I think that's your quick fix there.

if people set the registry because they have their own it's a lot easier because you don't need to overwrite random containers everywhere

FriederikeHanssen · 2024-07-30T15:55:56Z

Are we actually having that discussion anywhere?

@SPPearce several POCs here seqeralabs/nf-aggregate#43 seqeralabs/nf-aggregate#44 seqeralabs/nf-aggregate#45 seqeralabs/nf-aggregate#46 and here: #5832, nf-core/tools#2408

adamrtalbot requested review from edmundmiller, sruthipsuresh, drpatelh, sidorov-si and chris-cheshire as code owners July 30, 2024 09:01

JoseEspinosa reviewed Jul 30, 2024

View reviewed changes

add args2 for for customisation of GNU sort command

78647c0

Allows customisation of GNU

quoting for args2

73543a9

JoseEspinosa reviewed Jul 30, 2024

View reviewed changes

Use LC_ALL and default options for performance and consistency

b4b7466

JoseEspinosa approved these changes Jul 30, 2024

View reviewed changes

JoseEspinosa reviewed Jul 30, 2024

View reviewed changes

adamrtalbot added 2 commits July 30, 2024 11:25

Handle null memory value

c1204ba

Remove tags.yml

402c5bb

SPPearce approved these changes Jul 30, 2024

View reviewed changes

adamrtalbot added this pull request to the merge queue Aug 7, 2024

Merged via the queue into master with commit 9ba6b02 Aug 7, 2024
12 checks passed

adamrtalbot deleted the replace_bedtools_sort_with_unix_sort branch August 7, 2024 08:31

pinin4fjords mentioned this pull request Aug 7, 2024

Reduce resource usage for sort process in bedtools/genomecov nf-core/rnaseq#1350

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace bedtools sort with unix sort in BEDTOOLS_GENOMECOV #6063

Replace bedtools sort with unix sort in BEDTOOLS_GENOMECOV #6063

adamrtalbot commented Jul 30, 2024

JoseEspinosa left a comment •

edited

Loading

adamrtalbot commented Jul 30, 2024

JoseEspinosa commented Jul 30, 2024

adamrtalbot commented Jul 30, 2024

JoseEspinosa left a comment •

edited

Loading

adamrtalbot commented Jul 30, 2024

JoseEspinosa commented Jul 30, 2024

adamrtalbot commented Jul 30, 2024

JoseEspinosa left a comment

JoseEspinosa left a comment

adamrtalbot commented Jul 30, 2024

SPPearce left a comment

ewels commented Jul 30, 2024

adamrtalbot commented Jul 30, 2024

SPPearce commented Jul 30, 2024

edmundmiller commented Jul 30, 2024

FriederikeHanssen commented Jul 30, 2024

FriederikeHanssen commented Jul 30, 2024

Replace bedtools sort with unix sort in BEDTOOLS_GENOMECOV #6063

Replace bedtools sort with unix sort in BEDTOOLS_GENOMECOV #6063

Conversation

adamrtalbot commented Jul 30, 2024

PR checklist

JoseEspinosa left a comment • edited Loading

Choose a reason for hiding this comment

adamrtalbot commented Jul 30, 2024

JoseEspinosa commented Jul 30, 2024

adamrtalbot commented Jul 30, 2024

JoseEspinosa left a comment • edited Loading

Choose a reason for hiding this comment

adamrtalbot commented Jul 30, 2024

JoseEspinosa commented Jul 30, 2024

adamrtalbot commented Jul 30, 2024

JoseEspinosa left a comment

Choose a reason for hiding this comment

JoseEspinosa left a comment

Choose a reason for hiding this comment

adamrtalbot commented Jul 30, 2024

SPPearce left a comment

Choose a reason for hiding this comment

ewels commented Jul 30, 2024

adamrtalbot commented Jul 30, 2024

SPPearce commented Jul 30, 2024

edmundmiller commented Jul 30, 2024

FriederikeHanssen commented Jul 30, 2024

FriederikeHanssen commented Jul 30, 2024

JoseEspinosa left a comment •

edited

Loading

JoseEspinosa left a comment •

edited

Loading