Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace bedtools sort with unix sort in BEDTOOLS_GENOMECOV #6063

Merged
merged 6 commits into from
Aug 7, 2024

Conversation

adamrtalbot
Copy link
Contributor

bedtools sort uses a large amount of CPUs and memory, but when using it here it doesn't require the additional genome based features of bedtools. Replacing it should speed up the process and make it many times more efficient.

First discovered by @pabloaledo

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

`bedtools sort` uses a large amount of CPUs and memory, but when using it here it doesn't require the  additional genome based features of `bedtools`. Replacing it should speed up the process and make it many times more efficient.
Copy link
Member

@JoseEspinosa JoseEspinosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually had issues with bedtools sort in nf-core/atacseq, see this and this PR. Actually, in our case, we switch to use gnu sort and be able to use the --parallel and --buffer-size arguments, see here. The reason for these changes was that the previous implementation of the module was very memory gready when dealing with big files resulting for the merge of the files at the library level, making the pipeline to fail for many users.

@adamrtalbot
Copy link
Contributor Author

Thanks @JoseEspinosa, do you think we should use --parallel and --buffer-size here as well? Will GNU sort be sufficient for big files?

@JoseEspinosa
Copy link
Member

We actually had issues with bedtools sort in nf-core/atacseq, see this and this PR. Actually, in our case, we switch to use gnu sort and be able to use the --parallel and --buffer-size arguments, see here. The reason for these changes was that the previous implementation of the module was very memory gready when dealing with big files resulting for the merge of the files at the library level, making the pipeline to fail for many users.

So I am totally for this change. Also, I would suggest to use gnu sort and add an args2 to be able to use the above-mentioned arguments to tune sort.

@adamrtalbot
Copy link
Contributor Author

Also, I would suggest to use gnu sort and add an args2 to be able to use the above-mentioned arguments to tune sort.

Rather than complicate things with args2, perhaps we should just add those features if they don't cause any harm?

Copy link
Member

@JoseEspinosa JoseEspinosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not wrong sort in biocontainers/bedtools:2.31.1--hf5e1c6e_0 is not gnu sort and thus, the arguments --parallel and the --buffer-size won't work. This should do the trick. Also, not sure if we should also include LC_ALL=C sort to be sure that the sort order remains the same across different systems. Finally for reference, our implementation in nf-core/atacseq was inspired by this implementation on hicar of single sort process.

@adamrtalbot
Copy link
Contributor Author

Rather than complicate things with args2, perhaps we should just add those features if they don't cause any harm?

the default biocontainer comes with a version of sort that does not support --parallel, so that won't work. The Seqera container (community.wave.seqera.io/library/bedtools:2.31.1--8fd0e3802b0dc02e) does, so we could switch to that but I'm not certain what the current thinking is regarding switching containers. I notice that @JoseEspinosa switched to using that container in the PRs above.

@JoseEspinosa
Copy link
Member

Rather than complicate things with args2, perhaps we should just add those features if they don't cause any harm?

the default biocontainer comes with a version of sort that does not support --parallel, so that won't work. The Seqera container (community.wave.seqera.io/library/bedtools:2.31.1--8fd0e3802b0dc02e) does, so we could switch to that but I'm not certain what the current thinking is regarding switching containers. I notice that @JoseEspinosa switched to using that container in the PRs above.

For me is OK to add them as default, this is what I actually did in the atacseq module.
For the container, I thought that for mulled containers it was ok to use wave, since if you want to use sort from conda-forge you will need to construct one mulled container.

@adamrtalbot
Copy link
Contributor Author

OK let's go for the Seqera container. I'll wait for someone else to throw an objection before merging.

Copy link
Member

@JoseEspinosa JoseEspinosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful! 🚀

Copy link
Member

@JoseEspinosa JoseEspinosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thought, should we control for task.memory being not null (not set) ?

@adamrtalbot
Copy link
Contributor Author

One last thought, should we control for task.memory being not null (not set) ?

Good point. It should always be set but I've just moved the logic to handle the case where it has been explicitly set to null.

Copy link
Contributor

@SPPearce SPPearce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just add a comment as to why it has been changed please, so nobody decides to change it back in future ;)

@ewels
Copy link
Member

ewels commented Jul 30, 2024

One last thought, should we control for task.memory being not null (not set) ?

Good point. It should always be set but I've just moved the logic to handle the case where it has been explicitly set to null.

Some HPC clusters refuse job submissions where memory has been set, so there is a valid use case for this being set to null deliberately.

@adamrtalbot
Copy link
Contributor Author

We still have an open discussion around Seqera containers and private registries. Clearly, people are using them (great!) but support for private or offline registries isn't solved and it makes me wary of going too far down this route.

@SPPearce
Copy link
Contributor

We still have an open discussion around Seqera containers and private registries. Clearly, people are using them (great!) but support for private or offline registries isn't solved and it makes me wary of going too far down this route.

Are we actually having that discussion anywhere?

@edmundmiller
Copy link
Contributor

Bedtools recommends this method itself https://bedtools.readthedocs.io/en/latest/content/tools/sort.html

So method wise no controversy.

Seqera containers, @maxulysse has been copying them over to quay.io for the simplicity of having all the containers in one place (Why that makes it easier I still don't understand). So I think that's your quick fix there.

Are we actually having that discussion anywhere?

#5832

@FriederikeHanssen
Copy link
Contributor

Seqera containers, @maxulysse has been copying them over to quay.io for the simplicity of having all the containers in one place (Why that makes it easier I still don't understand). So I think that's your quick fix there.

if people set the registry because they have their own it's a lot easier because you don't need to overwrite random containers everywhere

@FriederikeHanssen
Copy link
Contributor

Are we actually having that discussion anywhere?

@SPPearce several POCs here seqeralabs/nf-aggregate#43 seqeralabs/nf-aggregate#44 seqeralabs/nf-aggregate#45 seqeralabs/nf-aggregate#46 and here: #5832, nf-core/tools#2408

@adamrtalbot adamrtalbot added this pull request to the merge queue Aug 7, 2024
Merged via the queue into master with commit 9ba6b02 Aug 7, 2024
12 checks passed
@adamrtalbot adamrtalbot deleted the replace_bedtools_sort_with_unix_sort branch August 7, 2024 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants