Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check that GZI index provided if using a gzipped FASTA #1741

Open
pontushojer opened this issue Nov 29, 2024 · 1 comment
Open

Check that GZI index provided if using a gzipped FASTA #1741

pontushojer opened this issue Nov 29, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@pontushojer
Copy link

pontushojer commented Nov 29, 2024

Description of feature

I started a sarek run (v3.4.2) providing a custom reference in the form of a bgzipped FASTA. The run from FASTQs started normally and did not run into any errors until the MarkDuplicates step. I had missed copying an index file (*.fasta.gz.gzi) to the same folder as the FASTA which caused the step to fail just before finishing 🤦, see the error message below.

  [Thu Nov 28 20:40:47 GMT 2024] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 134.77 minutes.
  Runtime.totalMemory()=285212672
  [E::bgzf_index_load] Error opening GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz.gzi : No such file or directory
  [E::bgzf_open_ref] Unable to load .gzi index 'GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz.gzi'
  [E::refs_load_fai] Failed to open reference file 'GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz'
  [E::hts_open_format] Failed to open file "OPM2.md.cram" : Invalid argument
  samtools view: failed to open "OPM2.md.cram" for writing: Invalid argument

This is the relevant part of my parameter file

fasta: /proj/ngi2016004/private/strategic_proj/SR_23_02_Element_vs_Illumina/resources/GRCh38_GIABv3/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz
fasta_fai: /proj/ngi2016004/private/strategic_proj/SR_23_02_Element_vs_Illumina/resources/GRCh38_GIABv3/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz.fai
igenomes_ignore: true
bwa: /proj/ngi2016004/private/strategic_proj/SR_23_02_Element_vs_Illumina/resources/GRCh38_GIABv3/BWAIndex

It would be great if there could be a parameter check on start so that when a bgzipped fasta *.fasta.gz is provided a corresponding index *.fasta.gz.gzi should be present.

Some other considerations if this is hard to implement:

  • This .gzi file could be generated automatically if it is missing as part of the build process.
  • The MarkDuplicates step should have a check that if a bgzipped FASTA is provided an index file is also present before starting the step, making the run fail earlier.

Edit: forgot to add info about sarek version
Edit2: gzip --> bgzip

@pontushojer pontushojer added the enhancement New feature or request label Nov 29, 2024
@pontushojer
Copy link
Author

An update on this, the .gzi is now in the folder with the bgzipped FASTA reference but I still run into this error. Seems that it specifically is samtools that requires this .gzi file for converting the output to CRAM, see related issue: samtools/samtools#804.

Looking at the relevant code, see below, it seems that the .gzi index is not included in the work folder causing the issue.

https://github.com/nf-core/sarek/blob/22c7315e9c9ccccf7658e9f18e36f99cd67ebfb9/modules/nf-core/gatk4/markduplicates/main.nf#L10C1-L14C1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant