Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--mirna_gtf for organism with no miRBase GFF file #329

Closed
OliverH96 opened this issue Mar 21, 2024 · 7 comments
Closed

--mirna_gtf for organism with no miRBase GFF file #329

OliverH96 opened this issue Mar 21, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@OliverH96
Copy link

OliverH96 commented Mar 21, 2024

Description of the bug

I'm using sheep miRNA data. miRBase contains a few entries for sheep miRNAs but does not provide a gff file on it's download page. I instead used a gff of sheep miRNAs from the RumimiR database (https://rumimir.sigenae.org/), but reach an error at the mirtop_quant step.

I've uploaded the gff file used, but appended the file extension to .txt to allow for uploading.

My params file:
input: '/gpfs01/home/sbzoh/F1_Seminal_Plasma_RNA/rawData/Fastq/F1_SeminalPlasma_Samplesheet.csv'
outdir: '/gpfs01/home/sbzoh/F1_Seminal_Plasma_RNA/smrnaseq_output'
with_umi: false
mirtrace_species: 'oar'
fasta: '/gpfs01/home/sbzoh//refGenome/Ovis_aries_rambouillet.ARS-UI_Ramb_v2.0.dna.toplevel.fasta'
mirna_gtf: '/gpfs01/home/sbzoh//refGenome/rumimir_sheep.gff'
mature: '/gpfs01/home/sbzoh//refGenome/mature.fa'
hairpin: '/gpfs01/home/sbzoh//refGenome/hairpin.fa'
filter_contamination: false
skip_mirdeep: true

Command used and terminal output

## Command used
nextflow run nf-core/smrnaseq -profile singularity -params-file params.yaml

## Tail of output containing error
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/smrnaseq] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT'

Caused by:
  Process `NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT` terminated with an error exit status (1)

Command executed:

  #Cleanup the GTF if mirbase html form is broken
  GTF="rumimir_sheep.gff"
  sed 's/&gt;/>/g' $GTF | sed 's#<br>#\n#g' | sed 's#</p>##g' | sed 's#<p>##g' | sed -e :a -e '/^\n*$/{$d;N;};/\n$/ba' > ${GTF}_html_cleaned.gtf
  mirtop gff --hairpin hairpin.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf -o mirtop --sps oar ./bams/*
  mirtop counts --hairpin hairpin.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf -o mirtop --sps oar --add-extra --gff mirtop/mirtop.gff
  mirtop export --format isomir --hairpin hairpin.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf --sps oar -o mirtop mirtop/mirtop.gff
  mirtop stats mirtop/mirtop.gff --out mirtop/stats
  mv mirtop/stats/mirtop_stats.log mirtop/stats/full_mirtop_stats.log

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT":
      mirtop: $(echo $(mirtop --version 2>&1) | sed 's/^.*mirtop //')
  END_VERSIONS

Command exit status:
  1

Command output:
  ['gff', '--hairpin', 'hairpin.fa_igenome.fa_idx.fa', '--gtf', 'rumimir_sheep.gff_html_cleaned.gtf', '-o', 'mirtop', '--sps', 'oar', './bams/Sire_A_8324_Control_seqcluster.bam', './bams/Sire_A_8401_Control_seqcluster.bam', './bams/Sire_A_8631_Biosolids_seqcluster.bam', './bams/Sire_A_8698_Biosolids_seqcluster.bam', './bams/Sire_B_8335_Control_seqcluster.bam', './bams/Sire_B_8433_Control_seqcluster.bam', './bams/Sire_B_8607_Biosolids_seqcluster.bam', './bams/Sire_B_8796_Biosolids_seqcluster.bam', './bams/Sire_C_8235_Control_seqcluster.bam', './bams/Sire_C_8431_Control_seqcluster.bam', './bams/Sire_C_8747_Biosolids_seqcluster.bam', './bams/Sire_C_8767_Biosolids_seqcluster.bam', './bams/Sire_D_8231_Control_seqcluster.bam', './bams/Sire_D_8416_Control_seqcluster.bam', './bams/Sire_D_8744_Biosolids_seqcluster.bam', './bams/Sire_D_8758_Biosolids_seqcluster.bam']

Command error:
  /usr/local/lib/python3.9/site-packages/mirtop/mirna/mintplates.py:512: SyntaxWarning: "is" with a literal. Did you mean "=="?
    if prefix is '':
  03/20/2024 06:34:35 INFO Run annotation
  03/20/2024 06:34:35 ERROR Database not found in --mirna rumimir_sheep.gff_html_cleaned.gtf. Use --database argument to add a custom source.
  ['gff', '--hairpin', 'hairpin.fa_igenome.fa_idx.fa', '--gtf', 'rumimir_sheep.gff_html_cleaned.gtf', '-o', 'mirtop', '--sps', 'oar', './bams/Sire_A_8324_Control_seqcluster.bam', './bams/Sire_A_8401_Control_seqcluster.bam', './bams/Sire_A_8631_Biosolids_seqcluster.bam', './bams/Sire_A_8698_Biosolids_seqcluster.bam', './bams/Sire_B_8335_Control_seqcluster.bam', './bams/Sire_B_8433_Control_seqcluster.bam', './bams/Sire_B_8607_Biosolids_seqcluster.bam', './bams/Sire_B_8796_Biosolids_seqcluster.bam', './bams/Sire_C_8235_Control_seqcluster.bam', './bams/Sire_C_8431_Control_seqcluster.bam', './bams/Sire_C_8747_Biosolids_seqcluster.bam', './bams/Sire_C_8767_Biosolids_seqcluster.bam', './bams/Sire_D_8231_Control_seqcluster.bam', './bams/Sire_D_8416_Control_seqcluster.bam', './bams/Sire_D_8744_Biosolids_seqcluster.bam', './bams/Sire_D_8758_Biosolids_seqcluster.bam']
  Traceback (most recent call last):
    File "/usr/local/bin/mirtop", line 10, in <module>
      sys.exit(main())
    File "/usr/local/lib/python3.9/site-packages/mirtop/command_line.py", line 31, in main
      reader(kwargs["args"])
    File "/usr/local/lib/python3.9/site-packages/mirtop/gff/__init__.py", line 24, in reader
      database = mapper.guess_database(args)
    File "/usr/local/lib/python3.9/site-packages/mirtop/mirna/mapper.py", line 23, in guess_database
      return _guess_database_file(args.gtf, args.database)
    File "/usr/local/lib/python3.9/site-packages/mirtop/mirna/mapper.py", line 40, in _guess_database_file
      raise ValueError("Database not found in %s header" % gff)
  ValueError: Database not found in rumimir_sheep.gff_html_cleaned.gtf header

Work dir:
  /gpfs01/home/sbzoh/F1_Seminal_Plasma_RNA/work/d8/42e9ee613e17eb83f5262cfae51a33

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Relevant files

nextflow.log
rumimir_sheep.txt

System information

Nextflow version (23.10.1)
Hardware (HPC)
Executor (slurm)
Container engine: (Singularity)
OS (CentOS Linux)
Version of nf-core/smrnaseq (2.3.0)

@OliverH96 OliverH96 added the bug Something isn't working label Mar 21, 2024
@OliverH96
Copy link
Author

Tried again on latest version (2.3.1) and getting the same error.

@christopher-mohr
Copy link
Contributor

Hi @OliverH96,
for now you could try to pass the additional argument --database to mirtop using a custom config. This would require adding something like:

process {
        withName: 'MIRTOP_QUANT' {
        ext.args = "--database RumimiR"
    }
}

You have to check if RumimiR is the term used in your provided gff. As far as I understand, mirtop searches for known tags in the gff file and therefore fails in your case.

@OliverH96
Copy link
Author

OliverH96 commented May 14, 2024

Hi @OliverH96, for now you could try to pass the additional argument --database to mirtop using a custom config. This would require adding something like:

process {
        withName: 'MIRTOP_QUANT' {
        ext.args = "--database RumimiR"
    }
}

You have to check if RumimiR is the term used in your provided gff. As far as I understand, mirtop searches for known tags in the gff file and therefore fails in your case.

Apologies for getting back to you so late. This did seem to advance the pipeline slightly, but am now getting a different error:

ERROR ~ Error executing process > 'NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT'

Caused by:
  Process `NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT` terminated with an error exit status (1)

Command executed:

  #Cleanup the GTF if mirbase html form is broken
  GTF="rumimir_sheep.gff"
  sed 's/&gt;/>/g' $GTF | sed 's#<br>#\n#g' | sed 's#</p>##g' | sed 's#<p>##g' | sed -e :a -e '/^\n*$/{$d;N;};/\n$/ba' > ${GTF}_html_cleaned.gtf
  mirtop gff --hairpin hairpin.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf -o mirtop --sps oar ./bams/*
  mirtop counts --hairpin hairpin.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf -o mirtop --sps oar --add-extra --gff mirtop/mirtop.gff
  mirtop export --format isomir --hairpin hairpin.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf --sps oar -o mirtop mirtop/mirtop.gff
  mirtop stats mirtop/mirtop.gff --out mirtop/stats
  mv mirtop/stats/mirtop_stats.log mirtop/stats/full_mirtop_stats.log

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT":
      mirtop: $(echo $(mirtop --version 2>&1) | sed 's/^.*mirtop //')
  END_VERSIONS

Command exit status:
  1

Command output:
  ['gff', '--hairpin', 'hairpin.fa_igenome.fa_idx.fa', '--gtf', 'rumimir_sheep.gff_html_cleaned.gtf', '-o', 'mirtop', '--sps', 'oar', './bams/Sire_A_8324_Control_seqcluster.bam', './bams/Sire_A_8401_Control_seqcluster.bam', './bams/Sire_A_8631_Biosolids_seqcluster.bam', './bams/Sire_A_8698_Biosolids_seqcluster.bam', './bams/Sire_B_8335_Control_seqcluster.bam', './bams/Sire_B_8433_Control_seqcluster.bam', './bams/Sire_B_8607_Biosolids_seqcluster.bam', './bams/Sire_B_8796_Biosolids_seqcluster.bam', './bams/Sire_C_8235_Control_seqcluster.bam', './bams/Sire_C_8431_Control_seqcluster.bam', './bams/Sire_C_8747_Biosolids_seqcluster.bam', './bams/Sire_C_8767_Biosolids_seqcluster.bam', './bams/Sire_D_8231_Control_seqcluster.bam', './bams/Sire_D_8416_Control_seqcluster.bam', './bams/Sire_D_8744_Biosolids_seqcluster.bam', './bams/Sire_D_8758_Biosolids_seqcluster.bam']

Command error:
  /usr/local/lib/python3.9/site-packages/mirtop/mirna/mintplates.py:512: SyntaxWarning: "is" with a literal. Did you mean "=="?
    if prefix is '':
  05/02/2024 05:12:45 INFO Run annotation
  05/02/2024 05:12:45 INFO Database different than miRBase or MirGeneDB
  05/02/2024 05:12:45 INFO If you get an error when loading,
  05/02/2024 05:12:45 INFO report it to https://github.com/miRTop/mirtop/issues
  ['gff', '--hairpin', 'hairpin.fa_igenome.fa_idx.fa', '--gtf', 'rumimir_sheep.gff_html_cleaned.gtf', '-o', 'mirtop', '--sps', 'oar', './bams/Sire_A_8324_Control_seqcluster.bam', './bams/Sire_A_8401_Control_seqcluster.bam', './bams/Sire_A_8631_Biosolids_seqcluster.bam', './bams/Sire_A_8698_Biosolids_seqcluster.bam', './bams/Sire_B_8335_Control_seqcluster.bam', './bams/Sire_B_8433_Control_seqcluster.bam', './bams/Sire_B_8607_Biosolids_seqcluster.bam', './bams/Sire_B_8796_Biosolids_seqcluster.bam', './bams/Sire_C_8235_Control_seqcluster.bam', './bams/Sire_C_8431_Control_seqcluster.bam', './bams/Sire_C_8747_Biosolids_seqcluster.bam', './bams/Sire_C_8767_Biosolids_seqcluster.bam', './bams/Sire_D_8231_Control_seqcluster.bam', './bams/Sire_D_8416_Control_seqcluster.bam', './bams/Sire_D_8744_Biosolids_seqcluster.bam', './bams/Sire_D_8758_Biosolids_seqcluster.bam']
  Traceback (most recent call last):
    File "/usr/local/bin/mirtop", line 10, in <module>
      sys.exit(main())
    File "/usr/local/lib/python3.9/site-packages/mirtop/command_line.py", line 31, in main
      reader(kwargs["args"])
    File "/usr/local/lib/python3.9/site-packages/mirtop/gff/__init__.py", line 28, in reader
      matures = mapper.read_gtf_to_precursor(args.gtf)
    File "/usr/local/lib/python3.9/site-packages/mirtop/mirna/mapper.py", line 172, in read_gtf_to_precursor
      mapped = read_gtf_to_precursor_mirbase(gtf)
    File "/usr/local/lib/python3.9/site-packages/mirtop/mirna/mapper.py", line 333, in read_gtf_to_precursor_mirbase
      id_dict[idname[0]] = name[0]
  IndexError: list index out of range

Work dir:
  /gpfs01/home/sbzoh/F1_Seminal_Plasma_RNA/work/bb/f388eaca99ec7268114f74a3fb2490

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

@nschcolnicov
Copy link
Contributor

Hi @OliverH96, this one is a bit tough to debug without being able to run the pipeline with the exact setting that you used. Would it be possible for you to share a tar.gz file with the input files or a truncated version of the files here?
The ones I need are input, fasta, mirna_gtf, mature, and hairpin

@atrigila
Copy link
Contributor

atrigila commented Oct 1, 2024

Please also use the latest dev version because this issue might have been solved when updating to latest mirtop.

@nschcolnicov
Copy link
Contributor

I recently tried to debug this, I used this command:

nextflow run nf-core/smrnaseq -r dev -latest -profile docker --input https://github.com/nf-core/test-datasets/raw/smrnaseq/samplesheet/v2.0/samplesheet-full.csv --outdir results --with_umi false --mirtrace_species oar --fasta ../files/Ovis_aries_rambouillet.ARS-UI_Ramb_v2.0.dna.toplevel.fa.gz --mirna_gtf ../files/rumimir_sheep.gff --mature https://github.com/nf-core/test-datasets/raw/smrnaseq/miRBase/mature.fa --hairpin https://github.com/nf-core/test-datasets/raw/smrnaseq/miRBase/hairpin.fa --filter_contamination false --skip_mirdeep true -c ../files/rumimir.config -resume

fasta was downloaded from https://ftp.ensembl.org/pub/release-112/fasta/ovis_aries_rambouillet/dna/Ovis_aries_rambouillet.ARS-UI_Ramb_v2.0.dna.toplevel.fa.gz
gff was downloaded from https://rumimir.sigenae.org/

rumimir.config config file contains the following:

process {
        withName: 'NFCORE_SMRNASEQ:MIRNA_QUANT:BAM_STATS_MIRNA_MIRTOP:MIRTOP_GFF' {
        ext.args = "--database RumimiR"
    }
}

But I encountered that I'm still getting the missing database issue, I opened a ticket for this in mirtop: miRTop/mirtop#90

@nschcolnicov
Copy link
Contributor

I created a PR to fix the issue with the database argument, but I spotted that even if the database argument is properly parsed, there is no parser capable of reading the gff file from rumimir. So currently any databases besides mirgenedb and mirbase are not supported by the pipeline.
miRTop/mirtop#92

@github-project-automation github-project-automation bot moved this from On Hold to Done in smrnaseq Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

4 participants