Skip to content

Formatting a GTF annotation to work with TALON

Fairlie Reese edited this page Mar 19, 2020 · 2 revisions

Formatting a GTF annotation to work with TALON

GTFs are notoriously difficult to work with as they are so variable, and TALON was developed to work with GENCODE GTF annotations. But we know that there are many other flavors of GTF out there, and thus have tried to accommodate them through various means. Listed here are several varieties of GTF we've encountered, and how we can format them to work with TALON.

  1. GTF has only exon entries (no gene or transcript entries)

The script talon_reformat_gtf will fix this type of GTF. It infers the transcript and gene entries from the last column of each exon entry. You can run it as follows:

talon_reformat_gtf -g <gtf to fix>
  1. GTF has only transcript and exon entries (no gene entries)

Currently, talon_reformat_gtf only works on fixing GTFs without genes and transcripts. Therefore, the hacky way to format these GTFs is to first remove the gene lines, then run talon_reformat_gtf as follows:

awk '($3 != "gene")' <gtf to fix> > <gtf to fix_no_genes>
talon_reformat_gtf -gtf <gtf to fix_no_genes>

Currently these are the only supported GTF formats. If you find that these do not help you don't hesitate to reach out and we can help support more of your data!

Below is the official help function from talon_reformat_gtf:

talon_reformat_gtf --h

Usage: talon_reformat_gtf [options]

Options:
-h, --help           Show help message and exit
-g, -gtf             gtf to fix