Skip to content

Latest commit

 

History

History
260 lines (211 loc) · 22.5 KB

output_docs.md

File metadata and controls

260 lines (211 loc) · 22.5 KB

Output file documentation

The following document provides further details on key output files generated by the pipeline. The primary focus is on the outputs storing novel last exons, summarised quantifications and differential usage output. The outputs of Salmon (<output_dir>/salmon) and StringTie (<output_dir>/stringtie) are the standard outputs of the respective tools and not discussed here. For further details on these outputs please consult the respective documentations (Salmon, StringTie).

Sections:

StringTie

tx_filtering

tree test_data_output/tx_filtering/
test_data_output/tx_filtering/
├── all_conditions.merged_last_exons.3p_end_filtered.gtf
├── group1
│   ├── group1.all_samples
│   ├── group1.all_samples.loci
│   ├── group1.all_samples.tracking
│   ├── group1.merged_last_exons.3p_end_filtered.gtf
│   ├── group1.merged_last_exons.3p_end_filtered.match_stats.tsv
│   ├── group1.merged_last_exons.3p_end_filtered.not_valid.class_summary_counts.tsv
│   ├── group1.merged_last_exons.3p_end_filtered.valid.class_summary_counts.tsv
│   ├── group1.merged_last_exons.gtf
│   ├── group1_sample_01.last_exons.gtf
│   ├── group1_sample_02.last_exons.gtf
│   ├── group1_sample_03.last_exons.gtf
│   └── gtf_list_group1.txt
├── group2
│   ├── group2.all_samples
│   ├── group2.all_samples.loci
│   ├── group2.all_samples.tracking
│   ├── group2.merged_last_exons.3p_end_filtered.gtf
│   ├── group2.merged_last_exons.3p_end_filtered.match_stats.tsv
│   ├── group2.merged_last_exons.3p_end_filtered.not_valid.class_summary_counts.tsv
│   ├── group2.merged_last_exons.3p_end_filtered.valid.class_summary_counts.tsv
│   ├── group2.merged_last_exons.gtf
│   ├── group2_sample_04.last_exons.gtf
│   ├── group2_sample_05.last_exons.gtf
│   ├── group2_sample_06.last_exons.gtf
│   └── gtf_list_group2.txt
├── novel_ref_combined.info.tsv
├── novel_ref_combined.last_exons.gtf
├── novel_ref_combined.le2gene.tsv
├── novel_ref_combined.le2genename.tsv
├── novel_ref_combined.quant.last_exons.gtf
├── novel_ref_combined.tx2gene.tsv
└── novel_ref_combined.tx2le.tsv

GTF of predicted last exons passing filtering across all conditions - all_conditions.merged_last_exons.3p_end_filtered.gtf

This GTF file contains all last exons passing the reference polyA site and polyA signal filtering in both conditions. It is the combination of group1/group1.merged_last_exons.3p_end_filtered.gtf and group2/group2.merged_last_exons.3p_end_filtered.gtf. Note these are complete novel last exon sequences, for regions used for quantification refer to novel_ref_combined.quant.last_exons.gtf

The coordinates follow standard GTF convention. The attribute fields are described below. Note that I consider this to be overly populated, and future releases are likely to simplify this output:

  • gene_id - gene ID generated by StringTie.
    • PAPA.group1_sample_01.1 - 'PAPA' is hardcoded. 2nd field corresponds to sample name and 3rd field to unique gene ID computed by StringTie
  • transcript_id - transcript ID generated by StringTie.
    • PAPA.group1_sample_01.1.7 - 'PAPA' is hardcoded. 2nd field corresponds to sample name and 3rd field to unique gene ID computed by StringTie. Final field corresponds to unique treanscript 'number' computed by StringTie
  • cov - coverage value computed by StringTie
  • exon_number - left-to-right (i.e. non strand aware) exon number as calculated by StringTie (TODO: DOUBLE CHECK THIS)
  • Start_ref - Start coordinate(s) of matching reference exon (if an extension event) or intron (if a 'spliced' event). If multiple matching reference exons, the coordinates are comma separated
  • End_ref - End coordinate(s) of matching reference exon (if an extension event) or intron (if a 'spliced' event). If multiple matching reference exons, the coordinates are comma separated
  • gene_name_ref - gene name of overlapping annotated gene
  • transcript_id_ref - transcript ID of matching reference exon (if an extension event) or intron (if a 'spliced' event)
  • gene_id_ref - gene_id of overlapping annotated gene
  • 3p_extension_length - Length of extension relative to 3'end of annotated exon. 'NULL' if event is not an extension of an annotated exon (i.e. is a spliced event)
  • event_type - inferred 'event type' for novel last exon isoform. Can take on the following values:
    • first_exon_spliced - most gene-proximal distinct last exon (i.e. it is the 'shortest' last exon isoform)
    • internal_exon_spliced - distinct last exon that is not most proximal/distal with respect to gene
    • last_exon_spliced - most gene-distal last exon
    • first_exon_extension - novel 3'end extension of an annotated first exon (i.e. a 'composite' or 'bleedthrough' event)
    • internal_exon_extension - novel 3'end extension of an annotated internal exon (i.e. a 'composite' or 'bleedthrough' event)
    • last_exon_extension - novel 3'end extension of an annotated last exon (i.e. a 3'UTR extension)
  • sample_id - sample in which event was identified. Corresponds to sample_name from sample table
  • last_exon_id - intermediate last exon isoform identifier grouping overlapping last exons.
  • atlas_filter - whether event passes (1) or fails (0) maximum distance from annotated polyA site (PolyASite) filter
  • nearest_atlas_distance - distance from last exon 3'end to the nearest annotated polyA site (PolyASite)
  • Name - ID for nearest annotated polyA site (PolyASite). Corresponds to the 'Name' field in provided reference polyA site BED file
  • Start_atlas - Start coordinate from nearest annotated polyA site from reference BED file
  • End_atlas - End coordinate from nearest annotated polyA site from reference BED file
  • motif_filter - whether event passes (1) or fails (0) polyA signal motif filter
  • pas_motifs - identified polyA signal motifs. 'not_found' if no motifs found in specified terminal window of predicted last exons
    • -18_AATAAA - <distance_from_3'end>_<polyA_signal_sequence>. '-' denotes that position is upstream of predicted 3'end ('End' coordinate of GTF entry). If multiple motifs found they are separated by comma characters
  • min_motif_3p_deviation - Distance deviation of identified motifs from the expected distance upstream of 3'ends (user-specified). Missing if motifs are not found in user-specified terminal window.
    • min(|distance_from_3'end - expected_distance|) where expected_distance is user defined and distance_from_3'end
    • Where multiple motifs are identifed, the minimum distance deviation is reported
  • min_motif_3p_distance - Distance from predicted 3'end of identified motif reported in min_motif_3p_deviation. Missing if motifs are not found in user-specified terminal window.
    • e.g. if -18_AATAAA was the reported motif, -18 would be value for this key
  • condition_id - condition key from 'condition' column in sample for corresponding sample_id
  • End_le - appended if atlas_filter = 1 and nearest_atlas_distance != 0. Corresponds to the original End coordinate of predicted last exon prior to updating to matched reference polyA site

Assorted notes:

  • This file can contain duplicated intervals if exactly the same coordinate is predicted across multiple replicates (or even multiple conditions). I am likely to address this in future releases.

Combined novel and annotated last exon references - novel_ref_combined.last_exons.gtf and novel_ref_combined.quant.last_exons.gtf

GTF files containing combination of novel and annotated last exons. novel_ref_combined.quant.last_exons.gtf represents the reference GTF used to construct the Salmon index. novel_ref_combined.last_exons.gtf instead represents the complete last exon sequences prior to trimming to unique regions (i.e. no overlaps with annotated internal exons).

The coordinates follow standard GTF convention. The attribute fields are identical between the two files and as described below. Note that I also consider these fields to be overly populated, and future releases are likely to simplify this output:

  • gene_id - gene ID for interval computed by Stringtie (if novel) or extracted from reference GTF (if annotated)
  • transcript_id - gene ID for interval computed by Stringtie (if novel) or extracted from reference GTF (if annotated)
  • exon_number - gene ID for interval computed by Stringtie (if novel) or extracted from reference GTF (if annotated)
  • Start_ref - Start coordinate(s) of matching reference exon (if an extension event) or intron (if a 'spliced' event). If multiple matching reference exons, the coordinates are comma separated
  • End_ref - End coordinate(s) of matching reference exon (if an extension event) or intron (if a 'spliced' event). If multiple matching reference exons, the coordinates are comma separated
  • event_type - inferred 'event type' for last exon isoform. See above mentions for description of possible values
  • gene_name - Only present if interval originates from annotation. Redundant with ref_gene_name
  • region_rank - takes on first, internal or last, based on the position of the last exon isoform in the gene (first = 5'most, last = 3'most)
  • transcript_id_ref - Only present if interval originates from annotation. Redundant with transcript_id
  • ref_gene_id - reference gene ID for last exon isoform. Propagated into ID mapping tables (see below)
  • ref_gene_name -reference gene name for last exon isoform. Propagated into ID mapping tables (see below)
  • le_number - corresponds to 5'-3' rank of unique last exon isoforms within gene
  • le_id - unique identifier for last exon isoform. Any last exon intervals with overlapping sequence are grouped into a common identifier.
    • id_ref_gene_2_1 - <ref_gene_id>_<le_number> - reference gene ID suffixed with 'last exon number' (5'-3' rank within gene)

Assorted notes:

  • This file can contain duplicated intervals (annotated or novel), including those with closely spaced 3'ends (likely the result of imprecise cleavage). I am likely to address this in future releases.

ID mapping files - novel_ref_combined.tx2gene.tsv, novel_ref_combined.tx2le.tsv, novel_ref_combined.le2gene.tsv, novel_ref_combined.le2genename.tsv

All the following files follow a common structure, with columns denoting the 'identifier type':

  • novel_ref_combined.tx2gene.tsv - transcript ID and gene ID
    • transcript_id - corresponds to transcript_id key from attribute field in combined GTFs
    • gene_id - corresponds to ref_gene_id key from attribute field in combined GTFs
  • novel_ref_combined.tx2le.tsv - transcript ID and 'le_id'
    • transcript_id - corresponds to transcript_id key from attribute field in combined GTFs
    • le_id - corresponds to le_id in combined GTFs
  • novel_ref_combined.le2gene.tsv - le_id and gene ID
    • le_id - corresponds to le_id in combined GTFs
    • gene_id - corresponds to ref_gene_id key from attribute field in combined GTFs
  • novel_ref_combined.le2genename.tsv - le_id and gene name
    • le_id - corresponds to le_id in combined GTFs
    • gene_name - corresponds to ref_gene_name key from attribute field in combined GTFs

Metadata for all last exon isoforms - novel_ref_combined.info.tsv

Simplified table containing metadata for each unique transcript in combined GTF. Note that present columns are not considered stable and may change in future releases:

  • transcript_id -assigned unique transcript ID. Corresponds to transcript_id key from attribute field in combined GTFs
  • le_id - assigned last exon identifier. Corresponds to le_id key from attribute field in combined GTFs
  • gene_id - assigned reference gene ID. Corresponds to ref_gene_id key from attribute field in combined GTFs
  • gene_name - assigned reference gene name. Corresponds to ref_gene_name key from attribute field in combined GTFs
  • event_type - inferred 'event type' for last exon isoform. See above mentions for description of possible values
  • Chromosome - reference chromosome sourced from combined GTF
  • Start - Start coordinate sourced from combined GTF
  • End - End coordinate sourced from combined GTF
  • Strand - genomic strand sourced from combined GTF
  • annot_status - Whether event originates from reference annotation ('annotated') or predicted last exons ('novel')

differential_apa

$ tree test_data_output/differential_apa/
test_data_output/differential_apa/
├── dexseq_apa.image.RData
├── dexseq_apa.results.processed.tsv
├── dexseq_apa.results.tsv
├── formulas.txt
├── summarised_pas_quantification.counts.tsv
├── summarised_pas_quantification.gene_tpm.tsv
├── summarised_pas_quantification.ppau.tsv
└── summarised_pas_quantification.tpm.tsv

The main output files of note are the metadata-augmented dexseq results table (dexseq_apa.results.processed.tsv) and various count/quantification matrices (summarised_pas_quantification.*.tsv).

Differential usage output table - dexseq_apa.results.processed.tsv

Example eg_dexseq_apa.results.processed.tsv:

binID groupID featureID exonBaseMean dispersion stat pvalue padj group1 group2 log2fold_group2_group1 le_id gene.qvalue contrast_name gene_id gene_name event_type annot_status transcript_id chromosome strand start end mean_PPAU_group1 mean_PPAU_group2 delta_PPAU_group2_group1
id_ref_gene_1:E001 id_ref_gene_1 E001 102.30010013797774 0.08728199388325558 14.779434541157954 1.2084626559287708e-4 3.625387967786313e-4 1.970004973787062 2.046000549463702 0.25495812632124615 id_ref_gene_1_1 0 group2vsgroup1 id_ref_gene_1 ref_gene_1 first_exon_spliced novel PAPA.group2_sample_04.1.1 chr1 + 500 700 0.571113312433762 0.819125243817251 0.24801193138349
id_ref_gene_1:E002 id_ref_gene_1 E002 53.162865498165885 0.0290240675573647 114.49484271761776 1.01529962365023e-26 6.091797741901381e-26 1.8403255217654115 1.3779809537657772 -1.576609330145999 id_ref_gene_1_2 0 group2vsgroup1 id_ref_gene_1 ref_gene_1 internal_exon_spliced novel PAPA.group1_sample_01.1.7,PAPA.group2_sample_06.1.6 chr1 + 1400,1400 1600,1600 0.427395249868125 0.169998726991834 -0.257396522876291
id_ref_gene_1:E003 id_ref_gene_1 E003 173.7374193477014 0.164070451327647 2.6797725917421076 0.10163024034168788 0.10163024034168788 1.7553054456349897 2.5546210217424665 2.6768091091447417 id_ref_gene_1_3 0 group2vsgroup1 id_ref_gene_1 ref_gene_1 internal_exon_spliced novel PAPA.group1_sample_03.1.6,PAPA.group2_sample_05.1.8 chr1 + 1950,1950 2599,2598 7.2652715448876e-4 0.00516402410597214 0.00443749695148337
id_ref_gene_1:E004 id_ref_gene_1 E004 306.7360492958544 0.2524001172499824 4.248342859211704 0.03928865011600179 0.04714638013920215 1.979697794910254 2.8059821452923033 2.757797344758746 id_ref_gene_1_4 0 group2vsgroup1 id_ref_gene_1 ref_gene_1 last_exon_spliced annotated ref_g1_tr_1 chr1 + 3000 3400 7.6491054362457e-4 0.00571200508494262 0.00494709454131804

Column descriptions:

The following columns are all generated from DEXSeq's results table. Please consult the DEXSeq documentation for column descriptions:

  • binID
  • groupID
  • featureID
  • exonBaseMean
  • dispersion
  • stat
  • pvalue
  • padj
  • group1
  • group2
  • log2fold_group2_group1
  • gene.qvalue

The following metadata columns are appended by the pipeline:

  • le_id - last exon identifier constructed by PAPA
  • contrast_name - Name of experimental constrast tested, corresponds to <numerator_condition>vs<denominator_condition> where numerator and denominator conditions correspond to keys in the 'condition' column of the sample table
  • gene_id - gene id extracted for reference annotation
  • gene_name - gene name extracted from reference annotation
  • event_type - inferred 'event type' for last exon isoform. See above mentions for description of possible values
  • annot_status - whether event originates from reference annotation or novel events
  • transcript_id - transcript ID extracted from reference annotation (if annotated) or defined by StringTie
  • chromosome - chromosome of origin
  • strand - genomic strand of origin
  • start - genomic start coordinate for last exon isoform. Can be multiple comma separated values if distinct last exons are collapsed into a single last exon isoform
  • end - genomic end coordinate for last exon isoform. Can be multiple comma separated values if distinct last exons are collapsed into a single last exon isoform. Indexes in start and end coordinates are matching (i.e. 1st start and 1st end coordinate correspond to a specific event)
  • mean_PPAU_group1 - mean % poly(A) site usage (expressed as a fraction) of last exon isoform in samples from group1. Group name suffix is determined by 'condition' column in sample table.
  • mean_PPAU_group2 - mean % poly(A) site usage (expressed as a fraction) of last exon isoform in samples from group2. Group name suffix is determined by 'condition' column in sample table.
  • delta_PPAU_group2_group1 - difference in mean % poly(A) site usage (expressed as a fraction) of last exon isoforms between group2 and group1 (mean_PPAU_group2 - mean_PPAU_group1). Group name suffixes is determined by 'condition' column in sample table.

Count, TPM and % PolyA usage matrices - summarised_pas_quantification.*.tsv

all isoform-level matrices follow the same basic structure. Example eg_summarised_pas_quantification.counts.tsv:

le_id gene_id group1_sample_01 group1_sample_02 group1_sample_03 group2_sample_04 group2_sample_05 group2_sample_06
id_ref_gene_1_1 id_ref_gene_1 153.270694438624 87.9224799846888 73.5793392383333 87.085465994054 95.2339837457173 107.058789738493
id_ref_gene_1_2 id_ref_gene_1 99.2776088978098 62.6447669892074 66.5717831203075 18.9491523227997 18.858214603034 22.1795102246696
id_ref_gene_1_3 id_ref_gene_1 37.353397764602 86.2499524563182 54.5226552431889 192.816927844777 276.881950557951 480.112214730231
id_ref_gene_1_4 id_ref_gene_1 40.8670692723053 92.300837007847 160.353380468839 721.378148575543 522.447234953964 397.961898154793

The first two columns are always:

  • le_id - last exon isoform identifier assigned by PAPA
  • gene_id - gene_id extracted from reference annotation and used to group last exon isoforms according to parent gene

The subsequent columns correspond to sample names extracted from the 'sample_name' column of the sample sheet. Depending on the file name, the columns are populated with different values:

  • summarised_pas_quantification.counts.tsv - contains estimated counts for each isoform as calculated by tximport
  • summarised_pas_quantification.ppau.tsv - contains calculated % polyA site usage values with respect to the gene for each isoform (expressed as fractions)
  • summarised_pas_quantification.tpm.tsv - contains TPM values for each isoform as calculated by tximport

The 'PPAU' matrix (summarised_pas_quantification.ppau.tsv) is appended with additional summary columns. These columns are identical to those that end up in the final differential usage output table (dexseq_apa.results.processed.tsv):

  • mean_PPAU_group1 - mean % poly(A) site usage (expressed as a fraction) of last exon isoform in samples from group1. Group name suffix is determined by 'condition' column in sample table.
  • mean_PPAU_group2 - mean % poly(A) site usage (expressed as a fraction) of last exon isoform in samples from group2. Group name suffix is determined by 'condition' column in sample table.
  • delta_PPAU_group2_group1 - difference in mean % poly(A) site usage (expressed as a fraction) of last exon isoforms between group2 and group1 (mean_PPAU_group2 - mean_PPAU_group1). Group name suffixes is determined by 'condition' column in sample table.

The 'gene-level expression' file - summarised_pas_quantification.gene_tpm.tsv - follows a similar structure to above files, just that the 'le_id' column is omitted. The column values correspond to the sum of TPM expression for all last exon isoforms of a given gene. Additionally, the following columns are appended:

  • mean_gene_TPM_group1 - mean gene-level TPM values for samples in group1. Group name suffix is determined by 'condition' column in sample table.
  • mean_gene_TPM_group2 - mean gene-level TPM values for samples in group2. Group name suffix is determined by 'condition' column in sample table.
  • median_gene_TPM_group1 - median gene-level TPM values for samples in group1. Group name suffix is determined by 'condition' column in sample table.
  • median_gene_TPM_group2 - median gene-level TPM values for samples in group2. Group name suffix is determined by 'condition' column in sample table.

Other output files

  • dexseq_apa.image.RData - workspace/saved environment for the DEXSeq run.
  • dexseq_apa.results.tsv - standard results dataframe output by a DEXSeq differential analysis
  • formulas.txt - two-line text file containing the full (first line) and reduced models (second line) input to DEXSeq's likelihood ratio test