Output file documentation

The following document provides further details on key output files generated by the pipeline. The primary focus is on the outputs storing novel last exons, summarised quantifications and differential usage output. The outputs of Salmon (<output_dir>/salmon) and StringTie (<output_dir>/stringtie) are the standard outputs of the respective tools and not discussed here. For further details on these outputs please consult the respective documentations (Salmon, StringTie).

Sections:

StringTie
tx_filtering
differential_apa

StringTie

tx_filtering

tree test_data_output/tx_filtering/
test_data_output/tx_filtering/
├── all_conditions.merged_last_exons.3p_end_filtered.gtf
├── group1
│   ├── group1.all_samples
│   ├── group1.all_samples.loci
│   ├── group1.all_samples.tracking
│   ├── group1.merged_last_exons.3p_end_filtered.gtf
│   ├── group1.merged_last_exons.3p_end_filtered.match_stats.tsv
│   ├── group1.merged_last_exons.3p_end_filtered.not_valid.class_summary_counts.tsv
│   ├── group1.merged_last_exons.3p_end_filtered.valid.class_summary_counts.tsv
│   ├── group1.merged_last_exons.gtf
│   ├── group1_sample_01.last_exons.gtf
│   ├── group1_sample_02.last_exons.gtf
│   ├── group1_sample_03.last_exons.gtf
│   └── gtf_list_group1.txt
├── group2
│   ├── group2.all_samples
│   ├── group2.all_samples.loci
│   ├── group2.all_samples.tracking
│   ├── group2.merged_last_exons.3p_end_filtered.gtf
│   ├── group2.merged_last_exons.3p_end_filtered.match_stats.tsv
│   ├── group2.merged_last_exons.3p_end_filtered.not_valid.class_summary_counts.tsv
│   ├── group2.merged_last_exons.3p_end_filtered.valid.class_summary_counts.tsv
│   ├── group2.merged_last_exons.gtf
│   ├── group2_sample_04.last_exons.gtf
│   ├── group2_sample_05.last_exons.gtf
│   ├── group2_sample_06.last_exons.gtf
│   └── gtf_list_group2.txt
├── novel_ref_combined.info.tsv
├── novel_ref_combined.last_exons.gtf
├── novel_ref_combined.le2gene.tsv
├── novel_ref_combined.le2genename.tsv
├── novel_ref_combined.quant.last_exons.gtf
├── novel_ref_combined.tx2gene.tsv
└── novel_ref_combined.tx2le.tsv

GTF of predicted last exons passing filtering across all conditions - `all_conditions.merged_last_exons.3p_end_filtered.gtf`

This GTF file contains all last exons passing the reference polyA site and polyA signal filtering in both conditions. It is the combination of group1/group1.merged_last_exons.3p_end_filtered.gtf and group2/group2.merged_last_exons.3p_end_filtered.gtf. Note these are complete novel last exon sequences, for regions used for quantification refer to novel_ref_combined.quant.last_exons.gtf

The coordinates follow standard GTF convention. The attribute fields are described below. Note that I consider this to be overly populated, and future releases are likely to simplify this output:

gene_id - gene ID generated by StringTie.
- PAPA.group1_sample_01.1 - 'PAPA' is hardcoded. 2nd field corresponds to sample name and 3rd field to unique gene ID computed by StringTie
transcript_id - transcript ID generated by StringTie.
- PAPA.group1_sample_01.1.7 - 'PAPA' is hardcoded. 2nd field corresponds to sample name and 3rd field to unique gene ID computed by StringTie. Final field corresponds to unique treanscript 'number' computed by StringTie
cov - coverage value computed by StringTie
exon_number - left-to-right (i.e. non strand aware) exon number as calculated by StringTie (TODO: DOUBLE CHECK THIS)
Start_ref - Start coordinate(s) of matching reference exon (if an extension event) or intron (if a 'spliced' event). If multiple matching reference exons, the coordinates are comma separated
End_ref - End coordinate(s) of matching reference exon (if an extension event) or intron (if a 'spliced' event). If multiple matching reference exons, the coordinates are comma separated
gene_name_ref - gene name of overlapping annotated gene
transcript_id_ref - transcript ID of matching reference exon (if an extension event) or intron (if a 'spliced' event)
gene_id_ref - gene_id of overlapping annotated gene
3p_extension_length - Length of extension relative to 3'end of annotated exon. 'NULL' if event is not an extension of an annotated exon (i.e. is a spliced event)
event_type - inferred 'event type' for novel last exon isoform. Can take on the following values:
- first_exon_spliced - most gene-proximal distinct last exon (i.e. it is the 'shortest' last exon isoform)
- internal_exon_spliced - distinct last exon that is not most proximal/distal with respect to gene
- last_exon_spliced - most gene-distal last exon
- first_exon_extension - novel 3'end extension of an annotated first exon (i.e. a 'composite' or 'bleedthrough' event)
- internal_exon_extension - novel 3'end extension of an annotated internal exon (i.e. a 'composite' or 'bleedthrough' event)
- last_exon_extension - novel 3'end extension of an annotated last exon (i.e. a 3'UTR extension)
sample_id - sample in which event was identified. Corresponds to sample_name from sample table
last_exon_id - intermediate last exon isoform identifier grouping overlapping last exons.
atlas_filter - whether event passes (1) or fails (0) maximum distance from annotated polyA site (PolyASite) filter
nearest_atlas_distance - distance from last exon 3'end to the nearest annotated polyA site (PolyASite)
Name - ID for nearest annotated polyA site (PolyASite). Corresponds to the 'Name' field in provided reference polyA site BED file
Start_atlas - Start coordinate from nearest annotated polyA site from reference BED file
End_atlas - End coordinate from nearest annotated polyA site from reference BED file
motif_filter - whether event passes (1) or fails (0) polyA signal motif filter
pas_motifs - identified polyA signal motifs. 'not_found' if no motifs found in specified terminal window of predicted last exons
- -18_AATAAA - <distance_from_3'end>_<polyA_signal_sequence>. '-' denotes that position is upstream of predicted 3'end ('End' coordinate of GTF entry). If multiple motifs found they are separated by comma characters
min_motif_3p_deviation - Distance deviation of identified motifs from the expected distance upstream of 3'ends (user-specified). Missing if motifs are not found in user-specified terminal window.
- min(|distance_from_3'end - expected_distance|) where expected_distance is user defined and distance_from_3'end
- Where multiple motifs are identifed, the minimum distance deviation is reported
min_motif_3p_distance - Distance from predicted 3'end of identified motif reported in min_motif_3p_deviation. Missing if motifs are not found in user-specified terminal window.
- e.g. if -18_AATAAA was the reported motif, -18 would be value for this key
condition_id - condition key from 'condition' column in sample for corresponding sample_id
End_le - appended if atlas_filter = 1 and nearest_atlas_distance != 0. Corresponds to the original End coordinate of predicted last exon prior to updating to matched reference polyA site

Assorted notes:

This file can contain duplicated intervals if exactly the same coordinate is predicted across multiple replicates (or even multiple conditions). I am likely to address this in future releases.

Combined novel and annotated last exon references - novel_ref_combined.last_exons.gtf and novel_ref_combined.quant.last_exons.gtf

GTF files containing combination of novel and annotated last exons. novel_ref_combined.quant.last_exons.gtf represents the reference GTF used to construct the Salmon index. novel_ref_combined.last_exons.gtf instead represents the complete last exon sequences prior to trimming to unique regions (i.e. no overlaps with annotated internal exons).

The coordinates follow standard GTF convention. The attribute fields are identical between the two files and as described below. Note that I also consider these fields to be overly populated, and future releases are likely to simplify this output:

gene_id - gene ID for interval computed by Stringtie (if novel) or extracted from reference GTF (if annotated)
transcript_id - gene ID for interval computed by Stringtie (if novel) or extracted from reference GTF (if annotated)
exon_number - gene ID for interval computed by Stringtie (if novel) or extracted from reference GTF (if annotated)
Start_ref - Start coordinate(s) of matching reference exon (if an extension event) or intron (if a 'spliced' event). If multiple matching reference exons, the coordinates are comma separated
End_ref - End coordinate(s) of matching reference exon (if an extension event) or intron (if a 'spliced' event). If multiple matching reference exons, the coordinates are comma separated
event_type - inferred 'event type' for last exon isoform. See above mentions for description of possible values
gene_name - Only present if interval originates from annotation. Redundant with ref_gene_name
region_rank - takes on first, internal or last, based on the position of the last exon isoform in the gene (first = 5'most, last = 3'most)
transcript_id_ref - Only present if interval originates from annotation. Redundant with transcript_id
ref_gene_id - reference gene ID for last exon isoform. Propagated into ID mapping tables (see below)
ref_gene_name -reference gene name for last exon isoform. Propagated into ID mapping tables (see below)
le_number - corresponds to 5'-3' rank of unique last exon isoforms within gene
le_id - unique identifier for last exon isoform. Any last exon intervals with overlapping sequence are grouped into a common identifier.
- id_ref_gene_2_1 - <ref_gene_id>_<le_number> - reference gene ID suffixed with 'last exon number' (5'-3' rank within gene)

Assorted notes:

This file can contain duplicated intervals (annotated or novel), including those with closely spaced 3'ends (likely the result of imprecise cleavage). I am likely to address this in future releases.

ID mapping files - novel_ref_combined.tx2gene.tsv, novel_ref_combined.tx2le.tsv, novel_ref_combined.le2gene.tsv, novel_ref_combined.le2genename.tsv

All the following files follow a common structure, with columns denoting the 'identifier type':

novel_ref_combined.tx2gene.tsv - transcript ID and gene ID
- transcript_id - corresponds to transcript_id key from attribute field in combined GTFs
- gene_id - corresponds to ref_gene_id key from attribute field in combined GTFs
novel_ref_combined.tx2le.tsv - transcript ID and 'le_id'
- transcript_id - corresponds to transcript_id key from attribute field in combined GTFs
- le_id - corresponds to le_id in combined GTFs
novel_ref_combined.le2gene.tsv - le_id and gene ID
- le_id - corresponds to le_id in combined GTFs
- gene_id - corresponds to ref_gene_id key from attribute field in combined GTFs
novel_ref_combined.le2genename.tsv - le_id and gene name
- le_id - corresponds to le_id in combined GTFs
- gene_name - corresponds to ref_gene_name key from attribute field in combined GTFs

Metadata for all last exon isoforms - novel_ref_combined.info.tsv

Simplified table containing metadata for each unique transcript in combined GTF. Note that present columns are not considered stable and may change in future releases:

transcript_id -assigned unique transcript ID. Corresponds to transcript_id key from attribute field in combined GTFs
le_id - assigned last exon identifier. Corresponds to le_id key from attribute field in combined GTFs
gene_id - assigned reference gene ID. Corresponds to ref_gene_id key from attribute field in combined GTFs
gene_name - assigned reference gene name. Corresponds to ref_gene_name key from attribute field in combined GTFs
event_type - inferred 'event type' for last exon isoform. See above mentions for description of possible values
Chromosome - reference chromosome sourced from combined GTF
Start - Start coordinate sourced from combined GTF
End - End coordinate sourced from combined GTF
Strand - genomic strand sourced from combined GTF
annot_status - Whether event originates from reference annotation ('annotated') or predicted last exons ('novel')

differential_apa

$ tree test_data_output/differential_apa/
test_data_output/differential_apa/
├── dexseq_apa.image.RData
├── dexseq_apa.results.processed.tsv
├── dexseq_apa.results.tsv
├── formulas.txt
├── summarised_pas_quantification.counts.tsv
├── summarised_pas_quantification.gene_tpm.tsv
├── summarised_pas_quantification.ppau.tsv
└── summarised_pas_quantification.tpm.tsv

The main output files of note are the metadata-augmented dexseq results table (dexseq_apa.results.processed.tsv) and various count/quantification matrices (summarised_pas_quantification.*.tsv).

Differential usage output table - dexseq_apa.results.processed.tsv

Example eg_dexseq_apa.results.processed.tsv:

binID	groupID	featureID	exonBaseMean	dispersion	stat	pvalue	padj	group1	group2	log2fold_group2_group1	le_id	contrast_name	gene_id	gene_name	event_type	annot_status	transcript_id	chromosome	strand	start	end	mean_PPAU_group1	mean_PPAU_group2	delta_PPAU_group2_group1
id_ref_gene_1:E001	id_ref_gene_1	E001	102.30010013797774	0.08728199388325558	14.779434541157954	1.2084626559287708e-4	3.625387967786313e-4	1.970004973787062	2.046000549463702	0.25495812632124615	id_ref_gene_1_1	group2vsgroup1	id_ref_gene_1	ref_gene_1	first_exon_spliced	novel	PAPA.group2_sample_04.1.1	chr1	+	500	700	0.571113312433762	0.819125243817251	0.24801193138349
id_ref_gene_1:E002	id_ref_gene_1	E002	53.162865498165885	0.0290240675573647	114.49484271761776	1.01529962365023e-26	6.091797741901381e-26	1.8403255217654115	1.3779809537657772	-1.576609330145999	id_ref_gene_1_2	group2vsgroup1	id_ref_gene_1	ref_gene_1	internal_exon_spliced	novel	PAPA.group1_sample_01.1.7,PAPA.group2_sample_06.1.6	chr1	+	1400,1400	1600,1600	0.427395249868125	0.169998726991834	-0.257396522876291
id_ref_gene_1:E003	id_ref_gene_1	E003	173.7374193477014	0.164070451327647	2.6797725917421076	0.10163024034168788	0.10163024034168788	1.7553054456349897	2.5546210217424665	2.6768091091447417	id_ref_gene_1_3	group2vsgroup1	id_ref_gene_1	ref_gene_1	internal_exon_spliced	novel	PAPA.group1_sample_03.1.6,PAPA.group2_sample_05.1.8	chr1	+	1950,1950	2599,2598	7.2652715448876e-4	0.00516402410597214	0.00443749695148337
id_ref_gene_1:E004	id_ref_gene_1	E004	306.7360492958544	0.2524001172499824	4.248342859211704	0.03928865011600179	0.04714638013920215	1.979697794910254	2.8059821452923033	2.757797344758746	id_ref_gene_1_4	group2vsgroup1	id_ref_gene_1	ref_gene_1	last_exon_spliced	annotated	ref_g1_tr_1	chr1	+	3000	3400	7.6491054362457e-4	0.00571200508494262	0.00494709454131804

Column descriptions:

The following columns are all generated from DEXSeq's results table. Please consult the DEXSeq documentation for column descriptions:

binID
groupID
featureID
exonBaseMean
dispersion
stat
pvalue
padj
group1
group2
log2fold_group2_group1
gene.qvalue

The following metadata columns are appended by the pipeline:

le_id - last exon identifier constructed by PAPA
contrast_name - Name of experimental constrast tested, corresponds to <numerator_condition>vs<denominator_condition> where numerator and denominator conditions correspond to keys in the 'condition' column of the sample table
gene_id - gene id extracted for reference annotation
gene_name - gene name extracted from reference annotation
event_type - inferred 'event type' for last exon isoform. See above mentions for description of possible values
annot_status - whether event originates from reference annotation or novel events
transcript_id - transcript ID extracted from reference annotation (if annotated) or defined by StringTie
chromosome - chromosome of origin
strand - genomic strand of origin
start - genomic start coordinate for last exon isoform. Can be multiple comma separated values if distinct last exons are collapsed into a single last exon isoform
end - genomic end coordinate for last exon isoform. Can be multiple comma separated values if distinct last exons are collapsed into a single last exon isoform. Indexes in start and end coordinates are matching (i.e. 1st start and 1st end coordinate correspond to a specific event)
mean_PPAU_group1 - mean % poly(A) site usage (expressed as a fraction) of last exon isoform in samples from group1. Group name suffix is determined by 'condition' column in sample table.
mean_PPAU_group2 - mean % poly(A) site usage (expressed as a fraction) of last exon isoform in samples from group2. Group name suffix is determined by 'condition' column in sample table.
delta_PPAU_group2_group1 - difference in mean % poly(A) site usage (expressed as a fraction) of last exon isoforms between group2 and group1 (mean_PPAU_group2 - mean_PPAU_group1). Group name suffixes is determined by 'condition' column in sample table.

Count, TPM and % PolyA usage matrices - summarised_pas_quantification.*.tsv

all isoform-level matrices follow the same basic structure. Example eg_summarised_pas_quantification.counts.tsv:

le_id	gene_id	group1_sample_01	group1_sample_02	group1_sample_03	group2_sample_04	group2_sample_05	group2_sample_06
id_ref_gene_1_1	id_ref_gene_1	153.270694438624	87.9224799846888	73.5793392383333	87.085465994054	95.2339837457173	107.058789738493
id_ref_gene_1_2	id_ref_gene_1	99.2776088978098	62.6447669892074	66.5717831203075	18.9491523227997	18.858214603034	22.1795102246696
id_ref_gene_1_3	id_ref_gene_1	37.353397764602	86.2499524563182	54.5226552431889	192.816927844777	276.881950557951	480.112214730231
id_ref_gene_1_4	id_ref_gene_1	40.8670692723053	92.300837007847	160.353380468839	721.378148575543	522.447234953964	397.961898154793

The first two columns are always:

le_id - last exon isoform identifier assigned by PAPA
gene_id - gene_id extracted from reference annotation and used to group last exon isoforms according to parent gene

The subsequent columns correspond to sample names extracted from the 'sample_name' column of the sample sheet. Depending on the file name, the columns are populated with different values:

summarised_pas_quantification.counts.tsv - contains estimated counts for each isoform as calculated by tximport
summarised_pas_quantification.ppau.tsv - contains calculated % polyA site usage values with respect to the gene for each isoform (expressed as fractions)
summarised_pas_quantification.tpm.tsv - contains TPM values for each isoform as calculated by tximport

The 'PPAU' matrix (summarised_pas_quantification.ppau.tsv) is appended with additional summary columns. These columns are identical to those that end up in the final differential usage output table (dexseq_apa.results.processed.tsv):

mean_PPAU_group1 - mean % poly(A) site usage (expressed as a fraction) of last exon isoform in samples from group1. Group name suffix is determined by 'condition' column in sample table.
mean_PPAU_group2 - mean % poly(A) site usage (expressed as a fraction) of last exon isoform in samples from group2. Group name suffix is determined by 'condition' column in sample table.
delta_PPAU_group2_group1 - difference in mean % poly(A) site usage (expressed as a fraction) of last exon isoforms between group2 and group1 (mean_PPAU_group2 - mean_PPAU_group1). Group name suffixes is determined by 'condition' column in sample table.

The 'gene-level expression' file - summarised_pas_quantification.gene_tpm.tsv - follows a similar structure to above files, just that the 'le_id' column is omitted. The column values correspond to the sum of TPM expression for all last exon isoforms of a given gene. Additionally, the following columns are appended:

mean_gene_TPM_group1 - mean gene-level TPM values for samples in group1. Group name suffix is determined by 'condition' column in sample table.
mean_gene_TPM_group2 - mean gene-level TPM values for samples in group2. Group name suffix is determined by 'condition' column in sample table.
median_gene_TPM_group1 - median gene-level TPM values for samples in group1. Group name suffix is determined by 'condition' column in sample table.
median_gene_TPM_group2 - median gene-level TPM values for samples in group2. Group name suffix is determined by 'condition' column in sample table.

Other output files

dexseq_apa.image.RData - workspace/saved environment for the DEXSeq run.
dexseq_apa.results.tsv - standard results dataframe output by a DEXSeq differential analysis
formulas.txt - two-line text file containing the full (first line) and reduced models (second line) input to DEXSeq's likelihood ratio test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output_docs.md

output_docs.md

Output file documentation

StringTie

tx_filtering

GTF of predicted last exons passing filtering across all conditions - `all_conditions.merged_last_exons.3p_end_filtered.gtf`

Combined novel and annotated last exon references - novel_ref_combined.last_exons.gtf and novel_ref_combined.quant.last_exons.gtf

ID mapping files - novel_ref_combined.tx2gene.tsv, novel_ref_combined.tx2le.tsv, novel_ref_combined.le2gene.tsv, novel_ref_combined.le2genename.tsv

Metadata for all last exon isoforms - novel_ref_combined.info.tsv

differential_apa

Differential usage output table - dexseq_apa.results.processed.tsv

Count, TPM and % PolyA usage matrices - summarised_pas_quantification.*.tsv

Other output files

Files

output_docs.md

Latest commit

History

output_docs.md

File metadata and controls

Output file documentation

StringTie

tx_filtering

GTF of predicted last exons passing filtering across all conditions - all_conditions.merged_last_exons.3p_end_filtered.gtf

Combined novel and annotated last exon references - novel_ref_combined.last_exons.gtf and novel_ref_combined.quant.last_exons.gtf

ID mapping files - novel_ref_combined.tx2gene.tsv, novel_ref_combined.tx2le.tsv, novel_ref_combined.le2gene.tsv, novel_ref_combined.le2genename.tsv

Metadata for all last exon isoforms - novel_ref_combined.info.tsv

differential_apa

Differential usage output table - dexseq_apa.results.processed.tsv

Count, TPM and % PolyA usage matrices - summarised_pas_quantification.*.tsv

Other output files

GTF of predicted last exons passing filtering across all conditions - `all_conditions.merged_last_exons.3p_end_filtered.gtf`