The following document provides further details on key output files generated by the pipeline. The primary focus is on the outputs storing novel last exons, summarised quantifications and differential usage output. The outputs of Salmon (<output_dir>/salmon
) and StringTie (<output_dir>/stringtie
) are the standard outputs of the respective tools and not discussed here. For further details on these outputs please consult the respective documentations (Salmon, StringTie).
Sections:
tree test_data_output/tx_filtering/
test_data_output/tx_filtering/
├── all_conditions.merged_last_exons.3p_end_filtered.gtf
├── group1
│ ├── group1.all_samples
│ ├── group1.all_samples.loci
│ ├── group1.all_samples.tracking
│ ├── group1.merged_last_exons.3p_end_filtered.gtf
│ ├── group1.merged_last_exons.3p_end_filtered.match_stats.tsv
│ ├── group1.merged_last_exons.3p_end_filtered.not_valid.class_summary_counts.tsv
│ ├── group1.merged_last_exons.3p_end_filtered.valid.class_summary_counts.tsv
│ ├── group1.merged_last_exons.gtf
│ ├── group1_sample_01.last_exons.gtf
│ ├── group1_sample_02.last_exons.gtf
│ ├── group1_sample_03.last_exons.gtf
│ └── gtf_list_group1.txt
├── group2
│ ├── group2.all_samples
│ ├── group2.all_samples.loci
│ ├── group2.all_samples.tracking
│ ├── group2.merged_last_exons.3p_end_filtered.gtf
│ ├── group2.merged_last_exons.3p_end_filtered.match_stats.tsv
│ ├── group2.merged_last_exons.3p_end_filtered.not_valid.class_summary_counts.tsv
│ ├── group2.merged_last_exons.3p_end_filtered.valid.class_summary_counts.tsv
│ ├── group2.merged_last_exons.gtf
│ ├── group2_sample_04.last_exons.gtf
│ ├── group2_sample_05.last_exons.gtf
│ ├── group2_sample_06.last_exons.gtf
│ └── gtf_list_group2.txt
├── novel_ref_combined.info.tsv
├── novel_ref_combined.last_exons.gtf
├── novel_ref_combined.le2gene.tsv
├── novel_ref_combined.le2genename.tsv
├── novel_ref_combined.quant.last_exons.gtf
├── novel_ref_combined.tx2gene.tsv
└── novel_ref_combined.tx2le.tsv
GTF of predicted last exons passing filtering across all conditions - all_conditions.merged_last_exons.3p_end_filtered.gtf
This GTF file contains all last exons passing the reference polyA site and polyA signal filtering in both conditions. It is the combination of group1/group1.merged_last_exons.3p_end_filtered.gtf and group2/group2.merged_last_exons.3p_end_filtered.gtf. Note these are complete novel last exon sequences, for regions used for quantification refer to novel_ref_combined.quant.last_exons.gtf
The coordinates follow standard GTF convention. The attribute fields are described below. Note that I consider this to be overly populated, and future releases are likely to simplify this output:
gene_id
- gene ID generated by StringTie.- PAPA.group1_sample_01.1 - 'PAPA' is hardcoded. 2nd field corresponds to sample name and 3rd field to unique gene ID computed by StringTie
transcript_id
- transcript ID generated by StringTie.- PAPA.group1_sample_01.1.7 - 'PAPA' is hardcoded. 2nd field corresponds to sample name and 3rd field to unique gene ID computed by StringTie. Final field corresponds to unique treanscript 'number' computed by StringTie
cov
- coverage value computed by StringTieexon_number
- left-to-right (i.e. non strand aware) exon number as calculated by StringTie (TODO: DOUBLE CHECK THIS)Start_ref
- Start coordinate(s) of matching reference exon (if an extension event) or intron (if a 'spliced' event). If multiple matching reference exons, the coordinates are comma separatedEnd_ref
- End coordinate(s) of matching reference exon (if an extension event) or intron (if a 'spliced' event). If multiple matching reference exons, the coordinates are comma separatedgene_name_ref
- gene name of overlapping annotated genetranscript_id_ref
- transcript ID of matching reference exon (if an extension event) or intron (if a 'spliced' event)gene_id_ref
- gene_id of overlapping annotated gene3p_extension_length
- Length of extension relative to 3'end of annotated exon. 'NULL' if event is not an extension of an annotated exon (i.e. is a spliced event)event_type
- inferred 'event type' for novel last exon isoform. Can take on the following values:- first_exon_spliced - most gene-proximal distinct last exon (i.e. it is the 'shortest' last exon isoform)
- internal_exon_spliced - distinct last exon that is not most proximal/distal with respect to gene
- last_exon_spliced - most gene-distal last exon
- first_exon_extension - novel 3'end extension of an annotated first exon (i.e. a 'composite' or 'bleedthrough' event)
- internal_exon_extension - novel 3'end extension of an annotated internal exon (i.e. a 'composite' or 'bleedthrough' event)
- last_exon_extension - novel 3'end extension of an annotated last exon (i.e. a 3'UTR extension)
sample_id
- sample in which event was identified. Corresponds to sample_name from sample tablelast_exon_id
- intermediate last exon isoform identifier grouping overlapping last exons.atlas_filter
- whether event passes (1) or fails (0) maximum distance from annotated polyA site (PolyASite) filternearest_atlas_distance
- distance from last exon 3'end to the nearest annotated polyA site (PolyASite)Name
- ID for nearest annotated polyA site (PolyASite). Corresponds to the 'Name' field in provided reference polyA site BED fileStart_atlas
- Start coordinate from nearest annotated polyA site from reference BED fileEnd_atlas
- End coordinate from nearest annotated polyA site from reference BED filemotif_filter
- whether event passes (1) or fails (0) polyA signal motif filterpas_motifs
- identified polyA signal motifs. 'not_found' if no motifs found in specified terminal window of predicted last exons- -18_AATAAA -
<distance_from_3'end>_<polyA_signal_sequence>
. '-' denotes that position is upstream of predicted 3'end ('End' coordinate of GTF entry). If multiple motifs found they are separated by comma characters
- -18_AATAAA -
min_motif_3p_deviation
- Distance deviation of identified motifs from the expected distance upstream of 3'ends (user-specified). Missing if motifs are not found in user-specified terminal window.- min(|distance_from_3'end - expected_distance|) where expected_distance is user defined and distance_from_3'end
- Where multiple motifs are identifed, the minimum distance deviation is reported
min_motif_3p_distance
- Distance from predicted 3'end of identified motif reported in min_motif_3p_deviation. Missing if motifs are not found in user-specified terminal window.- e.g. if -18_AATAAA was the reported motif, -18 would be value for this key
condition_id
- condition key from 'condition' column in sample for corresponding sample_idEnd_le
- appended if atlas_filter = 1 and nearest_atlas_distance != 0. Corresponds to the original End coordinate of predicted last exon prior to updating to matched reference polyA site
Assorted notes:
- This file can contain duplicated intervals if exactly the same coordinate is predicted across multiple replicates (or even multiple conditions). I am likely to address this in future releases.
Combined novel and annotated last exon references - novel_ref_combined.last_exons.gtf and novel_ref_combined.quant.last_exons.gtf
GTF files containing combination of novel and annotated last exons. novel_ref_combined.quant.last_exons.gtf represents the reference GTF used to construct the Salmon index. novel_ref_combined.last_exons.gtf instead represents the complete last exon sequences prior to trimming to unique regions (i.e. no overlaps with annotated internal exons).
The coordinates follow standard GTF convention. The attribute fields are identical between the two files and as described below. Note that I also consider these fields to be overly populated, and future releases are likely to simplify this output:
gene_id
- gene ID for interval computed by Stringtie (if novel) or extracted from reference GTF (if annotated)transcript_id
- gene ID for interval computed by Stringtie (if novel) or extracted from reference GTF (if annotated)exon_number
- gene ID for interval computed by Stringtie (if novel) or extracted from reference GTF (if annotated)Start_ref
- Start coordinate(s) of matching reference exon (if an extension event) or intron (if a 'spliced' event). If multiple matching reference exons, the coordinates are comma separatedEnd_ref
- End coordinate(s) of matching reference exon (if an extension event) or intron (if a 'spliced' event). If multiple matching reference exons, the coordinates are comma separatedevent_type
- inferred 'event type' for last exon isoform. See above mentions for description of possible valuesgene_name
- Only present if interval originates from annotation. Redundant with ref_gene_nameregion_rank
- takes on first, internal or last, based on the position of the last exon isoform in the gene (first = 5'most, last = 3'most)transcript_id_ref
- Only present if interval originates from annotation. Redundant with transcript_idref_gene_id
- reference gene ID for last exon isoform. Propagated into ID mapping tables (see below)ref_gene_name
-reference gene name for last exon isoform. Propagated into ID mapping tables (see below)le_number
- corresponds to 5'-3' rank of unique last exon isoforms within genele_id
- unique identifier for last exon isoform. Any last exon intervals with overlapping sequence are grouped into a common identifier.- id_ref_gene_2_1 -
<ref_gene_id>_<le_number>
- reference gene ID suffixed with 'last exon number' (5'-3' rank within gene)
- id_ref_gene_2_1 -
Assorted notes:
- This file can contain duplicated intervals (annotated or novel), including those with closely spaced 3'ends (likely the result of imprecise cleavage). I am likely to address this in future releases.
ID mapping files - novel_ref_combined.tx2gene.tsv, novel_ref_combined.tx2le.tsv, novel_ref_combined.le2gene.tsv, novel_ref_combined.le2genename.tsv
All the following files follow a common structure, with columns denoting the 'identifier type':
novel_ref_combined.tx2gene.tsv
- transcript ID and gene ID- transcript_id - corresponds to transcript_id key from attribute field in combined GTFs
- gene_id - corresponds to ref_gene_id key from attribute field in combined GTFs
novel_ref_combined.tx2le.tsv
- transcript ID and 'le_id'- transcript_id - corresponds to transcript_id key from attribute field in combined GTFs
- le_id - corresponds to le_id in combined GTFs
novel_ref_combined.le2gene.tsv
- le_id and gene ID- le_id - corresponds to le_id in combined GTFs
- gene_id - corresponds to ref_gene_id key from attribute field in combined GTFs
novel_ref_combined.le2genename.tsv
- le_id and gene name- le_id - corresponds to le_id in combined GTFs
- gene_name - corresponds to ref_gene_name key from attribute field in combined GTFs
Simplified table containing metadata for each unique transcript in combined GTF. Note that present columns are not considered stable and may change in future releases:
transcript_id
-assigned unique transcript ID. Corresponds to transcript_id key from attribute field in combined GTFsle_id
- assigned last exon identifier. Corresponds to le_id key from attribute field in combined GTFsgene_id
- assigned reference gene ID. Corresponds to ref_gene_id key from attribute field in combined GTFsgene_name
- assigned reference gene name. Corresponds to ref_gene_name key from attribute field in combined GTFsevent_type
- inferred 'event type' for last exon isoform. See above mentions for description of possible valuesChromosome
- reference chromosome sourced from combined GTFStart
- Start coordinate sourced from combined GTFEnd
- End coordinate sourced from combined GTFStrand
- genomic strand sourced from combined GTFannot_status
- Whether event originates from reference annotation ('annotated') or predicted last exons ('novel')
$ tree test_data_output/differential_apa/
test_data_output/differential_apa/
├── dexseq_apa.image.RData
├── dexseq_apa.results.processed.tsv
├── dexseq_apa.results.tsv
├── formulas.txt
├── summarised_pas_quantification.counts.tsv
├── summarised_pas_quantification.gene_tpm.tsv
├── summarised_pas_quantification.ppau.tsv
└── summarised_pas_quantification.tpm.tsv
The main output files of note are the metadata-augmented dexseq results table (dexseq_apa.results.processed.tsv
) and various count/quantification matrices (summarised_pas_quantification.*.tsv
).
Example eg_dexseq_apa.results.processed.tsv:
binID | groupID | featureID | exonBaseMean | dispersion | stat | pvalue | padj | group1 | group2 | log2fold_group2_group1 | le_id | gene.qvalue | contrast_name | gene_id | gene_name | event_type | annot_status | transcript_id | chromosome | strand | start | end | mean_PPAU_group1 | mean_PPAU_group2 | delta_PPAU_group2_group1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id_ref_gene_1:E001 | id_ref_gene_1 | E001 | 102.30010013797774 | 0.08728199388325558 | 14.779434541157954 | 1.2084626559287708e-4 | 3.625387967786313e-4 | 1.970004973787062 | 2.046000549463702 | 0.25495812632124615 | id_ref_gene_1_1 | 0 | group2vsgroup1 | id_ref_gene_1 | ref_gene_1 | first_exon_spliced | novel | PAPA.group2_sample_04.1.1 | chr1 | + | 500 | 700 | 0.571113312433762 | 0.819125243817251 | 0.24801193138349 |
id_ref_gene_1:E002 | id_ref_gene_1 | E002 | 53.162865498165885 | 0.0290240675573647 | 114.49484271761776 | 1.01529962365023e-26 | 6.091797741901381e-26 | 1.8403255217654115 | 1.3779809537657772 | -1.576609330145999 | id_ref_gene_1_2 | 0 | group2vsgroup1 | id_ref_gene_1 | ref_gene_1 | internal_exon_spliced | novel | PAPA.group1_sample_01.1.7,PAPA.group2_sample_06.1.6 | chr1 | + | 1400,1400 | 1600,1600 | 0.427395249868125 | 0.169998726991834 | -0.257396522876291 |
id_ref_gene_1:E003 | id_ref_gene_1 | E003 | 173.7374193477014 | 0.164070451327647 | 2.6797725917421076 | 0.10163024034168788 | 0.10163024034168788 | 1.7553054456349897 | 2.5546210217424665 | 2.6768091091447417 | id_ref_gene_1_3 | 0 | group2vsgroup1 | id_ref_gene_1 | ref_gene_1 | internal_exon_spliced | novel | PAPA.group1_sample_03.1.6,PAPA.group2_sample_05.1.8 | chr1 | + | 1950,1950 | 2599,2598 | 7.2652715448876e-4 | 0.00516402410597214 | 0.00443749695148337 |
id_ref_gene_1:E004 | id_ref_gene_1 | E004 | 306.7360492958544 | 0.2524001172499824 | 4.248342859211704 | 0.03928865011600179 | 0.04714638013920215 | 1.979697794910254 | 2.8059821452923033 | 2.757797344758746 | id_ref_gene_1_4 | 0 | group2vsgroup1 | id_ref_gene_1 | ref_gene_1 | last_exon_spliced | annotated | ref_g1_tr_1 | chr1 | + | 3000 | 3400 | 7.6491054362457e-4 | 0.00571200508494262 | 0.00494709454131804 |
Column descriptions:
The following columns are all generated from DEXSeq's results table. Please consult the DEXSeq documentation for column descriptions:
- binID
- groupID
- featureID
- exonBaseMean
- dispersion
- stat
- pvalue
- padj
- group1
- group2
- log2fold_group2_group1
- gene.qvalue
The following metadata columns are appended by the pipeline:
le_id
- last exon identifier constructed by PAPAcontrast_name
- Name of experimental constrast tested, corresponds to<numerator_condition>vs<denominator_condition>
where numerator and denominator conditions correspond to keys in the 'condition' column of the sample tablegene_id
- gene id extracted for reference annotationgene_name
- gene name extracted from reference annotationevent_type
- inferred 'event type' for last exon isoform. See above mentions for description of possible valuesannot_status
- whether event originates from reference annotation or novel eventstranscript_id
- transcript ID extracted from reference annotation (if annotated) or defined by StringTiechromosome
- chromosome of originstrand
- genomic strand of originstart
- genomic start coordinate for last exon isoform. Can be multiple comma separated values if distinct last exons are collapsed into a single last exon isoformend
- genomic end coordinate for last exon isoform. Can be multiple comma separated values if distinct last exons are collapsed into a single last exon isoform. Indexes in start and end coordinates are matching (i.e. 1st start and 1st end coordinate correspond to a specific event)mean_PPAU_group1
- mean % poly(A) site usage (expressed as a fraction) of last exon isoform in samples from group1. Group name suffix is determined by 'condition' column in sample table.mean_PPAU_group2
- mean % poly(A) site usage (expressed as a fraction) of last exon isoform in samples from group2. Group name suffix is determined by 'condition' column in sample table.delta_PPAU_group2_group1
- difference in mean % poly(A) site usage (expressed as a fraction) of last exon isoforms between group2 and group1 (mean_PPAU_group2 - mean_PPAU_group1). Group name suffixes is determined by 'condition' column in sample table.
all isoform-level matrices follow the same basic structure. Example eg_summarised_pas_quantification.counts.tsv:
le_id | gene_id | group1_sample_01 | group1_sample_02 | group1_sample_03 | group2_sample_04 | group2_sample_05 | group2_sample_06 |
---|---|---|---|---|---|---|---|
id_ref_gene_1_1 | id_ref_gene_1 | 153.270694438624 | 87.9224799846888 | 73.5793392383333 | 87.085465994054 | 95.2339837457173 | 107.058789738493 |
id_ref_gene_1_2 | id_ref_gene_1 | 99.2776088978098 | 62.6447669892074 | 66.5717831203075 | 18.9491523227997 | 18.858214603034 | 22.1795102246696 |
id_ref_gene_1_3 | id_ref_gene_1 | 37.353397764602 | 86.2499524563182 | 54.5226552431889 | 192.816927844777 | 276.881950557951 | 480.112214730231 |
id_ref_gene_1_4 | id_ref_gene_1 | 40.8670692723053 | 92.300837007847 | 160.353380468839 | 721.378148575543 | 522.447234953964 | 397.961898154793 |
The first two columns are always:
le_id
- last exon isoform identifier assigned by PAPAgene_id
- gene_id extracted from reference annotation and used to group last exon isoforms according to parent gene
The subsequent columns correspond to sample names extracted from the 'sample_name' column of the sample sheet. Depending on the file name, the columns are populated with different values:
summarised_pas_quantification.counts.tsv
- contains estimated counts for each isoform as calculated by tximportsummarised_pas_quantification.ppau.tsv
- contains calculated % polyA site usage values with respect to the gene for each isoform (expressed as fractions)summarised_pas_quantification.tpm.tsv
- contains TPM values for each isoform as calculated by tximport
The 'PPAU' matrix (summarised_pas_quantification.ppau.tsv
) is appended with additional summary columns. These columns are identical to those that end up in the final differential usage output table (dexseq_apa.results.processed.tsv
):
mean_PPAU_group1
- mean % poly(A) site usage (expressed as a fraction) of last exon isoform in samples from group1. Group name suffix is determined by 'condition' column in sample table.mean_PPAU_group2
- mean % poly(A) site usage (expressed as a fraction) of last exon isoform in samples from group2. Group name suffix is determined by 'condition' column in sample table.delta_PPAU_group2_group1
- difference in mean % poly(A) site usage (expressed as a fraction) of last exon isoforms between group2 and group1 (mean_PPAU_group2 - mean_PPAU_group1). Group name suffixes is determined by 'condition' column in sample table.
The 'gene-level expression' file - summarised_pas_quantification.gene_tpm.tsv
- follows a similar structure to above files, just that the 'le_id' column is omitted. The column values correspond to the sum of TPM expression for all last exon isoforms of a given gene. Additionally, the following columns are appended:
mean_gene_TPM_group1
- mean gene-level TPM values for samples in group1. Group name suffix is determined by 'condition' column in sample table.mean_gene_TPM_group2
- mean gene-level TPM values for samples in group2. Group name suffix is determined by 'condition' column in sample table.median_gene_TPM_group1
- median gene-level TPM values for samples in group1. Group name suffix is determined by 'condition' column in sample table.median_gene_TPM_group2
- median gene-level TPM values for samples in group2. Group name suffix is determined by 'condition' column in sample table.
dexseq_apa.image.RData
- workspace/saved environment for the DEXSeq run.dexseq_apa.results.tsv
- standard results dataframe output by a DEXSeq differential analysisformulas.txt
- two-line text file containing the full (first line) and reduced models (second line) input to DEXSeq's likelihood ratio test