Skip to content

R package for extracting and visualizing mutational patterns in base substitution catalogues

License

Notifications You must be signed in to change notification settings

malinlarsson/MutationalPatterns

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MutationalPatterns

The MutationalPatterns R package provides a comprehensive set of flexible functions for easy finding and plotting of mutational patterns in base substitution catalogues.

ChangeLog

v0.99.6

Renamed functions

  • mut_type_occurences to mut_type_occurrences
  • strand_occurences to strand_occurrences

v0.99.5

  • Added deprecation and defunct messages to functions that have changed since the v0.99.0.
  • Various small vignette and reference manual updates.

v0.99.4

  • Internal package loading changes.
  • Removed files that do not belong to the package.

v0.99.3

Renamed functions

  • get_mut_context to mutation_context
  • get_type_context to type_context
  • get_muts to mutations_from_vcf
  • get_strand to strand_from_vcf

Other changes

  • Added an explanation for the difference between SomaticSignatures and MutationalPatterns in the vignette.

v0.99.2

Renamed functions

  • vcf_to_granges to read_vcfs_as_granges
  • get_types to mutation_types

v0.99.1

Renamed functions

  • read_vcf to vcf_to_granges

Removed functions

  • bed_to_granges, estimate_rank, rename_chrom

Parameter changes

  • plot_rainfall, vcf_to_granges

v0.2.0-beta

Updated functions

  • Faster vcf file loading
  • Automatic check and exclusion of positions in vcf with indels and/or multiple alternative alleles
  • Default plotting colours to standard in the mutational signatures field

New features

  • Function to make chromosome names uniform according to e.g. UCSC standard
  • Transcriptional strand bias analysis
  • Signature extraction (NMF) with transcriptional strand information
  • Enrichment/depletion test for genomic annotations

Please give credit and cite MutationalPatterns R Package when you use it for your data analysis. A preprint of the article can be found on bioRxiv.

Table of Contents

Getting started

Installation

For this package, you need R version 3.3.0 or higher. Additionally, you need BiocInstaller and devtools to easily load this package.

Install BiocInstaller:

source("https://bioconductor.org/biocLite.R")
biocLite("BiocInstaller")
library("BiocInstaller")

Install devtools to load this package:

install.packages("devtools")
library(devtools)

Install and load MutationalPatterns:

options(unzip = 'internal')
install_github("CuppenResearch/MutationalPatterns")
library(MutationalPatterns)

Pick a reference genome

  1. List all available reference genomes (BSgenome)
library(BSgenome)
available.genomes()
  1. Download and load your reference genome of interest
ref_genome = "BSgenome.Hsapiens.UCSC.hg19"
biocLite(ref_genome)
library(ref_genome, character.only = TRUE)

Load data

Small data samples are included in the package. You can download larger samples from the MutationalPatterns-data repository.

The data in that repository consists of somatic mutation catalogues of nine normal human adult stem cells from three tissues (Blokzijl et al., 2016).

To load data, we need to locate it:

# Use the provides sample data
vcf_files <- list.files(system.file("extdata", package="MutationalPatterns"),
                        pattern = ".vcf", full.names = TRUE)

# Or list your own vcf files
vcf_files = list.files(your_dir, pattern = ".vcf", full.names = TRUE)

And define corresponding names for the datasets:

sample_names = c("colon1", "colon2", "colon3", 
                 "intestine1", "intestine2", "intestine3",
                 "liver1", "liver2", "liver3")

This package is for the analysis of patterns in base substitution data only, therefore indel positions and positions with multiple alternative alleles are discarded.

Load a single VCF file:

vcf = read_vcfs_as_granges(vcf_files[1], sample_names[1], genome = "hg19")

Or load a list of VCF files:

vcfs = read_vcfs_as_granges(vcf_files, sample_names, genome = "hg19")

Include relevant metadata in your analysis, e.g. donor id, cell type, age, tissue type, mutant or wild type

tissue = c(rep("colon", 3), rep("intestine", 3), rep("liver", 3))

Take a subset of the chromosomes

You could, for example, select autosomal chromosomes:

auto = extractSeqlevelsByGroup(species="Homo_sapiens", 
                               style="UCSC",
                               group="auto")

vcfs = lapply(vcfs, function(x) keepSeqlevels(x, auto))

Mutation characteristics

Base substitution types

Retrieve base substitutions from VCF object as "REF>ALT".

mutations_from_vcf(vcfs[[1]])

Retrieve base substitutions from vcf and convert to the 6 types of base substitution types that are distinguished by convention: C>A, C>G, C>T, T>A, T>C, T>G. For example, if the reference allele is G and the alternative allele is T (G>T), this functions returns the G:C>T:A mutation as a C>A mutation.

mutation_types(vcfs[[1]])

Retrieve the context (1 base upstream and 1 base downstream) of the positions in the vcf object from the reference genome.

mutation_context(vcfs[[1]], ref_genome)

Retrieve the types and context of the base substitution types for all positions in the vcf object. For the base substitutions that are converted to the conventional base substitution types, the reverse complement of the context is returned.

type_context(vcfs[[1]], ref_genome)

Count mutation type occurrences for one vcf object

type_occurrences = mut_type_occurrences(vcfs[1], ref_genome)

Count mutation type occurrences for all samples in a list of vcf objects

type_occurrences = mut_type_occurrences(vcfs, ref_genome)

Mutation spectrum

Plot mutation spectrum over all samples. Plots the mean relative contribution of each of the 6 base substitution types. Error bars indicate standard deviation over all samples. The n indicates the total number of mutations in the set.

plot_spectrum(type_occurrences)

Plot mutation spectrum with distinction between C>T at CpG sites

plot_spectrum(type_occurrences, CT = TRUE)

Plot spectrum without legend

plot_spectrum(type_occurrences, CT = TRUE, legend = FALSE)

spectra1

Plot spectrum for each tissue separately

plot_spectrum(type_occurrences, by = tissue, CT = TRUE)

Specify 7 colors for spectrum plotting

my_colors = c("pink", "orange", "blue", "lightblue", "green", "red", "purple")
plot_spectrum(type_occurrences, CT = T, legend = T, colors = my_colors)

spectra2

96 Mutation profile

Make 96 trinucleodide mutation count matrix

test_matrix = mut_matrix(vcf_list = vcfs, ref_genome = ref_genome)

Plot 96 profile of three samples

plot_96_profile(test_matrix[,c(1,4,7)])

96_mutation_profile

Mutational signatures

De novo mutational signature extraction using NMF

A critical parameter in NMF is the factorization rank, which is the number of mutational signatures. Determine the optimal factorization rank using the NMF package (Gaujoux and Seoighe, 2010). As described in their paper: "...a common way of deciding on the rank is to try different values, compute some quality measure of the results, and choose the best value according to this quality criteria. The most common approach is to choose the smallest rank for which cophenetic correlation coefficient starts decreasing. Another approach is to choose the rank for which the plot of the residual sum of squares (RSS) between the input matrix and its estimate shows an inflection point."

# Add a tiny psuedocount to avoid a 0 in the matrix.
mut_mat = mut_mat + 0.0001

# Use the NMF package to generate an estimate plot.
library("NMF")
estimate = nmf(mut_mat, rank=2:5, method="brunet", nrun=100, seed=123456)
plot(estimate)

estim_rank

Extract and plot 2 signatures

nmf_res = extract_signatures(test_matrix, rank = 3)
# provide signature names (optional)
colnames(nmf_res$signatures) = c("Signature A", "Signature B")
# plot signatures
plot_96_profile(nmf_res$signatures)

signatures

Plot signature contribution

# provide signature names (optional)
rownames(nmf_res$contribution) = c("Signature A", "Signature B")
# plot relative signature contribution
plot_contribution(nmf_res$contribution, nmf_res$signature, mode = "relative")
# plot absolute signature contribution
plot_contribution(nmf_res$contribution, nmf_res$signature, mode = "absolute")

contribution1

# plot contribution of signatures for subset of samples with index parameter
plot_contribution(nmf_res$contribution, nmf_res$signature, mode = "absolute", index = c(1,2))
# flip X and Y coordinates
plot_contribution(nmf_res$contribution, nmf_res$signature, mode = "absolute", coord_flip = T)

contribution2

Compare reconstructed mutation profile with original mutation profile

plot_compare_profiles(test_matrix[,1], nmf_res$reconstructed[,1], profile_names = c("Original", "Reconstructed"))

originalVSreconstructed

Fit 96 mutation profiles to known signatures

Download signatures from pan-cancer study (Alexandrov et al. 2013)

sp_url = "http://cancer.sanger.ac.uk/cancergenome/assets/signatures_probabilities.txt"
cancer_signatures = read.table(sp_url, sep = "\t", header = T)
# reorder (to make the order of the trinucleotide changes the same)
cancer_signatures = cancer_signatures[order(cancer_signatures[,1]),]
# only signatures in matrix
cancer_signatures = as.matrix(cancer_signatures[,4:33])

Fit mutation matrix to cancer signatures. This function finds the optimal linear combination of mutation signatures that most closely reconstructs the mutation matrix by solving nonnegative least-squares constraints problem

fit_res = fit_to_signatures(test_matrix, cancer_signatures)
# select signatures with some contribution
select = which(rowSums(fit_res$contribution) > 0)
# plot contribution
plot_contribution(fit_res$contribution[select,], cancer_signatures[,select], coord_flip = F, mode = "absolute")

signatures

Compare reconstructed mutation profile of sample 1 using cancer signatures with original profile

plot_compare_profiles(test_matrix[,1], fit_res$reconstructed[,1], profile_names = c("Original", "Reconstructed \n cancer signatures"))

contribution

Transcriptional strand bias

Strand bias analysis

For the mutations within genes it can be determined whether the mutation is on the transcribed or non-transcribed strand, which is interesting to study involvement of transcription-coupled repair. To this end, it is determined whether the "C" or "T" base (since by convention we regard base substitutions as C>X or T>X) are on the same strand as the gene definition. Base substitions on the same strand as the gene definitions are considered "untranscribed", and on the opposite strand of gene bodies as transcribed, since the gene definitions report the coding or sense strand, which is untranscribed. No strand information is reported for base substitution that overlap with more than one gene body.

Find gene definitions for your reference genome.

# get knowngenes table from UCSC for hg19
biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
genes_hg19 = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

Get transcriptional strand information for all positions in vcf 1. "Base substitions on the same strand as the gene definitions are considered. "-" for positions outside gene bodies, "U" for untranscribed/sense/coding strand, "T" for transcribed/anti-sense/non-coding strand.

strand_from_vcf(vcfs[[1]], genes_hg19)

Make mutation count matrix with transcriptional strand information (96 trinucleotides * 2 strands = 192 features). NB: only those mutations that are located within gene bodies are counted.

mut_mat_s = mut_matrix_stranded(vcfs, ref_genome, genes_hg19)

Perform strand bias analysis

strand_counts = strand_occurrences(mut_mat_s, by=tissue)
strand_plot = plot_strand(strand_counts, mode = "relative")

Perform poisson test for strand asymmetry significance testing

strand_bias = strand_bias_test(strand_counts)
strand_bias_plot = plot_strand_bias(strand_bias)

strand

Extract signatures with strand bias

# extract 2 signatures
nmf_res_strand = extract_signatures(mut_mat_s, rank = 2)

# provide signature names (optional)
colnames(nmf_res_strand$signatures) = c("Signature A", "Signature B")
# plot signatures with 192 features
plot_192_profile(nmf_res_strand$signatures)

# plot strand bias per mutation type for each signature with significance test
plot_signature_strand_bias(nmf_res_strand$signatures)

signatures_strand

Genomic distribution

Rainfall plot

A rainfall plot visualizes mutation types and intermutation distance. Rainfall plots can be used to visualize the distribution of mutations along the genome or a subset of chromosomes. The y-axis corresponds to the distance of a mutation with the previous mutation and is log10 transformed. Drop-downs from the plots indicate clusters or "hotspots" of mutations.

Make rainfall plot of sample 1 over all autosomal chromosomes

# define autosomal chromosomes
chromosomes = seqnames(get(ref_genome))[1:22]
# make rainfall plot
plot_rainfall(vcfs[[1]], title = names(vcfs[1]), ref_genome = ref_genome, chromosomes = chromosomes, cex = 1.5)

rainfall1

Make rainfall plot of sample 1 over chromosome 1

chromosomes = seqnames(get(ref_genome))[1]
plot_rainfall(vcfs[[1]], title = names(vcfs[1]), ref_genome = ref_genome, chromosomes = chromosomes[1], cex = 2)

rainfall2

Enrichment or depletion of mutations in genomic regions

Test for enrichment or depletion of mutations in certain genomic regions, such as promoters, CTCF binding sites and transcription factor binding sites. To use your own genomic region definitions (based on e.g. ChipSeq experiments) specify your genomic regions in a named list of GRanges objects. Alternatively, use publically available genomic annotation data, like in the example below.

Example: regulation annotation data from Ensembl using biomaRt

Here is an example of how to download regulation annotation data for genome build hg19. For other datasets, see biomaRt package documentation (Durinck et al. 2005).

Install and load biomaRt package

source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")
library(biomaRt)
mart="ensembl"
# list datasets available from ensembl BiomaRt
listDatasets(useEnsembl(biomart="regulation"))
# Multicell regulatory features for hg19
regulation_regulatory = useEnsembl(biomart="regulation", dataset="hsapiens_regulatory_feature", GRCh = 37)
# list all possible filters
listFilters(regulation_regulatory)
# list all posible output attributes
listAttributes(regulation_regulatory)
# list all filter options for a specific attribute
filterOptions("regulatory_feature_type_name", regulation_regulatory)

Download data from Ensembl using biomaRt

# Promoters
promoter = getBM(attributes = c('chromosome_name', 'chromosome_start', 'chromosome_end', 'feature_type_name'), 
                 filters = "regulatory_feature_type_name", 
                 values = "Promoter", 
                 mart = regulation_regulatory)
promoter_g = reduce(GRanges(promoter$chromosome_name, IRanges(promoter$chromosome_start, promoter$chromosome_end)))

  # Promoter flanking regions
  promoter_flanking = getBM(attributes = c('chromosome_name', 'chromosome_start', 'chromosome_end', 'feature_type_name'), 
              filters = "regulatory_feature_type_name", 
              values = "Promoter Flanking Region", 
              mart = regulation_regulatory)
promoter_flanking_g = reduce(GRanges(promoter_flanking$chromosome_name, IRanges(promoter_flanking$chromosome_start, promoter_flanking$chromosome_end))) 

# CTCF binding sites
CTCF = getBM(attributes = c('chromosome_name', 'chromosome_start', 'chromosome_end', 'feature_type_name', 'cell_type_name'), 
              filters = "regulatory_feature_type_name", 
              values = "CTCF Binding Site", 
              mart = regulation_regulatory)
# convert to GRanges object
CTCF_g = reduce(GRanges(CTCF$chromosome_name, IRanges(CTCF$chromosome_start, CTCF$chromosome_end))) 

# Open chromatin
open = getBM(attributes = c('chromosome_name', 'chromosome_start', 'chromosome_end', 'feature_type_name'), 
              filters = "regulatory_feature_type_name", 
              values = "Open chromatin", 
              mart = regulation_regulatory)
open_g = reduce(GRanges(open$chromosome_name, IRanges(open$chromosome_start, open$chromosome_end))) 


# Transcription factor binding sites
TF_binding = getBM(attributes = c('chromosome_name', 'chromosome_start', 'chromosome_end', 'feature_type_name'), 
                          filters = "regulatory_feature_type_name", 
                          values = "TF binding site", 
                          mart = regulation_regulatory)
TF_binding_g = reduce(GRanges(TF_binding$chromosome_name, IRanges(TF_binding$chromosome_start, TF_binding$chromosome_end))) 

Combine all genomic regions (GRanges objects) in a GRangesList.

regions = list(promoter_g, promoter_flanking_g, CTCF_g, open_g, TF_binding_g)
names(regions) = c("Promoter", "Promoter flanking", "CTCF", "Open chromatin", "TF binding")

Test for significant depletion or enrichment in genomic regions

It is necessary to include a list with Granges of regions that were surveyed in your analysis for each sample, that is: positions in the genome at which you have enough high quality reads to call a mutation. This can for example be determined using CallableLoci tool by GATK. If you would not include the surveyed area in your analysis, you might for example see a depletion of mutations in a certain genomic region that is solely a result from a low coverage in that region, and therefore does not represent an actual depletion of mutations.

# Locate the file with surveyed/callable regions
surveyed_file <- list.files(system.file("extdata",
                          package = "MutationalPatterns"),
                          pattern = ".bed",
                          full.names = TRUE)

# Read BED file as GRanges object using rtracklayer
library(rtracklayer)
surveyed <- import(surveyed_file)

# Unify to a single seqlevel style
seqlevelsStyle(surveyed) <- "UCSC"

# For this example we use the same surveyed file for each sample
surveyed_list= rep(surveyed_list, 9)

Test for an enrichment or depletion of mutations in your defined genomic regions using a binomial test. For this test, the chance of observing a mutation is calculated as the total number of mutations, divided by the total number of surveyed bases.

# for each sample calculate the number of observed and expected number of mutations in each genomic regions
distr = genomic_distribution(vcfs, surveyed_list, regions)
# test for significant enrichment or depletion in the genomic regions
# samples can be collapsed into groups, here the analysis is performed per tissue type
distr_test = enrichment_depletion_test(distr, by = tissue)
# plot enrichment depletion test results
plot_enrichment_depletion(distr_test)

distr

About

R package for extracting and visualizing mutational patterns in base substitution catalogues

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 100.0%