dataIntegration.R

# dataIntegration.R
#
#   +-----------------------------------------------------------------+
#   |                                                                 |
#   |  Do not edit this file! Edit "myDataIntegaration.R" instead.    |
#   |                                                                 |
#   +-----------------------------------------------------------------+
#
# Purpose:
#
# Version: 1.1
#
# Date:    2019  05  12
# Author:  Boris Steipe (boris.steipe@utoronto.ca)
#
# V 1.1    2019 updates
# V 1.0    First code 2018
#
# TODO:
#
#
# == HOW TO WORK WITH THIS FILE ================================================
#
#  This file contains scenarios and tasks, we will discuss them in detail in
#  class. Edit profusely, write code, experiment with options, or just play.
#  Especially play.
#
#  If there is anything you don't understand, use R's help system,
#  Google for an answer, or ask. Especially ask. Don't continue if you don't
#  understand what's going on. That's not how it works ...
#
# ==============================================================================


#TOC> ==========================================================================
#TOC> 
#TOC>   Section  Title                                     Line
#TOC> ---------------------------------------------------------
#TOC>   1        SCENARIO                                    51
#TOC>   2        READ DATA                                   68
#TOC>   3        EXPLORE DATA                                99
#TOC>   4        INTEGRATE  DATA                            114
#TOC>   4.1        BioMart provides integrated data         131
#TOC>   4.2        Put the data together                    158
#TOC>   5        PLOT THE DATA                              189
#TOC>   6        MORE PRACTICE                              246
#TOC> 
#TOC> ==========================================================================


# =    1  SCENARIO  ============================================================

# Our data is becoming more and more high-dimensional, as every biomolecule
# has been observed in many variant states, and has been richly annotated.
# In this unit we will retrieve some annotations for a protein encoded in the
# genomic region we worked with in the sequence analysis unit, we will integrate
# genome, transcript, variation and amino acid data, and we will design a
#  visualization of the annotated data.

# We wish to create a plot that looks like this:

source("./sampleSolutions/dataIntegrationSampleSolutions-ShowPlot.R")

# This is an amino-acid level plot of cancer-related mutation types and
# frequencies on a gene found on Chromosome 20.


# =    2  READ DATA  ===========================================================

# Task 2.1: Open coordinates 58,815,001 to 58,915,000 of the hg38 assembly
#           of chromosome 20 in the Ensembl genome browser. What gene is
#           annotated to this region?

# Task 2.2: GNAS is a complex locus with multiple transcripts. Download the
#           transcript coordinates for protein coding genes. Hint: download
#           the data from the corresponding Ensembl gene page.
#           - Save the results page as "ENSG00000087460data.csv"
#           - Read the file into an R data frame called GNAStranscripts

# Task 2.3  Remove all rows from GNAStranscripts that are not protein coding:
#           - what column are we looking at?
#           - what values exist in this column?
#           - how do we subset the data frame to the values we want?
#           - how many transcripts do we have? What are their IDs?
#           - restrict the rows to contain only Ensembl transcripts. How many
#             transcripts are left?

# Task 2.4  Calculate the transcript lengths for all transcripts. Store
#           them in a named vector called "tLengths".
#
# Task 2.5  Find the GNAS page on the intogen cancer driver gene website.
#           Explore the page. To download the mutation distribution, you need
#           to register (databases need records of who uses them to compete
#           for funding.) You can register and download, or use the file
#           "./data/GNAS-distribution-data.tsv" instead.
#           - Read the file into a data frame called "GNASmutations".


# =    3  EXPLORE DATA  ========================================================
#
# Task 3.1  View GNASmutations. What do you see?
#           - How many observations of each transcript?
#           - Are there transcripts that are not in our Ensembl table?
#               (hint: use the %in% operator)
#           - How many of each mutation type? Plot that!
#
#           - Exclude the splice region variants, since their effect
#             is not predictable.
#
#           - Are the reference nucleotides correct for our GRCh38 data?
#             If not: how do we fix that?


# =    4  INTEGRATE  DATA  =====================================================

# The resulting data is all over the place. We have a table with transcript
# annotations, a derived vector of lengths, some of our coordinates are
# from GRCh37, some are GRCh38. Integration is possible - but probably messy.
# We need to discuss first what a "proper" data model looks like in principle,
# then we'll explore BioMart, a versatile integration solution.
#
# Your ./assets folder contains a file: FND-CSC-Data_models.pdf ...
#
# Now: what do we need to integrate for our plot?
# - we need genomic coordinates, because that's what our sequencing
#   experiments and variant calling return;
# - we need the coding sequence
# - we need the codon positions/translation
# - we need the mutations that are mapped to the sequene of interest
#
# ==   4.1  BioMart provides integrated data  ==================================
#
# Navigate to http://www.ensembl.org/. Click on BioMart. Getting data
# from BioMart involves four steps:
# - Choose the Database: here - choose Ensembl Genes 92
# - Choose the Dataset:  here - choose Human Genes (GRCh38 p.13)
# - Choose Filters:      explore what's available. The Gene ID for GNAS is
#                          ENSG00000087460. Set this as the filter.
#
# - Choose Attributes:   explore what's available. Most importantly, we need a
#                        gene model. (Actually I haven't found a downloadable
#                        gene model for human genes anywhere else. Or do you
#                        know of a source?) How do we get a gene model from
#                        BioMart?
#
# Once you have selected what you need  - or just to explore what you selected,
# as a preview - click "Results". Finally select ...
# "Export all results to" ... "File" "TSV" , and "Go". Inspect the resulting
# file. But hold on ... are these the coordinates we need?
#
# Task 4.1.1  Save the correct gene model coordinates as GNASgeneModels.37.tsv
#             in your project folder.
#
#             - Read the data into a data frame, call it GNASmodels


# ==   4.2  Put the data together  =============================================
#
# Task 4.2.1 Create a data frame for a GNAS-2 gene model according to the
#            following specifications:
#
#     -  Choose data for  ENST00000371095 (codes for ENSP00000360136 / NP_536351
#          / GNASS / GNAS-2 / isoform of P36092)
#     -  call the data frame GNAS2model and store columns "start" and "end" for
#        each CDS segment
#     -  Make sure the segments are in the correct order.

# Task 4.2.2 Create a data frame for GNAS-2 protein annotations, according to
#            the  following specifications:
#
#     -  Call it GNAS2protein
#     -  It should have one row for each nucleotide in the CDS
#     -  Give it the following columns:
#           GNAS2protein$coord     - the genomic coordinates
#           GNAS2protein$nuc       - the actual nucleotide
#           GNAS2protein$codonPos  - 1,2 or 3: the codon position
#           GNAS2protein$aa        - The amino acid (in codon position 1 only)
#           GNAS2protein$iCodon    - The codon index (in all three positions)


# Task 4.2.2 Create a data frame for GNAS-2 protein mutations, according to
#            the  following specifications:
#     -  Call it GNAS2mut
#     -  Get all rows from GNASmutations where the positions fall
#          into the GNAS-2 CDS


# =    5  PLOT THE DATA  =======================================================
#
# Time for a Lolliplot

# Task 5.1 What categories of effects do we have?

# Task 5.2 Define colors for the categories - (Hint: pick a palette e.g. with
#          https://color.adobe.com/ You are looking for a divergent spectrum
#          that emphasizes similar vs. different effects.
#
#          e.g. "#D42823AA"      # "frameshift_variant"
#               "#FC7B14AA"      # "missense_variant"
#               "#ED69A7AA"      # "stop_gained"
#               "#CAD1FAAA"      # "synonymous_variant"
#
#          - also define a color for a rectangle that symbolizes the
#            protein:

#          - to work with the effect categories, put them into a
#            data frame: eff$effects - the effects
#                        eff$cols    - the colours
#                        eff$heights - the vertical positions
#          - give the data frame rownames of the effects, so it's easy
#            to fetch data by rowname

# Task 5.3 Compile the mutations by amino acid and mutation type.
#          - define a matrix with rows for each mutated position,
#            columns for each effect category. Give it rownames() of
#            positions, colnames() of effects - so we can easily
#            access data by position and mutation type. Call the matrix
#            mMut
#
#          - iterate over all mutations, find which sequence position it
#            affects with which effect, and increment the value you find
#            in the mMut matrix.


# Task 5.3 Prepare for plotting.
#          - How do we draw circles on the plot?
#          - What size should the circles have?
#          - How do we put graphic elements on a plot in principle?
#              (Hint: draw an empty plot of the correct size, then add
#               lines(), points(), rectangle(), polygon() or text().
#               Also add axes(). And a legend. And a title.)

# Task 5.4 Define a layout - x, and y ddimensions

# Task 5.5. Plot ...
#           - an empty frame to setup the coordinates...
#           - draw a rectangle for the protein ...
#           - and an axis at the bottom ...
#           - then plot the mutations for all positions and categories ...
#            - finally, plot a legend

# Done.
#

# =    6  MORE PRACTICE  =======================================================
#
# Is the observed ratio of missense/nonsense/synonymous variants for GNAS
# similar to what one would expect?

#  -  Write a function that executes a loop N times (for N <- 100000) to create
#     a point mutation randomly in the GNAS gene. Keep track of the
#     number of missense, silent ("synonymous"), and nonsense ("truncating")"
#     mutations you find. Count changes of the start codon and the stop
#     codon as "nonsense".

# Here is a header that specifies the function, its parameters and its value:

evalMut <- function(FA, N) {
    # Purpose: evaluate the distribution of silent, missense and nonsense
    # codon changes in cDNA read from FA for N random mutation trials.
    # Parameters:
    #     FA   chr      Filename of a FASTA formatted sequence file of cDNA
    #                     beginning with a start codon.
    #     N    integer  The number of point mutation trials to perform
    # Value:   list     List with the following elements:
    #                      FA    chr  the input file name
    #                      N     num  number of trials performed
    #                      nSilent    num  the number of silent mutations
    #                      nMissense  num  the number of missense mutations
    #                      nNonsense  num  the number of nonsense mutations

}

#  -  Contrast your findings with the relative frequency of the mutations in
#     each category reported on the IntOGen Web page for GNAS.

#  -  Do you think there is an important difference between the expected
#     categories of mutations (i.e. the stochastic background that you
#     simulated), and categories of mutations that were observed in cancer
#     genomes? How could you quantify that?


# [END]