-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathdataIntegration.R
285 lines (239 loc) · 12.1 KB
/
dataIntegration.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
# dataIntegration.R
#
# +-----------------------------------------------------------------+
# | |
# | Do not edit this file! Edit "myDataIntegaration.R" instead. |
# | |
# +-----------------------------------------------------------------+
#
# Purpose:
#
# Version: 1.1
#
# Date: 2019 05 12
# Author: Boris Steipe ([email protected])
#
# V 1.1 2019 updates
# V 1.0 First code 2018
#
# TODO:
#
#
# == HOW TO WORK WITH THIS FILE ================================================
#
# This file contains scenarios and tasks, we will discuss them in detail in
# class. Edit profusely, write code, experiment with options, or just play.
# Especially play.
#
# If there is anything you don't understand, use R's help system,
# Google for an answer, or ask. Especially ask. Don't continue if you don't
# understand what's going on. That's not how it works ...
#
# ==============================================================================
#TOC> ==========================================================================
#TOC>
#TOC> Section Title Line
#TOC> ---------------------------------------------------------
#TOC> 1 SCENARIO 51
#TOC> 2 READ DATA 68
#TOC> 3 EXPLORE DATA 99
#TOC> 4 INTEGRATE DATA 114
#TOC> 4.1 BioMart provides integrated data 131
#TOC> 4.2 Put the data together 158
#TOC> 5 PLOT THE DATA 189
#TOC> 6 MORE PRACTICE 246
#TOC>
#TOC> ==========================================================================
# = 1 SCENARIO ============================================================
# Our data is becoming more and more high-dimensional, as every biomolecule
# has been observed in many variant states, and has been richly annotated.
# In this unit we will retrieve some annotations for a protein encoded in the
# genomic region we worked with in the sequence analysis unit, we will integrate
# genome, transcript, variation and amino acid data, and we will design a
# visualization of the annotated data.
# We wish to create a plot that looks like this:
source("./sampleSolutions/dataIntegrationSampleSolutions-ShowPlot.R")
# This is an amino-acid level plot of cancer-related mutation types and
# frequencies on a gene found on Chromosome 20.
# = 2 READ DATA ===========================================================
# Task 2.1: Open coordinates 58,815,001 to 58,915,000 of the hg38 assembly
# of chromosome 20 in the Ensembl genome browser. What gene is
# annotated to this region?
# Task 2.2: GNAS is a complex locus with multiple transcripts. Download the
# transcript coordinates for protein coding genes. Hint: download
# the data from the corresponding Ensembl gene page.
# - Save the results page as "ENSG00000087460data.csv"
# - Read the file into an R data frame called GNAStranscripts
# Task 2.3 Remove all rows from GNAStranscripts that are not protein coding:
# - what column are we looking at?
# - what values exist in this column?
# - how do we subset the data frame to the values we want?
# - how many transcripts do we have? What are their IDs?
# - restrict the rows to contain only Ensembl transcripts. How many
# transcripts are left?
# Task 2.4 Calculate the transcript lengths for all transcripts. Store
# them in a named vector called "tLengths".
#
# Task 2.5 Find the GNAS page on the intogen cancer driver gene website.
# Explore the page. To download the mutation distribution, you need
# to register (databases need records of who uses them to compete
# for funding.) You can register and download, or use the file
# "./data/GNAS-distribution-data.tsv" instead.
# - Read the file into a data frame called "GNASmutations".
# = 3 EXPLORE DATA ========================================================
#
# Task 3.1 View GNASmutations. What do you see?
# - How many observations of each transcript?
# - Are there transcripts that are not in our Ensembl table?
# (hint: use the %in% operator)
# - How many of each mutation type? Plot that!
#
# - Exclude the splice region variants, since their effect
# is not predictable.
#
# - Are the reference nucleotides correct for our GRCh38 data?
# If not: how do we fix that?
# = 4 INTEGRATE DATA =====================================================
# The resulting data is all over the place. We have a table with transcript
# annotations, a derived vector of lengths, some of our coordinates are
# from GRCh37, some are GRCh38. Integration is possible - but probably messy.
# We need to discuss first what a "proper" data model looks like in principle,
# then we'll explore BioMart, a versatile integration solution.
#
# Your ./assets folder contains a file: FND-CSC-Data_models.pdf ...
#
# Now: what do we need to integrate for our plot?
# - we need genomic coordinates, because that's what our sequencing
# experiments and variant calling return;
# - we need the coding sequence
# - we need the codon positions/translation
# - we need the mutations that are mapped to the sequene of interest
#
# == 4.1 BioMart provides integrated data ==================================
#
# Navigate to http://www.ensembl.org/. Click on BioMart. Getting data
# from BioMart involves four steps:
# - Choose the Database: here - choose Ensembl Genes 92
# - Choose the Dataset: here - choose Human Genes (GRCh38 p.13)
# - Choose Filters: explore what's available. The Gene ID for GNAS is
# ENSG00000087460. Set this as the filter.
#
# - Choose Attributes: explore what's available. Most importantly, we need a
# gene model. (Actually I haven't found a downloadable
# gene model for human genes anywhere else. Or do you
# know of a source?) How do we get a gene model from
# BioMart?
#
# Once you have selected what you need - or just to explore what you selected,
# as a preview - click "Results". Finally select ...
# "Export all results to" ... "File" "TSV" , and "Go". Inspect the resulting
# file. But hold on ... are these the coordinates we need?
#
# Task 4.1.1 Save the correct gene model coordinates as GNASgeneModels.37.tsv
# in your project folder.
#
# - Read the data into a data frame, call it GNASmodels
# == 4.2 Put the data together =============================================
#
# Task 4.2.1 Create a data frame for a GNAS-2 gene model according to the
# following specifications:
#
# - Choose data for ENST00000371095 (codes for ENSP00000360136 / NP_536351
# / GNASS / GNAS-2 / isoform of P36092)
# - call the data frame GNAS2model and store columns "start" and "end" for
# each CDS segment
# - Make sure the segments are in the correct order.
# Task 4.2.2 Create a data frame for GNAS-2 protein annotations, according to
# the following specifications:
#
# - Call it GNAS2protein
# - It should have one row for each nucleotide in the CDS
# - Give it the following columns:
# GNAS2protein$coord - the genomic coordinates
# GNAS2protein$nuc - the actual nucleotide
# GNAS2protein$codonPos - 1,2 or 3: the codon position
# GNAS2protein$aa - The amino acid (in codon position 1 only)
# GNAS2protein$iCodon - The codon index (in all three positions)
# Task 4.2.2 Create a data frame for GNAS-2 protein mutations, according to
# the following specifications:
# - Call it GNAS2mut
# - Get all rows from GNASmutations where the positions fall
# into the GNAS-2 CDS
# = 5 PLOT THE DATA =======================================================
#
# Time for a Lolliplot
# Task 5.1 What categories of effects do we have?
# Task 5.2 Define colors for the categories - (Hint: pick a palette e.g. with
# https://color.adobe.com/ You are looking for a divergent spectrum
# that emphasizes similar vs. different effects.
#
# e.g. "#D42823AA" # "frameshift_variant"
# "#FC7B14AA" # "missense_variant"
# "#ED69A7AA" # "stop_gained"
# "#CAD1FAAA" # "synonymous_variant"
#
# - also define a color for a rectangle that symbolizes the
# protein:
# - to work with the effect categories, put them into a
# data frame: eff$effects - the effects
# eff$cols - the colours
# eff$heights - the vertical positions
# - give the data frame rownames of the effects, so it's easy
# to fetch data by rowname
# Task 5.3 Compile the mutations by amino acid and mutation type.
# - define a matrix with rows for each mutated position,
# columns for each effect category. Give it rownames() of
# positions, colnames() of effects - so we can easily
# access data by position and mutation type. Call the matrix
# mMut
#
# - iterate over all mutations, find which sequence position it
# affects with which effect, and increment the value you find
# in the mMut matrix.
# Task 5.3 Prepare for plotting.
# - How do we draw circles on the plot?
# - What size should the circles have?
# - How do we put graphic elements on a plot in principle?
# (Hint: draw an empty plot of the correct size, then add
# lines(), points(), rectangle(), polygon() or text().
# Also add axes(). And a legend. And a title.)
# Task 5.4 Define a layout - x, and y ddimensions
# Task 5.5. Plot ...
# - an empty frame to setup the coordinates...
# - draw a rectangle for the protein ...
# - and an axis at the bottom ...
# - then plot the mutations for all positions and categories ...
# - finally, plot a legend
# Done.
#
# = 6 MORE PRACTICE =======================================================
#
# Is the observed ratio of missense/nonsense/synonymous variants for GNAS
# similar to what one would expect?
# - Write a function that executes a loop N times (for N <- 100000) to create
# a point mutation randomly in the GNAS gene. Keep track of the
# number of missense, silent ("synonymous"), and nonsense ("truncating")"
# mutations you find. Count changes of the start codon and the stop
# codon as "nonsense".
# Here is a header that specifies the function, its parameters and its value:
evalMut <- function(FA, N) {
# Purpose: evaluate the distribution of silent, missense and nonsense
# codon changes in cDNA read from FA for N random mutation trials.
# Parameters:
# FA chr Filename of a FASTA formatted sequence file of cDNA
# beginning with a start codon.
# N integer The number of point mutation trials to perform
# Value: list List with the following elements:
# FA chr the input file name
# N num number of trials performed
# nSilent num the number of silent mutations
# nMissense num the number of missense mutations
# nNonsense num the number of nonsense mutations
}
# - Contrast your findings with the relative frequency of the mutations in
# each category reported on the IntOGen Web page for GNAS.
# - Do you think there is an important difference between the expected
# categories of mutations (i.e. the stochastic background that you
# simulated), and categories of mutations that were observed in cancer
# genomes? How could you quantify that?
# [END]