-
Notifications
You must be signed in to change notification settings - Fork 0
/
chapter01.tex
executable file
·185 lines (139 loc) · 44.8 KB
/
chapter01.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
% Marieke Kuijjer
% 2013-02-15
% chapter 01
%\documentclass[12pt,b5paper]{book}
%\setcounter{secnumdepth}{0}
%\setcounter{tocdepth}{1}
%\usepackage[hidelinks]{hyperref}
% \pagestyle{fancy}
% \renewcommand{\chaptermark}[1]{\markboth{#1}{}}
% \renewcommand{\sectionmark}[1]{\markright{\thesection\ #1}} \fancyhf{}
% \fancyhead[LE,RO]{\bfseries\thepage}
% \fancyhead[LO]{\bfseries\ Chapter 1}
% \fancyhead[RE]{\bfseries\ Chapter 1}
% \renewcommand{\headrulewidth}{0.5pt}
% \renewcommand{\footrulewidth}{0pt}
% \addtolength{\headheight}{0.5pt}
%\begin{document}
%%% title page
\chapter{General introduction}\label{ch1}
\thispagestyle{empty} %%% to remove page number from first page of chapter, must be placed after calling the chapter
%\vfill
\newpage
%%% main document
%
\section{Cancer genomics}\label{cancer1}
Cancer develops through the acquisition of genomic alterations, {\it i.e.} changes in the DNA sequence and chromosomal numerical content of a cell, such as point mutations, insertions, deletions, amplifications, and translocations. Such alterations may alter protein expression and/or function of oncogenes---genes promoting cancer---and tumor suppressors---genes protecting from cancer. Other factors, caused by mechanisms which do not change the underlying DNA sequence ({\it e.g.} DNA methylation and histone modification), may also alter the expression of genes, and may thereby play a role in cancer as well. As different biological processes, the so-called hallmarks of cancer~\cite{hanahan2000hallmarks,hanahan2011hallmarks}, need to be deregulated before a cancer can develop, a combination of events is needed to change a normal cell into a cancerous cell. Cancer is thought to arise through the stepwise acquisition of such events~\cite{fearon1990genetic}, although it has become clear that different alterations may be caused by a single event~\cite{stephens2011massive,yates2012evolution}, and particularly oncogene\hyp{}activating translocations seem to be sufficient for oncogenesis in some types of leukemias, lymphomas, and sarcomas~\cite{knudson2004human}.
In cancer genomics, germline and somatic aberrations, {\it i.e.} aberrations present in the germline of the patient and acquired aberrations, are studied in order to identify genes and biological processes which are important in the development and progression of cancer. Determining aberrations that are crucial for a cancer cell to survive, identifying defective tumor suppressors, and identifying biological processes which facilitate tumor progression is tremendously important for diagnostics and prognostics, and for the identification of targeted treatments. In the late 1990s, high\hyp{}throughput methods have been developed which can be utilized in studying cancer genomics---so-called microarrays. In this thesis, we have used these high\hyp{}throughput techniques in order to study high\hyp{}grade osteosarcoma genomics, aiming to learn more on osteosarcoma biology, and to identify possible targets for treatment.
%
\section{High-grade osteosarcoma}\label{high1}
High-grade osteosarcoma is a primary malignant tumor consisting of mesenchymal tumor cells producing osteoid. The tumor is rare, with an approximate incidence of 5--6 patients in a population of one million per year. The incidence is higher in adolescents and young adults, and shows a second peak at middle age~\cite{mirabello2009osteosarcoma}. Osteosarcoma developing later in adult life is thought to be partially secondary, and may be caused by previous treatment with radiation or by an underlying Paget's disease of bone. Males are more often affected by osteosarcoma than females (with a ratio of 3:2). High\hyp{}grade osteosarcoma most frequently develops in the long bones of patients, with the metaphysis as the most frequent (91\%), and the diaphysis as the second most frequent site ($<9\%$). Most often, the tumor develops in the region around the knee (distal femur and proximal tibia), followed by the proximal humerus~\cite{raymond2002conventional}. Osteosarcoma is rarely seen in the axial bones of the patient. The incidence pattern of osteosarcoma suggests a link between the development of the disease and growth~\cite{mirabello2011height} (this will be further discussed in Chapter~\ref{ch5}).
High-grade osteosarcoma is a very aggressive tumor. Patients are usually treated with several series of neoadjuvant chemotherapy consisting of a combination of different chemotherapeutic drugs, especially cisplatin, doxorubicin, and high\hyp{}dose methotrexate~\cite{raymond2002conventional}. The tumor is then removed by limb\hyp{}salvage surgery, although sometimes amputation is needed. Afterwards, a second series of adjuvant therapy is given to the patient. Despite this intensive treatment schedule, a significant number of patients die due to the development of distant metastases, which are most often pulmonary. The tumor metastasizes in approximately 45\% of all patients~\cite{pakos2009prognostic}. Overall survival of patients with resectable metastatic disease is roughly 20\%~\cite{buddingh2010prognostic}. Neoadjuvant treatment was introduced in the 1970s, and improved overall survival from 10--20\% to approximately 60\%. However, except for macrophage\hyp{}activating and recruiting agents, such as L-MTP-PE (discussed in Chapter~\ref{ch4} of this thesis), no new treatment options have been developed that can raise survival significantly. The many caveats and challenges hampering osteosarcoma research, which might explain why osteosarcoma patients still have no other treatment options, are discussed in Chapter~\ref{ch2}.
Known genes involved in osteosarcomagenesis have essential roles in cell cycle progression~\cite{cleton2005central}. The tumor suppressor {\it TP53}, which can induce cell cycle arrest or apoptosis in response to cellular stress, such as DNA damage, is mutated in approximately 20\% of high\hyp{}grade osteosarcomas and also often present in regions of copy number loss. {\it MDM2}, which targets the p53 protein for degradation, is amplified in 6--15\% of the tumors. {\it TP53} and {\it MDM2} aberrations have been described to be mutually exclusive~\cite{overholtzer2003presence}, although in our dataset, one sample (the osteosarcoma cell line HAL) had copy number loss of {\it TP53} and gain of {\it MDM2}. Inactivating somatic mutations of {\it RB1}, a negative regulator of the cell cycle, are also often found in osteosarcoma, and this gene is present in regions of copy number loss in over 60\% of osteosarcomas~\cite{thomas2003role,kuijjer2012identification}. Other players of the Rb pathway have been described in osteosarcoma as well, for instance {\it CDKN2A} deletions, which present homozygously and occur in approximately 25\% of all patients~\cite{mohseny2010small}. {\it TP53} and {\it RB1} mutations are not always somatic---a small percentage of osteosarcoma is hereditary, with mutations present in the germline of patients. The associated hereditary syndromes, Li\hyp{}Fraumeni and Retinoblastoma for mutations in {\it TP53} and {\it RB1}, respectively, give a strong predisposition to develop osteosarcoma. A third hereditary syndrome that is thought to predispose to osteosarcoma is Rothmund\hyp{}Thomson syndrome, where {\it REQL4}, a gene encoding for a DNA helicase, is mutated~\cite{calvert2012risk}, however, in contrast to {\it TP53} and {\it RB1}, the gene is not a frequent target for sporadic mutations in osteosarcoma~\cite{nishijo2004mutation}.
%
\section{The EuroBoNeT high-grade osteosarcoma database}\label{eurobonet1}
The aim of this thesis was to study osteosarcomagenesis by bioinformatics analysis of a high\hyp{}throughput dataset consisting of microarray data from high\hyp{}grade osteosarcoma specimens. A relatively large cohort of na\"{\i}ve, preoperative diagnostic osteosarcoma biopsies was collected as a collaborative effort by EuroBoNeT, a European Network of Excellence for studying primary bone tumors. This clinically well defined cohort consisted of samples from 84 patients. For most of these patients, clinical data were available on patient sex, age at diagnosis of the primary tumor (in months), tumor location, histological subtype of the tumor, and response to neoadjuvant chemotherapy (Huvos grade)~\cite{huvos1991bone}. Follow\hyp{}up data (metastasis\hyp{}free survival and overall survival, measured in months from diagnosis) was available for 83/84 patients. Clinical characteristics of this cohort can be found in Table~\ref{tab7.1}. In addition to the clinical samples, we used data from two osteosarcoma model systems---osteosarcoma cell lines (characterized and published by Ottaviano {\it et al}.~\cite{ottaviano2010molecular}) and xenografts~\cite{mayordomo2010tissue}, see Table~\ref{tab3.1} for clinical characteristics of the original tumors of these model systems. The entire osteosarcoma database consisted of data obtained from three different microarray platforms---genome\hyp{}wide gene expression data, data obtained with a kinome screen, and Single Nucleotide Polymorphism (SNP) microarrays. Table~\ref{tab1.1} illustrates the different data types, numbers of osteosarcoma and control samples, and the different comparative analyses which are described in this thesis. Raw and processed data are deposited in online databases~\cite{edgar2002gene,r2microarray}.
%
%%% table tab1.1
\newcolumntype{x}[1]{>{\raggedright\arraybackslash}p{#1}}
\begin{table}[htbp]
\centering
\small
\begin{tabular}[c]{|ll >{\raggedright}p{1.2in} >{\raggedright}p{1.55in}|}
\hline
Data type & mRNA & Kinome & SNP\tabularnewline
\hline
Company & Illumina & PamGene & Affymetrix\tabularnewline
Array & Human-6 v2.0 & Ser/Thr kinase PamChip & Genome\hyp{}wide Human SNP Array 6.0\tabularnewline
Software & Bioconductor & BioNavigator, & Genotyping Console,\tabularnewline
& & Bioconductor & Nexus Copy Number\tabularnewline
OS samples & 84 diagnostic biopsies, & 2 cell lines & 32 diagnostic biopsies,\tabularnewline
& 19 cell lines, & & 12 cell lines\tabularnewline
& 18 xenografts & &\tabularnewline
Control samples & 12 MSC cultures, & 12 MSC cultures & 27 normal samples\tabularnewline
& 3 osteoblast cultures & & \\
Analysis methods & {\it LIMMA}, {\it pamr} & {\it LIMMA} & Cut-off for aberration frequency\tabularnewline
Comparative & Clinical parameters, & Tumor {\it vs} controls & Clinical parameters,\tabularnewline
analyses & Tumor {\it vs} controls & & Tumor {\it vs} controls\tabularnewline
\hline
\end{tabular}
\caption{Layout of the high-grade osteosarcoma database. MSC: mesenchymal stem cell.}
\label{tab1.1}
\end{table}
%
%
\section{High\hyp{}throughput platforms to study osteosarcoma}\label{platforms1}
Genome\hyp{}wide gene/mRNA expression profiling can be performed using RNA isolated from a sample, such as tumor tissue or a cell culture. Generally, cDNA or cRNA is prepared from the RNA and is labeled with a fluorescent dye. This is then hybridized to a microarray chip containing oligonucleotide probes, which are short sequences of DNA, complementary to most or all specific transcripts, capable of binding the labeled cDNA/cRNA. For measuring genome\hyp{}wide gene expression, single- and dual channel microarrays are available. With dual channel microarrays, samples, {\it e.g.} paired tumor samples and normal tissues, can be directly compared on one chip, by labeling the cDNA/cRNA with two different fluorescent dyes. For the research described in this thesis (Chapters~\ref{ch3}--\ref{ch8}), single channel microarrays were used, which means that control samples were hybridized on different chips. We used Illumina Human-6 v2.0 BeadChips (Illumina, San Diego, CA). These microarrays contain over $48,000$ probes, of which approximately half are recognized by well\hyp{}annotated Reference Sequence (RefSeq) genes~\cite{pruitt2002reference}. Illumina BeadChips have a special structure: probes are present on beads, which are randomly arranged on the chip. Every bead type is replicated on each chip with a mean of approximately 35--40 times~\cite{oliphant2002beadarray,barbosa2010re} (see Figure~\ref{fig1.1}A).
%
\begin{figure}[htbp]
\centering
\includegraphics[width=1.0\textwidth]{figs01/fig1bw.pdf} % pdf version also bw
\caption{Schematic overview of {\it A}, the Illumina BeadChip and {\it B}, the Affymetrix SNP 6.0 array. Figure adapted from Hup\'e, P., \url{http://commons.wikimedia.org}.}
\label{fig1.1}
\end{figure}
%
Both the random position and the high amount of replicated beads make robust measurements possible~\cite{dunning2007beadarray}. The software designed by Illumina for data analysis, BeadStudio, does not take advantage of the large number of replications of beads present on these chips. Therefore, various methods have been specifically developed for analyzing Illumina BeadChips, such as Bioconductor~\cite{gentleman2004bioconductor} packages {\it beadarray}~\cite{dunning2007beadarray}, {\it beadarraySNP}~\cite{oosting2010beadarraysnp} (specifically for Illumina SNP data), and {\it lumi}~\cite{du2008lumi}, which will be described in the next section.
Peptide microarrays can be used for studying kinase activity in a sample. For the research performed in Chapter~\ref{ch6} of this thesis, we used PamGene\textregistered~serine/threonine (Ser/Thr) PamChips (PamGene, 's-Hertogenbosch, the Netherlands). These chips consist of porous membranes, which contain 142 different peptides derived from phosphorylation sites for Ser/Thr kinases of the human proteome. Cell or tissue lysates are supplemented with ATP and subsequently pumped through these membranes, so that kinases in the lysates have access to, and can phosphorylate the peptides on the chip. Phosphorylation is measured over a time span of 30 to 60 minutes by the detection of light emitted by fluorescently\hyp{}labeled, phospho\hyp{}specific antibodies. Figure~\ref{fig1.2} gives an overview of the experimental workflow of PamGene.
%
\begin{figure}[htbp]
\centering
\begin{minipage}[b]{0.50\linewidth}
\includegraphics[height=1\textheight]{figs01/fig2bw.pdf} % OBS! print version bw
% \includegraphics[height=1\textheight]{figs01/fig2rgb.pdf} % OBS! pdf version rgb
\end{minipage}
\hfill
\begin{minipage}[b]{0.46\linewidth}
\caption{Peptides can serve as substrates for kinases present in the sample. Phosphorylation is detected by fluorescently labeled phopho\hyp{}specific antibodies ({\it A}). The microarrays consist of a porous ceramic membrane ({\it B}), on which 142 different peptide substrates are present ({\it C}). Four arrays are combined into one chip ({\it D}). The phosphorylation reaction occurs by an up and down movement of the sample solution through the array, giving the kinases maximal opportunity to phosphorylate the peptides on each array ({\it E}). When the solution is underneath the array, the CCD camera in the workstation takes an image of each array, which is later used by the software to generate kinetic data curves ({\it F}). The incubation, washing, dispensing of reagents and imaging of the arrays is done in fully automated workstations ({\it G}). Figure adapted from PamGene\textregistered.}
\label{fig1.2}
\end{minipage}
\end{figure}
%
Single Nucleotide Polymorphisms (SNPs) are genetic changes or variations of a single base pair, which occur in at least 1\% of the population~\cite{gibbs2003international}. SNP microarrays contain so-called allele\hyp{}specific oligonucleotide probes (Figure~\ref{fig1.1}B), which are used to discriminate between specific SNPs in the sample, because of the different binding properties of the sample DNA, which is again labeled with a fluorescent dye. SNP microarrays can be employed to genotype a sample, which is used to identify small variations between genomes (to determine {\it e.g.} disease susceptibility), but can also be utilized to infer copy number aberrations and allelic states of regions in the genome. The SNP microarrays used in this thesis (Chapters~\ref{ch7}--\ref{ch8}) are Affymetrix Genome\hyp{}Wide Human SNP Array 6.0 chips (Affymetrix, Santa Clara, CA). These high\hyp{}density chips contain over $900,000$ SNPs and over $900,000$ probes for the detection of copy number variation.
%
\section{Microarray data preprocessing}\label{preprocessing1}
The three different platforms described above have in common that, after hybridization of DNA/cDNA/cRNA to the chip, or after phosphorylation of peptides on the microarray, a fluorescent signal is emitted, which is measured by a scanner. The image files that are returned by the scanner can be utilized for deducing intensity signals and the location of the specific spots/beads. This is usually performed directly by the software provided by the company which distributes the arrays, and generally overlays a grid and returns median intensity signals for each spot/bead. Alternatively raw image files can be analyzed (for example using {\it beadarray}~\cite{dunning2007beadarray}), thereby allowing additional methods of data processing. In the following paragraphs, we will discuss data preprocessing and subsequent data analysis of data generated with the above described microarray chips.
Preprocessing of microarray data is performed in order to correct for experimental bias and to reduce the signal to noise ratio. Numerous methods of microarray data preprocessing exist, and specific methods may differ per data type and platform. Preprocessing of microarray data can be performed using the software provided by the company that produced the arrays, or can be analyzed with open source programs, such as the statistical software R~\cite{r2.15.0}, for which several packages have been made available in the Bioconductor~\cite{gentleman2004bioconductor} framework to specifically analyze the raw data of various microarray platforms.
An optional start of preprocessing the raw data is a global or local background subtraction step. This can eliminate signals due to nonspecific binding, thereby reducing noise in the data. However, when applying this step, probes of low signal will be discarded, resulting in missing values. Some researchers convert these missing values into zero expression. Illumina's scanner software, BeadScan, automatically subtracts local background measures from the foreground intensities to generate bead level text files---files including intensities and location information obtained from the original .tiff files produced by the scanner software. These {\it bead level files} can be used for downstream data analysis. The standard local background subtraction method provided by Illumina results in a very low estimate of the background, which is thought to be mostly related to the optical properties of the array surface~\cite{dunning2008statistical}. Additional background subtracting methods can be applied, such as {\it background normalization} in BeadStudio, which subtracts the mean intensity of negative control beads from the foreground intensities. This method increases variability, and also introduces a significant number of negative values~\cite{dunning2008statistical}. Especially for small sample sizes it is crucial to achieve a homogeneous variance, and thus, as background subtraction introduces additional variation in the data, this may not be beneficial for the detection of differences between two or more groups~\cite{schmid2010comparison}. Apart from the local background subtraction by BeadScan (for mRNA expression data), we did not use other background subtraction methods in the preprocessing of our microarray data.
Data transformation is necessary because of the complicated error structure of microarray data, which is intensity\hyp{}dependent and nonlinear~\cite{durbin2002variance}. Often, a simple log transformation is used, but other methods exist that are milder in transforming signals near background, which are inflated by standard log transformations. Examples of such methods are variance stabilizing normalization ({\it vsn})~\cite{huber2002variance}, which both transforms the data and performs normalization of the data between the different arrays, and variance stabilizing transformation ({\it vst})~\cite{lin2008model}, a method similar to {\it vsn}, specifically developed for preprocessing Illumina BeadChips. {\it vst} has been shown to be advantageous over log transformation when large changes in expression are expected~\cite{dunning2008spike,du2010evaluation}. Normalization of the data is applied to reduce bias that may arise due to differences in sample preparation, and production (batch effects) and processing of the arrays. Various normalization methods exist, of which complete data methods, such as quantile normalization, are preferred over methods that use a baseline array in order to normalize the data~\cite{bolstad2003comparison}. We used {\it vst} and robust spline normalization ({\it rsn}), a normalization method specifically designed to normalize variance stabilization transformed data, on mRNA expression data (Chapters~\ref{ch3}--\ref{ch8}). Transformation and normalization of peptide chips (Chapter~\ref{ch5}) was performed using {\it vsn}, while SNP microarray data were $log_2$ transformed and quantile normalized. SNP microarray data (Chapters~\ref{ch7}--\ref{ch8}) were further corrected for the guanine\hyp{}cytosine (GC) content, as different percentages in GC content can cause waviness in the $log_2$ ratio data, which can increase false positive and false negative segment calls. We used the Regional GC correction algorithm in Genotyping Console to correct for this waviness~\cite{gcwaviness}.
%%%
\section{Quality control}\label{quality1}
A very important microarray data preprocessing step is outlier detection. When correctly performed, this step can significantly improve data quality and thereby improve the outcome of the experiment, {\it e.g.} the detection of differential expression~\cite{allison2006microarray,kauffmann2010microarray}. Defective probes from Affymetrix chips can be detected and subsequently removed~\cite{li2001model}. In Illumina data, spatial artifacts can be detected and removed using BeadArray Subversion of Harshlight, or {\it BASH}~\cite{cairns2008bash}. Although the detection of large spatial artifacts may be helpful for determining whole outlier chips, the {\it BASH} algorithm only improves results very mildly. This can be described to the extremely robust structure of the Illumina BeadChips (tested for Human-6 and GoldenGate BeadChips, Kuijjer {\it et al}., {\it unpublished results}). The more recently developed HumanHT-12 Expression BeadChips contain fewer replicates per bead type, and this preprocessing step may therefore be valuable for removing outliers in these newer chips. Other artifacts in Illumina data have been reported, such as particularly bright beads showing a bleed over effect on neighboring beads, raising their associated values~\cite{smith2010identification}. One can adjust for such spatial artifacts by masking affected beads using the {\it beadarray} package~\cite{dunning2007beadarray}.
Regularly, it is necessary to remove entire chips of poor quality, since such chips can impair overall statistical and biological significance~\cite{kauffmann2010microarray}. Poor quality chips can be identified by visually checking the scanner images, the distribution of both raw and normalized data ({\it e.g.} by plotting density plots, boxplots, and MA-plots), and by performing unsupervised hierarchical clustering or visualizing the data using principal components analysis (PCA, reducing the data dimensionality to {\it e.g.} its first two or three principle components). Such methods can for example be applied using Bioconductor package {\it arrayQualityMetrics}~\cite{kauffmann2009arrayqualitymetrics} (used for quality control of mRNA and kinome profiling in this thesis) or using quality control functions in the package {\it affy}~\cite{gautier2004affy}. Another method to control the influence of poor quality chips is assigning weights to all chips, so that arrays of better quality will have a higher influence on the analysis than poor quality arrays ({\it arrayWeights}~\cite{ritchie2006empirical}). Such an approach is, however, not intended to replace a quality check identifying catastrophically poor quality chips, and these should still be discarded. In a comparative study of removing poor quality chips with {\it arrayQualityMetrics}, assigning {\it arrayWeights} to the data, or applying both methods on the {\it LIMMA} analysis described in Chapter~\ref{ch4}, we determined more differentially expressed probes at a false\hyp{}discovery rate (FDR, see next section for an explanation) of 0.05 without assigning weights, but this depended on the FDR (for $0.05<$ FDR $\le0.1$ {\it arrayWeights} or a combination of both methods performed slightly better, Kuijjer {\it et al}., {\it unpublished results}).
In SNP microarray quality control, one can determine the ability of an experiment to resolve SNP signals into three genotype clusters (AA, AB, BB). The Affymetrix Genotyping Console Contrast Quality Control test metric is a measure for this ability~\cite{qualitycontrol}, and was used in this thesis (Chapters~\ref{ch7}--\ref{ch8}). This test uses $10,000$ random SNPs to measure the difference between peaks in the distributions of homozygote genotypes (AA and BB), and the valleys these distributions share with the heterozygote peak (AB). When this difference approaches zero, the experiment poorly distinguishes between homozygous and heterozygous genotypes. Such chips should be removed from further data analysis.
%
\section{Microarray data analysis}\label{analysis1}
After having performed the preprocessing steps necessary for the specific type and platform of microarray data, the actual data analysis can be performed.
Unsupervised hierarchical clustering of microarray data may not only be used as a quality check (as described in the previous section), but can also be applied to detect different subgroups of samples, which may be associated with a clinical feature. In a supervised approach, differences between groups of samples can be determined using a moderated t-test, such as the {\it LIMMA} analysis (used in this thesis for detection of differential expression and phosphorylation)~\cite{smyth2004linear}. Important to note is that with the testing of multiple hypotheses, the amount of true null hypotheses that are rejected will increase. In microarray experiments, often large numbers of probes/peptides are tested for differential expression or phosphorylation, and therefore, an excessive amount of false\hyp{}positives may be returned from conventional statistical tests. Hence, a correction for multiple testing should be performed~\cite{allison2006microarray}. Examples of such methods are conservative familywise error rate procedures, such as the Bonferroni method~\cite{weisstein2006bonferroni}, or the less stringent false discovery rate (FDR) controlling methods, {\it e.g.} the Benjamini and Hochberg~\cite{benjamini1995controlling}, and Benjamini and Yekutieli~\cite{benjamini2001control} approaches. Other methods use permutations to estimate the FDR, such as Significant Analysis of Microarrays ({\it SAM})~\cite{tusher2001significance}.
SNP data is analyzed in a different manner. Genotyping can be performed by specific genotyping algorithms, such as the Birdseed v2 algorithm in Genotyping Console, which uses unsupervised learning to fit the data, producing genotype calls and returning confidence scores for each SNP~\cite{genotypingconsole}. Copy number data analysis is performed by comparing the intensity signals for each marker and each sample against a reference genome, which usually consists of a set of in-house or publicly available control samples. A cut-off for gains and losses is used to determine whether probes are present in a region of amplification or deletion (in this thesis, an absolute log$_2$ ratio cut-off of 0.2, equivalent to an absolute fold change of approximately 1.15, was used). Using the genotyping information, calls can also be made for allelic ratios. In Nexus Copy Number software, this is done by determining the B-allele frequency. Regions on the genome which show LOH will not reveal any AB signals (a B-allele frequency of 0.5, at least in theory, if there are no normal cells present in the tumor tissue). This also makes the identification of allelic imbalance possible, which, over a genomic region, will show multiple B-allele frequencies in between 0 and 1, depending on the amounts of copy number of each allele.
A drawback in SNP data analysis is that copy number changes are detected relative to the overall DNA content in the sample~\cite{attiyeh2009genomic}. In addition, normal cell populations, such as stromal and inflammatory cells, and heterogeneity within the tumor itself can further impede the detection of the true copy number alterations in the tumor cell. In epithelial tumors, a DNA index can be determined by flow\hyp{}sorting tumor cells, which can separate these from mesenchymal cells, and which can identify subpopulations of tumor cells with different chromosomal aberrations. To infer true copy numbers and allelic states, the algorithm lesser allele intensity ratio (LAIR, included in {\it beadarraySNP}~\cite{oosting2010beadarraysnp}) integrates the DNA index in the analysis of SNP data~\cite{corver2008genome}. Unfortunately, this approach can not be applied to SNP data analysis of high\hyp{}grade osteosarcoma samples, as osteosarcoma is a mesenchymal tumor for which no specific markers are available. However, the amount of stroma in osteosarcoma is not as extensive as in epithelial tumors, and the percentage of stroma as determined by the pathologist could in principle be used in order to approximate the DNA index of these tumors.
SNP microarray data show a high degree of noise, and not all markers reflect the true copy number of the region. Segmentation is performed in order to identify the chromosomal segments with actual copy number aberrations. Most frequently used algorithms for segmentation are Circular binary segmentation (CBS)\hyp{}based~\cite{olshen2004circular} or Hidden Markov Model (HMM)\hyp{}based methods. CBS\hyp{}based methods divide the genome into always smaller segments until no region can be further segmented, taking into account a minimum amount of probes per segment. The SNPRank segmentation algorithm in Nexus Copy Number Software is CBS\hyp{}based, and ranks log ratio probe values and B-allele frequencies in a segment. If the distribution of these probe ranks is significantly different from those of an adjacent segment, the region is segmented out, meaning the region probably has a different median copy number than that of the adjacent segment. HMM\hyp{}based methods, such as the SNP-FASST segmentation algorithm in Nexus Copy Number software, perform faster than CBS\hyp{}based methods, but require an estimate of signal--copy number relationship, as it works with integer copy numbers. Because of the heterogeneity present in tumor samples, this is probably not an optimal way to segment tumor data~\cite{rasmussen2011allele}. We used SNPRank segmentation to segment the copy number data, with a minimum of 5 probes per segment. After segmentation, a cut-off for frequency of copy number changes can be set, so that the most recurrent alterations will be detected. One can also specifically look for focal or broad events, as is described in Chapter~\ref{ch9} of this thesis. As with the analysis of other microarray data types, permutations can be used to determine whether there are significant differences in copy number or LOH profiles of groups with different features ({\it e.g.} in Nexus Copy Number software).
%
\section{Downstream data analysis}\label{downstream1}
Deducing a biological interpretation from large lists of significant genes may be challenging, and validation of all significant genes is often very labor intensive. Several methods have been developed which determine whether specific signal transduction pathways, biological processes, or other groups of genes with similar functions, are affected. Genes making up such pathways or processes are often taken from public databases, such as the Gene Ontology (GO)~\cite{ashburner2000gene} or the Kyoto Encyclopedia of Genes and Genomes (KEGG)~\cite{kanehisa2000kegg}, or are available as commercial software, such as Ingenuity Pathways Analysis (IPA, Ingenuity Systems), which is manually curated. The hypergeometric test (a one\hyp{}tailed Fisher's exact test) is most often used to obtain information on the enrichment of significant genes in specific pathways or biological processes. This test determines whether there is more overlap between the list of significant genes and the set of genes of interest ({\it e.g.} the pathway) than would be expected by chance. The hypergeometric test can be applied on microarray data in IPA (used in Chapter~\ref{ch6}) and in the Bioconductor {\it topGO} package (used in Chapters~\ref{ch3} and~\ref{ch7})~\cite{alexa2006improved}. A disadvantage of this simple test is that it requires a hard definition of significance ({\it e.g.} a p-value cut-off), and discards information on the exact p-values of the genes tested. The hypergeometric test also assumes independence of genes, which is not accurately representing the biology of a cell, since the expression of functionally related genes is often correlated. Because of this assumption, the hypergeometric test may understate the true p-values. It is therefore recommended to use a very low p-value ({\it e.g.} $0.001$ or lower) as cut-off for significance when applying this test. Another problem of the hypergeometric test is that it assumes independence of categories. GO terms are certainly not independent, as these terms are set up in a hierarchical structure of nodes, with parent terms representing a broader GO term, and child terms a more specific subset of its parent terms~\cite{ashburner2000gene,rhee2008use}. Algorithms which can identify the GO term which better represents the biological situation (significantly affected genes) than other terms from its neighborhood have been developed, such as the {\it weight} algorithm in the {\it topGO} package~\cite{alexa2006improved}.
A method which takes into account a continuous measure of significance is gene set enrichment analysis ({\it GSEA})~\cite{subramanian2005gene}. This method ranks genes based on their associated p-values and subsequently determines an enrichment score based on the rank of the genes present and not present in a specific pathway or category. The significance of this enrichment score is subsequently tested by permuting phenotype labels to determine the null distribution of the enrichment score.
Another approach to determine which biological pathways are significantly affected is the {\it globaltest} (used in Chapter~\ref{ch5}). Based on a logistic regression model, this test determines whether a prespecified group of genes is differentially expressed, and thus tests groups of genes instead of single genes~\cite{goeman2004global}. This test is particularly intended for identifying gene sets for which many genes are associated with a phenotype in a small way. Using this approach may be especially fruitful in case no overall differential expression is detected due to small sample sizes, as this approach significantly reduces the multiple testing problem~\cite{goeman2005testing}. The {\it globaltest} has much more power than self\hyp{}contained tests (tests which compare a gene set with its complement), such as the hypergeometric test~\cite{goeman2007analyzing}. To apply the {\it globaltest} on GO terms, Goeman {\it et al}. also developed a method that preserves the specific graph structure of the Gene Ontology~\cite{goeman2008multiple}. In addition, this algorithm can be used in combination with follow\hyp{}up data~\cite{goeman2005testing}.
A final method to extract biological information from lists of significantly affected genes is performing network analysis. Networks are assembled {\it de novo}, based on connectivity ({\it e.g.} binding or functional properties) between affected molecules. In IPA, networks are assembled using decreasingly connected molecules from the significant genes in the dataset which is analyzed, and are annotated with functional categories, which are manually curated. In contrast to pathway analysis, these IPA networks do not have directionality (but network analysis methods which include directionality between molecules also exist). We used network analysis to interpret differential gene expression between various histological subtypes of osteosarcoma (Chapter~\ref{ch3}).
%
\section{Supervised learning}\label{supervised1}
Generating a prediction profile which can classify tumors based on mRNA expression or specific copy number aberrations may also be used in microarray analysis of a cancer dataset. Classification may for example help to diagnose a tumor based on its microarray data profile, or may predict event\hyp{}free or overall survival of patients. Some examples of supervised learning approaches are nearest shrunken centroids classification ({\it e.g.} available in Bioconductor package {\it pamr}~\cite{tibshirani2002diagnosis}), support vector machine (SVM) learning ({\it e.g.} available in R package {\it e1701}~\cite{dimitriadou2008misc}), and random forest classification ({\it e.g.} available in R package {\it varSelRF}~\cite{diaz2007genesrf}).
In this thesis, we used nearest shrunken centroids classification to develop a classifier of the main histological subtype of conventional osteosarcoma. We validated this classifier on an independent dataset, and applied it on data obtained from osteosarcoma model systems (Chapter~\ref{ch3}). Nearest centroids classification determines centroids for each class by dividing average expression of a gene signature by the standard deviation. New samples are classified to that specific class, of which the centroid is closest---in squared distance---to the expression of the genes in the prediction profile. Nearest shrunken centroids is an adaptation of this method---it shrinks each centroid toward the overall centroid for all classes by a certain threshold. This shrinkage automatically selects genes and reduces the effect of noisy genes. The profile with the lowest prediction error is then selected as the final classifier. Internal cross\hyp{}validation, which divides the training set in different parts, is subsequently used to compute a cross\hyp{}validated error. This approach, however, leads to an underestimation of the error rate, as the same data is used to select features and to estimate the error rate. An extra external cross\hyp{}validation step would thus be appropriate, or, given that there is often only a limited number of samples available for training, the feature selection (genes to include in the profile) should be newly computed for each separate cross\hyp{}validation step~\cite{wood2007classification,ambroise2002selection}. External cross\hyp{}validation is performed in order to correct for overfitting of the data by the model. This can be done on an independent cross\hyp{}validation set, or by using methods such as one\hyp{}leave\hyp{}out cross\hyp{}validation~\cite{simon2003pitfalls}. Also regularization may be used to prevent overfitting, but this is not often used in microarray data analysis, and is therefore beyond the scope of this thesis.
In prediction profiling, the way the distance between the actual sample and the class is calculated may be very different, and this has important consequences for biological interpretation of the profile. In a prediction profile where the magnitude ({\it e.g.} of gene expression) is important, Euclidian distance is best used, while correlation ({\it e.g.} Pearson or Spearman) coefficients are more useful when the way the genes depend on each other, so the pattern of expression, is important for the specific gene list~\cite{quackenbush2006microarray}. This may be one of the reasons why the CINSARC profile, a gene expression signature which was generated on sarcomas~\cite{chibon2010validated} and which uses Spearman correlation as a measurement for distance, did not show significant results on our osteosarcoma dataset (centroids for classifier needed to be retrained, because we used data of a different platform than the original CINSARC signature, Kuijjer {\it et al}., {\it unpublished results}), while the Carter signature~\cite{carter2006signature}, which classifies data based on average expression of genomic instability genes, could predict for metastasis\hyp{}free survival in our data (as shown in Chapter~\ref{ch7}).
%
\section{Data integration}\label{integration1}
As explained in the next chapter, the integration of different data types is particularly relevant when studying a highly genomically unstable tumor. We used superimposed integration of mRNA expression and kinome profiling data in Chapter~\ref{ch6}. This approach was taken, because kinase activity usually does not have a direct downstream effect on mRNA expression (generally, there are several intermediate molecules which confer signaling), and the other way around. It may therefore be more relevant to determine how these data complement each other, instead of identifying only overlapping genes.
For integration of copy number and gene expression (Chapters~\ref{ch7}--\ref{ch8}) data, we identified genes with aberrations occurring in both data types, as the copy number state of a gene can have a direct effect on its expression. We specifically chose to identify cooccurrence and not correlation of copy number and expression signals, because these signals do not have to show a linear correlation, {\it i.e.} correlation will miss our genes which are also regulated at other dimensions, such as epigenetics and feedback mechanisms.
A conservative approach was taken---only genes which were significantly differentially expressed between osteosarcoma tumors and presumed osteosarcoma progenitors were analyzed, and the cut-off for recurrence was set to 35\%. We tested this approach in a paired and nonpaired way to determine cooccurrence of copy number aberrations and differential expression in Chapter~\ref{ch7}, and used paired analysis of cooccurrence of LOH, copy number gains, and differential expression in Chapter~\ref{ch8}.
%
\section{Aims and outline of this thesis}\label{aims1}
In this thesis, a systems biology approach to study high\hyp{}grade osteosarcoma is described. Chapter~\ref{ch1} starts with an introduction on cancer genomics and high\hyp{}grade osteosarcoma, and introduces the EuroBoNeT high\hyp{}grade osteosarcoma database, on which the research in the following chapters is based. In addition, different platforms used in this thesis are described, and different types of high\hyp{}throughput data analyses are explained (this chapter).
In Chapter~\ref{ch2}, published literature on microarray studies on high\hyp{}grade osteosarcoma is reviewed. This review also discusses challenges in high\hyp{}throughput data analysis of osteosarcoma and introduces different model systems which have been used in osteosarcoma research. In addition, information on different comparative analyses and a rationale for integrating different data types are given. The review concludes with a section on how bioinformatics can be translated into functional studies.
The following six chapters of the thesis describe the work which has been performed to answer different research questions regarding osteosarcoma biology and possible targets for therapy. Specifically, we aimed to study molecular differences between clinically different tumors, such as tumors of different histological subtypes, and of tumors with different metastasis\hyp{}free survival profiles. These research questions are answered in Chapters~\ref{ch3} and~\ref{ch4}, respectively. In addition, in Chapter~\ref{ch3}, a histological subtype\hyp{}specific gene expression profile is tested on osteosarcoma model systems. High\hyp{}grade osteosarcoma is also compared with controls, in order to detect what signal transduction pathways may be targeted in osteosarcoma to identify potential adjuvant drugs for treatment of this aggressive tumor (Chapters~\ref{ch5} and~\ref{ch6}). Chapter~\ref{ch5} reports on the analysis of gene expression data, while Chapter~\ref{ch6} determines active pathways based on kinome profiling, and integrates gene expression data with kinome profiling results. Finally, we performed integrative data analysis of SNP and gene expression data, to detect osteosarcoma driver genes (Chapters~\ref{ch7} and~\ref{ch8}). In Chapter~\ref{ch7}, copy number aberrations are integrated with overexpression and downregulation, while in Chapter~\ref{ch8} we specifically look at the combination of Loss of Heterozygosity (LOH), DNA copy number gain, and differential mRNA expression.
In Chapter~\ref{ch9}, results described in Chapters~\ref{ch3} to~\ref{ch8} are discussed and future perspectives for high\hyp{}throughput data analysis on high\hyp{}grade osteosarcoma are given. Chapter~\ref{ch10} includes a Dutch summary, Curriculum Vitae, and a list of publications.
%%% references
\begin{small}
\begin{singlespace}
\bibliographystyle{unsrtnatshort} % sorted as referenced, was unsrtnat, but unsrtnatshort gives shorter output
\bibliography{biblio}
\end{singlespace}
\end{small}
%\end{document}