-
Notifications
You must be signed in to change notification settings - Fork 0
/
bioc-workflow-annotated.tex
2897 lines (2572 loc) · 135 KB
/
bioc-workflow-annotated.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\section*{Introduction}
Quantitative mass spectrometry based spatial proteomics involves
elaborate, expensive and time consuming experimental protocols and
considerable effort is invested in the generation of such data.
Multiple research groups have described a variety of approaches to
establish high quality proteome-wide datasets (see for example
\cite{Gatto:2010} for a review, and
\cite{hyper,Itzhak:2016,Jean_Beltran:2016} for recent
examples). However, data analysis is as critical as data production
for reliable and insightful biological interpretation. Here, we walk
the reader through a typical pipeline for the analysis of such data
using several Bioconductor~\cite{Huber:2015} packages for the R
statistical programming environment.
The main package to analyse protein localisation data is
\Biocpkg{pRoloc}, which offers a set of dedicated functions for the
analysis of such data. \Biocpkg{pRoloc} itself relies on
\Biocpkg{MSnbase} to manipulate and process quantitative proteomics
data. Many other packages are used by \Biocpkg{pRoloc} for clustering,
classification and visualisation. Support for interactive
visualisation is offered by the \Biocpkg{pRolocGUI} package.
In this workflow, we will describe how to prepare the spatial
proteomics data starting from a spreadsheet containing quantitative
mass spectrometry data, through to some essential data processing
steps, and finish with different applications of machine learning
(Figure \ref{fig:overview}). We focus on a recent pluripotent mouse
embryonic stem cells experiment \cite{hyper}. These data, as well as
additional annotated and pre-formatted datasets from various species
are readily available in the \Biocexptpkg{pRolocdata} package.
Installation of Bioconductor package is documented in detail on the
\href{http://bioconductor.org/install/#install-bioconductor-packages}{Bioconductor
installation help page}. Below, we show how to install the four main
packages used in this workflow:
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{source}\hlstd{(}\hlstr{"https://bioconductor.org/biocLite.R"}\hlstd{)}
\hlkwd{biocLite}\hlstd{(}\hlkwd{c}\hlstd{(}\hlstr{"MSnase"}\hlstd{,} \hlstr{"pRoloc"}\hlstd{,} \hlstr{"pRolocdata"}\hlstd{,} \hlstr{"pRolocGUI"}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}
This procedure is also applicable to any packages, from
\href{https://cran.r-project.org/}{CRAN} as well as GitHub. Once a
package has been installed, it needs to be loaded for its
functionality to become available in the R session; this is done with
the \texttt{library} function e.g. to load the \Biocpkg{pRoloc} package one
would type \texttt{library("pRoloc")} after installation.
If you have questions about this workflow in particular, or about
other Bioconductor packages in general, they are best asked on the
\href{https://support.bioconductor.org/}{Bioconductor support site}
following the
\href{http://www.bioconductor.org/help/support/posting-guide/}{posting
guidelines}. Questions can be tagged with specific package names or
keywords. For more general information about mass spectrometry and
proteomics, the readers are invited to read the
\Biocexptpkg{RforProteomics} package vignettes and associated
papers \cite{Gatto:2014,Gatto:2015}.
\begin{figure}[!ht]
\centering
\includegraphics[width=.5\textwidth]{./Figures/overview.pdf}
\caption{Schematic overview of the pRoloc pipeline from data import, through to data processing, machine learning and data export.}
\label{fig:overview}
\end{figure}
\section*{Reading and processing spatial proteomics data}
\subsection*{The use-case: predicting sub-cellular localisation in pluripotent embryonic mouse stem cells}
As a use-case, we analyse a recent high-throughput spatial proteomics
dataset from pluripotent mouse embryonic stem cells (E14TG2a)
\cite{hyper}. The data was generated using hyperplexed LOPIT
(hyperLOPIT), a state-of-the-art method relying on improved
sub-cellular fractionation and more accurate quantitation, leading to
more reliable classification of protein localisation across the whole
sub-cellular space. The method uses an elaborate sub-cellular
fractionation scheme, enabled by the use of Tandem Mass Tag (TMT)
\cite{Thompson:2003} 10-plex and application of the MS data
acquisition technique named synchronous precursor selection MS$^3$
(SPS-MS$^3$) \cite{McAlister:2014}, for TMT quantification with high
accuracy and precision. Three biological replicates were generated
from the E14TG2a experiment, the first was to target low density
fractions and the second and third were to emphasis separation of the
denser organelles. The intersect of replicates 1 and 2 was treated as
a 20-plex dataset for the analysis. As discussed in the publication
\cite{hyper}, it has been shown that combining replicates from
different gradients can increase spatial resolution
\cite{Trotter:2010}. The combination of replicates resulted in 5032
proteins common to both experiments.
These, as well as many other data are directly available as properly
structured and annotated datasets from the \Biocexptpkg{pRolocdata}
experiment package. In this workflow, we will start with a description
of how to generate these ad hoc data objects starting from an
arbitrary spreadsheet, as produced by many popular third-party
applications.
While we focus here on a LOPIT-type dataset, these analyses are
\hl{relevent} for any quantitative spatial proteomics data,
irrespective of the fractionation (i.e. density gradient or
differential centrifugation \cite{Itzhak:2016}) or quantitation
(i.e. labelled or label-free) methods.
\subsection*{The infrastructure: \Biocpkg{pRoloc} and \Biocpkg{MSnbase} packages}
To make use of the full functionality of the \Biocpkg{pRoloc} software
one needs to import their data into R and prepare them as an
\texttt{MSnSet}. The \texttt{MSnSet} is a dedicated data structure for
the efficient manipulation and processing of mass spectrometry and
proteomics data in R. Figure \ref{fig:msnset} illustrates a simplified view of the
\texttt{MSnSet} structure; there exists 3 key sub-parts (termed slots)
to such a data object: (1) the \texttt{exprs} (short for
\textit{expression} data) slot for storing the quantitation data, (2)
the \texttt{fData} slot (short for \textit{feature}-metadata) for
storing the feature meta-data, and finally (3) the \texttt{pData} slot
(short for \textit{pheno}-metadata, i.e. sample phenotypic data) for
storing the sample meta-data.
\begin{figure}[!ht]
\centering
\includegraphics[width=.5\textwidth]{./Figures/msnset.png}
\caption{Simplified representation of the \texttt{MSnSet} data
structure (reproduced with permission from the \Biocpkg{MSnbase}
vignette)}
\label{fig:msnset}
\end{figure}
Feature metadata typically contains general annotation about the
proteins (accession numbers, description, \ldots), information related
to the identification search (confidence scores, number of peptides,
\ldots) as well as annotation about known sub-cellular location (see in
particular the \textit{Markers} section) and results from data
analysis. The sample metadata would, for example, record what stable
isotope labels were used for the respective fraction (when labelled
quantitation is used), replicate number, fraction number along the
gradient and pooling information.
Another slot of interest is \texttt{processingData}, that logs the
processing \texttt{MSnSet} objects undergo. The processing log can be
accessed with the \texttt{processingData} function and is displayed
under \textit{Processing information} in the textual object summary
when an \texttt{MSnSet}'s name is typed in the R console.
\subsection*{Importing data}
There are a number of ways to import quantitation data and create an
\texttt{MSnSet} instance. All methods are described in the
\Biocpkg{MSnbase}
\href{http://bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/MSnbase-io.pdf}{input/output
capabilities vignette}. One suggested simple method is to use the
function \texttt{readMSnSet2}. The function takes a single spreadsheet
file name as input and extracts the columns containing the
quantitation data, as identified by the argument \texttt{ecol}, to
create the expression data, while the other columns in the spreadsheet
are appended to the feature meta-data slot. By example, in the code
chunk below we read in the \texttt{csv} spreadsheet containing the
quantitation data from the intersect of replicates 1 and 2 of the
mouse map \cite{hyper}, using the \texttt{readMSnSet2} function. The
data is as available online with the manuscript (see tab 2 of the
\texttt{xlsx} supplementary data set 1 in \cite{hyper}, which should
be exported as a text-based spreadsheet). It is also available as a
\texttt{csv} in the Bioconductor \Biocexptpkg{pRolocdata} data
package, which we make use of below.
To use the \texttt{readMSnSet2} function, as a minimum one must
specify the file path to the data and which columns of the spreadsheet
contain quantitation data. In the code chunk below, we start by
identifying the file that we want to use. The \texttt{system.file}
function is used to return the path to the \texttt{extdata} directory
from the \Biocexptpkg{pRolocdata} package, which is where our file of
interest resides. We then use the \texttt{dir} function to list the
content of that directory and store the path that matches the file
name of interest in the \texttt{csvfile}. Note that these two lines
are only needed here to locate a file in a package; in a more general
use case, the user would define the \texttt{csvfile} variable
containing the file name of interest directly.
A common pitfall here is to provide only the file name, rather than
full path to the file (which is what is shown below with
\texttt{basename}; we don't print the full path, as it will vary from
computer to computer). Note that only specifying the name of the file
is sufficient when it exists in the working directory (i.e. the
directory in which R is running, which can be queried and changed with
the \texttt{getwd} and \texttt{setwd} functions respectively).
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{extdatadir} \hlkwb{<-} \hlkwd{system.file}\hlstd{(}\hlstr{"extdata"}\hlstd{,} \hlkwc{package} \hlstd{=} \hlstr{"pRolocdata"}\hlstd{)}
\hlstd{csvfile} \hlkwb{<-} \hlkwd{dir}\hlstd{(extdatadir,} \hlkwc{full.names} \hlstd{=} \hlnum{TRUE}\hlstd{,}
\hlkwc{pattern} \hlstd{=} \hlstr{"hyperLOPIT-SIData-ms3-rep12-intersect.csv"}\hlstd{)}
\hlkwd{basename}\hlstd{(csvfile)}
\end{alltt}
\begin{verbatim}
## [1] "hyperLOPIT-SIData-ms3-rep12-intersect.csv.gz"
\end{verbatim}
\end{kframe}
\end{knitrout}
Note that the file is compressed (as indicated by the \texttt{gz}, for
\texttt{gzip}, extension), and will be decompressed on-the-fly when
read into R.
Next, we need to identify which columns in the spreadsheet contain the
quantitation data. This can be done using the \texttt{getEcols}
function inside R. The spreadsheet deposited by the authors contains
two headers, with the second header containing information about where
the quantitation data is stored.
\begin{figure}[!ht]
\centering
\includegraphics[width=.85\textwidth]{./Figures/spreadsheet-screenshot.png}
\caption{A screenshot of the data in the spreadsheet.}
\label{fig:spreadsheet}
\end{figure}
We can display the names of the second header by calling the
\texttt{getEcols} function with the argument \texttt{n = 2} (the
default value is \texttt{n = 1}), to specify that we wish to display
the column names of the second line. We also specify the name of the
spreadsheet file (defined as \texttt{csvfile} above) and the
separator that splits cells.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{getEcols}\hlstd{(csvfile,} \hlkwc{split} \hlstd{=} \hlstr{","}\hlstd{,} \hlkwc{n} \hlstd{=} \hlnum{2}\hlstd{)}
\end{alltt}
\begin{verbatim}
## [1] ""
## [2] ""
## [3] ""
## [4] "Experiment 1"
## [5] "Experiment 2"
## [6] "Experiment 1"
## [7] "Experiment 2"
## [8] "126"
## [9] "127N"
## [10] "127C"
## [11] "128N"
## [12] "128C"
## [13] "129N"
## [14] "129C"
## [15] "130N"
## [16] "130C"
## [17] "131"
## [18] "126"
## [19] "127N"
## [20] "127C"
## [21] "128N"
## [22] "128C"
## [23] "129N"
## [24] "129C"
## [25] "130N"
## [26] "130C"
## [27] "131"
## [28] "phenoDisco Input"
## [29] "phenoDisco Output"
## [30] "Curated phenoDisco Output"
## [31] "SVM marker set"
## [32] "SVM classification"
## [33] "SVM score"
## [34] "SVM classification (top quartile)"
## [35] "Final Localization Assignment"
## [36] "First localization evidence?"
## [37] "Curated Organelles"
## [38] "Cytoskeletal Components"
## [39] "Trafficking Proteins"
## [40] "Protein Complexes"
## [41] "Signaling Cascades"
## [42] "Oct4 Interactome"
## [43] "Nanog Interactome"
## [44] "Sox2 Interactome"
## [45] "Cell Surface Proteins"
\end{verbatim}
\end{kframe}
\end{knitrout}
It is now easy for one to identify that the quantitation data,
corresponding to the 10 TMT isobaric tags, is located in columns 8
to 27. We now have the two mandatory arguments to \texttt{readMSnSet2},
namely the file name (stored in the \texttt{csvfile} variable) and the
quantitation column indices. In addition to these, it is also possible
to pass the optional argument \texttt{fnames} to indicate which column to use
as the labels by which to identify each protein in the sample. Here,
we use \texttt{fnames = 1} to use the UniProt identifiers contained in the
first (unnamed) column of the spreadsheet. We also need to specify to
skip the first line of the file (for the same reason that we used
\texttt{n = 2} in \texttt{getEcols} above) to read the \texttt{csv} data and convert it to an
\texttt{MSnSet} object, named \texttt{hl} (for hyperLOPIT).
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{hl} \hlkwb{<-} \hlkwd{readMSnSet2}\hlstd{(csvfile,} \hlkwc{ecol} \hlstd{=} \hlnum{8}\hlopt{:}\hlnum{27}\hlstd{,} \hlkwc{fnames} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{skip} \hlstd{=} \hlnum{1}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
Below, we display a short summary of the data. The data contains
5032 proteins/features common across the 2 biological replicates
for the respective 2 x 10-plex reporter tags (20
columns or samples), along with associated feature meta-data such as
protein markers, protein description, number of quantified peptides
etc (see below).
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{hl}
\end{alltt}
\begin{verbatim}
## MSnSet (storageMode: lockedEnvironment)
## assayData: 5032 features, 20 samples
## element names: exprs
## protocolData: none
## phenoData: none
## featureData
## featureNames: Q9JHU4 Q9QXS1-3 ... Q9Z2R6 (5032 total)
## fvarLabels: X X.1 ... Cell.Surface.Proteins (25 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## MSnbase version: 2.7.1
\end{verbatim}
\end{kframe}
\end{knitrout}
Below, we examine the quantitative information along the whole
gradient for first 5 proteins. It is also possible to access specific
rows and columns by naming the proteins and TMT tag channels of
interest.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{exprs}\hlstd{(hl)[}\hlnum{1}\hlopt{:}\hlnum{5}\hlstd{, ]}
\end{alltt}
\begin{verbatim}
## X126 X127N X127C X128N X128C X129N X129C X130N X130C X131
## Q9JHU4 0.028 0.034 0.024 0.014 0.026 0.045 0.107 0.341 0.059 0.321
## Q9QXS1-3 0.039 0.134 0.095 0.053 0.084 0.121 0.107 0.128 0.122 0.117
## Q9ERU9 0.021 0.013 0.014 0.009 0.024 0.054 0.116 0.257 0.209 0.284
## P26039 0.120 0.255 0.148 0.091 0.135 0.095 0.041 0.057 0.014 0.043
## Q8BTM8 0.055 0.139 0.078 0.050 0.077 0.098 0.093 0.171 0.079 0.160
## X126.1 X127N.1 X127C.1 X128N.1 X128C.1 X129N.1 X129C.1 X130N.1
## Q9JHU4 0.037 0.064 0.058 0.059 0.067 0.078 0.140 0.208
## Q9QXS1-3 0.033 0.073 0.074 0.062 0.081 0.142 0.190 0.069
## Q9ERU9 0.026 0.017 0.023 0.029 0.039 0.071 0.105 0.171
## P26039 0.111 0.181 0.141 0.144 0.152 0.119 0.075 0.028
## Q8BTM8 0.062 0.108 0.091 0.086 0.099 0.111 0.117 0.095
## X130C.1 X131.1
## Q9JHU4 0.141 0.147
## Q9QXS1-3 0.151 0.125
## Q9ERU9 0.304 0.215
## P26039 0.017 0.033
## Q8BTM8 0.144 0.087
\end{verbatim}
\begin{alltt}
\hlkwd{exprs}\hlstd{(hl)[}\hlkwd{c}\hlstd{(}\hlstr{"Q9ERU9"}\hlstd{,} \hlstr{"Q9Z2R6"}\hlstd{),} \hlkwd{c}\hlstd{(}\hlstr{"X126"}\hlstd{,} \hlstr{"X131.1"}\hlstd{)]}
\end{alltt}
\begin{verbatim}
## X126 X131.1
## Q9ERU9 0.021 0.215
## Q9Z2R6 0.563 0.000
\end{verbatim}
\end{kframe}
\end{knitrout}
The feature meta-data is stored in the \texttt{fData} slot and can be
accessed by \texttt{fData(hl)}. When using \texttt{readMSnSet2}, automatically,
everything that is not defined as quantitation data by \texttt{ecol} or the
feature names by \texttt{fnames} is deposited to the \texttt{fData} slot.
We see the \texttt{fData} contains 25 columns describing information such as
the number of peptides, associated markers, machine learning results
etc. To identify the feature variable names we can use the function
\texttt{fvarLabels}. We see that the first 6 feature variable names contain
non-discriminatory label names, so we relabel them to help us identify
what feature data information is stored in the associated columns.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{fvarLabels}\hlstd{(hl)}
\end{alltt}
\begin{verbatim}
## [1] "X"
## [2] "X.1"
## [3] "X.2"
## [4] "Experiment.1"
## [5] "Experiment.2"
## [6] "Experiment.1.1"
## [7] "Experiment.2.1"
## [8] "phenoDisco.Input"
## [9] "phenoDisco.Output"
## [10] "Curated.phenoDisco.Output"
## [11] "SVM.marker.set"
## [12] "SVM.classification"
## [13] "SVM.score"
## [14] "SVM.classification..top.quartile."
## [15] "Final.Localization.Assignment"
## [16] "First.localization.evidence."
## [17] "Curated.Organelles"
## [18] "Cytoskeletal.Components"
## [19] "Trafficking.Proteins"
## [20] "Protein.Complexes"
## [21] "Signaling.Cascades"
## [22] "Oct4.Interactome"
## [23] "Nanog.Interactome"
## [24] "Sox2.Interactome"
## [25] "Cell.Surface.Proteins"
\end{verbatim}
\begin{alltt}
\hlkwd{fvarLabels}\hlstd{(hl)[}\hlnum{1}\hlopt{:}\hlnum{3}\hlstd{]} \hlkwb{<-} \hlkwd{c}\hlstd{(}\hlstr{"uniprot.accession"}\hlstd{,} \hlstr{"uniprot.id"}\hlstd{,} \hlstr{"description"}\hlstd{)}
\hlkwd{fvarLabels}\hlstd{(hl)[}\hlnum{4}\hlopt{:}\hlnum{6}\hlstd{]} \hlkwb{<-} \hlkwd{paste0}\hlstd{(}\hlstr{"peptides.expt"}\hlstd{,} \hlnum{1}\hlopt{:}\hlnum{3}\hlstd{)}
\hlcom{## feature vars 1, 2, and 4 to 6}
\hlkwd{fData}\hlstd{(hl)[}\hlnum{1}\hlopt{:}\hlnum{4}\hlstd{,} \hlkwd{c}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{2}\hlstd{,} \hlnum{4}\hlopt{:}\hlnum{6}\hlstd{)]}
\end{alltt}
\begin{verbatim}
## uniprot.accession uniprot.id peptides.expt1 peptides.expt2
## Q9JHU4 Q9JHU4 DYHC1_MOUSE 175 166
## Q9QXS1-3 Q9QXS1-3 PLEC_MOUSE 123 150
## Q9ERU9 Q9ERU9 RBP2_MOUSE 101 90
## P26039 P26039 TLN1_MOUSE 101 94
## peptides.expt3
## Q9JHU4 322
## Q9QXS1-3 174
## Q9ERU9 181
## P26039 167
\end{verbatim}
\end{kframe}
\end{knitrout}
Note that when using the simple \texttt{readMSnSet2} procedure, the
\texttt{pData} slot which is used to store information about the
samples/channels is kept empty. As illustrated below, one can use the
\texttt{\$} operator to access (or create) individual columns in the
metadata slot. It is advised to annotate the channels as well. Below,
we annotate the replicate from which the profiles originate and the
TMT tag (extracted from the sample/channel names). To do so, we use
the sample names that were assigned automatically using the
quantiation column names and remove leading \texttt{X} and trailing
\texttt{.1} using the \texttt{sub} function.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{pData}\hlstd{(hl)}\hlopt{$}\hlstd{Replicate} \hlkwb{<-} \hlkwd{rep}\hlstd{(}\hlnum{1}\hlopt{:}\hlnum{2}\hlstd{,} \hlkwc{each} \hlstd{=} \hlnum{10}\hlstd{)}
\hlkwd{pData}\hlstd{(hl)}\hlopt{$}\hlstd{Tag} \hlkwb{<-} \hlkwd{sub}\hlstd{(}\hlstr{"\textbackslash{}\textbackslash{}.1$"}\hlstd{,} \hlstr{""}\hlstd{,} \hlkwd{sub}\hlstd{(}\hlstr{"^X"}\hlstd{,} \hlstr{""}\hlstd{,} \hlkwd{sampleNames}\hlstd{(hl)))}
\hlkwd{pData}\hlstd{(hl)}
\end{alltt}
\begin{verbatim}
## Replicate Tag
## X126 1 126
## X127N 1 127N
## X127C 1 127C
## X128N 1 128N
## X128C 1 128C
## X129N 1 129N
## X129C 1 129C
## X130N 1 130N
## X130C 1 130C
## X131 1 131
## X126.1 2 126
## X127N.1 2 127N
## X127C.1 2 127C
## X128N.1 2 128N
## X128C.1 2 128C
## X129N.1 2 129N
## X129C.1 2 129C
## X130N.1 2 130N
## X130C.1 2 130C
## X131.1 2 131
\end{verbatim}
\end{kframe}
\end{knitrout}
Throughout this workflow we refer to the different columns that
are found in the \texttt{exprs} (expression data) slot as channels (short for TMT channels).
In the frame of LOPIT and hyperLOPIT these channels constitute the
relative abundance of each protein (along the rows) in the channel of
interest. Each TMT channel originates from fractions collected from
the density gradient, or a set of pooled fractions or may be a sample
originating from an alternative preparation e.g. such as from the
chromatin enrichment performed in Christoforou et al \cite{hyper}.
Information about which gradient fractions were used for which tag
should also be stored in the sample meta-data \texttt{pData} slot.
The sample meta-data that is distributed with the
\Biocexptpkg{pRolocdata} package for Christoforou's
hyperLOPIT experiment and (as above) the quantitation data file, are
located in the \texttt{extdata} in the \Biocexptpkg{pRolocdata}
package on the hard drive.
In the code chunk below we again use the
\texttt{dir} function to locate the filepath to the meta-data \texttt{csv} file and
then read it into R using \texttt{read.csv}. We then append the meta-data to
the \texttt{pData} slot. Information about the gradient fractions used and
the associated subcellular fraction densities in % w/v Iodixanol for
each replicate are stored here.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{expinfo} \hlkwb{<-} \hlkwd{dir}\hlstd{(extdatadir,} \hlkwc{full.names} \hlstd{=} \hlnum{TRUE}\hlstd{,}
\hlkwc{pattern} \hlstd{=} \hlstr{"hyperLOPIT-SIData-fraction-info.csv"}\hlstd{)}
\hlstd{fracinfo} \hlkwb{<-} \hlkwd{read.csv}\hlstd{(expinfo,} \hlkwc{row.names}\hlstd{=}\hlnum{1}\hlstd{,} \hlkwc{skip} \hlstd{=} \hlnum{2}\hlstd{,}
\hlkwc{header} \hlstd{=} \hlnum{FALSE}\hlstd{,} \hlkwc{stringsAsFactors} \hlstd{=} \hlnum{FALSE}\hlstd{)}
\hlkwd{pData}\hlstd{(hl)}\hlopt{$}\hlstd{Gradient.Fraction} \hlkwb{<-} \hlkwd{c}\hlstd{(fracinfo[,} \hlnum{1}\hlstd{], fracinfo[,} \hlnum{2}\hlstd{])}
\hlkwd{pData}\hlstd{(hl)}\hlopt{$}\hlstd{Iodixonal.Density} \hlkwb{<-} \hlkwd{c}\hlstd{(fracinfo[,} \hlnum{4}\hlstd{], fracinfo[,} \hlnum{5}\hlstd{])}
\hlkwd{pData}\hlstd{(hl)}
\end{alltt}
\begin{verbatim}
## Replicate Tag Gradient.Fraction Iodixonal.Density
## X126 1 126 Cytosol 0.0
## X127N 1 127N 1 to 6 (pooled) 6.0
## X127C 1 127C 8 to 9 (pooled) 11.0
## X128N 1 128N 10 to 11 (pooled) 13.3
## X128C 1 128C 12 14.6
## X129N 1 129N 14 17.4
## X129C 1 129C 16 20.1
## X130N 1 130N 18 26.8
## X130C 1 130C Chromatin NA
## X131 1 131 19 34.5
## X126.1 2 126 Cytosol 0.0
## X127N.1 2 127N 1 to 6 (pooled) 5.2
## X127C.1 2 127C 7 to 9 (pooled) 10.0
## X128N.1 2 128N 10 to 11 (pooled) 12.5
## X128C.1 2 128C 12 14.0
## X129N.1 2 129N 14 to 15 (pooled) 17.3
## X129C.1 2 129C 17 20.9
## X130N.1 2 130N 18 to 19 (pooled) 24.7
## X130C.1 2 130C Chromatin NA
## X131.1 2 131 20 31.9
\end{verbatim}
\end{kframe}
\end{knitrout}
\subsection*{Data processing}
\subsubsection*{Normalisation}
There are two aspects related to data normalisation that are \hl{relevent}
to spatial proteomics data processing. The first one focuses on
reducing purely technical variation between channels without affecting
biological variability (i.e. the shape of the quantitatives
profiles). This normalisation will depend on the underlying
quantitative technology and the experimental design, and will not be
addressed in this workflow. The second aspect, and more specific to
spatial proteomics data, is scaling all the organelle-specific
profiles into \hl{the} same intensity interval (typically 0 and 1) by, for
example, dividing each intensity by the sum of the intensities for
that quantitative feature. This is not necessary in this example as
the intensities for each replicate have already been re-scaled to 1 in
Proteome Discoverer v1.4 Thermo Fisher. However, if the data require
normalisation, the user can execute the \texttt{normalise} function as
demonstrated in the below code chunk.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{hl} \hlkwb{<-} \hlkwd{normalise}\hlstd{(hl,} \hlkwc{method} \hlstd{=} \hlstr{"sum"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
This transformation of the data assures cancellation of the effect of
the absolute intensities of the quantitative features along the rows,
and focus subsequent analyses on the relative profiles along the
sub-cellular channels.
The same \texttt{normalise} function (or \texttt{normalize}, both
spellings are supported) can also be applied in the first case
described above. Different normalisation methods, such as mean or
median scaling, variance stabilisation or quantile normalisation, to
cite a few, can be applied to accomodate different needs (see
\texttt{?normalise} for available options).
As previously mentioned, before combination, the two replicates in the
\texttt{hl} data that we read into R were separately normalised by sum (i.e.
to 1) across the 10 channels for each replicate respectively. We can
verify this by summing each rows for each replicate:
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{summary}\hlstd{(}\hlkwd{rowSums}\hlstd{(}\hlkwd{exprs}\hlstd{(hl[, hl}\hlopt{$}\hlstd{Replicate} \hlopt{==} \hlnum{1}\hlstd{])))}
\end{alltt}
\begin{verbatim}
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.997 0.999 1.000 1.000 1.001 1.003
\end{verbatim}
\begin{alltt}
\hlkwd{summary}\hlstd{(}\hlkwd{rowSums}\hlstd{(}\hlkwd{exprs}\hlstd{(hl[, hl}\hlopt{$}\hlstd{Replicate} \hlopt{==} \hlnum{2}\hlstd{])))}
\end{alltt}
\begin{verbatim}
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.997 0.999 1.000 1.000 1.001 1.003
\end{verbatim}
\end{kframe}
\end{knitrout}
We see that some features do not add up exactly to 1 due to rounding
errors after exporting to intermediate files. These small deviations
do not bear any consequences here.
\subsubsection*{Combining acquisitions}
The spreadsheet that was used to create the \texttt{hl}
\texttt{MSnSet} included the two replicates within one .csv file. We
also provide individual replicates in the \Biocexptpkg{pRolocdata}
package. Below, we show how to combine \texttt{MSnSet} objects and,
subsequently, how to filter and handle missing values. We start by
loading the \Biocexptpkg{pRolocdata} package and the equivalent
replicates using the \texttt{data} function.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{library}\hlstd{(}\hlstr{"pRolocdata"}\hlstd{)}
\hlkwd{data}\hlstd{(hyperLOPIT2015ms3r1)}
\hlkwd{data}\hlstd{(hyperLOPIT2015ms3r2)}
\end{alltt}
\end{kframe}
\end{knitrout}
At the R prompt, typing
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{pRolocdata}\hlstd{()}
\end{alltt}
\end{kframe}
\end{knitrout}
will list the 75 datasets that are
available in \Biocexptpkg{pRolocdata}.
%% $
Combining data is performed with the \texttt{combine} function. This
function will inspect the feature and sample names to identify how to
combine the data. As we want our replicates to be combined along the
columns (same proteins, different sets of channels), we need to assure
that the respective sample names differ so they can be identified from
one another. The function \texttt{updateSampleNames} can be used do
this.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{identical}\hlstd{(}\hlkwd{sampleNames}\hlstd{(hyperLOPIT2015ms3r1),} \hlkwd{sampleNames}\hlstd{(hyperLOPIT2015ms3r2))}
\end{alltt}
\begin{verbatim}
## [1] TRUE
\end{verbatim}
\begin{alltt}
\hlstd{hyperLOPIT2015ms3r1} \hlkwb{<-} \hlkwd{updateSampleNames}\hlstd{(hyperLOPIT2015ms3r1,} \hlnum{1}\hlstd{)}
\hlstd{hyperLOPIT2015ms3r2} \hlkwb{<-} \hlkwd{updateSampleNames}\hlstd{(hyperLOPIT2015ms3r2,} \hlnum{2}\hlstd{)}
\hlkwd{sampleNames}\hlstd{(hyperLOPIT2015ms3r1)}
\end{alltt}
\begin{verbatim}
## [1] "X126.1" "X127N.1" "X127C.1" "X128N.1" "X128C.1" "X129N.1" "X129C.1"
## [8] "X130N.1" "X130C.1" "X131.1"
\end{verbatim}
\begin{alltt}
\hlkwd{sampleNames}\hlstd{(hyperLOPIT2015ms3r2)}
\end{alltt}
\begin{verbatim}
## [1] "X126.2" "X127N.2" "X127C.2" "X128N.2" "X128C.2" "X129N.2" "X129C.2"
## [8] "X130N.2" "X130C.2" "X131.2"
\end{verbatim}
\end{kframe}
\end{knitrout}
In addition to matching names, the content of the feature metadata for
identical feature annotations must match exactly across the data to be
combined. In particular for these data, we expect the same proteins in
each replicate to be annotated with the same UniProt entry names and
descriptions, but not with the same coverage of number of peptides or
peptide-spectrum matches (PSMs).
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{fvarLabels}\hlstd{(hyperLOPIT2015ms3r1)}
\end{alltt}
\begin{verbatim}
## [1] "EntryName" "ProteinDescription" "Peptides"
## [4] "PSMs" "ProteinCoverage" "markers"
\end{verbatim}
\begin{alltt}
\hlkwd{fvarLabels}\hlstd{(hyperLOPIT2015ms3r2)}
\end{alltt}
\begin{verbatim}
## [1] "EntryName" "ProteinDescription" "Peptides"
## [4] "PSMs" "ProteinCoverage" "markers"
\end{verbatim}
\end{kframe}
\end{knitrout}
Below, we update the replicate specific feature variable names and
remove the shared annotation. In the first line, we update only the
feature variable names 3 to 5 (by appending a \texttt{1}) and in the
second line, we apply the \texttt{updateFvarLabels} function to update
all feature variable names (by appending a \texttt{2}).
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{fvarLabels}\hlstd{(hyperLOPIT2015ms3r1)[}\hlnum{3}\hlopt{:}\hlnum{5}\hlstd{]} \hlkwb{<-} \hlkwd{paste0}\hlstd{(}\hlkwd{fvarLabels}\hlstd{(hyperLOPIT2015ms3r1)[}\hlnum{3}\hlopt{:}\hlnum{5}\hlstd{],} \hlnum{1}\hlstd{)}
\hlstd{hyperLOPIT2015ms3r2} \hlkwb{<-} \hlkwd{updateFvarLabels}\hlstd{(hyperLOPIT2015ms3r2,} \hlstr{"2"}\hlstd{,} \hlkwc{sep} \hlstd{=} \hlstr{""}\hlstd{)}
\hlkwd{fData}\hlstd{(hyperLOPIT2015ms3r1)} \hlkwb{<-} \hlkwd{fData}\hlstd{(hyperLOPIT2015ms3r1)[}\hlnum{1}\hlopt{:}\hlnum{5}\hlstd{]}
\hlkwd{fData}\hlstd{(hyperLOPIT2015ms3r2)} \hlkwb{<-} \hlkwd{fData}\hlstd{(hyperLOPIT2015ms3r2)[}\hlnum{3}\hlopt{:}\hlnum{5}\hlstd{]}
\hlkwd{fvarLabels}\hlstd{(hyperLOPIT2015ms3r1)}
\end{alltt}
\begin{verbatim}
## [1] "EntryName" "ProteinDescription" "Peptides1"
## [4] "PSMs1" "ProteinCoverage1"
\end{verbatim}
\begin{alltt}
\hlkwd{fvarLabels}\hlstd{(hyperLOPIT2015ms3r2)}
\end{alltt}
\begin{verbatim}
## [1] "Peptides2" "PSMs2" "ProteinCoverage2"
\end{verbatim}
\end{kframe}
\end{knitrout}
We can now combine the two experiments into a single \texttt{MSnSet}:
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{combined} \hlkwb{<-} \hlstd{MSnbase}\hlopt{::}\hlkwd{combine}\hlstd{(hyperLOPIT2015ms3r1, hyperLOPIT2015ms3r2)}
\hlstd{combined}
\end{alltt}
\begin{verbatim}
## MSnSet (storageMode: lockedEnvironment)
## assayData: 6725 features, 20 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: X126.1 X127N.1 ... X131.2 (20 total)
## varLabels: Replicate TMT.Reagent ... Iodixonal.Density (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: Q9JHU4 Q9QXS1-3 ... Q9Z2Y3-3 (6725 total)
## fvarLabels: EntryName ProteinDescription ... ProteinCoverage2 (8
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 26754106
## Annotation:
## - - - Processing information - - -
## Combined [5489,10] and [6268,10] MSnSets Tue May 22 16:06:39 2018
## MSnbase version: 2.5.9
\end{verbatim}
\end{kframe}
\end{knitrout}
More details about combining data are given in the dedicated
\textit{Combining MSnSet instances} section of the \Biocpkg{MSnbase}
\href{http://bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/MSnbase-demo.pdf}{tutorial
vignette}.
\subsubsection*{Missing data}
Missing data are a recurrent issue in mass spectrometry applications,
and should be addressed independently of this workflow
\cite{Webb-Robertson:2015,Lazar:2016}. In \cite{Gatto:2014b}, we have
described how a high content in missing values in spatial proteomics
data and their inappropriate handling leads to a reduction of
sub-cellular resolution. \st{Missing data can be imputated} \hl{We can impute missing data} using
\Biocpkg{MSnbase}'s \texttt{impute} function. The method underlying
the imputation method is then determined by a \texttt{methods}
parameter (see \texttt{?impute} for available options). To impute
missing values using nearest neighbour imputation, one would
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{hl} \hlkwb{<-} \hlkwd{impute}\hlstd{(hl,} \hlkwc{method} \hlstd{=} \hlstr{"knn"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
In our particular case, missing values are indicative of protein
groups that were not acquired in both replicates (Figure
\ref{fig:namap}).
\begin{figure}[!ht]
\centering
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{image2}\hlstd{(}\hlkwd{is.na}\hlstd{(combined),} \hlkwc{col} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"black"}\hlstd{,} \hlstr{"white"}\hlstd{),}
\hlkwc{main} \hlstd{=} \hlstr{"Missing values (white cells) after combining replicates"}\hlstd{)}
\end{alltt}
\end{kframe}
\includegraphics[width=.65\textwidth]{figure/namap-1}
\end{knitrout}
\caption{Heatmap of missing values. Note that the features are
re-ordered to highlight \hl{clusters} of proteins with similar numbers of
missing values.}
\label{fig:namap}
\end{figure}
We prefer to remove proteins that were not assayed in both replicated
experiments. This is done with the \texttt{filterNA} function that
removes features (proteins) that contain more than a certain
proportion (default is 0) missing values. The \textit{Processing
information} section summarises how combining and filtering missing
values (subsetting) changed the dimensions of the data.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{combined} \hlkwb{<-} \hlkwd{filterNA}\hlstd{(combined)}
\hlstd{combined}
\end{alltt}
\begin{verbatim}
## MSnSet (storageMode: lockedEnvironment)
## assayData: 5032 features, 20 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: X126.1 X127N.1 ... X131.2 (20 total)
## varLabels: Replicate TMT.Reagent ... Iodixonal.Density (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: Q9JHU4 Q9QXS1-3 ... Q9Z2R6 (5032 total)
## fvarLabels: EntryName ProteinDescription ... ProteinCoverage2 (8
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 26754106
## Annotation:
## - - - Processing information - - -
## Combined [5489,10] and [6268,10] MSnSets Tue May 22 16:06:39 2018
## Subset [6725,20][5032,20] Tue May 22 16:06:39 2018
## Removed features with more than 0 NAs: Tue May 22 16:06:39 2018
## Dropped featureData's levels Tue May 22 16:06:39 2018
## MSnbase version: 2.5.9
\end{verbatim}
\end{kframe}
\end{knitrout}
When more than 2 datasets are to be combined and too many proteins
have not been consistently assayed, leading to too many proteins being
filtered out, we suggest to implement an ensemble of classifiers
voting on protein-sub-cellular niche membership over the output of
several experiments (see section \textit{Supervised machine learning}
for the description of sub-cellular assignments).
\section*{Quality Control}
Data quality is routinely examined through visualisation to verify
that sub-cellular niches have been separated along the gradient. Based
on De Duve's principle \cite{DeDuve:1981} proteins that co-localise in
a cell, exhibit similar quantitation profiles across the gradient
fractions employed. One approach that has been widely used to
visualise and inspect high throughput mass spectrometry-based
proteomics data is principal components analysis (PCA). PCA is one of
many dimensionality reduction methods, that allows one to effectively
summarise multi-dimensional data in to 2 or 3 dimensions to enable
visualisation. Very generally, the original continuous
multi-dimensional data is transformed into a set of orthogonal
components ordered according to the amount of variability that they
describe. The \texttt{plot2D} and \texttt{plot3D} methods in
\Biocpkg{pRoloc} allows one to plot the principal components (PCs) of
a dataset against one another, by default the first two components are
plotted on the x- and y-axis for the \texttt{plot2D} function, and
first three components are loaded for the \texttt{plot3D} function,
respectively (the \texttt{dims} argument can be used to plot other
PCs). If distinct clusters are observed, we assume that there is
organellar separation present in the data. Although, representing the
multi-dimensional data along a limited set of PCs does not give us a
hard quantitative measure of separation, it is extremely useful
summarising complex experimental information in one figure, to get a
simplified overview of the data.
In the code chunk below we produce a 2-dimensional PCA plot of the
mouse stem cell dataset (Figure \ref{fig:pcahl}). Each point on the
plot represents one protein. We can indeed see several distinct
protein clusters. We specify \texttt{fcol = NULL} to ignore feature
metadata columns and not annotate any feature (protein) with a
colour. We will see later how to use this argument to annotate the PCA
plot with prior information about sub-cellular localisation.
\begin{figure}[!ht]
\centering
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{library}\hlstd{(}\hlstr{"pRoloc"}\hlstd{)}
\hlkwd{plot2D}\hlstd{(hl,} \hlkwc{fcol} \hlstd{=} \hlkwa{NULL}\hlstd{,} \hlkwc{col} \hlstd{=} \hlstr{"black"}\hlstd{)}
\hlkwd{plot2D}\hlstd{(hl,} \hlkwc{method} \hlstd{=} \hlstr{"hexbin"}\hlstd{)}
\end{alltt}
\end{kframe}
\includegraphics[width=.45\textwidth]{figure/qcplot-1}
\includegraphics[width=.45\textwidth]{figure/qcplot-2}
\end{knitrout}
\caption{PCA plot of the mouse stem cell data \texttt{hl}. Each dot
represents a single protein, and cluster of proteins represent
proteins residing in the same sub-cellular niche. The figure on the
right bins proteins and represent the bins density to highlight the
presence of protein clusters. }
\label{fig:pcahl}
\end{figure}
In the first instance we advise one to visualise their data without
any annotation (i.e. with \texttt{fcol = NULL}), before proceeding with data
annotation. The identification of well resolved clusters in the data,
constitutes an unbiased assessment of the data structure,
demonstrating the successful separation of sub-cellular clusters.
It is also useful to visualise the relative intensities along the
gradient to identify channels displaying particularly low yield. This
can be done using the \texttt{plotDist} and \texttt{boxplot}
functions, that plot the protein profiles occupancy along the gradient
(we also display the mean channel intensities) and a \texttt{boxplot}
of the column intensities. In the two plots displayed on figure
\ref{fig:qcbx}, we re-order the TMT channels to pair corresponding
channels in the two replicates (rather than ordering the channels by
replicate).
\begin{figure}[!ht]
\centering
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{par}\hlstd{(}\hlkwc{mfrow} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{1}\hlstd{,} \hlnum{2}\hlstd{),} \hlcom{## creates a two-panel figure}
\hlkwc{las} \hlstd{=} \hlnum{2}\hlstd{,} \hlcom{## axis labels orientation}
\hlkwc{cex.axis} \hlstd{=} \hlnum{.7}\hlstd{)} \hlcom{## axis label size}
\hlstd{o} \hlkwb{<-} \hlkwd{order}\hlstd{(hl}\hlopt{$}\hlstd{Iodixonal.Density)}
\hlkwd{plotDist}\hlstd{(hl[, o],} \hlkwc{pcol} \hlstd{=} \hlstr{"#00000010"}\hlstd{,} \hlkwc{xlab} \hlstd{=} \hlstr{""}\hlstd{)}
\hlkwd{lines}\hlstd{(}\hlkwd{colMeans}\hlstd{(}\hlkwd{exprs}\hlstd{(hl[, o])),} \hlkwc{col} \hlstd{=} \hlstr{"red"}\hlstd{,} \hlkwc{type} \hlstd{=} \hlstr{"b"}\hlstd{)}
\hlkwd{boxplot}\hlstd{(}\hlkwd{exprs}\hlstd{(hl[, o]))}
\end{alltt}
\end{kframe}
\includegraphics[width=\textwidth]{figure/qcbx-1}
\end{knitrout}
\caption{Protein profiles and distribution of channel intensities. The
red dots represent the mean relative intensity for each channel. }
\label{fig:q cbx}
\end{figure}
\pagebreak
\section*{Markers}
In the context of spatial proteomics, a marker protein is defined as a
well-known resident of a specific sub-cellular niche in a species
\textit{and} condition of interest. Applying this to machine learning
(ML), and specifically supervised learning, for the task of protein
localisation prediction, these markers constitute the labelled
training data to use as input to a classification analyses. Defining
well-known residents, and obtaining labelled training data for ML
analyses can be time consuming, but it is important to define markers
that are representative of the multivariate data space and on which a
classifier will be trained and generated. \Biocpkg{pRoloc} provides a
convenience function, \texttt{addMarkers}, to directly add markers to
an \texttt{MSnSet} object, as demonstrated in the code chunk
below. These marker sets can be accessed using the
\texttt{pRolocmarkers()} function. Marker sets are stored as a simple
named vector in R, and originate from in-house user-defined sets of
markers or from previous published studies \cite{Gatto:2014b}, which
are continuosly updated and integrated.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{## List available marker sets}
\hlkwd{pRolocmarkers}\hlstd{()}
\end{alltt}
\begin{verbatim}