-
Notifications
You must be signed in to change notification settings - Fork 1
/
AnRCheatsheet.txt
672 lines (530 loc) · 30.9 KB
/
AnRCheatsheet.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
An R Overview & Cheatsheet
==========================
Harry Mangalam <[email protected]>
v1.61, Nov 15, 2014
// export bfp="/home/hjm/nacs"; export fileroot="${bfp}/AnRCheatsheet"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.[ht]* ${bfp}/baye.p_binned2_s.png ${bfp}/Denatured_Ratio_Distribution_s.png moo:~/public_html;
//update svn from BDUC
// scp ${fileroot}.txt hmangala@claw1:~/bduc/trunk/sge; ssh hmangala@bduc-login 'cd ~/bduc/trunk/sge; svn update; svn commit -m "new mods to AnRCheatSheet"'
The latest version of this document http://moo.nac.uci.edu/~hjm/AnRCheatsheet.html[will always be found here].
Note that this is an extremely brief and limited introduction to R, with the
examples writ with the UC Irvine BDUC system in mind. There are many more
complete introductory docs, some of the best listed in the section below.
This is only to get your toes wet.
[NOTE]
.Some assumptions
============================================================================
The following tutorial assumes that you're using a Linux system from the
bash shell, you're familiar navigating directories on a Linux system
using *cd*, using the basic shell utilities such as *head, tail, less, and
grep*, and the minimum R system has been installed on your system. If not,
you should peruse a basic introduction to the bash shell. The bash prompt
is shown as *bash %*. The R shell prompt is *>* and R commands will be
prefixed by the *>* to denote it. Inline comments are prefixed by *#* and
can be copied into your shell along with the R or bash commands they
comment - the *#* shields them from being executed, but do NOT copy in the
R or bash shell prompt. For a quick introduction to basic data manipulation
with Linux, please refer to the imaginatively named
http://moo.nac.uci.edu/~hjm/ManipulatingDataOnLinux.html[Manipulating
Data on Linux].
============================================================================
R's operational modes
---------------------
R is a programming language designed explicitly for statistical computation.
As such, it has many of the characteristics of a general-purpose language
including iterators, control loops, network and database operations,
many of which are useful, but in general not as easy to use as the more general
http://en.wikipedia.org/wiki/Python_(programming_language)[Python] or
http://en.wikipedia.org/wiki/Perl[Perl]
R can operate in 2 modes, as an interactive interpreter (the R shell) and as a
scripting language, much like Perl or Python. Typically the R shell is used to
try things out and the serial commands written in the R language and saved in a
file are used to automate operations once the sequence is well-defined and debugged.
While it is not a great language for procedural programming, it does excel at
mathematical and statistical manipulations. However, it does so in an odd way,
especially for those who have done procedural programming before. R is quite
'object-oriented' in that you tend to deal with 'data objects' rather than
with individual integers, floats, arrays, etc. The best way to think of data
if you have programmed in 'C' is to think of R data (typically termed 'tables'
or 'frames') as C 'structs', arbitrary constructs that can be dealt with by name.
If you haven't programmed in a procedural language, it may actually be easier
for you. R manipulates chunks of data similar to how you might think of them.
For example, "multiply that column by 3.54" or "pivot that spreadsheet".
For a still-brief but more complete overview of R, see
http://en.wikipedia.org/wiki/R_(programming_language)[Wikipedia's entry for R].
Graphics and Graphical User Interfaces for R
--------------------------------------------
R was developed as a commandline language. However, it has gained progressively
more graphics capabilities and graphical user interfaces (GUIs). Some notable
examples are the de facto standard R GUI http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/[R Commander],
the elegant and powerful interactive graphics utility http://www.ggobi.org[ggobi],
and the fully graphical statistics package http://gretl.sourceforge.net/[gretl]
which was developed external to R for time-series econometrics but which now
supports R as an external module.
In addition, many routines in R are packaged with their own GUIs so that when
called from the commandline interface, the GUI will pop up and allow the user
to interact with a mouse rather than the keyboard.
As opposed to fully integrated commercial applications which have a cohesive
interface, these packages differ slightly in the way that they approach getting
things done, but in general the mechanisms follow generally accepted user
interface conventions.
Getting help on R
-----------------
After starting R, you can usually start R's internal help with:
-----------------------------------------------------------------
bash % R # more comments on R's startup state below
# R startup messages deleted
> help.start().
-----------------------------------------------------------------
This will start a browser page with a lot of links to R documentation, both introductory and advanced.
If that help is not installed, most R help can be found in PDF and HTML at
http://cran.r-project.org/manuals.html[the R documentation page].
There is a very useful 100 page PDF book called:
http://cran.r-project.org/doc/manuals/R-intro.pdf[An Introduction to R]
(also in http://cran.r-project.org/doc/manuals/R-intro.html[HTML]).
Especially note http://cran.r-project.org/doc/manuals/R-intro.html#Introduction-and-preliminaries[Chapter 1]
and http://cran.r-project.org/doc/manuals/R-intro.html#A-sample-session[Appendix A]
on page 78, which is a more sophisticated, (but less gentle) tutorial than this.
Note also that if you have experience in SAS or SPSS, there is a good introduction
written especially with this experience in mind. It has been expanded into a
large book (http://www.amazon.com/SAS-SPSS-Users-Statistics-Computing/dp/0387094172[to be released imminently]),
but much of the core is still available for free as http://oit.utk.edu/scc/RforSAS%26SPSSusers.pdf[R for SAS and SPSS Users]
There is another nice Web page that provides a quick, broad intro to R appropriately
called http://www.personality-project.org/r/[Using R for psychological research: A simple guide to an elegant package]
which manages to compress a great deal of what you need to know about R into about 30 screens of well-cross-linked HTML.
There is also a website that expands on simple tutorials and examples, so after
buzzing thru this very simple example, please continue to the http://www.statmethods.net/[QuickR website]
mailto:[email protected][Thomas Girke] at UC Riverside has a very nice HTML
http://faculty.ucr.edu/\~tgirke/Documents/R_BioCond/R_BioCondManual.html[Introduction to R and Bioconductor]
as part of his Bioinformatics Core, which also runs
http://faculty.ucr.edu/\~tgirke/Workshops.htm[frequent R-related courses] which
may be of interest for the Southern California area.
Start the R interpreter
-----------------------
On the BDUC cluster, we host several versions of R. You if you want a particular
version, you'll have to specify it via the module command:
-----------------------------------------------------------------
bash % module load R/2.12.2
-----------------------------------------------------------------
which causes the module info to be spat out and then you can start R with:
-----------------------------------------------------------------
bash % R # simple, no?
-----------------------------------------------------------------
# if you get the line shown below:
-----------------------------------------------------------------
[Previously saved workspace restored]
-----------------------------------------------------------------
you had saved the previous data environment and if you type
*ls()* you'll see all the previously created variables:
-----------------------------------------------------------------
> ls()
[1] "aa" "data.matrix" "fil_ss" "i" "ma5"
[6] "mn" "mn_tss" "norm.data" "sg5" "ss"
[11] "ss_mn" "sss" "tss" "t_ss" "tss_mn"
[16] "x"
-----------------------------------------------------------------
If the previous session was not saved, there will be no saved
variables in the environment
-----------------------------------------------------------------
> ls()
character(0) # denotes an empty character (= nothing there)
-----------------------------------------------------------------
NB: In many cases the data objects can be manipulated as you would files in the
Unix shell, with the takes-some-getting-used-to of adding the
almost-always-required parens afterwards:
-----------------------------------------------------------------
> ls() # lists all data objects in view
> ls(someframe) # may give additional inforamtion on the frame
> rm(someframe) # deletes the 'someframe' data object
-----------------------------------------------------------------
To determine what an object is, you can use a variety of methods, but one of the
most useful is *str* (get the structure of an abject).
-----------------------------------------------------------------
> str(aa) # aa is a table or data.frame resulting from reading in a 25MB file.
'data.frame': 385238 obs. of 6 variables:
$ ID : Factor w/ 1649 levels "ENSXETG00000000007_SCAFFOLD_1356_47264_67263",..: 1 1 1 1 1 1 1 1 1 1 ...
$ BEGIN : int 1 70 139 232 301 370 439 508 577 658 ...
$ END : int 50 119 188 281 350 419 488 557 626 707 ...
$ Blue : num -0.67 -0.47 0.36 0.42 0.09 0.22 -0.42 0.61 0.71 1.61 ...
$ Red : num 2.8 2.56 2.96 -0.76 -2.6 -1.52 -0.36 -3.24 -1.8 -3.36 ...
$ RedNeg: num -0.7 -0.64 -0.74 0.19 0.65 0.38 0.09 0.81 0.45 0.84 ...
> str(y) # y is a vector of size 50
num [1:50] -0.202 -0.418 0.325 -0.377 -0.405 ...
-----------------------------------------------------------------
You can also use dim() to get the bare dimensions of the data structure. Always
a good idea before you try to view it by referencing the data directly without limits.
-----------------------------------------------------------------
> dim (aa) # aa is a table or data.frame resulting from reading in a 25MB file.
[1] 385238 6
-----------------------------------------------------------------
And to address the individual vectors composing the structure, just use the form 'Structure$Vector'.
ie: 'aa$Blue' refers to the structure 'aa', vector 'Blue' (see above).
Some basic R data types
------------------------
Before we get too far, note that R has a variety of data structures that may be required for different functions.
Some of the main ones are:
* 'numeric' - (aka double) the default double-precision (64bit) float representation for a number.
* 'integer' - a single-precision (32bit) integer.
* 'single' - a single precision (32bit) float
* 'string' - any character string ("harry", "r824_09_hust", "888764"). Note the
last is a number defined as a string so you couldn't add "888764" + 765.
* 'vector' - a series of data types of the same type. (3, 5, 6, 2, 15) is a
vector of numerics. ("have", "third", "start", "grep) is a vector of strings.
* 'matrix' - an array of identical data types - all numerics, all strings, all booleans, etc.
* 'data.frame' - a table (another name for it) or array of mixed data types.
A column of strings, a column of integers, a column of booleans, 3 columns of numerics.
* 'list' - a concatenated series of data objects.
Loading data into R
-------------------
This process is described in much more detail in the document
http://cran.r-project.org/doc/manuals/R-data.pdf[R Data Import/Export],
but this will give you a short example of one of the most popular ways of loading data into R.
The following reads the example data file above (red+blue_all.txt) into an R table
called 'aa'; note the ',header=TRUE' option which uses the file
header line to name the columns. In order for this to work, the column
headers have to be separated by the same delimiter as the data (usually
a space or a tab).
-----------------------------------------------------------------
> aa <- read.table(file="red+blue_all.txt",header=TRUE) # try it
-----------------------------------------------------------------
I have run into situations where 'read.table()' would simply not read in a file which I had validated with an external Perl script:
-----------------------------------------------------------------
> em <- read.table(file="long_complicated_data_file",header=TRUE,sep="\t",comment.char="#") # try it
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 134 did not have 29 elements
-----------------------------------------------------------------
In these cases, using 'read.delim()' may be more successful:
-----------------------------------------------------------------
em <- read.delim(file="long_complicated_data_file",na.strings = "", fill=TRUE, header=T, sep="\t")
>
-----------------------------------------------------------------
(see also http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#R_read_write[Thomas Girke's excellent R/BioC Guide].)
Note that in R, you can work left to right or right to left,
altho the left pointing arrow is the usual syntax.
The above command could also have been given as :
-----------------------------------------------------------------
> read.table(file="red+blue_all.txt",header=TRUE) -> aa # try it
-----------------------------------------------------------------
Did we get the col names labeled correctly?
-----------------------------------------------------------------
> names(aa) # try it
[1] "ID" "BEGIN" "END" "Blue" "Red"
-----------------------------------------------------------------
Many functions will require the data as a matrix (a data structure whose components
are identical data types, usually strings. The following will load the file
into a matrix, doing the conversions along the way. This command also shows the
use of the *sep="\t"* option which explicitly sets the data delimiter to
the *TAB* character.
-----------------------------------------------------------------
Data <- as.matrix(read.table("red+blue_all.txt", sep="\t", header=TRUE))
-----------------------------------------------------------------
Viewing data
------------
To view some of the table, we only need to reference it. R assumes that a naked
variable is a request to view it. An R table is referenced in [rows,column] order.
That is, the index is [row indices, col indices], so that a reference like
aa[1:6,1:5]
will show a 'square' window of the table; rows 1-6, columns 1-5
Also: R parameters are 1-indexed, not 0-indexed, tho it seems pretty forgiving (if
you ask for [0-6,0-5], R will assume you meant [1-6,1-5].
also note that R requires acknowledging that the data structures are n-dimensional:
to see rows 1-11, you can't just type:
-----------------------------------------------------------------
> aa[1:11] # try it!
-----------------------------------------------------------------
but must instead type:
-----------------------------------------------------------------
> aa[1:11,] # another dimension implied by the ','
-----------------------------------------------------------------
To slice the table even further, you can view only those elements that interest you
-----------------------------------------------------------------
# R will provide row and column headers if they were provided.
# if they weren't provided, it will provide column indices.
> aa[3:12,4:5] # try it.
-----------------------------------------------------------------
[[hdf5]]
== Using HDF5 files with R
http://www.hdfgroup.org/HDF5/[HDF5] (see also the http://goo.gl/26dvHq[Wikipedia entry])
is a very sophisticated format for data which is being used increasingly in
domains that were not previously using it, chiefly in Financial Record Archiving,
and Bioinformatics.
As the name suggests, internally, it is a hierarchically organized 'hyperslab',
a structure of multi-dimensional arrays of the same data type (ints, floats, bools, etc),
sort of like the file equivalent of a http://goo.gl/EMgF[C struct]. There can be
multiple different such arrays but each contains data of the same type.
R supports reading and writing HDF5 files with the
http://goo.gl/RtyXfl[rhdf5 library], among others.
=== Reading HDF5 files into R
If you are only going to be reading HDF5 files provided by others, the approach is very easy:
-----------------------------------------------------------------
# assuming the rhdf5 library is installed
> library(rhdf5)
> h5 <- h5read("SVM_noaa_ops.h5") # a 12MB HDF5 file
> h5ls("SVM_noaa_ops.h5") # describe the structure of the HDF5 file
group name otype dclass
0 / All_Data H5I_GROUP
1 /All_Data VIIRS-M1-SDR_All H5I_GROUP
2 /All_Data/VIIRS-M1-SDR_All ModeGran H5I_DATASET INTEGER
3 /All_Data/VIIRS-M1-SDR_All ModeScan H5I_DATASET INTEGER
4 /All_Data/VIIRS-M1-SDR_All NumberOfBadChecksums H5I_DATASET INTEGER
5 /All_Data/VIIRS-M1-SDR_All NumberOfDiscardedPkts H5I_DATASET INTEGER
6 /All_Data/VIIRS-M1-SDR_All NumberOfMissingPkts H5I_DATASET INTEGER
7 /All_Data/VIIRS-M1-SDR_All NumberOfScans H5I_DATASET INTEGER
8 /All_Data/VIIRS-M1-SDR_All PadByte1 H5I_DATASET INTEGER
9 /All_Data/VIIRS-M1-SDR_All QF1_VIIRSMBANDSDR H5I_DATASET INTEGER
10 /All_Data/VIIRS-M1-SDR_All QF2_SCAN_SDR H5I_DATASET INTEGER
11 /All_Data/VIIRS-M1-SDR_All QF3_SCAN_RDR H5I_DATASET INTEGER
12 /All_Data/VIIRS-M1-SDR_All QF4_SCAN_SDR H5I_DATASET INTEGER
13 /All_Data/VIIRS-M1-SDR_All QF5_GRAN_BADDETECTOR H5I_DATASET INTEGER
14 /All_Data/VIIRS-M1-SDR_All Radiance H5I_DATASET INTEGER
15 /All_Data/VIIRS-M1-SDR_All RadianceFactors H5I_DATASET FLOAT
16 /All_Data/VIIRS-M1-SDR_All Reflectance H5I_DATASET INTEGER
17 /All_Data/VIIRS-M1-SDR_All ReflectanceFactors H5I_DATASET FLOAT
18 / Data_Products H5I_GROUP
19 /Data_Products VIIRS-M1-SDR H5I_GROUP
20 /Data_Products/VIIRS-M1-SDR VIIRS-M1-SDR_Aggr H5I_DATASET REFERENCE
21 /Data_Products/VIIRS-M1-SDR VIIRS-M1-SDR_Gran_0 H5I_DATASET REFERENCE
dim
0
1
2 1
3 48
4 48
5 48
6 48
7 1
8 3
9 3200 x 768
10 48
11 48
12 768
13 16
14 3200 x 768
15 2
16 3200 x 768
17 2
18
19
20 16
21 16
# in a slightly different format, 'h5dump' provides similar information:
> h5dump(file="SVM_noaa_ops.h5", load=FALSE)
$All_Data
$All_Data$`VIIRS-M1-SDR_All`
$All_Data$`VIIRS-M1-SDR_All`$ModeGran
group name otype dclass dim
1 / ModeGran H5I_DATASET INTEGER 1
$All_Data$`VIIRS-M1-SDR_All`$ModeScan
group name otype dclass dim
1 / ModeScan H5I_DATASET INTEGER 48
... <etc>
-----------------------------------------------------------------
Other R data manipulations are fairly easy as well.
-----------------------------------------------------------------
# you can read parts of the HDF5 hyperslab into an R variable as well
# the following reads a subgroup into an R object
> h5_aa <- h5read("SVM_noaa_ops.h5", "/All_Data/VIIRS-M1-SDR_All")
# and this is what h5_aa contains.
> names(h5_aa)
[1] "ModeGran" "ModeScan" "NumberOfBadChecksums"
[4] "NumberOfDiscardedPkts" "NumberOfMissingPkts" "NumberOfScans"
[7] "PadByte1" "QF1_VIIRSMBANDSDR" "QF2_SCAN_SDR"
[10] "QF3_SCAN_RDR" "QF4_SCAN_SDR" "QF5_GRAN_BADDETECTOR"
[13] "Radiance" "RadianceFactors" "Reflectance"
[16] "ReflectanceFactors"
# see each sub-object by referencing it in the standard R way:
> h5_aa$NumberOfScans
[1] 48
-----------------------------------------------------------------
=== Writing HDF5 files from R
Writing HDF5 files is slightly more difficult, but only because you have to be
careful that you're writing to the right group and that there is congruence between
R and HDF5 data types and attributes. The http://goo.gl/RtyXfl[rhdf5 docs] have
good examples for this.
[[ggobi]]
Visualizing Data with ggobi
---------------------------
While there are many ways in R to http://www.statmethods.net/graphs/index.html[plot your data],
one unique benefit that R provides is its interface to http://www.ggobi.org[ggobi],
an advanced visualization tool for multivariate data, which for the sake of
argument is more than 3 variables.
Operations on Data
------------------
To operate on a column, ie: to multiply it by some value, you only need to reference the
column ID, not each element. This is a key idea behind R's object-oriented approach.
The following multiplies the column 'Red' of the table 'aa' by -1 and stores the result
in the column 'RedNeg. There's no need to pre-define the RedNeg column; R will
allocate space for it as needed.
-----------------------------------------------------------------
> aa$RedNeg = aa$Red * -1 # try it
# the table now has a new column called 'RedNeg'
> aa[1:4,] # try it
-----------------------------------------------------------------
If you want to have the col overwritten by the new value, simply use the original
col name as the left-hand value. You can also combine statements by separating them
with a ';'
-----------------------------------------------------------------
> aa$Red = aa$Red * -2; aa[1:4,] # mul aa$Red by -2 and print the 1st 4 lines
-----------------------------------------------------------------
To transpose (aka pivot) a table:
-----------------------------------------------------------------
> aacopy <- aa # copy the table
> aac_t <- t(aacopy) # transpose the table
# show the all the rows & 1st 3 columns of the transposed table
> aac_t[,1:3]
-----------------------------------------------------------------
To bin a vector and visualize the binning:
-----------------------------------------------------------------
> bp <- read.table(file="just.bayesp",header=TRUE) # 36K fp values
> hist(bp[1:36000,],breaks=50,plot="True")
# see result below
-----------------------------------------------------------------
.Result of binning 36K points with hist()
image:baye.p_binned2_s.png[binned data]
To draw a histogram of a data vector and
http://www.stat.wisc.edu/~xie/smooth_spline_tutorial.html[spline]
the distribution curve, using the file 'MS21_Denat_C_Native_E.txt'
-----------------------------------------------------------------
# read the file into a dataframe
$ cat MS21_Denat_C_Native_E.txt
Denat Native
1.83 4.80
1.78 3.95
1.75 3.72
1.61 3.60
1.60 3.32
...
# start R
$ R
> bth <- read.table(file="MS21_Denat_C_Native_E.txt",header=TRUE,sep="\t",comment.char="#")
Check it
> str(bth)
'data.frame': 233 obs. of 2 variables:
$ Denat : num 1.83 1.78 1.75 1.61 1.6 1.6 1.58 1.55 1.51 1.5 ...
$ Native: num 4.8 3.95 3.72 3.6 3.32 3.23 3.1 2.9 2.7 2.66 ...
# Create a histogram for the 'Denat' column. The following cmd not only plots the histogram
# but also populates the 'hhde' histogram object.
# breaks = how many bins there should be (controls the size of the bins
# plot = to plot or not to plot
# labels = if T, each bar is labeled with the numeric value
# main, xlab, ylab = plot title, axis labels
# col = color of hist bars
# ylim, xlim = axis limits; this was set to allow another histogram to fit inside
# the defined limits; if not set, the limits will be those of the data.
> hhde <-hist(bth$Denat,breaks=16,plot="True",labels=F, main="Ratio Distribution (Denatured)",xlab="Ratio",ylab="Frequency", col="lightblue",ylim=c(0,70), xlim=c(0,5) )
# now plot a smoothed distribution using the hhde$mids (x) and hhde$counts (y)
# df = degrees of freedom, how much smoothing is applied;
# (how closely the line will follow the values)
> lines(sm.spline(hhde$mids,hhde$counts, df=10), lty=1, col = "green")
-----------------------------------------------------------------
.The image generated by the above sequence is shown below.
image:Denatured_Ratio_Distribution_s.png[Ratio Histogram with Distribution curve]
Exporting Data
--------------
Ok, you've imported your data, analyzed it, generated some nice plots and dataframes.
How do you export it to a file so it can be read by other applications?
As you might expect, R provides some excellent documentation for this as well.
The previously noted document http://cran.r-project.org/doc/manuals/R-data.pdf[R Data Import/Export]
describes it well. It's definitely worthwhile reading it in full, but the very
short version is that the most frequently used export function is to write
dataframes into tabular files, essentially the reverse of reading Excel spreadsheets, b
ut in text form. For this, we most often use the *write.table()* function.
Using the *aa* dataframe already created, let's output some of it:
-----------------------------------------------------------------
> aa[1:3,] # lets see what it looks like 'raw'
ID BEGIN END Blue Red
1 ENSXETG00000000007_SCAFFOLD_1356_47264_67263 1 50 -0.67 0.70
2 ENSXETG00000000007_SCAFFOLD_1356_47264_67263 70 119 -0.47 0.64
3 ENSXETG00000000007_SCAFFOLD_1356_47264_67263 139 188 0.36 0.74
> write.table(aa[1:3,])
"ID" "BEGIN" "END" "Blue" "Red"
"1" "ENSXETG00000000007_SCAFFOLD_1356_47264_67263" 1 50 -0.67 0.7
"2" "ENSXETG00000000007_SCAFFOLD_1356_47264_67263" 70 119 -0.47 0.64
"3" "ENSXETG00000000007_SCAFFOLD_1356_47264_67263" 139 188 0.36 0.74
# in the write.table() format above, it's subtly different. Note the quoting and spacing.
# You can address all these in the various arguments to write.table(), for example:
> write.table(aa[1:3,], quote = FALSE, sep = "\t",row.names = FALSE,col.names = TRUE)
ID BEGIN END Blue Red
ENSXETG00000000007_SCAFFOLD_1356_47264_67263 1 50 -0.67 0.7
ENSXETG00000000007_SCAFFOLD_1356_47264_67263 70 119 -0.47 0.64
ENSXETG00000000007_SCAFFOLD_1356_47264_67263 139 188 0.36 0.74
# note that I omitted a file specification to dump the output to the screen. To send
# the same output to a file, you just have to specify it:
> write.table(aa[1:3,], file="/path/to/the/file.txt", quote = FALSE, sep = "\t",row.names = FALSE,col.names = TRUE)
# play around with the arguments, reversing or modifying the argument values.
# the entire documentation for write.table is available from the commandline with
# '?write.table'
-----------------------------------------------------------------
Basic Statistics
----------------
Now, let's get some basic descriptive stats out of the original table.
1st, load the lib that has the functions we'll need
-----------------------------------------------------------------
> library(pastecs) # load the lib that has the correct functions
> attach(aa) # make the Red & Blue columns usable as variables.
# following line combines an R data structure by column.
# type 'help(cbind)' or '?cbind' for more info on cbind, rbind
> redblue <- cbind(Red,Blue)
# the following calculates a table of descriptive stats for both the Red & Blue variables
> stat.desc(redblue)
Red Blue
nbr.val 3.852380e+05 3.852380e+05
nbr.null 2.486000e+03 2.188000e+03
nbr.na 0.000000e+00 0.000000e+00
min -1.504000e+01 -5.130000e+00
max 2.256000e+01 5.340000e+00
range 3.760000e+01 1.047000e+01
sum -6.929548e+04 3.060450e+03
median 4.000000e-02 0.000000e+00
mean -1.798771e-01 7.944310e-03
SE.mean 3.826047e-03 1.115451e-03
CI.mean.0.95 7.498938e-03 2.186250e-03
var 5.639358e+00 4.793247e-01
std.dev 2.374733e+00 6.923328e-01
coef.var -1.320198e+01 8.714826e+01
-----------------------------------------------------------------
You can also feed column numbers to 'stat.desc' to generate similar results:
-----------------------------------------------------------------
stat.desc(some_table[2:13])
stress_ko_n1 stress_ko_n2 stress_ko_n3 stress_wt_n1 stress_wt_n2
nbr.val 3.555700e+04 3.555700e+04 3.555700e+04 3.555700e+04 3.555700e+04
nbr.null 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
nbr.na 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
min 1.013390e-04 1.041508e-04 1.003806e-04 1.013399e-04 1.014944e-04
max 2.729166e+04 2.704107e+04 2.697153e+04 2.664553e+04 2.650478e+04
range 2.729166e+04 2.704107e+04 2.697153e+04 2.664553e+04 2.650478e+04
sum 3.036592e+07 2.995104e+07 2.977157e+07 3.054361e+07 3.055979e+07
median 2.408813e+02 2.540591e+02 2.377578e+02 2.396799e+02 2.382967e+02
mean 8.540068e+02 8.423388e+02 8.372915e+02 8.590042e+02 8.594593e+02
SE.mean 8.750202e+00 8.608451e+00 8.599943e+00 8.810000e+00 8.711472e+00
CI.mean.0.95 1.715066e+01 1.687283e+01 1.685615e+01 1.726787e+01 1.707475e+01
var 2.722458e+06 2.634967e+06 2.629761e+06 2.759796e+06 2.698411e+06
std.dev 1.649987e+03 1.623258e+03 1.621654e+03 1.661263e+03 1.642684e+03
coef.var 1.932054e+00 1.927085e+00 1.936785e+00 1.933941e+00 1.911300e+00
<etc>
-----------------------------------------------------------------
There is also a basic function called 'summary', which will give summary stats on
all components of a dataframe.
-----------------------------------------------------------------
> summary(aa)
ID BEGIN
ENSXETG00000004313_SCAFFOLD_317_890727_910726 : 289 Min. : 1
ENSXETG00000010135_SCAFFOLD_646_44207_64206 : 289 1st Qu.: 4875
ENSXETG00000017293_SCAFFOLD_132_1100669_1120668: 289 Median : 9722
ENSXETG00000020596_SCAFFOLD_474_677230_697229 : 289 Mean : 9781
ENSXETG00000021861_SCAFFOLD_101_1701655_1721654: 289 3rd Qu.:14639
ENSXETG00000023499_SCAFFOLD_486_295358_315357 : 289 Max. :19950
(Other) :383504
END Blue Red RedNeg
Min. : 50 Min. :-5.130000 Min. :-15.0400 Min. :-5.64000
1st Qu.: 4924 1st Qu.:-0.470000 1st Qu.: -1.7600 1st Qu.:-0.39000
Median : 9771 Median : 0.000000 Median : 0.0400 Median :-0.01000
Mean : 9830 Mean : 0.007944 Mean : -0.1799 Mean : 0.04497
3rd Qu.:14688 3rd Qu.: 0.480000 3rd Qu.: 1.5600 3rd Qu.: 0.44000
Max. :19999 Max. : 5.340000 Max. : 22.5600 Max. : 3.76000
-----------------------------------------------------------------
The commercial Revolution R is free to Academics
------------------------------------------------
Recently a company called Revolution Analytics has started to commercialize R,
smoothing off the rough edges, providing a GUI, improving the memory management,
speeding up some core routines. If you're interested in it, you can
http://www.revolutionanalytics.com/downloads/[download it from here].