Skip to content

Commit dde5048

Browse files
committed
New R notebooks to replace the Day 1 and 2 slides
1 parent a0a9a63 commit dde5048

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+10338
-5585
lines changed
Binary file not shown.
Binary file not shown.

Session1.1-intro.Rmd

+643
Large diffs are not rendered by default.

Session1.1-intro.nb.html

+1,176
Large diffs are not rendered by default.

Session1.2-data-structures.Rmd

+404
Large diffs are not rendered by default.

Session1.2-data-structures.nb.html

+1,111
Large diffs are not rendered by default.

Session1.3-walkthrough.Rmd

+244
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
---
2+
title: "Introduction to Solving Biological Problems Using R - Day 2"
3+
author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
4+
Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
5+
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
6+
output: html_notebook
7+
---
8+
9+
# 3. R for data analysis
10+
11+
##3 steps to Basic Data Analysis
12+
13+
- In this short section, we show how the data manipulation steps we have just seen can be used as part of an analysis pipeline:
14+
15+
1. Reading in data
16+
+ `read.table()`
17+
+ `read.csv(), read.delim()`
18+
2. Analysis
19+
+ Manipulating & reshaping the data
20+
+ Any maths you like
21+
+ Plotting the outcome
22+
3. Writing out results
23+
+ `write.table()`
24+
+ `write.csv()`
25+
26+
##A simple walkthrough
27+
28+
- 50 neuroblastoma patients were tested for NMYC gene copy number by interphase nuclei FISH:
29+
+ Amplification of NMYC correlates with worse prognosis
30+
+ We have count data
31+
+ Numbers of cells per patient assayed
32+
+ For each we have NMYC copy number relative to base ploidy
33+
- We need to determine which patients have amplifications
34+
+ (i.e > 33% of cells show NMYC amplification)
35+
36+
##The Working Directory (wd)
37+
38+
39+
- Like many programs R has a concept of a working directory
40+
- It is the place where R will look for files to execute and where it will
41+
save files, by default
42+
- For this course we need to set the working directory to the location
43+
of the course scripts
44+
- In RStudio use the mouse and browse to the directory where you saved the Course Materials
45+
46+
- ***Session → Set Working Directory → Choose Directory...***
47+
48+
## 0. Locate the data
49+
50+
Before we even start the analysis, we need to be sure of where the data are located on our hard drive
51+
52+
- Functions that import data need a file location as a character vector
53+
- The default location is the ***working directory***
54+
```{r}
55+
getwd()
56+
```
57+
58+
- If the file you want to read is in your working directory, you can just use the file name
59+
```{r}
60+
list.files()
61+
```
62+
63+
- Otherwise you need the *path* to the file
64+
+ you can get this using **`file.choose()`**
65+
66+
##1. Read in the data
67+
68+
- The data is a tab-delimited file. Each row is a record, each column is a field. Columns are separated by tabs in the text
69+
- We need to read in the results and assign it to an object (`rawdata`)
70+
71+
```{r}
72+
rawData <- read.delim("countData.txt")
73+
```
74+
75+
In the latest RStudio, there is the option to import data directly from the File menu. ***File*** -> ***Import Dataset*** -> ***From Csv***
76+
77+
- If the data is comma-separated, then use either the argument `sep=","` or the function `read.csv()`:
78+
```{r}
79+
read.csv("countData.csv")
80+
```
81+
- For full list of arguments:
82+
```{r}
83+
?read.table
84+
```
85+
86+
##1b. Check the data
87+
- *Always* check the object to make sure the contents and dimensions are as you expect
88+
- R will sometimes create the object without error, but the contents may be un-usable for analysis
89+
+ If you specify an incorrect separator, R will not be able to locate the columns in your data, and you may end up with an object with just one column
90+
91+
```{r}
92+
# View the first 10 rows to ensure import is OK
93+
rawData[1:10,]
94+
```
95+
96+
97+
- or use the `View()` function to get a display of the data in RStudio:
98+
```{r}
99+
View(rawData)
100+
```
101+
102+
##1c. Understanding the object
103+
104+
- Once we have read the data successfully, we can start to interact with it
105+
- The object we have created is a *data frame*:
106+
```{r}
107+
class(rawData)
108+
```
109+
110+
- We can query the dimensions:
111+
112+
```{r}
113+
ncol(rawData)
114+
nrow(rawData)
115+
dim(rawData)
116+
```
117+
118+
- Or the structure of an object:
119+
+ TIP: In RStudio, window *Environment*, click the blue arrow in the left of an object's name, in order to see the object structure
120+
```{r}
121+
str(rawData)
122+
```
123+
124+
##1c. Understanding the object
125+
126+
- The names of the columns are automatically assigned:
127+
128+
```{r}
129+
colnames(rawData)
130+
```
131+
132+
- We can use any of these names to access a particular column:
133+
+ and create a vector
134+
+ TOP TIP: type the name of the object and hit TAB: you can select the column from the drop-down list!
135+
```{r}
136+
rawData$Nuclei
137+
```
138+
139+
##Word of caution
140+
141+
142+
![](images/tolstoy.jpg)
143+
144+
145+
146+
![](images/hadley.jpg)
147+
148+
> Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others)
149+
150+
##Word of caution
151+
152+
You will make your life a lot easier if you keep your data **tidy** and ***organised***. Before blaming R, consider if your data are in a suitable form for analysis. The more manual manipulation you have done on the data (highlighting, formulas, copy-and-pasting), the less happy R is going to be to read it. Here are some useful links on some common pitfalls and how to avoid them
153+
154+
- http://www.datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
155+
- http://kbroman.org/dataorg/
156+
157+
##Handling missing values
158+
159+
- The data frame contains some **`NA`** values, which means the values are missing – a common occurrence in real data collection
160+
- `NA` is a special value that can be present in objects of any type (logical, character, numeric etc)
161+
- `NA` is not the same as `NULL`:
162+
- `NULL` is an empty R object.
163+
- `NA` is one missing value within an R object (like a data frame or a vector)
164+
- Often R functions will handle `NA`s gracefully:
165+
166+
```{r}
167+
x <- c(1, NA, 3)
168+
length(x)
169+
```
170+
171+
##Handling missing values
172+
173+
- However, sometimes we have to tell the functions what to do with them.
174+
- R has some built-in functions for dealing with `NA`s, and functions often have their own arguments (like `na.rm`) for handling them:
175+
176+
177+
```{r}
178+
mean(x, na.rm = TRUE)
179+
180+
mean(na.omit(x))
181+
```
182+
183+
##2. Analysis (reshaping data and maths)
184+
185+
- Our analysis involves identifying patients with > 33% NB amplification
186+
+ we can use the **`which()`** function to select indices from a logical vector that are `TRUE`
187+
188+
```{r}
189+
# Create an index of results:
190+
prop <- rawData$NB_Amp / rawData$Nuclei
191+
192+
```
193+
194+
```{r}
195+
prop > 0.33
196+
```
197+
198+
```{r}
199+
amp <- which(prop > 0.33)
200+
amp
201+
```
202+
203+
204+
##2. Analysis (reshaping data and maths)
205+
206+
- We can plot a simple chart of the % NB amplification
207+
+ Note that two samples are amplified
208+
+ Plotting will be covered in detail shortly
209+
210+
```{r}
211+
plot(prop, ylim=c(0,1))
212+
# Add a horizonal line:
213+
abline(h=0.33)
214+
```
215+
216+
##3. Outputting the results
217+
218+
- We write out a data frame of results (patients > 33% NB amplification) as a 'comma separated values' text file (CSV):
219+
```{r}
220+
write.csv(rawData[amp,], file="selectedSamples.csv")
221+
```
222+
- The output file is directly-readable by Excel
223+
- It's often helpful to double check where the data has been saved. Use the *get working directory* function:
224+
225+
```{r}
226+
getwd() # print working directory
227+
list.files() # list files in working directory
228+
229+
```
230+
231+
##Exercise:
232+
233+
- Patients are *near normal* if:
234+
`(NB_Amp / Nuclei < 0.33 & NB_Del == 0)`
235+
- Modify the condition in our previous code to find these patients
236+
- Write out a results file of the samples that match these criteria, and open it in a spreadsheet program
237+
238+
239+
```{r}
240+
### Your Answer Here ###
241+
242+
243+
```
244+

Session1.3-walkthrough.nb.html

+594
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)