|
| 1 | +--- |
| 2 | +title: "Introduction to Solving Biological Problems Using R - Day 2" |
| 3 | +author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić, |
| 4 | + Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts |
| 5 | +date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' |
| 6 | +output: html_notebook |
| 7 | +--- |
| 8 | + |
| 9 | +# 3. R for data analysis |
| 10 | + |
| 11 | +##3 steps to Basic Data Analysis |
| 12 | + |
| 13 | +- In this short section, we show how the data manipulation steps we have just seen can be used as part of an analysis pipeline: |
| 14 | + |
| 15 | +1. Reading in data |
| 16 | + + `read.table()` |
| 17 | + + `read.csv(), read.delim()` |
| 18 | +2. Analysis |
| 19 | + + Manipulating & reshaping the data |
| 20 | + + Any maths you like |
| 21 | + + Plotting the outcome |
| 22 | +3. Writing out results |
| 23 | + + `write.table()` |
| 24 | + + `write.csv()` |
| 25 | + |
| 26 | +##A simple walkthrough |
| 27 | + |
| 28 | +- 50 neuroblastoma patients were tested for NMYC gene copy number by interphase nuclei FISH: |
| 29 | + + Amplification of NMYC correlates with worse prognosis |
| 30 | + + We have count data |
| 31 | + + Numbers of cells per patient assayed |
| 32 | + + For each we have NMYC copy number relative to base ploidy |
| 33 | +- We need to determine which patients have amplifications |
| 34 | + + (i.e > 33% of cells show NMYC amplification) |
| 35 | + |
| 36 | +##The Working Directory (wd) |
| 37 | + |
| 38 | + |
| 39 | +- Like many programs R has a concept of a working directory |
| 40 | +- It is the place where R will look for files to execute and where it will |
| 41 | +save files, by default |
| 42 | +- For this course we need to set the working directory to the location |
| 43 | +of the course scripts |
| 44 | +- In RStudio use the mouse and browse to the directory where you saved the Course Materials |
| 45 | + |
| 46 | +- ***Session → Set Working Directory → Choose Directory...*** |
| 47 | + |
| 48 | +## 0. Locate the data |
| 49 | + |
| 50 | +Before we even start the analysis, we need to be sure of where the data are located on our hard drive |
| 51 | + |
| 52 | +- Functions that import data need a file location as a character vector |
| 53 | +- The default location is the ***working directory*** |
| 54 | +```{r} |
| 55 | +getwd() |
| 56 | +``` |
| 57 | + |
| 58 | +- If the file you want to read is in your working directory, you can just use the file name |
| 59 | +```{r} |
| 60 | +list.files() |
| 61 | +``` |
| 62 | + |
| 63 | +- Otherwise you need the *path* to the file |
| 64 | + + you can get this using **`file.choose()`** |
| 65 | + |
| 66 | +##1. Read in the data |
| 67 | + |
| 68 | +- The data is a tab-delimited file. Each row is a record, each column is a field. Columns are separated by tabs in the text |
| 69 | +- We need to read in the results and assign it to an object (`rawdata`) |
| 70 | + |
| 71 | +```{r} |
| 72 | +rawData <- read.delim("countData.txt") |
| 73 | +``` |
| 74 | + |
| 75 | +In the latest RStudio, there is the option to import data directly from the File menu. ***File*** -> ***Import Dataset*** -> ***From Csv*** |
| 76 | + |
| 77 | +- If the data is comma-separated, then use either the argument `sep=","` or the function `read.csv()`: |
| 78 | +```{r} |
| 79 | +read.csv("countData.csv") |
| 80 | +``` |
| 81 | +- For full list of arguments: |
| 82 | +```{r} |
| 83 | +?read.table |
| 84 | +``` |
| 85 | + |
| 86 | +##1b. Check the data |
| 87 | +- *Always* check the object to make sure the contents and dimensions are as you expect |
| 88 | +- R will sometimes create the object without error, but the contents may be un-usable for analysis |
| 89 | + + If you specify an incorrect separator, R will not be able to locate the columns in your data, and you may end up with an object with just one column |
| 90 | + |
| 91 | +```{r} |
| 92 | +# View the first 10 rows to ensure import is OK |
| 93 | +rawData[1:10,] |
| 94 | +``` |
| 95 | + |
| 96 | + |
| 97 | +- or use the `View()` function to get a display of the data in RStudio: |
| 98 | +```{r} |
| 99 | +View(rawData) |
| 100 | +``` |
| 101 | + |
| 102 | +##1c. Understanding the object |
| 103 | + |
| 104 | +- Once we have read the data successfully, we can start to interact with it |
| 105 | +- The object we have created is a *data frame*: |
| 106 | +```{r} |
| 107 | +class(rawData) |
| 108 | +``` |
| 109 | + |
| 110 | +- We can query the dimensions: |
| 111 | + |
| 112 | +```{r} |
| 113 | +ncol(rawData) |
| 114 | +nrow(rawData) |
| 115 | +dim(rawData) |
| 116 | +``` |
| 117 | + |
| 118 | +- Or the structure of an object: |
| 119 | + + TIP: In RStudio, window *Environment*, click the blue arrow in the left of an object's name, in order to see the object structure |
| 120 | +```{r} |
| 121 | +str(rawData) |
| 122 | +``` |
| 123 | + |
| 124 | +##1c. Understanding the object |
| 125 | + |
| 126 | +- The names of the columns are automatically assigned: |
| 127 | + |
| 128 | +```{r} |
| 129 | +colnames(rawData) |
| 130 | +``` |
| 131 | + |
| 132 | +- We can use any of these names to access a particular column: |
| 133 | + + and create a vector |
| 134 | + + TOP TIP: type the name of the object and hit TAB: you can select the column from the drop-down list! |
| 135 | +```{r} |
| 136 | +rawData$Nuclei |
| 137 | +``` |
| 138 | + |
| 139 | +##Word of caution |
| 140 | + |
| 141 | + |
| 142 | + |
| 143 | + |
| 144 | + |
| 145 | + |
| 146 | + |
| 147 | + |
| 148 | +> Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others) |
| 149 | +
|
| 150 | +##Word of caution |
| 151 | + |
| 152 | +You will make your life a lot easier if you keep your data **tidy** and ***organised***. Before blaming R, consider if your data are in a suitable form for analysis. The more manual manipulation you have done on the data (highlighting, formulas, copy-and-pasting), the less happy R is going to be to read it. Here are some useful links on some common pitfalls and how to avoid them |
| 153 | + |
| 154 | +- http://www.datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html |
| 155 | +- http://kbroman.org/dataorg/ |
| 156 | + |
| 157 | +##Handling missing values |
| 158 | + |
| 159 | +- The data frame contains some **`NA`** values, which means the values are missing – a common occurrence in real data collection |
| 160 | +- `NA` is a special value that can be present in objects of any type (logical, character, numeric etc) |
| 161 | +- `NA` is not the same as `NULL`: |
| 162 | + - `NULL` is an empty R object. |
| 163 | + - `NA` is one missing value within an R object (like a data frame or a vector) |
| 164 | +- Often R functions will handle `NA`s gracefully: |
| 165 | + |
| 166 | +```{r} |
| 167 | +x <- c(1, NA, 3) |
| 168 | +length(x) |
| 169 | +``` |
| 170 | + |
| 171 | +##Handling missing values |
| 172 | + |
| 173 | +- However, sometimes we have to tell the functions what to do with them. |
| 174 | +- R has some built-in functions for dealing with `NA`s, and functions often have their own arguments (like `na.rm`) for handling them: |
| 175 | + |
| 176 | + |
| 177 | +```{r} |
| 178 | +mean(x, na.rm = TRUE) |
| 179 | +
|
| 180 | +mean(na.omit(x)) |
| 181 | +``` |
| 182 | + |
| 183 | +##2. Analysis (reshaping data and maths) |
| 184 | + |
| 185 | +- Our analysis involves identifying patients with > 33% NB amplification |
| 186 | + + we can use the **`which()`** function to select indices from a logical vector that are `TRUE` |
| 187 | + |
| 188 | +```{r} |
| 189 | +# Create an index of results: |
| 190 | +prop <- rawData$NB_Amp / rawData$Nuclei |
| 191 | +
|
| 192 | +``` |
| 193 | + |
| 194 | +```{r} |
| 195 | +prop > 0.33 |
| 196 | +``` |
| 197 | + |
| 198 | +```{r} |
| 199 | +amp <- which(prop > 0.33) |
| 200 | +amp |
| 201 | +``` |
| 202 | + |
| 203 | + |
| 204 | +##2. Analysis (reshaping data and maths) |
| 205 | + |
| 206 | +- We can plot a simple chart of the % NB amplification |
| 207 | + + Note that two samples are amplified |
| 208 | + + Plotting will be covered in detail shortly |
| 209 | + |
| 210 | +```{r} |
| 211 | +plot(prop, ylim=c(0,1)) |
| 212 | +# Add a horizonal line: |
| 213 | +abline(h=0.33) |
| 214 | +``` |
| 215 | + |
| 216 | +##3. Outputting the results |
| 217 | + |
| 218 | +- We write out a data frame of results (patients > 33% NB amplification) as a 'comma separated values' text file (CSV): |
| 219 | +```{r} |
| 220 | +write.csv(rawData[amp,], file="selectedSamples.csv") |
| 221 | +``` |
| 222 | +- The output file is directly-readable by Excel |
| 223 | +- It's often helpful to double check where the data has been saved. Use the *get working directory* function: |
| 224 | + |
| 225 | +```{r} |
| 226 | +getwd() # print working directory |
| 227 | +list.files() # list files in working directory |
| 228 | +
|
| 229 | +``` |
| 230 | + |
| 231 | +##Exercise: |
| 232 | + |
| 233 | +- Patients are *near normal* if: |
| 234 | +`(NB_Amp / Nuclei < 0.33 & NB_Del == 0)` |
| 235 | +- Modify the condition in our previous code to find these patients |
| 236 | +- Write out a results file of the samples that match these criteria, and open it in a spreadsheet program |
| 237 | + |
| 238 | + |
| 239 | +```{r} |
| 240 | +### Your Answer Here ### |
| 241 | +
|
| 242 | +
|
| 243 | +``` |
| 244 | + |
0 commit comments