-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrun_analysis.R
172 lines (137 loc) · 8.7 KB
/
run_analysis.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
#===================================================================================================================
# This is file run_analysis.R. It contains commands to tidy and analyze the UCI HAR Dataset. In outline,
# here are the cleaning and analysis steps in this document. Steps 1) through 4) correspond directly to the project
# instructions. Step 0) is preprocessing where I read in all relevant data.
#
# 0) First, I read in all the relevant files. I assume the pwd (current working directory) contains a directory
# called "UCI HAR Dataset" that contains the files and directories from the input. Given this, I read in
# the following files:
# * activity_labels.txt: this file contains the translation between numbers in the data and activity
# labels. I read it into a table called activity_labels, with columns "code" and "label".
# * features.txt: this file contains the column names for the main dataset, read into data frame features
# * test/subject_test.txt: this file contains which subject each observation in the X_test and y_test
# files belongs to. Read into data frame subject_test
# * test/X_test.txt: this file contains the testing subset of the main data, features extracted from the
# observations from the sensors (one observation per row). Read into data frame X_test.
# * test/y_test.txt: this file contains the "answers" if a model trains on the data from X_test.txt,
# which are categorizations of the observations into activity codes. Read into data frame y_test.
# * train/subject_test.txt: again, the file contains the subjects for the observations in X_train
# and classifications in y_train. Read into data frame subject_test.
# * train/X_train.txt: this file contains the training subset of the main data, which is features extracted
# from the observations from the sensors (one observation per row). Read into data frame X_train.
# * train/y_train.txt: this file contains the "answers" to the training data Xtrain.txt, in other words,
# it contains the classification into activity codes for each observation. Read into data frame y_train.
#
# 1) Next, I work on the data frames X_train and X_test. First, I label the columns according to the features
# column names. The result is two appropriately labeled data frames, one for the training, and one for the testing
# input data. Then, I put together the various pieces of the dataset. I use cbind to put together the subjects,
# features, and classifications for the training and the test data. I use rbind to put together the resulting
# completed data subsets. The result of the rbind is a merged training and test set.
#
# 2) Next, I select only those columns whose name indicates a mean or std function. This extracts
# the mean, std, and meanFreq columns for all relevant measurements.
#
# 3) Next, I translate activity codes in the data into human-readable labels according to the activity_labels
# data frame.
#
# 4) Finally, I create the output data frame, which contains the average value of each variable for each
# activity and subject. I output this data frame to a file using write.table.
#
# The Codebook and README files for this project that should accompany this script can be found at:
# https://github.com/datacathy/GCD_course_project. Thanks!
#===================================================================================================================
#----------------------------------------------------------------------------------------------
# STEP 0): read in the relevant pieces of data.
#----------------------------------------------------------------------------------------------
print("Reading data...")
# read activity_labels.txt into a data frame and give it sensible column names
activity_labels <- read.table("./UCI\ HAR\ Dataset/activity_labels.txt")
names(activity_labels) <- c("code", "activity")
# read features.txt into a data frame and give it a sensible column name
features <- read.table("./UCI\ HAR\ Dataset/features.txt")
names(features) <- c("column_index", "feature_name")
# read the three training data files: subject_train.txt, X_train.txt, and y_train.txt and give
# the resulting data frames appropriate column names. Note that the column names for X_train
# are given by the features$feature_name column (in order).
subject_train <- read.table("./UCI\ HAR\ Dataset/train/subject_train.txt")
names(subject_train) <- c("subject")
X_train <- read.table("./UCI\ HAR\ Dataset/train/X_train.txt")
names(X_train) <- features$feature_name
y_train <- read.table("./UCI\ HAR\ Dataset/train/y_train.txt")
names(y_train) <- c("activity_code")
# now read the three test data files: subject_test.txt, X_test.txt, and y_test.txt and give
# the resulting data frames appropriate column names. Note that the column names for X_test
# are given by the features$feature_name column (in order).
subject_test <- read.table("./UCI\ HAR\ Dataset/test/subject_test.txt")
names(subject_test) <- c("subject")
X_test <- read.table("./UCI\ HAR\ Dataset/test/X_test.txt")
names(X_test) <- features$feature_name
y_test <- read.table("./UCI\ HAR\ Dataset/test/y_test.txt")
names(y_test) <- c("activity_code")
# Done reading in data!
#----------------------------------------------------------------------------------------------
# STEP 1): Merge training and test data
#----------------------------------------------------------------------------------------------
print("Merging training and test data...")
# First, we have to put the 3 data frames containing the training data together. Note that the
# order of rows in subject_train, X_train, and y_train is the same, so we cbind them to
# produce complete observations
training <- cbind(cbind(subject_train, y_train), X_train)
# similarly for the test data
testing <- cbind(cbind(subject_test, y_test), X_test)
# now that we have the completed training and testing data frames, we can put all the
# observations into one data frame using rbind.
fulldata <- rbind(training, testing)
# Done merging training and test data
#----------------------------------------------------------------------------------------------
# STEP 2): Extract mean and std data for each measurement
#----------------------------------------------------------------------------------------------
print("Extracting mean and std columns...")
# Now create a logical vector detailing which columns we want to keep, namely those that
# include a mean or std function. Note that this includes meanFreq columns.
which_cols_to_keep <- sapply(names(fulldata), function(x) { length(grep("mean()|std()", x)) > 0 })
# keep "activity_code", and "subject" columns as well
which_cols_to_keep['activity_code'] <- TRUE
which_cols_to_keep['subject'] <- TRUE
# now keep only those columns where which_cols_to_keep is true
keptdata <- fulldata[which_cols_to_keep]
# Done extracting mean and std functions
#----------------------------------------------------------------------------------------------
# STEP 3): Appropriately label activities
#----------------------------------------------------------------------------------------------
print("Labeling activities in human-readable form...")
# merge keptdata with the activity_labels data frame, resulting column activity contains
# human readable labels for each activity code in the observations
keptdata <- merge(activity_labels, keptdata, by.x="code", by.y="activity_code")
# drop extraneous "code" column
keptdata$code <- NULL
# Done labeling activities
#----------------------------------------------------------------------------------------------
# STEP 4): Output a table of averages of each variable for each activity & subject
#----------------------------------------------------------------------------------------------
print("Preparing and writing output to file GCDC-project-result.txt...")
# create a wide table of means for each activity and subject. For a discussion of whether to
# create this wide table or a narrower one, see README.md and Codebook.md.
library(dplyr)
result <- keptdata %>% group_by(activity, subject) %>% summarize_each(funs(mean))
# output result table using table.write with option row.names=FALSE. TO read the table back
# into R, use table.read with option header=TRUE.
write.table(result, "GCDC-project-result.txt", row.names=FALSE)
# Done outputing result.
# clean up
rm(list = c("activity_labels",
"features",
"subject_train",
"X_train",
"y_train",
"subject_test",
"X_test",
"y_test",
"fulldata",
"keptdata",
"result",
"testing",
"training",
"which_cols_to_keep"))
print("...done.")
# Done script run_analysis.R. Thanks!