For an explanation of the features involved, see features_info.txt
in the given data set.
data
- Data frame, 10299 observations, 564 variablesactivityNumber
- The outcome (y). It is a numeric representation of the activity being performed in each observation.activity
- Labels naming the activity, as described in the given data.subject
- ID number of the subject.- The remaining 561 columns are numeric values of the features (X), named as described in the given data.
meansDevs
- Data frame, 10299 observations, 68 variablesactivity
- Labels naming the activity.subject
- ID number of the subject.- The remaining 66 columns correspond to the numeric values of features containing
-mean()
or-std()
in their names. Their names have been cleaned up to be self-explanatory.
dataSummary
- Data frame, 180 observations, 68 variablesactivity
- Labels naming the activity.subject
- ID number of the subject.- The remaining 66 columns correspond to the means of their counterparts in
meansDevs
, for each subject, grouped by activity.
- The training case subjects, features, and outcomes are read, then aggregated with
cbind
intotrainSet
. The same is done for test cases, aggregated intotestSet
. testSet
andtrainset
are aggregated withrbind
to formdata
.- Feature names are read from
features.txt
intofeatures
and used to set column names fordata
. - Activity labels are read from
activity_labels.txt
intoactivities
and this data frame is merged withdata
on the basis ofactivityNumber
. This adds anactivity
column providing a description of activities instead of a numeric representation. activity
,subject
, and all columns with names containing-mean()
or-std()
are extracted from data. This is done usinggrep
on a regexp. The resulting data frame is stored asmeansDevs
.- Column names in
meansDevs
are cleaned to be more readable and meaningful, usinggsub
on regexps to replace parts of column names with descriptive equivalents. meansDevs
ismelt
ed withactivity
andsubject
as IDs. It is thendcast
ed to give means of each feature. The result is stored indataSummary
.dataSummary
is written todata_summary.txt
in theUCI HAR Dataset
folder.
-
For
meansDevs
anddataSummary
, the cryptic feature names are modified into more readable and meaningful camelCase names. -
The activity labels are not of the character class but of the factor class, for easier grouping operations in the future. A side effect of this is that underscores have been left in (eg.
WALKING_UPSTAIRS
instead ofWALKING UPSTAIRS
) because using spaces would make factor levels confusing, as follows:> head(dataSummary$activity) [1] LAYING LAYING LAYING LAYING LAYING LAYING Levels: LAYING SITTING STANDING WALKING WALKING DOWNSTAIRS WALKING UPSTAIRS
-
Because the large size of raw data involved in this project may potentially cause slowdown, intermediary objects that are no longer needed are removed with
rm()
.