Add script to clean and combine data, and add data

- Update survey data dictionary with left out questions - Update survey data dictionary with variable/column names for questions - Add script `clean-data.R` to clean and combine the two survey datasets into one for ease of analysis - Create the combined survey dataset after running `clean-data.R` - Create README.md file to explain cleaned data and the script to produce it - Update root README.md file to briefly explain data - Change `data/` directory to `raw-data/`
freeCodeCamp · May 11, 2016 · 7f185c3 · 7f185c3
1 parent 97ba361
commit 7f185c3
Show file tree

Hide file tree

Showing 8 changed files with 16,975 additions and 37 deletions.
diff --git a/README.md b/README.md
@@ -5,9 +5,18 @@ We announced on [March 29th,
 
 Survey development was lead by [Quincy Larson](https://twitter.com/ossia) with Free Code Camp and [Saron Yitbarek](https://twitter.com/saronyitbarek) with Code Newbie. For more about why we made this survey: ["How we crafted a survey for thousands of people who are learning to code"](https://medium.freecodecamp.com/we-just-launched-the-biggest-ever-survey-of-people-learning-to-code-cac81dadf1ea#.8g9ts8gm5).
 
+## Table of Contents
+
+- [About the Data](#about-the-data)
+- [How to Contribute](#how-to-contribute)
+- [Analysis of other relevant recent data](#analysis-of-other-relevant-recent-data)
+- [License](#license)
+
 ## About the Data
 
-The survey results are located in the [`data/`](data/) directory, in .csv format.
+The raw survey results are located in the [`raw-data/`](raw-data/) directory, in `.csv` format.
+
+We have cleaned and combined the data for convenience of downstream analyses and visualizations. The cleaned data is located in the [`clean-data/`](clean-data/) directory.
 
 ## How to Contribute
 

diff --git a/clean-data/2016-FCC-New-Coders-Survey-Part-1.csv b/clean-data/2016-FCC-New-Coders-Survey-Part-1.csv
diff --git a/clean-data/README.md b/clean-data/README.md
@@ -0,0 +1,65 @@
+# Cleaning and Combine Free Code Camp Survey Data
+
+## Introduction
+
+The survey data was broken up into two parts and need to be combined into one
+for ease of future downstream analyses. Additionally, these two data sets need
+to be cleaned up a bit because of the nature of survey data.
+
+## Notable Data Transformations
+
+### Obvious Outliers
+
+In some of the numeric free text answers, numeric values were filtered out if it
+was beyond a reasonable threshold. For example, an answer saying you've coded
+for 100,000 months would be removed.
+
+### Numeric Ranges
+
+Some answers were given as ranges. For example, a range of "9-10" months of
+programming might have been answer to a question. The average of this range was
+taken when possible.
+
+### Years to Months
+
+Some answers to a question asking about months were given in years. These were
+converted to months if possible.
+
+### Normalization of Answers
+
+Some of the free text answers were very similar to each other, with the
+exception of a space or two. These will register as different answers if you
+aren't looking for them. Answers like "Cybersecurity" and "Cyber Security" are
+the same and were changed to a consistent manner. There may have been some
+missed.
+
+
+## Prerequisites to Rerun Data Manipulations
+
+- [R][RProj] (>= 3.2.3)
+- [dplyr][dplyrGH] (>= 0.4.3) [CRAN][dplyrCRAN]
+- [Rcpp][RcppGH] (>= 0.12.4) [CRAN][RcppCRAN]
+
+[RProj]: https://www.r-project.org/
+[dplyrGH]: https://github.com/hadley/dplyr
+[RcppGH]: https://github.com/RcppCore/Rcpp
+[dplyrCRAN]: https://cran.r-project.org/web/packages/dplyr/index.html
+[RcppCRAN]: https://cran.r-project.org/web/packages/Rcpp/index.html
+
+
+## Reproduce Cleaning and Combining of Data
+
+Running the following script will create a new file
+`2016-New-Coders-Survey.csv` file in this directory `clean-data/`.
+
+```shell
+git clone https://github.com/FreeCodeCamp/2016-new-coder-survey.git
+cd clean-data
+Rscript clean-data.R
+```
+
+
+## Cleaning Pipeline
+
+1. Rename column names
+2. Clean free text fields for appropriate question