|
| 1 | +# Cleaning and Combine Free Code Camp Survey Data |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +The survey data was broken up into two parts and need to be combined into one |
| 6 | +for ease of future downstream analyses. Additionally, these two data sets need |
| 7 | +to be cleaned up a bit because of the nature of survey data. |
| 8 | + |
| 9 | +## Notable Data Transformations |
| 10 | + |
| 11 | +### Obvious Outliers |
| 12 | + |
| 13 | +In some of the numeric free text answers, numeric values were filtered out if it |
| 14 | +was beyond a reasonable threshold. For example, an answer saying you've coded |
| 15 | +for 100,000 months would be removed. |
| 16 | + |
| 17 | +### Numeric Ranges |
| 18 | + |
| 19 | +Some answers were given as ranges. For example, a range of "9-10" months of |
| 20 | +programming might have been answer to a question. The average of this range was |
| 21 | +taken when possible. |
| 22 | + |
| 23 | +### Years to Months |
| 24 | + |
| 25 | +Some answers to a question asking about months were given in years. These were |
| 26 | +converted to months if possible. |
| 27 | + |
| 28 | +### Normalization of Answers |
| 29 | + |
| 30 | +Some of the free text answers were very similar to each other, with the |
| 31 | +exception of a space or two. These will register as different answers if you |
| 32 | +aren't looking for them. Answers like "Cybersecurity" and "Cyber Security" are |
| 33 | +the same and were changed to a consistent manner. There may have been some |
| 34 | +missed. |
| 35 | + |
| 36 | + |
| 37 | +## Prerequisites to Rerun Data Manipulations |
| 38 | + |
| 39 | +- [R][RProj] (>= 3.2.3) |
| 40 | +- [dplyr][dplyrGH] (>= 0.4.3) [CRAN][dplyrCRAN] |
| 41 | +- [Rcpp][RcppGH] (>= 0.12.4) [CRAN][RcppCRAN] |
| 42 | + |
| 43 | +[RProj]: https://www.r-project.org/ |
| 44 | +[dplyrGH]: https://github.com/hadley/dplyr |
| 45 | +[RcppGH]: https://github.com/RcppCore/Rcpp |
| 46 | +[dplyrCRAN]: https://cran.r-project.org/web/packages/dplyr/index.html |
| 47 | +[RcppCRAN]: https://cran.r-project.org/web/packages/Rcpp/index.html |
| 48 | + |
| 49 | + |
| 50 | +## Reproduce Cleaning and Combining of Data |
| 51 | + |
| 52 | +Running the following script will create a new file |
| 53 | +`2016-New-Coders-Survey.csv` file in this directory `clean-data/`. |
| 54 | + |
| 55 | +```shell |
| 56 | +git clone https://github.com/FreeCodeCamp/2016-new-coder-survey.git |
| 57 | +cd clean-data |
| 58 | +Rscript clean-data.R |
| 59 | +``` |
| 60 | + |
| 61 | + |
| 62 | +## Cleaning Pipeline |
| 63 | + |
| 64 | +1. Rename column names |
| 65 | +2. Clean free text fields for appropriate question |
0 commit comments