This repository has been archived by the owner on Jun 23, 2020. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 59
[WIP] Add script to clean and combine data, and add data #29
Merged
QuincyLarson
merged 48 commits into
freeCodeCamp:master
from
erictleung:clean-and-combine-data
May 18, 2016
Merged
Changes from all commits
Commits
Show all changes
48 commits
Select commit
Hold shift + click to select a range
2bcd819
Add script to clean and combine data, and add data
erictleung 1c898ab
Move around functions and add more edits
erictleung 7971f09
Move cleaning of code events to own function
erictleung f5e834a
Create function to search and add col + formatting
erictleung bf64e62
Create temp helper function to look at columns
erictleung e6070a9
Move reading data function to main processes
erictleung 39072f7
Create draft full dataset
erictleung 86fc7b0
Rename cleaning function and update joining key
erictleung 3244de2
Add feedback to user on script actions
erictleung d969590
Separate other job interests cleaning to function
erictleung fe71c1b
Fix inconsistent indenting in helper function
erictleung 958d1a8
Move cleaning other podcasts to separate function
erictleung 33aa339
Reorganize sub-cleaning functions to own category
erictleung 6e95ec2
Update helper function with flexible use
erictleung 8b9d6fb
Create new columns for significant other podcasts
erictleung 1010c2c
Separate a function for cleaning hours learned
erictleung 1afeb10
Add feedback in cleaning code events & exp earning
erictleung 88546f6
Separate function for cleaning months programming
erictleung 1f4d79a
Separate function cleaning post bootcamp salary
erictleung bd9b112
Separate function for cleaning money for learning
erictleung e77c24b
Add description to entire script
erictleung 2b03580
Floor values and remove outliers in money to learning
erictleung 177b6e3
Create function for cleaning age
erictleung 3b80f03
Initialize functions for columns needing cleaning
erictleung 10f3161
Create new boolean column for PodcastOther
erictleung 2bb7c1c
Fix feedback message for cleaning hours learning
erictleung 465e574
Update draft of complete data
erictleung 278cb58
Remove boolean Podcast Other column
erictleung aef6ad8
Finish cleaning income and remove extras
erictleung 590e4ec
Remove "Other" from new podcast cols
erictleung 3851e9d
Finish cleaning commute times
erictleung 7b849cc
Update code events cleaning to make new cols
erictleung 1f827f3
Clean other resources
erictleung 486ea58
Update code events threshold to 1.5% frequency
erictleung f7597ce
Update detail on cutoff for other podcasts is 1.5%
erictleung de85dc1
Add Bootcamp Name into joining key
erictleung afb8ce7
Add back in podcast and events from 2nd dataset
erictleung 379c932
Make ages less than 10 to NA
erictleung 1f1fe3d
Convert resources to boolean
erictleung bf1ff93
Finish cleaning data with consistency check
erictleung e3e5a19
Remove "Other" from new Podcast columns
erictleung a63c2a4
Clean student debt owed
erictleung d3ae597
Add CodeEvent column to columns removed
erictleung c4a7f8d
Write final polish of data
erictleung 5e88628
Fix small spelling mistakes
erictleung f668cb4
Update final dataset
erictleung 660fd6c
Remove first dataset
erictleung 8a1dfcd
Update script date
erictleung File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
15,621 changes: 15,621 additions & 0 deletions
15,621
clean-data/2016-FCC-New-Coders-Survey-Data.csv
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# Cleaning and Combine Free Code Camp Survey Data | ||
|
||
## Introduction | ||
|
||
The survey data was broken up into two parts and need to be combined into one | ||
for ease of future downstream analyses. Additionally, these two data sets need | ||
to be cleaned up a bit because of the nature of survey data. | ||
|
||
## Notable Data Transformations | ||
|
||
### Obvious Outliers | ||
|
||
In some of the numeric free text answers, numeric values were filtered out if it | ||
was beyond a reasonable threshold. For example, an answer saying you've coded | ||
for 100,000 months would be removed. | ||
|
||
### Numeric Ranges | ||
|
||
Some answers were given as ranges. For example, a range of "9-10" months of | ||
programming might have been answer to a question. The average of this range was | ||
taken when possible. | ||
|
||
### Years to Months | ||
|
||
Some answers to a question asking about months were given in years. These were | ||
converted to months if possible. | ||
|
||
### Normalization of Answers | ||
|
||
Some of the free text answers were very similar to each other, with the | ||
exception of a space or two. These will register as different answers if you | ||
aren't looking for them. Answers like "Cybersecurity" and "Cyber Security" are | ||
the same and were changed to a consistent manner. There may have been some | ||
missed. | ||
|
||
|
||
## Prerequisites to Rerun Data Manipulations | ||
|
||
- [R][RProj] (>= 3.2.3) | ||
- [dplyr][dplyrGH] (>= 0.4.3) [CRAN][dplyrCRAN] | ||
- [Rcpp][RcppGH] (>= 0.12.4) [CRAN][RcppCRAN] | ||
|
||
[RProj]: https://www.r-project.org/ | ||
[dplyrGH]: https://github.com/hadley/dplyr | ||
[RcppGH]: https://github.com/RcppCore/Rcpp | ||
[dplyrCRAN]: https://cran.r-project.org/web/packages/dplyr/index.html | ||
[RcppCRAN]: https://cran.r-project.org/web/packages/Rcpp/index.html | ||
|
||
|
||
## Reproduce Cleaning and Combining of Data | ||
|
||
Running the following script will create a new file | ||
`2016-New-Coders-Survey.csv` file in this directory `clean-data/`. | ||
|
||
```shell | ||
git clone https://github.com/FreeCodeCamp/2016-new-coder-survey.git | ||
cd clean-data | ||
Rscript clean-data.R | ||
``` | ||
|
||
|
||
## Cleaning Pipeline | ||
|
||
1. Rename column names | ||
2. Clean free text fields for appropriate question |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
Sorry, something went wrong.