Skip to content
This repository has been archived by the owner on Jun 23, 2020. It is now read-only.

[WIP] Add script to clean and combine data, and add data #29

Merged
merged 48 commits into from
May 18, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
2bcd819
Add script to clean and combine data, and add data
erictleung May 1, 2016
1c898ab
Move around functions and add more edits
erictleung May 16, 2016
7971f09
Move cleaning of code events to own function
erictleung May 16, 2016
f5e834a
Create function to search and add col + formatting
erictleung May 16, 2016
bf64e62
Create temp helper function to look at columns
erictleung May 16, 2016
e6070a9
Move reading data function to main processes
erictleung May 16, 2016
39072f7
Create draft full dataset
erictleung May 16, 2016
86fc7b0
Rename cleaning function and update joining key
erictleung May 17, 2016
3244de2
Add feedback to user on script actions
erictleung May 17, 2016
d969590
Separate other job interests cleaning to function
erictleung May 17, 2016
fe71c1b
Fix inconsistent indenting in helper function
erictleung May 17, 2016
958d1a8
Move cleaning other podcasts to separate function
erictleung May 17, 2016
33aa339
Reorganize sub-cleaning functions to own category
erictleung May 17, 2016
6e95ec2
Update helper function with flexible use
erictleung May 17, 2016
8b9d6fb
Create new columns for significant other podcasts
erictleung May 17, 2016
1010c2c
Separate a function for cleaning hours learned
erictleung May 17, 2016
1afeb10
Add feedback in cleaning code events & exp earning
erictleung May 17, 2016
88546f6
Separate function for cleaning months programming
erictleung May 17, 2016
1f4d79a
Separate function cleaning post bootcamp salary
erictleung May 17, 2016
bd9b112
Separate function for cleaning money for learning
erictleung May 17, 2016
e77c24b
Add description to entire script
erictleung May 17, 2016
2b03580
Floor values and remove outliers in money to learning
erictleung May 17, 2016
177b6e3
Create function for cleaning age
erictleung May 17, 2016
3b80f03
Initialize functions for columns needing cleaning
erictleung May 17, 2016
10f3161
Create new boolean column for PodcastOther
erictleung May 17, 2016
2bb7c1c
Fix feedback message for cleaning hours learning
erictleung May 17, 2016
465e574
Update draft of complete data
erictleung May 17, 2016
278cb58
Remove boolean Podcast Other column
erictleung May 17, 2016
aef6ad8
Finish cleaning income and remove extras
erictleung May 17, 2016
590e4ec
Remove "Other" from new podcast cols
erictleung May 18, 2016
3851e9d
Finish cleaning commute times
erictleung May 18, 2016
7b849cc
Update code events cleaning to make new cols
erictleung May 18, 2016
1f827f3
Clean other resources
erictleung May 18, 2016
486ea58
Update code events threshold to 1.5% frequency
erictleung May 18, 2016
f7597ce
Update detail on cutoff for other podcasts is 1.5%
erictleung May 18, 2016
de85dc1
Add Bootcamp Name into joining key
erictleung May 18, 2016
afb8ce7
Add back in podcast and events from 2nd dataset
erictleung May 18, 2016
379c932
Make ages less than 10 to NA
erictleung May 18, 2016
1f1fe3d
Convert resources to boolean
erictleung May 18, 2016
bf1ff93
Finish cleaning data with consistency check
erictleung May 18, 2016
e3e5a19
Remove "Other" from new Podcast columns
erictleung May 18, 2016
a63c2a4
Clean student debt owed
erictleung May 18, 2016
d3ae597
Add CodeEvent column to columns removed
erictleung May 18, 2016
c4a7f8d
Write final polish of data
erictleung May 18, 2016
5e88628
Fix small spelling mistakes
erictleung May 18, 2016
f668cb4
Update final dataset
erictleung May 18, 2016
660fd6c
Remove first dataset
erictleung May 18, 2016
8a1dfcd
Update script date
erictleung May 18, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,18 @@ We announced on [March 29th,

Survey development was lead by [Quincy Larson](https://twitter.com/ossia) with Free Code Camp and [Saron Yitbarek](https://twitter.com/saronyitbarek) with Code Newbie. For more about why we made this survey: ["How we crafted a survey for thousands of people who are learning to code"](https://medium.freecodecamp.com/we-just-launched-the-biggest-ever-survey-of-people-learning-to-code-cac81dadf1ea#.8g9ts8gm5).

## Table of Contents

- [About the Data](#about-the-data)
- [How to Contribute](#how-to-contribute)
- [Analysis of other relevant recent data](#analysis-of-other-relevant-recent-data)
- [License](#license)

## About the Data

The survey results are located in the [`data/`](data/) directory, in .csv format.
The raw survey results are located in the [`raw-data/`](raw-data/) directory, in `.csv` format.

We have cleaned and combined the data for convenience of downstream analyses and visualizations. The cleaned data is located in the [`clean-data/`](clean-data/) directory.

## How to Contribute

Expand Down
15,621 changes: 15,621 additions & 0 deletions clean-data/2016-FCC-New-Coders-Survey-Data.csv

Large diffs are not rendered by default.

65 changes: 65 additions & 0 deletions clean-data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Cleaning and Combine Free Code Camp Survey Data

## Introduction

The survey data was broken up into two parts and need to be combined into one
for ease of future downstream analyses. Additionally, these two data sets need
to be cleaned up a bit because of the nature of survey data.

## Notable Data Transformations

### Obvious Outliers

In some of the numeric free text answers, numeric values were filtered out if it
was beyond a reasonable threshold. For example, an answer saying you've coded
for 100,000 months would be removed.

### Numeric Ranges

Some answers were given as ranges. For example, a range of "9-10" months of
programming might have been answer to a question. The average of this range was
taken when possible.

### Years to Months

Some answers to a question asking about months were given in years. These were
converted to months if possible.

### Normalization of Answers

Some of the free text answers were very similar to each other, with the
exception of a space or two. These will register as different answers if you
aren't looking for them. Answers like "Cybersecurity" and "Cyber Security" are
the same and were changed to a consistent manner. There may have been some
missed.


## Prerequisites to Rerun Data Manipulations

- [R][RProj] (>= 3.2.3)
- [dplyr][dplyrGH] (>= 0.4.3) [CRAN][dplyrCRAN]
- [Rcpp][RcppGH] (>= 0.12.4) [CRAN][RcppCRAN]

[RProj]: https://www.r-project.org/
[dplyrGH]: https://github.com/hadley/dplyr
[RcppGH]: https://github.com/RcppCore/Rcpp
[dplyrCRAN]: https://cran.r-project.org/web/packages/dplyr/index.html
[RcppCRAN]: https://cran.r-project.org/web/packages/Rcpp/index.html

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.


## Reproduce Cleaning and Combining of Data

Running the following script will create a new file
`2016-New-Coders-Survey.csv` file in this directory `clean-data/`.

```shell
git clone https://github.com/FreeCodeCamp/2016-new-coder-survey.git
cd clean-data
Rscript clean-data.R
```


## Cleaning Pipeline

1. Rename column names
2. Clean free text fields for appropriate question
Loading