[WIP] Add script to clean and combine data, and add data #29

erictleung · 2016-05-05T04:07:49Z

cc/ @QuincyLarson @evaristoc Feel free to comment on aspects of the changes I'll be making. I figured it would be easier and faster to get feedback by using GitHub's feature to comment on PR changes.

Closes #26

Checklist

Commit Message

Update survey data dictionary with left out questions
Update survey data dictionary with variable/column names for questions
Add script clean-data.R to clean and combine the two survey datasets into
one for ease of analysis
Create the combined survey dataset after running clean-data.R
Create README.md file to explain cleaned data and the script to produce it

QuincyLarson · 2016-05-05T05:19:09Z

@erictleung Rather than giving people a script, I say we just give them the cleaned data

So if you can run your script and verify it worked, then we should remove the old csv files and replace them with your unified (and cleaned) csv file

You can commit the R script if you want for archival purposes, but I think 99.9% of the people going to the repo will just want a polished final CSV - they won't care as much about the details of our implementation

SamAI-Software · 2016-05-05T10:52:01Z

Both variants would be good to have in case of any bugs

clean-data/clean-data.R

+    cleanPart1 <- cleanPart1 %>% filter(!numericIdx) %>%
+        bind_rows(numericData)
+
+    # Make all expected earnings less than 100 multiplied by 1000


clean-data/clean-data.R

+    ##  Adapted from: http://stackoverflow.com/a/26766945
+    undecidedWords <- c("not sure", "don't know", "not certain",
+                        "unsure", "dont know", "undecided",
+                        "no preference", "not", "any", "no idea")


evaristoc · 2016-05-07T09:11:12Z

@SamAI-Software about your question above:

Firstly, we all should agree on one thing. Should we just cutoff (into blank) all weird numbers or should we guess what was the real intention and then try to normalize it? And it's not only about expected salary, but about all questions with open answers with numbers.

Yes, even if that will be arbitrary. The most rigorous option is missing or "outliering".

Similarly I have been in communication with @erictleung about:

Both variants would be good to have in case of any bugs

My proposal has been to supply different levels of files:

Raw datasets
Totally Clean dataset
Annex datasets

The Totally Clean is ours with the whole parsing + our arbitrary interpretations of the meaning of the values. Annex datasets could be intermediate ones containing unchanged values for some of the variables that asked for the most of the arbitrary changes, for example all open questions like "Other". See an example at:
https://github.com/evaristoc/2016-new-coder-survey/blob/clean-and-combine-data/clean-data/factors_CodeEventOther

This kind of files will preserve part of the "information" we will have to get rid of when cleaning the data. A person more interested in that additional information could revisit those Annex datasets and build a new dataset if desired.

The key is to provide metadata dictionaries describing the changes.

I have been commenting to @erictleung about the need to maintain consistency:

in naming
in variable types per set of questions
the nesting
the identification of missing values
etc.

The lesser the number of inconsistencies found in the Totally Clean dataset, the better. Also, an important aspect is to provide a robust metadata file as much as we can.

clean-data/clean-data.R

+
+    ## Normalize variations of "None"
+    nones <- c("non", "none", "haven't", "havent", "not", "nothing",
+               "didn't", "n/a", "\bna\b", "never", "nil", "nope")


evaristoc · 2016-05-07T09:30:47Z

@QuincyLarson I am not sure if you agree with keeping several files? As owner of the project, you might have the final decision. I understand that for you the best is to keep the last file ONLY, but be aware that our decisions when cleaning, even if well guided or well intended, will be always arbitrary ones, and they will risk information that someone could find interesting.

Whatever the case I will always insist in proper metadata dict and change file.

SamAI-Software · 2016-05-07T10:15:31Z

My proposal has been to supply different levels of files:

Raw datasets

Totally Clean dataset

Annex datasets

@evaristoc sounds great!

clean-data/clean-data.R

+        mutate(PodcastNone = "1")
+    cleanPart1 <- cleanPart1 %>% filter(!nonesPodIdx) %>%
+        bind_rows(nonesPodData)
+


evaristoc · 2016-05-07T12:15:09Z

Quick before having to go:
@SamAI-Software about maintaining a max. of 100 hours (weekly): I agree.
Also, about all questions with the "Other" open answer option:

the idea is to vectorise/factorise the categories; the "Other" options SHOULD disappear or contain Unknowns in the worst case.

I have commented this to @erictleung. Multi-answered question and or an Other option should be vectorised instead of giving a Other in categorical format and the related questions in Boolean.

@erictleung this is the current challenge we are facing with those questions. Also, as @QuincyLarson suggested it would be better if we give a completely digested file to users. Totally agree:

Final user shouldn't be bothered in trying to normalize / parse any values: only in directly making data representations/visualizations; otherwise asking someone to work on some particular values would be painstaking.

Our goal should be to parse and vectorize the values, even if we have to make arbitrary decisions. Those arbitrary decisions would be always documented though.

I personally agree that the SIMPLEST and probably the BEST change file we can suggest is in fact YOUR code, @erictleung, and likely this thread.

clean-data/README.md

+
+- [R](https://www.r-project.org/) (>= 3.2.3)
+- [dplyr](https://github.com/hadley/dplyr) (>= 0.4.3)
+


evaristoc · 2016-05-09T11:55:03Z

@QuincyLarson @SamAI-Software

I wrote to @erictleung :

I gave it a second thought and realised that even trying a better categorisation of the variables

PodcastOther
CodeEventOther
ResourceOther

is not necessarily informative enough. For example: the name I am giving is mostly arbitrary: would that name help the user to identify and find the resource online if desired?

Some answers are not easy to solve. For example: There are cases where some people reported attending "meetups" without specifying what for meetup, while other ones specified a name of specific meetups, but it is a meetup at the end. So

how shall we define those categories in order to provide enough info without trying to go too far with detailed naming?

Considering the quality of the data and the difficulties to take clearcut decision about how to operationalize some of the responses, I think that we are better off by NOT trying to vectorize all that info of the aforementioned variables. Otherwise we could end up unnecessarily obfuscating the Totally Clean Dataset.

In order to support the users what we can do is offering Annexes in a similar form as the following:
https://github.com/evaristoc/2016-new-coder-survey/blob/clean-and-combine-data/clean-data/factors_EventsOtherDrafted.csv

with a tentative, partial operationalization, without cross-comparison (there are categories that users tended to repeat between questions). The user could use those tentative, informal definitions while still invited to propose a personal one that could better work for her analysis.

clean-data/clean-data.R

+    ## Remove outlier months of programming
+    cleanPart1 <- cleanPart1 %>%
+        mutate(MonthsProgramming = remove_outlier(MonthsProgramming, 744))
+


SamAI-Software · 2016-05-12T03:05:40Z

@erictleung latest dataset (10h ago) looks good 👍

The only question is about consistency - why booleans are sometimes factors and sometimes integers?

Integer booleans: IsSoftwareDev, JobRelocate, BootcampYesNo, BootcampFinish, BootcampFullJobAfter, BootcampLoan, BootcampRecommend, CodeEventCoffee, CodeEventHackathons, CodeEventConferences, CodeEventNodeSchool, CodeEventRailsBridge, CodeEventStartUpWknd, CodeEventWomenCode, CodeEventGirlDev, CodeEventNone, PodcastCodeNewbie, PodcastChangeLog, PodcastSEDaily, PodcastJSJabber, PodcastNone

Factor booleans: ResourceEdX, ResourceCoursera, ResourceFCC, ResourceKhanAcademy, ResourcePluralSight, ResourceCodeacademy, ResourceUdacity, ResourceUdemy, ResourceCodeWars, ResourceOdinProj, ResourceDevTips,

And why ExpectedEarning, HoursLearning, MonthsProgramming are integer, while BootcampPostSalary, MoneyForLearning, BootcampMonthsAgo are numeric?

Other bugs I'll comment as usually on the code later on today, but the data is already looking pretty clean and shiny :)

erictleung · 2016-05-12T17:57:32Z

@SamAI-Software they are different because of how they are inherently read into R (I'm assuming you're using R to read them in).

I still need to do a pass over all of the variables and force a certain data type. I'll have to double check the integer and numeric values. I think it has to do with some values being 0.0 or something with a decimal point.

evaristoc · 2016-05-12T20:41:54Z

@SamAI-Software which were your conventions for HoursLearning at #40?

Not sure if added to the datasets, @erictleung?

summary(as.factor(part1$HoursLearning))

Give some good but also weird values (a few though):
one 2
0.2 1
0.5 1
.1 1
100000000000000 1
100 hours per week 1
10-15 1
12321231231232123123123123123123123 1
14 hours 1
15-20 1
2-20 1
.25 1
2.5 1
300000000000000000000 1
3-4 1
40-50 1
4-6 1
5-7 1
5-8 1
6-8 1
(Other) 11
NA 788

Just looking at the datasets I didn't find any changes...

evaristoc · 2016-05-12T20:42:40Z

@erictleung @SamAI-Software

we need some convention for CommuteTime. Several people decided to give time in minutes instead of hours (!!!).

evaristoc · 2016-05-12T20:47:26Z

@erictleung :

Age: two records over 100 in part2

SamAI-Software · 2016-05-13T02:55:23Z

@evaristoc are you sure that you take data here?

str(as.factor(data.Learn$HoursLearning))
summary(as.factor(data.Learn$HoursLearning))

I have no problems, 73 levels from 0 to 100

which were your conventions for HoursLearning at #40?

## Remove the word "hour(s)"
## Remove hyphen and "to" for ranges of hours
## Remove hours greater than 100 hours
And of course round decimal numbers

we need some convention for CommuteTime. Several people decided to give time in minutes instead of hours (!!!).

Yes, you are absolutely right, we need many conventions for the second dataset, feel free to find all weird answers and suggest your solutions, because today I'll be focused on #41 and on first dataset, there are still bugs to be fixed.

- Finished cleaning income function - Removed changing ExpectedEarning to integer - Remove unnecessary cleaning

- Check for inconsistencies between job role interests - Remove unnecessary columns

erictleung force-pushed the clean-and-combine-data branch 2 times, most recently from ea7e9a0 to 5c8add3 Compare May 6, 2016 03:40

SamAI-Software reviewed May 6, 2016
View reviewed changes

erictleung mentioned this pull request May 6, 2016

Normalizing data #33

Closed

SamAI-Software reviewed May 7, 2016
View reviewed changes

clean-data/clean-data.R

## Normalize variations of "None"

nones <- c("non", "none", "haven't", "havent", "not", "nothing",

"didn't", "n/a", "\bna\b", "never", "nil", "nope")

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

SamAI-Software reviewed May 7, 2016
View reviewed changes

clean-data/clean-data.R

mutate(PodcastNone = "1")

cleanPart1 <- cleanPart1 %>% filter(!nonesPodIdx) %>%

bind_rows(nonesPodData)

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

SamAI-Software reviewed May 8, 2016
View reviewed changes

erictleung force-pushed the clean-and-combine-data branch 2 times, most recently from 8359985 to fa56e84 Compare May 10, 2016 15:02

SamAI-Software reviewed May 11, 2016
View reviewed changes

erictleung force-pushed the clean-and-combine-data branch 2 times, most recently from 01ac7a1 to 7f185c3 Compare May 11, 2016 16:15

evaristoc mentioned this pull request May 12, 2016

Survey Datasets Transformation and Re-Configuration #26

Closed

erictleung added 29 commits May 17, 2016 12:26

Separate function for cleaning money for learning

bd9b112

Add description to entire script

e77c24b

Floor values and remove outliers in money to learning

2b03580

Create function for cleaning age

177b6e3

Initialize functions for columns needing cleaning

3b80f03

Create new boolean column for PodcastOther

10f3161

Fix feedback message for cleaning hours learning

2bb7c1c

Update draft of complete data

465e574

Remove boolean Podcast Other column

278cb58

Finish cleaning income and remove extras

aef6ad8

- Finished cleaning income function - Removed changing ExpectedEarning to integer - Remove unnecessary cleaning

Remove "Other" from new podcast cols

590e4ec

Finish cleaning commute times

3851e9d

Update code events cleaning to make new cols

7b849cc

Clean other resources

1f827f3

Update code events threshold to 1.5% frequency

486ea58

Update detail on cutoff for other podcasts is 1.5%

f7597ce

Add Bootcamp Name into joining key

de85dc1

Add back in podcast and events from 2nd dataset

afb8ce7

Make ages less than 10 to NA

379c932

Convert resources to boolean

1f1fe3d

Finish cleaning data with consistency check

bf1ff93

- Check for inconsistencies between job role interests - Remove unnecessary columns

Remove "Other" from new Podcast columns

e3e5a19

Clean student debt owed

a63c2a4

Add CodeEvent column to columns removed

d3ae597

Write final polish of data

c4a7f8d

Fix small spelling mistakes

5e88628

Update final dataset

f668cb4

Remove first dataset

660fd6c

Update script date

8a1dfcd

QuincyLarson merged commit 4c903b0 into freeCodeCamp:master May 18, 2016


		- [R](https://www.r-project.org/) (>= 3.2.3)
		- [dplyr](https://github.com/hadley/dplyr) (>= 0.4.3)

Uh oh!

[WIP] Add script to clean and combine data, and add data #29

[WIP] Add script to clean and combine data, and add data #29

Uh oh!

Conversation

erictleung commented May 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QuincyLarson commented May 5, 2016

Uh oh!

SamAI-Software commented May 5, 2016

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

evaristoc commented May 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

evaristoc commented May 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamAI-Software commented May 7, 2016

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

evaristoc commented May 7, 2016

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

evaristoc commented May 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

SamAI-Software commented May 12, 2016

Uh oh!

erictleung commented May 12, 2016

Uh oh!

evaristoc commented May 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evaristoc commented May 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evaristoc commented May 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erictleung commented May 5, 2016 •

edited

Loading

evaristoc commented May 7, 2016 •

edited

Loading

evaristoc commented May 7, 2016 •

edited

Loading

evaristoc commented May 9, 2016 •

edited

Loading

evaristoc commented May 12, 2016 •

edited

Loading

evaristoc commented May 12, 2016 •

edited

Loading

evaristoc commented May 12, 2016 •

edited

Loading