Skip to content
This repository has been archived by the owner on Jun 23, 2020. It is now read-only.

[WIP] Add script to clean and combine data, and add data #29

Merged
merged 48 commits into from
May 18, 2016

Conversation

erictleung
Copy link
Member

@erictleung erictleung commented May 5, 2016

cc/ @QuincyLarson @evaristoc Feel free to comment on aspects of the changes I'll be making. I figured it would be easier and faster to get feedback by using GitHub's feature to comment on PR changes.

Closes #26

Checklist

  • Rename column names
  • Group code that work on a single question into separate function
  • Clean first dataset
    • Which role are you most interested in?
    • Expect earnings as first developer job
    • Coding Events marked as "Other"
    • Podcasts marked as "Other"
    • Number of hours spent learning
    • "Number of months programming"
    • Months ago finished bootcamp
    • Salary after bootcamp
    • Money used for learning
  • Clean second dataset - Below are those which will require effort to clean i.e. not just categorical or boolean
    • "How old are you?"
    • University major, if applicable
    • Mortgage amount
    • Employment status for other
    • Which field do you work in? (Possibly)
    • Money made last year
    • Minutes of commute
  • Polish data for finishing e.g. remove inconsistent data
  • Combine data into one
  • Update root README with information on the data
  • Rename original data/ directory to raw-data/
  • Update README on the cleaned data

Commit Message

  • Update survey data dictionary with left out questions
  • Update survey data dictionary with variable/column names for questions
  • Add script clean-data.R to clean and combine the two survey datasets into
    one for ease of analysis
  • Create the combined survey dataset after running clean-data.R
  • Create README.md file to explain cleaned data and the script to produce it

@QuincyLarson
Copy link
Contributor

@erictleung Rather than giving people a script, I say we just give them the cleaned data

So if you can run your script and verify it worked, then we should remove the old csv files and replace them with your unified (and cleaned) csv file

You can commit the R script if you want for archival purposes, but I think 99.9% of the people going to the repo will just want a polished final CSV - they won't care as much about the details of our implementation

@SamAI-Software
Copy link
Member

Both variants would be good to have in case of any bugs

@erictleung erictleung force-pushed the clean-and-combine-data branch 2 times, most recently from ea7e9a0 to 5c8add3 Compare May 6, 2016 03:40
cleanPart1 <- cleanPart1 %>% filter(!numericIdx) %>%
bind_rows(numericData)

# Make all expected earnings less than 100 multiplied by 1000

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@erictleung erictleung mentioned this pull request May 6, 2016
## Adapted from: http://stackoverflow.com/a/26766945
undecidedWords <- c("not sure", "don't know", "not certain",
"unsure", "dont know", "undecided",
"no preference", "not", "any", "no idea")

This comment was marked as off-topic.

This comment was marked as off-topic.

@evaristoc
Copy link
Collaborator

evaristoc commented May 7, 2016

@SamAI-Software about your question above:

Firstly, we all should agree on one thing. Should we just cutoff (into blank) all weird numbers or should we guess what was the real intention and then try to normalize it? And it's not only about expected salary, but about all questions with open answers with numbers.

Yes, even if that will be arbitrary. The most rigorous option is missing or "outliering".

Similarly I have been in communication with @erictleung about:

Both variants would be good to have in case of any bugs

My proposal has been to supply different levels of files:

  • Raw datasets
  • Totally Clean dataset
  • Annex datasets

The Totally Clean is ours with the whole parsing + our arbitrary interpretations of the meaning of the values. Annex datasets could be intermediate ones containing unchanged values for some of the variables that asked for the most of the arbitrary changes, for example all open questions like "Other". See an example at:
https://github.com/evaristoc/2016-new-coder-survey/blob/clean-and-combine-data/clean-data/factors_CodeEventOther

This kind of files will preserve part of the "information" we will have to get rid of when cleaning the data. A person more interested in that additional information could revisit those Annex datasets and build a new dataset if desired.

The key is to provide metadata dictionaries describing the changes.

I have been commenting to @erictleung about the need to maintain consistency:

  • in naming
  • in variable types per set of questions
  • the nesting
  • the identification of missing values
  • etc.

The lesser the number of inconsistencies found in the Totally Clean dataset, the better. Also, an important aspect is to provide a robust metadata file as much as we can.


## Normalize variations of "None"
nones <- c("non", "none", "haven't", "havent", "not", "nothing",
"didn't", "n/a", "\bna\b", "never", "nil", "nope")

This comment was marked as off-topic.

This comment was marked as off-topic.

@evaristoc
Copy link
Collaborator

evaristoc commented May 7, 2016

@QuincyLarson I am not sure if you agree with keeping several files? As owner of the project, you might have the final decision. I understand that for you the best is to keep the last file ONLY, but be aware that our decisions when cleaning, even if well guided or well intended, will be always arbitrary ones, and they will risk information that someone could find interesting.

Whatever the case I will always insist in proper metadata dict and change file.

@SamAI-Software
Copy link
Member

My proposal has been to supply different levels of files:

  • Raw datasets
  • Totally Clean dataset
  • Annex datasets

@evaristoc sounds great!

mutate(PodcastNone = "1")
cleanPart1 <- cleanPart1 %>% filter(!nonesPodIdx) %>%
bind_rows(nonesPodData)

This comment was marked as off-topic.

This comment was marked as off-topic.

@evaristoc
Copy link
Collaborator

Quick before having to go:
@SamAI-Software about maintaining a max. of 100 hours (weekly): I agree.
Also, about all questions with the "Other" open answer option:

the idea is to vectorise/factorise the categories; the "Other" options SHOULD disappear or contain Unknowns in the worst case.

I have commented this to @erictleung. Multi-answered question and or an Other option should be vectorised instead of giving a Other in categorical format and the related questions in Boolean.

@erictleung this is the current challenge we are facing with those questions. Also, as @QuincyLarson suggested it would be better if we give a completely digested file to users. Totally agree:

Final user shouldn't be bothered in trying to normalize / parse any values: only in directly making data representations/visualizations; otherwise asking someone to work on some particular values would be painstaking.

Our goal should be to parse and vectorize the values, even if we have to make arbitrary decisions. Those arbitrary decisions would be always documented though.

I personally agree that the SIMPLEST and probably the BEST change file we can suggest is in fact YOUR code, @erictleung, and likely this thread.


- [R](https://www.r-project.org/) (>= 3.2.3)
- [dplyr](https://github.com/hadley/dplyr) (>= 0.4.3)

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@evaristoc
Copy link
Collaborator

evaristoc commented May 9, 2016

@QuincyLarson @SamAI-Software

I wrote to @erictleung :

I gave it a second thought and realised that even trying a better categorisation of the variables

  • PodcastOther
  • CodeEventOther
  • ResourceOther

is not necessarily informative enough. For example: the name I am giving is mostly arbitrary: would that name help the user to identify and find the resource online if desired?

Some answers are not easy to solve. For example: There are cases where some people reported attending "meetups" without specifying what for meetup, while other ones specified a name of specific meetups, but it is a meetup at the end. So

how shall we define those categories in order to provide enough info without trying to go too far with detailed naming?

Considering the quality of the data and the difficulties to take clearcut decision about how to operationalize some of the responses, I think that we are better off by NOT trying to vectorize all that info of the aforementioned variables. Otherwise we could end up unnecessarily obfuscating the Totally Clean Dataset.

In order to support the users what we can do is offering Annexes in a similar form as the following:
https://github.com/evaristoc/2016-new-coder-survey/blob/clean-and-combine-data/clean-data/factors_EventsOtherDrafted.csv

with a tentative, partial operationalization, without cross-comparison (there are categories that users tended to repeat between questions). The user could use those tentative, informal definitions while still invited to propose a personal one that could better work for her analysis.

@erictleung erictleung force-pushed the clean-and-combine-data branch 2 times, most recently from 8359985 to fa56e84 Compare May 10, 2016 15:02
## Remove outlier months of programming
cleanPart1 <- cleanPart1 %>%
mutate(MonthsProgramming = remove_outlier(MonthsProgramming, 744))

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@erictleung erictleung force-pushed the clean-and-combine-data branch 2 times, most recently from 01ac7a1 to 7f185c3 Compare May 11, 2016 16:15
@SamAI-Software
Copy link
Member

@erictleung latest dataset (10h ago) looks good 👍

The only question is about consistency - why booleans are sometimes factors and sometimes integers?

Integer booleans: IsSoftwareDev, JobRelocate, BootcampYesNo, BootcampFinish, BootcampFullJobAfter, BootcampLoan, BootcampRecommend, CodeEventCoffee, CodeEventHackathons, CodeEventConferences, CodeEventNodeSchool, CodeEventRailsBridge, CodeEventStartUpWknd, CodeEventWomenCode, CodeEventGirlDev, CodeEventNone, PodcastCodeNewbie, PodcastChangeLog, PodcastSEDaily, PodcastJSJabber, PodcastNone

Factor booleans: ResourceEdX, ResourceCoursera, ResourceFCC, ResourceKhanAcademy, ResourcePluralSight, ResourceCodeacademy, ResourceUdacity, ResourceUdemy, ResourceCodeWars, ResourceOdinProj, ResourceDevTips,

And why ExpectedEarning, HoursLearning, MonthsProgramming are integer, while BootcampPostSalary, MoneyForLearning, BootcampMonthsAgo are numeric?

Other bugs I'll comment as usually on the code later on today, but the data is already looking pretty clean and shiny :)

@erictleung
Copy link
Member Author

@SamAI-Software they are different because of how they are inherently read into R (I'm assuming you're using R to read them in).

I still need to do a pass over all of the variables and force a certain data type. I'll have to double check the integer and numeric values. I think it has to do with some values being 0.0 or something with a decimal point.

@evaristoc
Copy link
Collaborator

evaristoc commented May 12, 2016

@SamAI-Software which were your conventions for HoursLearning at #40?

Not sure if added to the datasets, @erictleung?

summary(as.factor(part1$HoursLearning))

Give some good but also weird values (a few though):
one 2
0.2 1
0.5 1
.1 1
100000000000000 1
100 hours per week 1
10-15 1
12321231231232123123123123123123123 1
14 hours 1
15-20 1
2-20 1
.25 1
2.5 1
300000000000000000000 1
3-4 1
40-50 1
4-6 1
5-7 1
5-8 1
6-8 1
(Other) 11
NA 788

Just looking at the datasets I didn't find any changes...

@evaristoc
Copy link
Collaborator

evaristoc commented May 12, 2016

@erictleung @SamAI-Software

we need some convention for CommuteTime. Several people decided to give time in minutes instead of hours (!!!).

@evaristoc
Copy link
Collaborator

evaristoc commented May 12, 2016

@erictleung :

  • Age: two records over 100 in part2

@SamAI-Software
Copy link
Member

@evaristoc are you sure that you take data here?

str(as.factor(data.Learn$HoursLearning))
summary(as.factor(data.Learn$HoursLearning))

I have no problems, 73 levels from 0 to 100

which were your conventions for HoursLearning at #40?

## Remove the word "hour(s)"
## Remove hyphen and "to" for ranges of hours
## Remove hours greater than 100 hours
And of course round decimal numbers

we need some convention for CommuteTime. Several people decided to give time in minutes instead of hours (!!!).

Yes, you are absolutely right, we need many conventions for the second dataset, feel free to find all weird answers and suggest your solutions, because today I'll be focused on #41 and on first dataset, there are still bugs to be fixed.

erictleung added 29 commits May 17, 2016 12:26
- Finished cleaning income function
- Removed changing ExpectedEarning to integer
- Remove unnecessary cleaning
- Check for inconsistencies between job role interests
- Remove unnecessary columns
@QuincyLarson QuincyLarson merged commit 4c903b0 into freeCodeCamp:master May 18, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants