-
-
Notifications
You must be signed in to change notification settings - Fork 59
[WIP] Add script to clean and combine data, and add data #29
[WIP] Add script to clean and combine data, and add data #29
Conversation
@erictleung Rather than giving people a script, I say we just give them the cleaned data So if you can run your script and verify it worked, then we should remove the old csv files and replace them with your unified (and cleaned) csv file You can commit the R script if you want for archival purposes, but I think 99.9% of the people going to the repo will just want a polished final CSV - they won't care as much about the details of our implementation |
Both variants would be good to have in case of any bugs |
ea7e9a0
to
5c8add3
Compare
cleanPart1 <- cleanPart1 %>% filter(!numericIdx) %>% | ||
bind_rows(numericData) | ||
|
||
# Make all expected earnings less than 100 multiplied by 1000 |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
## Adapted from: http://stackoverflow.com/a/26766945 | ||
undecidedWords <- c("not sure", "don't know", "not certain", | ||
"unsure", "dont know", "undecided", | ||
"no preference", "not", "any", "no idea") |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@SamAI-Software about your question above:
Yes, even if that will be arbitrary. The most rigorous option is missing or "outliering". Similarly I have been in communication with @erictleung about:
My proposal has been to supply different levels of files:
The Totally Clean is ours with the whole parsing + our arbitrary interpretations of the meaning of the values. Annex datasets could be intermediate ones containing unchanged values for some of the variables that asked for the most of the arbitrary changes, for example all open questions like "Other". See an example at: This kind of files will preserve part of the "information" we will have to get rid of when cleaning the data. A person more interested in that additional information could revisit those Annex datasets and build a new dataset if desired. The key is to provide metadata dictionaries describing the changes. I have been commenting to @erictleung about the need to maintain consistency:
The lesser the number of inconsistencies found in the Totally Clean dataset, the better. Also, an important aspect is to provide a robust metadata file as much as we can. |
|
||
## Normalize variations of "None" | ||
nones <- c("non", "none", "haven't", "havent", "not", "nothing", | ||
"didn't", "n/a", "\bna\b", "never", "nil", "nope") |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@QuincyLarson I am not sure if you agree with keeping several files? As owner of the project, you might have the final decision. I understand that for you the best is to keep the last file ONLY, but be aware that our decisions when cleaning, even if well guided or well intended, will be always arbitrary ones, and they will risk information that someone could find interesting. Whatever the case I will always insist in proper metadata dict and change file. |
@evaristoc sounds great! |
mutate(PodcastNone = "1") | ||
cleanPart1 <- cleanPart1 %>% filter(!nonesPodIdx) %>% | ||
bind_rows(nonesPodData) | ||
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Quick before having to go:
I have commented this to @erictleung. Multi-answered question and or an Other option should be vectorised instead of giving a Other in categorical format and the related questions in Boolean. @erictleung this is the current challenge we are facing with those questions. Also, as @QuincyLarson suggested it would be better if we give a completely digested file to users. Totally agree:
Our goal should be to parse and vectorize the values, even if we have to make arbitrary decisions. Those arbitrary decisions would be always documented though. I personally agree that the SIMPLEST and probably the BEST change file we can suggest is in fact YOUR code, @erictleung, and likely this thread. |
|
||
- [R](https://www.r-project.org/) (>= 3.2.3) | ||
- [dplyr](https://github.com/hadley/dplyr) (>= 0.4.3) | ||
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
I wrote to @erictleung : I gave it a second thought and realised that even trying a better categorisation of the variables
is not necessarily informative enough. For example: the name I am giving is mostly arbitrary: would that name help the user to identify and find the resource online if desired? Some answers are not easy to solve. For example: There are cases where some people reported attending "meetups" without specifying what for meetup, while other ones specified a name of specific meetups, but it is a meetup at the end. So
Considering the quality of the data and the difficulties to take clearcut decision about how to operationalize some of the responses, I think that we are better off by NOT trying to vectorize all that info of the aforementioned variables. Otherwise we could end up unnecessarily obfuscating the Totally Clean Dataset. In order to support the users what we can do is offering Annexes in a similar form as the following: with a tentative, partial operationalization, without cross-comparison (there are categories that users tended to repeat between questions). The user could use those tentative, informal definitions while still invited to propose a personal one that could better work for her analysis. |
8359985
to
fa56e84
Compare
## Remove outlier months of programming | ||
cleanPart1 <- cleanPart1 %>% | ||
mutate(MonthsProgramming = remove_outlier(MonthsProgramming, 744)) | ||
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
01ac7a1
to
7f185c3
Compare
@erictleung latest dataset (10h ago) looks good 👍 The only question is about consistency - why booleans are sometimes factors and sometimes integers? Integer booleans: IsSoftwareDev, JobRelocate, BootcampYesNo, BootcampFinish, BootcampFullJobAfter, BootcampLoan, BootcampRecommend, CodeEventCoffee, CodeEventHackathons, CodeEventConferences, CodeEventNodeSchool, CodeEventRailsBridge, CodeEventStartUpWknd, CodeEventWomenCode, CodeEventGirlDev, CodeEventNone, PodcastCodeNewbie, PodcastChangeLog, PodcastSEDaily, PodcastJSJabber, PodcastNone Factor booleans: ResourceEdX, ResourceCoursera, ResourceFCC, ResourceKhanAcademy, ResourcePluralSight, ResourceCodeacademy, ResourceUdacity, ResourceUdemy, ResourceCodeWars, ResourceOdinProj, ResourceDevTips, And why ExpectedEarning, HoursLearning, MonthsProgramming are integer, while BootcampPostSalary, MoneyForLearning, BootcampMonthsAgo are numeric? Other bugs I'll comment as usually on the code later on today, but the data is already looking pretty clean and shiny :) |
@SamAI-Software they are different because of how they are inherently read into R (I'm assuming you're using R to read them in). I still need to do a pass over all of the variables and force a certain data type. I'll have to double check the integer and numeric values. I think it has to do with some values being 0.0 or something with a decimal point. |
@SamAI-Software which were your conventions for Not sure if added to the datasets, @erictleung? summary(as.factor(part1$HoursLearning)) Give some good but also weird values (a few though): Just looking at the datasets I didn't find any changes... |
we need some convention for |
|
@evaristoc are you sure that you take data here?
I have no problems, 73 levels from 0 to 100
Yes, you are absolutely right, we need many conventions for the second dataset, feel free to find all weird answers and suggest your solutions, because today I'll be focused on #41 and on first dataset, there are still bugs to be fixed. |
- Finished cleaning income function - Removed changing ExpectedEarning to integer - Remove unnecessary cleaning
- Check for inconsistencies between job role interests - Remove unnecessary columns
cc/ @QuincyLarson @evaristoc Feel free to comment on aspects of the changes I'll be making. I figured it would be easier and faster to get feedback by using GitHub's feature to comment on PR changes.
Closes #26
Checklist
README
with information on the datadata/
directory toraw-data/
README
on the cleaned dataCommit Message
clean-data.R
to clean and combine the two survey datasets intoone for ease of analysis
clean-data.R