-
-
Notifications
You must be signed in to change notification settings - Fork 60
Update cleaning of dataset #72
Update cleaning of dataset #72
Conversation
@erictleung I am unfamiliar with R so I don't feel qualified to QA this, but all of these changes you described sound sane :) |
The issues might be because some people already did their analyses, so changing variable names will break their code. Mine, too. |
Right, I understand that I've changed those variables names and it will break some people's code. @evaristoc and I discussed the reason for renaming the variable names with I guess it is not too urgent that those variable names be changed. I can revert them back and just make a note of it in the The most important part of the change is the normalization part to address issue #33. |
This part seems to be fine. If you revert the old variable names and add a note into |
@SamAI-Software awesome, I'll try to get to it later tonight. |
896a533
to
3069b4a
Compare
@SamAI-Software updated my PR! I reverted the major change of adding Feel free to pull down my PR and QA check the dataset. Let me know if there's anything else of concern 😃 |
- Change commute times >300 minutes to NA - Change minimum mortgage to $1000 and maximum mortgage to $1000000 - Move data dictionary into `clean-data/` directory - Change minimum student debt to $1000 and maximum debt to $500000 - Add changelog to clean data README - Remove missing data encoding information in README - Add example exploration in clean data README - Add figure of distribution of ages in dataset - Clean data for children to make sure number of children and yes/no answer to having children is consistent - Fix spelling mistake in `IsReceiveDisabilitiesBenefits` (original: IsReceiveDiabilitiesBenefits) - Use `ungroup()` command in `time_diff_check` because of `dplyr` version changes - Separate polishing steps for podcasts, resources, and so on to make it easier to see what is being polished - Update survey data dictionary description with details on the two datasets and parts of the survey - Update survey data - Update version numbers for R packages
3069b4a
to
5890a46
Compare
LGTM |
clean-data/
directoryChange naming of some columns that were originally extracted from "Other"columns in the dataset to reflect columns were derived, rather than
originally being there
having children is consistent
IsReceiveDisabilitiesBenefits
(original:IsReceiveDiabilitiesBenefits)
ungroup()
command intime_diff_check
because ofdplyr
versionchanges
to see what is being polished
and parts of the survey
cc/ @evaristoc @SamAI-Software @QuincyLarson
Close #33