Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find better data examples #46

Open
bensoltoff opened this issue Aug 14, 2017 · 14 comments
Open

Find better data examples #46

bensoltoff opened this issue Aug 14, 2017 · 14 comments
Assignees
Labels

Comments

@bensoltoff
Copy link
Member

Needs more social scientific data examples to practice skills, not just datasets of convenience

@bensoltoff
Copy link
Member Author

Police shootings dataset

@bensoltoff bensoltoff added the hard label Jun 6, 2018
This was referenced Jun 6, 2018
@bensoltoff bensoltoff self-assigned this Oct 22, 2019
@bensoltoff
Copy link
Member Author

Need to replace

  • diamonds
  • mtcars
  • mpg
  • Auto

Questionable datasets

  • titanic
  • flights
  • gapminder

@bensoltoff
Copy link
Member Author

Draw examples from Kieran Healy's socviz package.

  • opiates
  • gss_sm
  • gss_lon

@bensoltoff
Copy link
Member Author

Deadest names - see #115

@deblnia
Copy link
Contributor

deblnia commented Aug 14, 2020

Still working on this, specifically with Movies and Snapchat data. (Repo is very minimal now.) Will also look into socviz, deadest names and police shootings.

Two other options, would love to hear what you think-- palmer penguins instead of diamonds, and recent-ish O'Hare/Midway data using anyflights instead of flights.

EDIT: Also flagging Damon Jones' scrape of UCPD stops as a potential alternative to the WaPo Police shooting dataset.

@bensoltoff
Copy link
Member Author

Palmer penguins is supposed to be a good drop-in replacement for iris. Not sure if it contains sufficient variables to replace diamonds. We'd need to check how diamonds is used on the website to verify the penguins dataset contains appropriate variables.

Chicago flights data would be nice to replace nycflights13, though I think I only use it for one set of exercises for relational joins.

@bensoltoff
Copy link
Member Author

Flagging @YinsuH on this. She's working as an RA for me this summer through SISRM.

@YinsuH
Copy link

YinsuH commented Aug 15, 2020

I have looked at the penguin dataset and the lecture notes. I would say the penguin data is viable in terms of most of the operations we need. For instance, it could be used for practicing pipe and writing functions. However, one problem I think might be significant about penguin data is that it contains only 344 observations, while diamonds has more than 20k observations. In the exercise we use characteristics like color and cut, both of which have more than 5 kinds. But the qualitative variables, species and island, in penguins only have three different possible entries. This fact to some extent signifies the lack of variability in the penguin data, and thus might lead to some problems in modeling and make the data visualization less diverse than figures produced by diamonds.

@deblnia
Copy link
Contributor

deblnia commented Aug 15, 2020

I don't think we use diamonds in any modelling pages (feel free to correct me if I'm wrong, just searched diamonds on the website), so I don't think the sample size should be disqualifying. The lack of levels for categorical variables is definitely a valid concern though.

@YinsuH
Copy link

YinsuH commented Aug 16, 2020

This is the website I looked up with a few use of scatterplot. I was not sure if my concern was significant, so I decided to bring it up anyways :)

@bensoltoff
Copy link
Member Author

@YinsuH I think we'd be okay with the number of observations. But I agree with your concerns about an appropriate number of categorical variables for some of the examples. Especially I am thinking about computer programing as problem solving. Could you take a stab at rewriting the examples in the notes folder that currently use diamonds, but substituting with the penguins dataset?

I think the easiest workflow will be to fork the course-site repo, then edit the .Rmarkdown files directly. Note that if you try to build the entire site, you will need to knit all the R Markdown files in the repo which will take some time (and probably require you have additional packages installed). If it's easier, just write a fresh .R script for each page that uses diamonds and just rework the code. I/we can update the written narrative later once we know the examples work.

@YinsuH
Copy link

YinsuH commented Aug 19, 2020

@bensoltoff I have created a pull request for the course site. However, this is my first attempt in updating the website and some of the work might still have problems. I will continue checking them in the next few days. Also, I have written a few questions I got in the pull request post. Please have a look.

@bensoltoff
Copy link
Member Author

Household Pulse Survey - assess impact of COVID-19 on households

@bensoltoff
Copy link
Member Author

bensoltoff commented Jan 6, 2022

Need to replace

  • diamonds
  • mtcars
    • Need a fully numeric data frame to drop into iteration exercises. Or need to rewrite that exercise
  • mpg
  • Auto

Questionable datasets

  • titanic
  • flights
  • gapminder

bensoltoff added a commit that referenced this issue Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants