Find better data examples #46

bensoltoff · 2017-08-14T14:30:07Z

Needs more social scientific data examples to practice skills, not just datasets of convenience

bensoltoff · 2018-06-06T17:27:17Z

Police shootings dataset

bensoltoff · 2019-10-22T15:34:45Z

Need to replace

diamonds
mtcars
mpg
Auto

Questionable datasets

titanic
flights
gapminder

bensoltoff · 2019-10-22T15:40:57Z

Draw examples from Kieran Healy's socviz package.

opiates
gss_sm
gss_lon

bensoltoff · 2020-03-02T16:28:27Z

Deadest names - see #115

deblnia · 2020-08-14T16:16:16Z

Still working on this, specifically with Movies and Snapchat data. (Repo is very minimal now.) Will also look into socviz, deadest names and police shootings.

Two other options, would love to hear what you think-- palmer penguins instead of diamonds, and recent-ish O'Hare/Midway data using anyflights instead of flights.

EDIT: Also flagging Damon Jones' scrape of UCPD stops as a potential alternative to the WaPo Police shooting dataset.

bensoltoff · 2020-08-14T17:55:47Z

Palmer penguins is supposed to be a good drop-in replacement for iris. Not sure if it contains sufficient variables to replace diamonds. We'd need to check how diamonds is used on the website to verify the penguins dataset contains appropriate variables.

Chicago flights data would be nice to replace nycflights13, though I think I only use it for one set of exercises for relational joins.

bensoltoff · 2020-08-14T17:56:53Z

Flagging @YinsuH on this. She's working as an RA for me this summer through SISRM.

YinsuH · 2020-08-15T04:20:25Z

I have looked at the penguin dataset and the lecture notes. I would say the penguin data is viable in terms of most of the operations we need. For instance, it could be used for practicing pipe and writing functions. However, one problem I think might be significant about penguin data is that it contains only 344 observations, while diamonds has more than 20k observations. In the exercise we use characteristics like color and cut, both of which have more than 5 kinds. But the qualitative variables, species and island, in penguins only have three different possible entries. This fact to some extent signifies the lack of variability in the penguin data, and thus might lead to some problems in modeling and make the data visualization less diverse than figures produced by diamonds.

deblnia · 2020-08-15T16:58:07Z

I don't think we use diamonds in any modelling pages (feel free to correct me if I'm wrong, just searched diamonds on the website), so I don't think the sample size should be disqualifying. The lack of levels for categorical variables is definitely a valid concern though.

YinsuH · 2020-08-16T01:15:27Z

This is the website I looked up with a few use of scatterplot. I was not sure if my concern was significant, so I decided to bring it up anyways :)

bensoltoff · 2020-08-16T17:27:44Z

@YinsuH I think we'd be okay with the number of observations. But I agree with your concerns about an appropriate number of categorical variables for some of the examples. Especially I am thinking about computer programing as problem solving. Could you take a stab at rewriting the examples in the notes folder that currently use diamonds, but substituting with the penguins dataset?

I think the easiest workflow will be to fork the course-site repo, then edit the .Rmarkdown files directly. Note that if you try to build the entire site, you will need to knit all the R Markdown files in the repo which will take some time (and probably require you have additional packages installed). If it's easier, just write a fresh .R script for each page that uses diamonds and just rework the code. I/we can update the written narrative later once we know the examples work.

YinsuH · 2020-08-19T16:25:17Z

@bensoltoff I have created a pull request for the course site. However, this is my first attempt in updating the website and some of the work might still have problems. I will continue checking them in the next few days. Also, I have written a few questions I got in the pull request post. Please have a look.

bensoltoff · 2021-04-14T13:27:38Z

Household Pulse Survey - assess impact of COVID-19 on households

bensoltoff · 2022-01-06T17:43:31Z

Need to replace

diamonds
mtcars
- Need a fully numeric data frame to drop into iteration exercises. Or need to rewrite that exercise
mpg
Auto

Questionable datasets

titanic
flights
gapminder

bensoltoff added the hard label Jun 6, 2018

This was referenced Jun 6, 2018

Dataset for classification #41

Closed

Classify handwritten digits #32

Closed

bensoltoff self-assigned this Oct 22, 2019

bensoltoff mentioned this issue Jan 4, 2022

Eradicate all dumb datasets #304

Closed

bensoltoff added a commit that referenced this issue Jan 6, 2022

Remove uses of the mpg dataset. See #46

4cb9200

bensoltoff added a commit that referenced this issue Jan 6, 2022

Remove one mtcars example. See #46

d004c99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find better data examples #46

Find better data examples #46

bensoltoff commented Aug 14, 2017

bensoltoff commented Jun 6, 2018

bensoltoff commented Oct 22, 2019

bensoltoff commented Oct 22, 2019

bensoltoff commented Mar 2, 2020

deblnia commented Aug 14, 2020 •

edited

Loading

bensoltoff commented Aug 14, 2020

bensoltoff commented Aug 14, 2020

YinsuH commented Aug 15, 2020

deblnia commented Aug 15, 2020 •

edited

Loading

YinsuH commented Aug 16, 2020

bensoltoff commented Aug 16, 2020

YinsuH commented Aug 19, 2020

bensoltoff commented Apr 14, 2021

bensoltoff commented Jan 6, 2022 •

edited

Loading

Find better data examples #46

Find better data examples #46

Comments

bensoltoff commented Aug 14, 2017

bensoltoff commented Jun 6, 2018

bensoltoff commented Oct 22, 2019

Need to replace

Questionable datasets

bensoltoff commented Oct 22, 2019

bensoltoff commented Mar 2, 2020

deblnia commented Aug 14, 2020 • edited Loading

bensoltoff commented Aug 14, 2020

bensoltoff commented Aug 14, 2020

YinsuH commented Aug 15, 2020

deblnia commented Aug 15, 2020 • edited Loading

YinsuH commented Aug 16, 2020

bensoltoff commented Aug 16, 2020

YinsuH commented Aug 19, 2020

bensoltoff commented Apr 14, 2021

bensoltoff commented Jan 6, 2022 • edited Loading

Need to replace

Questionable datasets

deblnia commented Aug 14, 2020 •

edited

Loading

deblnia commented Aug 15, 2020 •

edited

Loading

bensoltoff commented Jan 6, 2022 •

edited

Loading