Skip to content

Latest commit

 

History

History
147 lines (122 loc) · 6.71 KB

readme.md

File metadata and controls

147 lines (122 loc) · 6.71 KB

TidyTuesday Datasets

This week we're exploring our own data! The data this week comes from our {ttmeta} package, which automatically updates with information about the TidyTuesday datasets. The data shared here was compiled on 2024-06-18. If you would like, you can use the package to retrieve updated information about the last few weeks.

What are the most common variable names? Of those variables, which are used to mean the same thing, and which are used to mean different things?

The Data

# Option 1: tidytuesdayR package 
## install.packages("tidytuesdayR")

tuesdata <- tidytuesdayR::tt_load('2024-07-02')
## OR
tuesdata <- tidytuesdayR::tt_load(2024, week = 27)

tt_datasets <- tuesdata$tt_datasets
tt_summary <- tuesdata$tt_summary
tt_urls <- tuesdata$tt_urls
tt_variables <- tuesdata$tt_variables

# Option 2: Read directly from GitHub

tt_datasets <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-02/tt_datasets.csv')
tt_summary <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-02/tt_summary.csv')
tt_urls <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-02/tt_urls.csv')
tt_variables <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-02/tt_variables.csv')

How to Participate

  • Explore the data, watching out for interesting relationships. We would like to emphasize that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.
  • Create a visualization, a model, a shiny app, or some other piece of data-science-related output, using R or another programming language.
  • Share your output and the code used to generate it on social media with the #TidyTuesday hashtag.

Data Dictionary

tt_summary.csv

variable class description
year integer The year in which the dataset was released.
week integer The week number for this dataset within this year.
date Date The date of the Tuesday of this week.
title character The overall title of this week's TidyTuesday. This is different from the individual dataset titles (although often similar).
source_title character The title of this week's source. If there are multiple sources, there is still a single title merging them.
article_title character The title of this week's article. If there are multiple articles, there is still a single title merging them.

tt_urls.csv

variable class description
year integer The year in which the dataset was released.
week integer The week number for this dataset within this year.
type factor Whether this url appeared as an "article" or as a data "source".
url character The full url.
scheme factor Whether this url uses "http" or "https".
domain character The main part of the url. For example, in www.google.com, "google" would be the domain for this column.
subdomain character The part of the url before the period (when present). For example, in www.google.com, "www" would be the subdomain for this column.
tld character The top-level domain, the part of the url after the domain. For example, in www.google.com, "com" would be the tld for this column.
path character The part of the url after the slash but before any "?" or "#" (when present). For example, in www.google.com/maps/place/Googleplex, "maps/place/Googleplex" would be the path for this column.
query character The parts of the url after "?" (when present). For example, for example.com/source?a=1&b=2, this column would contain "a=1&b=2"
fragment character The part of the url after "#" (when present). for example, for cascadiarconf.com/agenda/#craggy, this column would contain "craggy".

tt_datasets.csv

variable class description
year integer The year in which the dataset was released.
week integer The week number for this dataset within this year.
dataset_name character The name of this dataset. Some weeks have multiple datasets.
variables integer The number of columns in this dataset.
observations integer The number of rows in this dataset.

tt_variables.csv

variable class description
year integer The year in which the dataset was released.
week integer The week number for this dataset within this year.
dataset_name character The name of this dataset. Some weeks have multiple datasets.
variable character The name of this variable.
class character The class of this variable.
n_unique integer The number of unique values of this variable within this dataset.
min character The "lowest" value of this variable (lowest number, first value alphabetically, etc).
max character The "highest" value of this variable (highest number, last value alphabetically, etc).
description character A short description of this variable.

Cleaning Script

library(tidyverse)
library(here)
library(fs)
# install.packages("pak")
# pak::pak("r4ds/ttmeta")
library(ttmeta)

tt_summary <- ttmeta::tt_summary_tbl |> 
  dplyr::select(-dplyr::ends_with("urls"))

tt_urls <- ttmeta::tt_urls_tbl |> 
  dplyr::mutate(
    query = purrr::map_chr(
      query,
      \(x) {
        if (!length(x)) {
          return(NA_character_)
        }
        paste0(names(x), "=", x, collapse = "&")
      }
    )
  )

tt_datasets <- ttmeta::tt_datasets_metadata |> 
  dplyr::filter(!is.na(dataset_name)) |> 
  dplyr::select(-variable_details)

tt_variables <- ttmeta::tt_datasets_metadata |> 
  dplyr::filter(!is.na(dataset_name)) |> 
  dplyr::select(-variables, -observations) |> 
  tidyr::unnest(variable_details) |> 
  dplyr::mutate(
    min = purrr::map_chr(
      min,
      \(x) {
        if (!length(x)) {
          return(NA_character_)
        }
        as.character(x)
      }
    ),
    max = purrr::map_chr(
      max,
      \(x) {
        if (!length(x)) {
          return(NA_character_)
        }
        as.character(x)
      }
    )
  )