This week we're exploring our own data!
The data this week comes from our {ttmeta} package, which automatically updates with information about the TidyTuesday datasets.
The data shared here was compiled on 2024-06-18.
If you would like, you can use the package to retrieve updated information about the last few weeks.
What are the most common variable names?
Of those variables, which are used to mean the same thing, and which are used to mean different things?
# Option 1: tidytuesdayR package
## install.packages("tidytuesdayR")
tuesdata <- tidytuesdayR::tt_load('2024-07-02')
## OR
tuesdata <- tidytuesdayR::tt_load(2024, week = 27)
tt_datasets <- tuesdata$tt_datasets
tt_summary <- tuesdata$tt_summary
tt_urls <- tuesdata$tt_urls
tt_variables <- tuesdata$tt_variables
# Option 2: Read directly from GitHub
tt_datasets <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-02/tt_datasets.csv')
tt_summary <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-02/tt_summary.csv')
tt_urls <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-02/tt_urls.csv')
tt_variables <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-02/tt_variables.csv')
- Explore the data, watching out for interesting relationships. We would like to emphasize that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.
- Create a visualization, a model, a shiny app, or some other piece of data-science-related output, using R or another programming language.
- Share your output and the code used to generate it on social media with the #TidyTuesday hashtag.
variable |
class |
description |
year |
integer |
The year in which the dataset was released. |
week |
integer |
The week number for this dataset within this year. |
date |
Date |
The date of the Tuesday of this week. |
title |
character |
The overall title of this week's TidyTuesday. This is different from the individual dataset titles (although often similar). |
source_title |
character |
The title of this week's source. If there are multiple sources, there is still a single title merging them. |
article_title |
character |
The title of this week's article. If there are multiple articles, there is still a single title merging them. |
variable |
class |
description |
year |
integer |
The year in which the dataset was released. |
week |
integer |
The week number for this dataset within this year. |
type |
factor |
Whether this url appeared as an "article" or as a data "source". |
url |
character |
The full url. |
scheme |
factor |
Whether this url uses "http" or "https". |
domain |
character |
The main part of the url. For example, in www.google.com, "google" would be the domain for this column. |
subdomain |
character |
The part of the url before the period (when present). For example, in www.google.com, "www" would be the subdomain for this column. |
tld |
character |
The top-level domain, the part of the url after the domain. For example, in www.google.com, "com" would be the tld for this column. |
path |
character |
The part of the url after the slash but before any "?" or "#" (when present). For example, in www.google.com/maps/place/Googleplex, "maps/place/Googleplex" would be the path for this column. |
query |
character |
The parts of the url after "?" (when present). For example, for example.com/source?a=1&b=2, this column would contain "a=1&b=2" |
fragment |
character |
The part of the url after "#" (when present). for example, for cascadiarconf.com/agenda/#craggy, this column would contain "craggy". |
variable |
class |
description |
year |
integer |
The year in which the dataset was released. |
week |
integer |
The week number for this dataset within this year. |
dataset_name |
character |
The name of this dataset. Some weeks have multiple datasets. |
variables |
integer |
The number of columns in this dataset. |
observations |
integer |
The number of rows in this dataset. |
variable |
class |
description |
year |
integer |
The year in which the dataset was released. |
week |
integer |
The week number for this dataset within this year. |
dataset_name |
character |
The name of this dataset. Some weeks have multiple datasets. |
variable |
character |
The name of this variable. |
class |
character |
The class of this variable. |
n_unique |
integer |
The number of unique values of this variable within this dataset. |
min |
character |
The "lowest" value of this variable (lowest number, first value alphabetically, etc). |
max |
character |
The "highest" value of this variable (highest number, last value alphabetically, etc). |
description |
character |
A short description of this variable. |
library(tidyverse)
library(here)
library(fs)
# install.packages("pak")
# pak::pak("r4ds/ttmeta")
library(ttmeta)
tt_summary <- ttmeta::tt_summary_tbl |>
dplyr::select(-dplyr::ends_with("urls"))
tt_urls <- ttmeta::tt_urls_tbl |>
dplyr::mutate(
query = purrr::map_chr(
query,
\(x) {
if (!length(x)) {
return(NA_character_)
}
paste0(names(x), "=", x, collapse = "&")
}
)
)
tt_datasets <- ttmeta::tt_datasets_metadata |>
dplyr::filter(!is.na(dataset_name)) |>
dplyr::select(-variable_details)
tt_variables <- ttmeta::tt_datasets_metadata |>
dplyr::filter(!is.na(dataset_name)) |>
dplyr::select(-variables, -observations) |>
tidyr::unnest(variable_details) |>
dplyr::mutate(
min = purrr::map_chr(
min,
\(x) {
if (!length(x)) {
return(NA_character_)
}
as.character(x)
}
),
max = purrr::map_chr(
max,
\(x) {
if (!length(x)) {
return(NA_character_)
}
as.character(x)
}
)
)