- "markdown": "# Data Visualization Basics {#sec-basic-data-vis}\n\nThis section is intended as a very light overview of how you might create charts in R and python. @sec-data-vis will be much more in depth.\n\n## {{< fa bullseye >}} Objectives \n\n- Use ggplot2/plotnine to create a chart\n- Begin to identify issues with data formatting\n\n## Package Installation \n\nYou will need the `plotnine` (python) and `ggplot2` (R) packages for this section. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"ggplot2\")\n```\n:::\n\n\nTo install plotnine, pick one of the following methods (you can read more about them and decide which is appropriate for you in @sec-py-pkg-install)\n\n::: panel-tabset\n### System Terminal\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npip3 install plotnine matplotlib\n```\n:::\n\n\n### R Terminal\n\nThis package installation method requires that you have a virtual environment set up (that is, if you are on Windows, don't try to install packages this way).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreticulate::py_install(c(\"plotnine\", \"matplotlib\"))\n```\n:::\n\n\n### Python Terminal\n\nIn a python chunk (or the python terminal), you can run the following command. This depends on something called \"IPython magic\" commands, so if it doesn't work for you, try the System Terminal method instead.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n%pip install plotnine matplotlib\n```\n:::\n\n\nOnce you have run this command, please comment it out so that you don't reinstall the same packages every time.\n\n:::\n\n## First Steps\n\nNow that you can read data in to R and python and define new variables, you can create plots! \nData visualization is a skill that takes a lifetime to learn, but for now, let's start out easy: let's talk about how to make (basic) plots in R (with `ggplot2`) and in python (with `plotnine`, which is a ggplot2 clone).\n\n### Graphing HBCU Enrollment\nLet's work with Historically Black College and University enrollment. \n\n::: callout-demo\n#### Loading Libraries\n\n::: panel-tabset\n#### R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhbcu_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-02/hbcu_all.csv')\n\nlibrary(ggplot2)\n```\n:::\n\n\n#### Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport pandas as pd\nfrom plotnine import *\n\nhbcu_all = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-02/hbcu_all.csv')\n```\n:::\n\n\n:::\n:::\n\n### Making a Line Chart\n\nggplot2 and plotnine work with data frames. \n\nIf you pass a data frame in as the data argument, you can refer to columns in that data with \"bare\" column names (you don't have to reference the full data object using `df$name` or `df.name`; you can instead use `name` or `\"name\"`).\n\n::: callout-demo\n\n::: panel-tabset\n#### R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\nggplot(hbcu_all, aes(x = Year, y = `4-year`)) + geom_line() +\n ggtitle(\"4-year HBCU College Enrollment\")\n```\n\n::: {.cell-output-display}\n{width=2100}\n:::\n:::\n\n\n#### Python\n\n::: {.cell}\n\n```{.python .cell-code}\n\nggplot(hbcu_all, aes(x = \"Year\", y = \"4-year\")) + geom_line() + \\\n ggtitle(\"4-year HBCU College Enrollment\")\n## <Figure Size: (640 x 480)>\n```\n\n::: {.cell-output-display}\n{width=614}\n:::\n:::\n\n\n:::\n:::\n\n### Data Formatting\n\nIf your data is in the right format, ggplot2 is very easy to use; if your data aren't formatted neatly, it can be a real pain. \nIf you want to plot multiple lines, you need to either list each variable you want to plot, one by one, or (more likely) you want to get your data into \"long form\". \nWe'll learn more about how to do this type of data transition when we talk about [reshaping data](05-data-reshape.qmd).\n\n::: callout-demo\n\nYou don't need to know exactly how this works, but it is helpful to see the difference in the two datasets:\n\n::: panel-tabset\n#### R\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyr)\nhbcu_long <- pivot_longer(hbcu_all, -Year, names_to = \"type\", values_to = \"value\")\n```\n:::\n\n\n#### Python\n\n::: {.cell}\n\n```{.python .cell-code}\nhbcu_long = pd.melt(hbcu_all, id_vars = ['Year'], value_vars = hbcu_all.columns[1:11])\n```\n:::\n\n\n#### Original Data\n\n::: {.cell}\n::: {.cell-output-display}\n| Year| Total enrollment| Males| Females| 4-year| 2-year| Total - Public| 4-year - Public| 2-year - Public| Total - Private| 4-year - Private| 2-year - Private|\n|----:|----------------:|------:|-------:|------:|------:|--------------:|---------------:|---------------:|---------------:|----------------:|----------------:|\n| 1976| 222613| 104669| 117944| 206676| 15937| 156836| 143528| 13308| 65777| 63148| 2629|\n| 1980| 233557| 106387| 127170| 218009| 15548| 168217| 155085| 13132| 65340| 62924| 2416|\n| 1982| 228371| 104897| 123474| 212017| 16354| 165871| 151472| 14399| 62500| 60545| 1955|\n| 1984| 227519| 102823| 124696| 212844| 14675| 164116| 151289| 12827| 63403| 61555| 1848|\n| 1986| 223275| 97523| 125752| 207231| 16044| 162048| 147631| 14417| 61227| 59600| 1627|\n| 1988| 239755| 100561| 139194| 223250| 16505| 173672| 158606| 15066| 66083| 64644| 1439|\n:::\n:::\n\n\n#### Long Data\n\n::: {.cell}\n::: {.cell-output-display}\n| Year|type | value|\n|----:|:----------------|------:|\n| 1976|Total enrollment | 222613|\n| 1976|Males | 104669|\n| 1976|Females | 117944|\n| 1976|4-year | 206676|\n| 1976|2-year | 15937|\n| 1976|Total - Public | 156836|\n:::\n:::\n\n\nIn the long form of the data, we have a row for each data point (year x measurement type), not for each year.\n:::\n:::\n\n### Making a (Better) Line Chart\n\nIf we had wanted to show all of the available data before, we would have needed to add a separate line for each column, coloring each one manually, and then we would have wanted to create a legend manually (which is a pain). \nConverting the data to long form means we can use ggplot2/plotnine to do all of this for us with only a single `geom_line` statement.\nHaving the data in the right form to plot is very important if you want to get the plot you're imagining with relatively little effort.\n\n\n::: callout-demo\n::: panel-tabset\n\n#### R\n\n::: {.cell}\n\n```{.r .cell-code}\n\nggplot(hbcu_long, aes(x = Year, y = value, color = type)) + geom_line() +\n ggtitle(\"HBCU College Enrollment\")\n```\n\n::: {.cell-output-display}\n{width=2100}\n:::\n:::\n\n\n#### Python\n\n::: {.cell}\n\n```{.python .cell-code}\n\nggplot(hbcu_long, aes(x = \"Year\", y = \"value\", color = \"variable\")) + geom_line() + \\\n ggtitle(\"HBCU College Enrollment\") + \\\n theme(subplots_adjust={'right':0.75}) # This moves the key so it takes up 25% of the area\n## <Figure Size: (640 x 480)>\n```\n\n::: {.cell-output-display}\n{width=614}\n:::\n:::\n\n\n:::\n:::\n\n## References {#sec-graphics-intro-refs}\n",
0 commit comments