diff --git a/docs/1-getting-started.html b/docs/1-getting-started.html
index bbee4b206..73cbcf126 100644
--- a/docs/1-getting-started.html
+++ b/docs/1-getting-started.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>Chapter 1 Getting Started with Data in R | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="Chapter 1 Getting Started with Data in R | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="Chapter 1 Getting Started with Data in R | Statistical Inference via Data Science" />
@@ -21,18 +21,18 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="index.html">
-<link rel="next" href="2-viz.html">
+<link rel="prev" href="about-the-authors.html"/>
+<link rel="next" href="2-viz.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -576,10 +589,10 @@ <h1><span class="header-section-number">Chapter 1</span> Getting Started with Da
 <li>How do I code in R?</li>
 <li>What are R packages?</li>
 </ol>
-<p>We’ll introduce these concepts in the upcoming Sections <a href="1-getting-started.html#r-rstudio">1.1</a>-<a href="1-getting-started.html#packages">1.3</a>. If you are already somewhat familiar with these concepts, feel free to skip to Section <a href="1-getting-started.html#nycflights13">1.4</a> where we’ll introduce our first data set: all domestic flights departing a New York City airport in 2013. This is a dataset we will explore in depth for the rest of this book.</p>
+<p>We’ll introduce these concepts in the upcoming Sections <a href="1-getting-started.html#r-rstudio">1.1</a>-<a href="1-getting-started.html#packages">1.3</a>. If you are already somewhat familiar with these concepts, feel free to skip to Section <a href="1-getting-started.html#nycflights13">1.4</a> where we’ll introduce our first dataset: all domestic flights departing one of the three main New York City (NYC) airports in 2013. This is a dataset we will explore in depth for much of the rest of this book.</p>
 <div id="r-rstudio" class="section level2">
 <h2><span class="header-section-number">1.1</span> What are R and RStudio?</h2>
-<p>For much of this book, we will assume that you are using R via RStudio. First time users often confuse the two. At its simplest R is like a car’s engine while RStudio is like a car’s dashboard.</p>
+<p>Throughout this book, we will assume that you are using R via RStudio. First time users often confuse the two. At its simplest, R is like a car’s engine while RStudio is like a car’s dashboard as illustrated in Figure <a href="1-getting-started.html#fig:R-vs-RStudio-1">1.1</a>.</p>
 <!--
 R: Engine            |  RStudio: Dashboard 
 :-------------------------:|:-------------------------:
@@ -591,20 +604,21 @@ <h2><span class="header-section-number">1.1</span> What are R and RStudio?</h2>
 FIGURE 1.1: Analogy of difference between R and RStudio.
 </p>
 </div>
-<p>More precisely, R is a programming language that runs computations while RStudio is an <em>integrated development environment (IDE)</em> that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rear-view mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well.</p>
-<div id="installing-r-and-rstudio" class="section level3">
+<p>More precisely, R is a programming language that runs computations, while RStudio is an <em>integrated development environment (IDE)</em> that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well.</p>
+<div id="installing" class="section level3">
 <h3><span class="header-section-number">1.1.1</span> Installing R and RStudio</h3>
 <blockquote>
-<p><strong>Note about RStudio Server</strong>: If your instructor has provided you with a link and access to RStudio Server, then you can skip this section. We do recommend after a few months of working on RStudio Server that you return to these instructions to install this software on your own computer though.</p>
+<p><strong>Note about RStudio Server or RStudio Cloud</strong>: If your instructor has provided you with a link and access to RStudio Server or RStudio Cloud, then you can skip this section. We do recommend after a few months of working on RStudio Server/Cloud that you return to these instructions to install this software on your own computer though.</p>
 </blockquote>
-<p>You will first need to download and install both R and RStudio (Desktop version) on your computer. It is important that you install R first and then install RStudio second.</p>
+<p>You will first need to download and install both R and RStudio (Desktop version) on your computer. It is important that you install R first and then install RStudio.</p>
 <ol style="list-style-type: decimal">
-<li><strong>You must do this first:</strong> <a href="https://cran.r-project.org/">Download and install R</a>.
+<li><strong>You must do this first:</strong> Download and install R by going to <a href="https://cloud.r-project.org/" class="uri">https://cloud.r-project.org/</a>. 
 <ul>
 <li>If you are a Windows user: Click on “Download R for Windows”, then click on “base”, then click on the Download link.</li>
-<li>If you are macOS user: Click on “Download R for (Mac) OS X”, then under “Latest release:” click on R-X.X.X.pkg, where R-X.X.X is the version number. For example, the latest version of R as of August 10, 2019 was R-3.6.1.</li>
+<li>If you are macOS user: Click on “Download R for (Mac) OS X”, then under “Latest release:” click on R-X.X.X.pkg, where R-X.X.X is the version number. For example, the latest version of R as of November 25, 2019 was R-3.6.1.</li>
+<li>If you are a Linux user: Click on “Download R for Linux” and choose your distribution for more information on installing R for your setup.</li>
 </ul></li>
-<li><strong>You must do this second:</strong> <a href="https://www.rstudio.com/products/rstudio/download/">Download and install RStudio</a>.
+<li><strong>You must do this second:</strong> Download and install RStudio at <a href="https://www.rstudio.com/products/rstudio/download/" class="uri">https://www.rstudio.com/products/rstudio/download/</a>.
 <ul>
 <li>Scroll down to “Installers for Supported Platforms” near the bottom of the page.</li>
 <li>Click on the download link corresponding to your computer’s operating system. </li>
@@ -613,7 +627,7 @@ <h3><span class="header-section-number">1.1.1</span> Installing R and RStudio</h
 </div>
 <div id="using-r-via-rstudio" class="section level3">
 <h3><span class="header-section-number">1.1.2</span> Using R via RStudio</h3>
-<p>Recall our car analogy from earlier. Much as we don’t drive a car by interacting directly with the engine but rather by interacting with elements on the car’s dashboard, we won’t be using R directly but rather we will use RStudio’s interface. After you install R and RStudio on your computer, you’ll have two new <em>programs</em> (also called <em>applications</em>) you can open. We’ll always work in RStudio and not R. Figure <a href="1-getting-started.html#fig:R-vs-RStudio-2">1.2</a> shows what icon you should be clicking on your computer.</p>
+<p>Recall our car analogy from earlier. Much as we don’t drive a car by interacting directly with the engine but rather by interacting with elements on the car’s dashboard, we won’t be using R directly but rather we will use RStudio’s interface. After you install R and RStudio on your computer, you’ll have two new <em>programs</em> (also called <em>applications</em>) you can open. We’ll always work in RStudio and not in the R application. Figure <a href="1-getting-started.html#fig:R-vs-RStudio-2">1.2</a> shows what icon you should be clicking on your computer.</p>
 <!--
 R: Do not open this          |  RStudio: Open this
 :-------------------------:|:-------------------------:
@@ -625,41 +639,45 @@ <h3><span class="header-section-number">1.1.2</span> Using R via RStudio</h3>
 FIGURE 1.2: Icons of R versus RStudio on your computer.
 </p>
 </div>
-<p>After you open RStudio, you should see the following in Figure <a href="1-getting-started.html#fig:RStudio-interface">1.3</a>.</p>
+<p>After you open RStudio, you should see something similar to Figure <a href="1-getting-started.html#fig:RStudio-interface">1.3</a>. (Note that slight differences might exist if the RStudio interface is updated after 2019 to not be this by default.)</p>
 <div class="figure" style="text-align: center"><span id="fig:RStudio-interface"></span>
-<img src="images/rstudio_screenshots/rstudio.png" alt="RStudio interface to R." width="100%" />
+<img src="images/rstudio_screenshots/rstudio.png" alt="RStudio interface to R." width="93%" />
 <p class="caption">
 FIGURE 1.3: RStudio interface to R.
 </p>
 </div>
-<p>Note the three <em>panes</em> which are three panels dividing the screen: The <em>console pane</em>, the <em>files pane</em>, and the <em>environment pane</em>. Over the course of this chapter, you’ll come to learn what purpose each of these panes serve.</p>
+<p>Note the three <em>panes</em> which are three panels dividing the screen: the <em>console pane</em>, the <em>files pane</em>, and the <em>environment pane</em>. Over the course of this chapter, you’ll come to learn what purpose each of these panes serves.</p>
 </div>
 </div>
 <div id="code" class="section level2">
 <h2><span class="header-section-number">1.2</span> How do I code in R?</h2>
-<p>Now that you’re set up with R and RStudio, you are probably asking yourself “OK. Now how do I use R?” The first thing to note is that unlike other statistical software programs like Excel, STATA, or SAS that provide <a href="https://en.wikipedia.org/wiki/Point_and_click">point-and-click</a> interfaces, R is an <a href="https://en.wikipedia.org/wiki/Interpreted_language">interpreted language</a>. This means you have to type in commands written in <em>R code</em>. In other words, you have to code/program in R. Note that we’ll use the terms “coding” and “programming” interchangeably in this book.</p>
-<p>While it is not required to be a seasoned coder/computer programmer to use R, there is still a set of basic programming concepts that R users need to understand. Consequently, while this book is not a book on programming, you will still learn just enough of these basic programming concepts needed to explore and analyze data effectively.</p>
+<p>Now that you’re set up with R and RStudio, you are probably asking yourself, “OK. Now how do I use R?”. The first thing to note is that unlike other statistical software programs like Excel, SPSS, or Minitab that provide <a href="https://en.wikipedia.org/wiki/Point_and_click">point-and-click</a> interfaces, R is an <a href="https://en.wikipedia.org/wiki/Interpreted_language">interpreted language</a>. This means you have to type in commands written in <em>R code</em>. In other words, you have to code/program in R. Note that we’ll use the terms “coding” and “programming” interchangeably in this book.</p>
+<p>While it is not required to be a seasoned coder/computer programmer to use R, there is still a set of basic programming concepts that new R users need to understand. Consequently, while this book is not a book on programming, you will still learn just enough of these basic programming concepts needed to explore and analyze data effectively.</p>
 <div id="programming-concepts" class="section level3">
 <h3><span class="header-section-number">1.2.1</span> Basic programming concepts and terminology</h3>
-<p>We now introduce some basic programming concepts and terminology. Instead of asking you to learn all these concepts and terminology right now, we’ll guide you so that you’ll “learn by doing.” Note that in this book we will always use a different font to distinguish regular text from <code>computer_code</code>. The best way to master these topics is, in our opinions, “learning by doing” and lots of repetition.</p>
+<p>We now introduce some basic programming concepts and terminology. Instead of asking you to memorize all these concepts and terminology right now, we’ll guide you so that you’ll “learn by doing.” To help you learn, we will always use a different font to distinguish regular text from <code>computer_code</code>. The best way to master these topics is, in our opinions, through <a href="https://jamesclear.com/deliberate-practice-theory">deliberate practice</a> with R and lots of repetition.</p>
 <ul>
 <li>Basics: 
 <ul>
-<li><em>Console</em>: Where you enter in commands. </li>
-<li><em>Running code</em>: The act of telling R to perform an act by giving it commands in the console.</li>
-<li><em>Objects</em>: Where values are saved in R. We’ll show you how to <em>assign</em> values to objects and how to display the contents of objects. </li>
-<li><em>Data types</em>: Integers, doubles/numerics, logicals, and characters. </li>
-</ul></li>
-<li><em>Vectors</em>: A series of values. These are created using the <code>c()</code> function, where <code>c()</code> stands for “combine” or “concatenate.” For example: <code>c(6, 11, 13, 31, 90, 92)</code>. </li>
-<li><em>Factors</em>: <em>Categorical data</em> are represented in R as factors. </li>
-<li><em>Data frames</em>: Data frames are like rectangular spreadsheets: they are representations of datasets in R where the rows correspond to <em>observations</em> and the columns correspond to <em>variables</em> that describe the observations.  We’ll cover data frames later in Section <a href="1-getting-started.html#nycflights13">1.4</a>.</li>
+<li><em>Console pane</em>: where you enter in commands. </li>
+<li><em>Running code</em>: the act of telling R to perform an act by giving it commands in the console.</li>
+<li><em>Objects</em>: where values are saved in R. We’ll show you how to <em>assign</em> values to objects and how to display the contents of objects. </li>
+<li><em>Data types</em>: integers, doubles/numerics, logicals, and characters.  Integers are values like -1, 0, 2, 4092. Doubles or numerics are a larger set of values containing both the integers but also fractions and decimal values like -24.932 and 0.8. Logicals are either <code>TRUE</code> or <code>FALSE</code> while characters are text such as “cabbage”, “Hamilton”, “The Wire is the greatest TV show ever”, and “This ramen is delicious.” Note that characters are often denoted with the quotation marks around them.</li>
+</ul></li>
+<li><em>Vectors</em>: a series of values. These are created using the <code>c()</code> function, where <code>c()</code> stands for “combine” or “concatenate.” For example, <code>c(6, 11, 13, 31, 90, 92)</code> creates a six element series of positive integer values .</li>
+<li><em>Factors</em>: <em>categorical data</em> are commonly represented in R as factors.  Categorical data can also be represented as <em>strings</em>. We’ll study this difference as we progress through the book.</li>
+<li><em>Data frames</em>: rectangular spreadsheets. They are representations of datasets in R where the rows correspond to <em>observations</em> and the columns correspond to <em>variables</em> that describe the observations.  We’ll cover data frames later in Section <a href="1-getting-started.html#nycflights13">1.4</a>.</li>
 <li><em>Conditionals</em>: 
 <ul>
-<li>Testing for equality in R using <code>==</code> (and not <code>=</code> which is typically used for assignment). Ex: <code>2 + 1 == 3</code> compares <code>2 + 1</code> to <code>3</code> and is correct R code, while <code>2 + 1 = 3</code> will return an error.</li>
-<li>Boolean algebra: <code>TRUE/FALSE</code> statements and mathematical operators such as <code>&lt;</code> (less than), <code>&lt;=</code> (less than or equal), and <code>!=</code> (not equal to). </li>
-<li>Logical operators: <code>&amp;</code> representing “and” as well as <code>|</code> representing “or.” Ex: <code>(2 + 1 == 3) &amp; (2 + 1 == 4)</code> returns <code>FALSE</code> since both clauses are not <code>TRUE</code> (only the first clause is <code>TRUE</code>). On the other hand, <code>(2 + 1 == 3) | (2 + 1 == 4)</code> returns <code>TRUE</code> since at least one of the two clauses is <code>TRUE</code>. </li>
+<li>Testing for equality in R using <code>==</code> (and not <code>=</code>, which is typically used for assignment). For example, <code>2 + 1 == 3</code> compares <code>2 + 1</code> to <code>3</code> and is correct R code, while <code>2 + 1 = 3</code> will return an error.</li>
+<li>Boolean algebra: <code>TRUE/FALSE</code> statements and mathematical operators such as <code>&lt;</code> (less than), <code>&lt;=</code> (less than or equal), and <code>!=</code> (not equal to).  For example, <code>4 + 2 &gt;= 3</code> will return <code>TRUE</code>, but <code>3 + 5 &lt;= 1</code> will return <code>FALSE</code>.</li>
+<li>Logical operators: <code>&amp;</code> representing “and” as well as <code>|</code> representing “or.” For example, <code>(2 + 1 == 3) &amp; (2 + 1 == 4)</code> returns <code>FALSE</code> since both clauses are not <code>TRUE</code> (only the first clause is <code>TRUE</code>). On the other hand, <code>(2 + 1 == 3) | (2 + 1 == 4)</code> returns <code>TRUE</code> since at least one of the two clauses is <code>TRUE</code>. </li>
+</ul></li>
+<li><em>Functions</em>, also called <em>commands</em>: Functions perform tasks in R. They take in inputs called <em>arguments</em> and return outputs. You can either manually specify a function’s arguments or use the function’s <em>default values</em>. 
+<ul>
+<li>For example, the function <code>seq()</code> in R generates a sequence of numbers. If you just run <code>seq()</code> it will return the value 1. That doesn’t seem very useful! This is because the default arguments are set as <code>seq(from = 1, to = 1)</code>. Thus, if you don’t pass in different values for <code>from</code> and <code>to</code> to change this behavior, R just assumes all you want is the number 1. You can change the argument values by updating the values after the <code>=</code> sign. If we try out <code>seq(from = 2, to = 5)</code> we get the result <code>2 3 4 5</code> that we might expect.</li>
+<li>We’ll work with functions a lot throughout this book and you’ll get lots of practice in understanding their behaviors. To further assist you in understanding when a function is mentioned in the book, we’ll also include the <code>()</code> after them as we did with <code>seq()</code> above.</li>
 </ul></li>
-<li><em>Functions</em>, also called <em>commands</em>: Functions perform tasks in R. They take in inputs called <em>arguments</em> and return outputs. You can either manually specify a function’s arguments or use the function’s <em>default values</em>. </li>
 </ul>
 <p>This list is by no means an exhaustive list of all the programming concepts and terminology needed to become a savvy R user; such a list would be so large it wouldn’t be very useful, especially for novices. Rather, we feel this is a minimally viable list of programming concepts and terminology you need to know before getting started. We feel that you can learn the rest as you go. Remember that your mastery of all of these concepts and terminology will build as you practice more and more.</p>
 </div>
@@ -668,33 +686,33 @@ <h3><span class="header-section-number">1.2.2</span> Errors, warnings, and messa
 <p>One thing that intimidates new R and RStudio users is how it reports <em>errors</em>, <em>warnings</em>, and <em>messages</em>. R reports errors, warnings, and messages in a glaring red font, which makes it seem like it is scolding you. However, seeing red text in the console is not always bad.</p>
 <p>R will show red text in the console pane in three different situations:</p>
 <ul>
-<li><strong>Errors</strong>:  When the red text is a legitimate error, it will be prefaced with “Error in…” and try to explain what went wrong. Generally when there’s an error, the code will not run. For example, we’ll see in Subsection <a href="1-getting-started.html#package-use">1.3.3</a> if you see <code>Error in ggplot(...) : could not find function &quot;ggplot&quot;</code>, it means that the <code>ggplot()</code> function is not accessible because the package that contains the function (<code>ggplot2</code>) was not loaded with <code>library(ggplot2)</code>. Thus you cannot use the <code>ggplot()</code> function without the <code>ggplot2</code> package being loaded first.</li>
-<li><strong>Warnings</strong>:  When the red text is a warning, it will be prefaced with “Warning:” and R will try to explain why there’s a warning. Generally your code will still work, but with some caveats. For example, you will see in Chapter <a href="2-viz.html#viz">2</a> if you create a scatterplot based on a dataset where one of the values is missing, you will see this warning: <code>Warning: Removed 1 rows containing missing values (geom_point)</code>. R will still produce the scatterplot with all the remaining values, but it is warning you that one of the points isn’t there.</li>
-<li><strong>Messages</strong>:  When the red text doesn’t start with either “Error” or “Warning”, it’s <em>just a friendly message</em>. You’ll see these messages when you load <em>R packages</em> in the upcoming Subsection <a href="1-getting-started.html#package-loading">1.3.2</a> or when you read data saved in spreadsheet files with the <code>read_csv()</code> function as you’ll see in Chapter <a href="4-tidy.html#tidy">4</a>. These are helpful diagnostic messages and they don’t stop your code from working. Additionally, you’ll see these messages when you install packages too using <code>install.packages()</code>.</li>
+<li><strong>Errors</strong>:  When the red text is a legitimate error, it will be prefaced with “Error in…” and will try to explain what went wrong. Generally when there’s an error, the code will not run. For example, we’ll see in Subsection <a href="1-getting-started.html#package-use">1.3.3</a> if you see <code>Error in ggplot(...) : could not find function &quot;ggplot&quot;</code>, it means that the <code>ggplot()</code> function is not accessible because the package that contains the function (<code>ggplot2</code>) was not loaded with <code>library(ggplot2)</code>. Thus you cannot use the <code>ggplot()</code> function without the <code>ggplot2</code> package being loaded first.</li>
+<li><strong>Warnings</strong>:  When the red text is a warning, it will be prefaced with “Warning:” and R will try to explain why there’s a warning. Generally your code will still work, but with some caveats. For example, you will see in Chapter <a href="2-viz.html#viz">2</a> if you create a scatterplot based on a dataset where two of the rows of data have missing entries that would be needed to create points in the scatterplot, you will see this warning: <code>Warning: Removed 2 rows containing missing values (geom_point)</code>. R will still produce the scatterplot with all the remaining non-missing values, but it is warning you that two of the points aren’t there.</li>
+<li><strong>Messages</strong>:  When the red text doesn’t start with either “Error” or “Warning”, it’s <em>just a friendly message</em>. You’ll see these messages when you load <em>R packages</em> in the upcoming Subsection <a href="1-getting-started.html#package-loading">1.3.2</a> or when you read data saved in spreadsheet files with the <code>read_csv()</code> function as you’ll see in Chapter <a href="4-tidy.html#tidy">4</a>. These are helpful diagnostic messages and they don’t stop your code from working. Additionally, you’ll see these messages when you install packages too using <code>install.packages()</code> as discussed in Subsection <a href="1-getting-started.html#package-installation">1.3.1</a>.</li>
 </ul>
 <p>Remember, when you see red text in the console, <em>don’t panic</em>. It doesn’t necessarily mean anything is wrong. Rather:</p>
 <ul>
 <li>If the text starts with “Error”, figure out what’s causing it. <span style="color:red">Think of errors as a red traffic light: something is wrong!</span></li>
 <li>If the text starts with “Warning”, figure out if it’s something to worry about. For instance, if you get a warning about missing values in a scatterplot and you know there are missing values, you’re fine. If that’s surprising, look at your data and see what’s missing. <span style="color:gold">Think of warnings as a yellow traffic light: everything is working fine, but watch out/pay attention.</span></li>
-<li>Otherwise the text is just a message. Read it, wave back at R, and thank it for talking to you. <span style="color:green">Think of messages as a green traffic light: everything is working fine.</span></li>
+<li>Otherwise, the text is just a message. Read it, wave back at R, and thank it for talking to you. <span style="color:green">Think of messages as a green traffic light: everything is working fine and keep on going!</span></li>
 </ul>
 </div>
 <div id="tips-code" class="section level3">
 <h3><span class="header-section-number">1.2.3</span> Tips on learning to code</h3>
-<p>Learning to code/program is very much like learning a foreign language. It can be very daunting and frustrating at first. Such frustrations are very common and it is very normal to feel discouraged as you learn. However just as with learning a foreign language, if you put in the effort and are not afraid to make mistakes, anybody can learn.</p>
+<p>Learning to code/program is quite similar to learning a foreign language. It can be daunting and frustrating at first. Such frustrations are common and it is normal to feel discouraged as you learn. However, just as with learning a foreign language, if you put in the effort and are not afraid to make mistakes, anybody can learn and improve.</p>
 <p>Here are a few useful tips to keep in mind as you learn to program:</p>
 <ul>
-<li><strong>Remember that computers are not actually that smart</strong>: You may think your computer or smartphone are “smart,” but really people spent a lot of time and energy designing them to appear “smart.” In reality, you have to tell a computer everything it needs to do. Furthermore, the instructions you give your computer can’t have any mistakes in them nor can they be ambiguous in any way.</li>
-<li><strong>Take the “copy, paste, and tweak” approach</strong>: Especially when you learn your first programming language or you need to understand particularly complicated code, it is often much easier to take existing code that you know works and modify it to suit your ends. This is opposed to trying to type out the code from scratch. We call this the <em>“copy, paste, and tweak”</em> approach. So early on, we suggest not trying to write code from memory, but rather take existing examples we have provided you, then copy, paste, and tweak them to suit your goals. After you start feeling more confident, you can slowly move away from this approach. Think of the “copy, paste, and tweak” approach as training wheels for a child learning to ride a bike. After getting comfortable, they won’t need them anymore.</li>
-<li><strong>The best way to learn to code is by doing</strong>: Rather than learning to code for its own sake, we feel that learning to code goes much smoother when you have a goal in mind or when you are working on a particular project, like analyzing data that you are interested in.</li>
-<li><strong>Practice is key</strong>: Just as the only method to improve your foreign language skills is through lots of practice, the only method to improving your coding skills is through lots of practice. Don’t worry however, we’ll give you plenty of opportunities to do so!</li>
+<li><strong>Remember that computers are not actually that smart</strong>: You may think your computer or smartphone is “smart,” but really people spent a lot of time and energy designing them to appear “smart.” In reality, you have to tell a computer everything it needs to do. Furthermore, the instructions you give your computer can’t have any mistakes in them, nor can they be ambiguous in any way.</li>
+<li><strong>Take the “copy, paste, and tweak” approach</strong>: Especially when you learn your first programming language or you need to understand particularly complicated code, it is often much easier to take existing code that you know works and modify it to suit your ends. This is as opposed to trying to type out the code from scratch. We call this the <em>“copy, paste, and tweak”</em> approach. So early on, we suggest not trying to write code from memory, but rather take existing examples we have provided you, then copy, paste, and tweak them to suit your goals. After you start feeling more confident, you can slowly move away from this approach and write code from scratch. Think of the “copy, paste, and tweak” approach as training wheels for a child learning to ride a bike. After getting comfortable, they won’t need them anymore.</li>
+<li><strong>The best way to learn to code is by doing</strong>: Rather than learning to code for its own sake, we find that learning to code goes much smoother when you have a goal in mind or when you are working on a particular project, like analyzing data that you are interested in and that is important to you.</li>
+<li><strong>Practice is key</strong>: Just as the only method to improve your foreign language skills is through lots of practice and speaking, the only method to improving your coding skills is through lots of practice. Don’t worry, however, we’ll give you plenty of opportunities to do so!</li>
 </ul>
 </div>
 </div>
 <div id="packages" class="section level2">
 <h2><span class="header-section-number">1.3</span> What are R packages?</h2>
-<p>Another point of confusion with many new R users is the idea of an R package. R packages  extend the functionality of R by providing additional functions, data, and documentation. They are written by a world-wide community of R users and can be downloaded for free from the internet.</p>
-<p>For example, among the many packages we will use in this book are the <code>ggplot2</code> package for data visualization in Chapter <a href="2-viz.html#viz">2</a>, the <code>dplyr</code> package <span class="citation">(Wickham, François, et al. <a href="#ref-R-dplyr">2019</a>)</span> for data wrangling in Chapter <a href="3-wrangling.html#wrangling">3</a>, the <code>moderndive</code> package <span class="citation">(Ismay <a href="#ref-R-moderndive">2019</a>)</span> that accompanies this book, and the <code>infer</code> package <span class="citation">(Bray et al. <a href="#ref-R-infer">2019</a>)</span> for “tidy” and transparent statistical inference in Chapters <a href="8-confidence-intervals.html#confidence-intervals">8</a>, <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>, and <a href="10-inference-for-regression.html#inference-for-regression">10</a>.</p>
+<p>Another point of confusion with many new R users is the idea of an R package. R packages  extend the functionality of R by providing additional functions, data, and documentation. They are written by a worldwide community of R users and can be downloaded for free from the internet.</p>
+<p>For example, among the many packages we will use in this book are the <code>ggplot2</code> package <span class="citation">(Wickham, Chang, et al. <a href="#ref-R-ggplot2">2019</a>)</span> for data visualization in Chapter <a href="2-viz.html#viz">2</a>, the <code>dplyr</code> package <span class="citation">(Wickham, François, et al. <a href="#ref-R-dplyr">2019</a>)</span> for data wrangling in Chapter <a href="3-wrangling.html#wrangling">3</a>, the <code>moderndive</code> package <span class="citation">(Kim and Ismay <a href="#ref-R-moderndive">2019</a>)</span> that accompanies this book, and the <code>infer</code> package <span class="citation">(Bray et al. <a href="#ref-R-infer">2019</a>)</span> for “tidy” and transparent statistical inference in Chapters <a href="8-confidence-intervals.html#confidence-intervals">8</a>, <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>, and <a href="10-inference-for-regression.html#inference-for-regression">10</a>.</p>
 <p>A good analogy for R packages  is they are like apps you can download onto a mobile phone:</p>
 <!--
 R: A new phone           |  R Packages: Apps you can download
@@ -702,27 +720,27 @@ <h2><span class="header-section-number">1.3</span> What are R packages?</h2>
 ![](images/shutterstock/shutterstock_693573352_cropped.jpg){ height=2.5in } |  ![](images/shutterstock/shutterstock_220533046.jpg){ height=2.5in }
 -->
 <div class="figure" style="text-align: center"><span id="fig:R-vs-R-packages"></span>
-<img src="images/shutterstock/R_vs_R_packages.png" alt="Analogy of R versus R packages." width="90%" />
+<img src="images/shutterstock/R_vs_R_packages.png" alt="Analogy of R versus R packages." width="70%" />
 <p class="caption">
 FIGURE 1.4: Analogy of R versus R packages.
 </p>
 </div>
 <p>So R is like a new mobile phone: while it has a certain amount of features when you use it for the first time, it doesn’t have everything. R packages are like the apps you can download onto your phone from Apple’s App Store or Android’s Google Play.</p>
-<p>Let’s continue this analogy by considering the Instagram app for editing and sharing pictures. Say you have purchased a new phone and you would like to share a photo you have just taken with friends and family on Instagram. You need to:</p>
+<p>Let’s continue this analogy by considering the Instagram app for editing and sharing pictures. Say you have purchased a new phone and you would like to share a photo you have just taken with friends on Instagram. You need to:</p>
 <ol style="list-style-type: decimal">
 <li><em>Install the app</em>: Since your phone is new and does not include the Instagram app, you need to download the app from either the App Store or Google Play. You do this once and you’re set for the time being. You might need to do this again in the future when there is an update to the app.</li>
-<li><em>Open the app</em>: After you’ve installed Instagram, you need to open the app.</li>
+<li><em>Open the app</em>: After you’ve installed Instagram, you need to open it.</li>
 </ol>
 <p>Once Instagram is open on your phone, you can then proceed to share your photo with your friends and family. The process is very similar for using an R package. You need to:</p>
 <ol style="list-style-type: decimal">
 <li><em>Install the package</em>: This is like installing an app on your phone. Most packages are not installed by default when you install R and RStudio. Thus if you want to use a package for the first time, you need to install it first. Once you’ve installed a package, you likely won’t install it again unless you want to update it to a newer version.</li>
 <li><em>“Load” the package</em>: “Loading” a package is like opening an app on your phone. Packages are not “loaded” by default when you start RStudio on your computer; you need to “load” each package you want to use every time you start RStudio.</li>
 </ol>
-<p>Let’s now show you how to perform these two steps for the <code>ggplot2</code> package for data visualization.</p>
+<p>Let’s perform these two steps for the <code>ggplot2</code> package for data visualization.</p>
 <div id="package-installation" class="section level3">
 <h3><span class="header-section-number">1.3.1</span> Package installation</h3>
 <blockquote>
-<p><strong>Note about RStudio Server</strong>: If your instructor has provided you with a link and access to RStudio Server, you probably will not need to install packages, as they have likely been pre-installed for you by your instructor. That being said, it is still a good idea to know this process for later on when you are not using RStudio Server, but rather RStudio Desktop on your own computer.</p>
+<p><strong>Note about RStudio Server or RStudio Cloud</strong>: If your instructor has provided you with a link and access to RStudio Server or RStudio Cloud, you might not need to install packages, as they might be preinstalled for you by your instructor. That being said, it is still a good idea to know this process for later on when you are not using RStudio Server or Cloud, but rather RStudio Desktop on your own computer.</p>
 </blockquote>
 <p>There are two ways to install an R package: an easy way and a more advanced way.  Let’s install the <code>ggplot2</code> package the easy way first as shown in Figure <a href="1-getting-started.html#fig:easy-way-install">1.5</a>. In the Files pane of RStudio:</p>
 <ol style="list-style-type: lower-alpha">
@@ -732,31 +750,35 @@ <h3><span class="header-section-number">1.3.1</span> Package installation</h3>
 <li>Click “Install.”</li>
 </ol>
 <div class="figure" style="text-align: center"><span id="fig:easy-way-install"></span>
-<img src="images/rstudio_screenshots/install_packages_easy_way.png" alt="Installing packages in R the easy way." width="70%" />
+<img src="images/rstudio_screenshots/install_packages_easy_way.png" alt="Installing packages in R the easy way." width="55%" height="55%" />
 <p class="caption">
 FIGURE 1.5: Installing packages in R the easy way.
 </p>
 </div>
 <p>An alternative but slightly less convenient way to install a package is by typing <code>install.packages(&quot;ggplot2&quot;)</code> in the console pane of RStudio and pressing Return/Enter on your keyboard. Note you must include the quotation marks around the name of the package.</p>
 <p>Much like an app on your phone, you only have to install a package once. However, if you want to update a previously installed package to a newer version, you need to reinstall it by repeating the earlier steps.</p>
+
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC1.1)</strong> Repeat the earlier installing steps, but for the <code>dplyr</code>, <code>nycflights13</code>, and <code>knitr</code> packages. This will install the earlier mentioned <code>dplyr</code> package for data wrangling, the <code>nycflights13</code> package containing data on all domestic flights leaving a NYC airport in 2013, and the <code>knitr</code> package for writing reports in R. We’ll use these packages in the next section.</p>
+<p><strong>(LC1.1)</strong> Repeat the earlier installation steps, but for the <code>dplyr</code>, <code>nycflights13</code>, and <code>knitr</code> packages. This will install the earlier mentioned <code>dplyr</code> package for data wrangling, the <code>nycflights13</code> package containing data on all domestic flights leaving a NYC airport in 2013, and the <code>knitr</code> package for generating easy-to-read tables in R. We’ll use these packages in the next section.</p>
 <div class="learncheck">
 
 </div>
-<p>Note that if you’d like to match up exactly with what the output looks like throughout the book, you may want to use the exact versions of the packages that we used. You can find a full listing of these packages and their versions in Appendix <a href="E-appendixE.html#appendixE">E</a>. This likely won’t be relevant for novices, but we included it for reproducibility reasons.</p>
+<p></p>
+<p>Note that if you’d like your output on your computer to match up exactly with the output presented throughout the book, you may want to use the exact versions of the packages that we used. You can find a full listing of these packages and their versions in Appendix <a href="E-appendixE.html#appendixE">E</a>. This likely won’t be relevant for novices, but we included it for reproducibility reasons.</p>
 </div>
 <div id="package-loading" class="section level3">
 <h3><span class="header-section-number">1.3.2</span> Package loading</h3>
-<p>Recall that after you’ve installed a package, you need to “load it.” In other words, you need to “open it.” We do this by using the <code>library()</code> command.  For example, to load the <code>ggplot2</code> package, run the following code in the console pane. What do we mean by “run the following code”? Either type or copy &amp; paste the following code into the console pane and then hit the Enter key.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(ggplot2)</code></pre>
-<p>If after running the earlier code, a blinking cursor returns next to the <code>&gt;</code> “prompt” sign, it means you were successful and the <code>ggplot2</code> package is now loaded and ready to use. If however, you get a red “error message” that reads… </p>
+<p>Recall that after you’ve installed a package, you need to “load it.” In other words, you need to “open it.” We do this by using the <code>library()</code> command. </p>
+<p>For example, to load the <code>ggplot2</code> package, run the following code in the console pane. What do we mean by “run the following code”? Either type or copy-and-paste the following code into the console pane and then hit the Enter key.</p>
+<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" data-line-number="1"><span class="kw">library</span>(ggplot2)</a></code></pre></div>
+<p>If after running the earlier code, a blinking cursor returns next to the <code>&gt;</code> “prompt” sign, it means you were successful and the <code>ggplot2</code> package is now loaded and ready to use. If, however, you get a red “error message” that reads <code>...</code> </p>
 <pre><code>Error in library(ggplot2) : there is no package called ‘ggplot2’</code></pre>
-<p>… it means that you didn’t successfully install it. This is an example of an “error message” we discussed in Subsection <a href="1-getting-started.html#messages">1.2.2</a>. If you get this error message, go back to Subsection <a href="1-getting-started.html#package-installation">1.3.1</a> on R package installation and make sure to install it.</p>
+<p><code>...</code> it means that you didn’t successfully install it. This is an example of an “error message” we discussed in Subsection <a href="1-getting-started.html#messages">1.2.2</a>. If you get this error message, go back to Subsection <a href="1-getting-started.html#package-installation">1.3.1</a> on R package installation and make sure to install the <code>ggplot2</code> package before proceeding.</p>
+
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -766,38 +788,39 @@ <h3><span class="header-section-number">1.3.2</span> Package loading</h3>
 <div class="learncheck">
 
 </div>
+<p></p>
 </div>
 <div id="package-use" class="section level3">
 <h3><span class="header-section-number">1.3.3</span> Package use</h3>
 <p>One very common mistake new R users make when wanting to use particular packages is they forget to “load” them first by using the <code>library()</code> command we just saw. Remember: <em>you have to load each package you want to use every time you start RStudio.</em> If you don’t first “load” a package, but attempt to use one of its features, you’ll see an error message similar to:</p>
 <pre><code>Error: could not find function</code></pre>
-<p>This is a different error message than the one you just saw on a package not having been installed yet. R is telling you that you are trying to use a function in a package that has not yet been “loaded.” R doesn’t know where to find the function you are using. Almost all new users forget to do this when starting out, and it is a little annoying to get used to doing it. However, you’ll remember with practice.</p>
+<p>This is a different error message than the one you just saw on a package not having been installed yet. R is telling you that you are trying to use a function in a package that has not yet been “loaded.” R doesn’t know where to find the function you are using. Almost all new users forget to do this when starting out, and it is a little annoying to get used to doing it. However, you’ll remember with practice and after some time it will become second nature for you.</p>
 </div>
 </div>
 <div id="nycflights13" class="section level2">
 <h2><span class="header-section-number">1.4</span> Explore your first datasets</h2>
 <p>Let’s put everything we’ve learned so far into practice and start exploring some real data! Data comes to us in a variety of formats, from pictures to text to numbers. Throughout this book, we’ll focus on datasets that are saved in “spreadsheet”-type format. This is probably the most common way data are collected and saved in many fields. Remember from Subsection <a href="1-getting-started.html#programming-concepts">1.2.1</a> that these “spreadsheet”-type datasets are called <em>data frames</em> in R.  We’ll focus on working with data saved as data frames throughout this book.</p>
 <p>Let’s first load all the packages needed for this chapter, assuming you’ve already installed them. Read Section <a href="1-getting-started.html#packages">1.3</a> for information on how to install and load R packages if you haven’t already.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(nycflights13)
-<span class="kw">library</span>(dplyr)
-<span class="kw">library</span>(knitr)</code></pre>
+<div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb4-1" data-line-number="1"><span class="kw">library</span>(nycflights13)</a>
+<a class="sourceLine" id="cb4-2" data-line-number="2"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb4-3" data-line-number="3"><span class="kw">library</span>(knitr)</a></code></pre></div>
 <p>At the beginning of all subsequent chapters in this book, we’ll always have a list of packages that you should have installed and loaded in order to work with that chapter’s R code.</p>
 <div id="nycflights13-package" class="section level3">
 <h3><span class="header-section-number">1.4.1</span> <code>nycflights13</code> package</h3>
-<p>Many of us have flown on airplanes or know someone who has. Air travel has become an ever-present aspect in many people’s lives. If you look at the Departures flight information board at an airport, you will frequently see that some flights are delayed for a variety of reasons. Are there ways that we can understand the reasons that cause flight delays?</p>
-<p>We’d all like to arrive at our destinations on time whenever possible. (Unless you secretly love hanging out at airports. If you are one of these people, pretend for a moment that you are very much anticipating being at your final destination.) Throughout this book, we’re going to analyze data related to all 2013 domestic flights departing from one of New York City’s three airports: Newark Liberty International (EWR), John F. Kennedy International (JFK), and La Guardia (LGA). We’ll access this data using the <code>nycflights13</code>  R package which contained five datasets saved in five data frames:</p>
+<p>Many of us have flown on airplanes or know someone who has. Air travel has become an ever-present aspect of many people’s lives. If you look at the Departures flight information board at an airport, you will frequently see that some flights are delayed for a variety of reasons. Are there ways that we can understand the reasons that cause flight delays?</p>
+<p>We’d all like to arrive at our destinations on time whenever possible. (Unless you secretly love hanging out at airports. If you are one of these people, pretend for a moment that you are very much anticipating being at your final destination.) Throughout this book, we’re going to analyze data related to all domestic flights departing from one of New York City’s three main airports in 2013: Newark Liberty International (EWR), John F. Kennedy International (JFK), and LaGuardia Airport (LGA). We’ll access this data using the <code>nycflights13</code>  R package, which contains five datasets saved in five data frames:</p>
 <ul>
 <li><code>flights</code>: Information on all 336,776 flights.</li>
-<li><code>airlines</code>: A table matching airline names and their two letter IATA airline codes (also known as carrier codes) for 16 airline companies. Ex: DL is the two letter code for Delta Air Lines.</li>
+<li><code>airlines</code>: A table matching airline names and their two-letter International Air Transport Association (IATA) airline codes (also known as carrier codes) for 16 airline companies. For example, “DL” is the two-letter code for Delta.</li>
 <li><code>planes</code>: Information about each of the 3,322 physical aircraft used.</li>
-<li><code>weather</code>: Hourly meteorological data for each of the three NYC airports. This data frame has 26,115 rows, roughly corresponding to the 365 <span class="math inline">\(\times\)</span> 24 <span class="math inline">\(\times\)</span> 3 = 26,280 possible hourly measurements one can observe at three locations over the course of a year.</li>
-<li><code>airports</code>: Airport names, codes, and locations for the 1,458 domestic destination airports.</li>
+<li><code>weather</code>: Hourly meteorological data for each of the three NYC airports. This data frame has 26,115 rows, roughly corresponding to the <span class="math inline">\(365 \times 24 \times 3 = 26,280\)</span> possible hourly measurements one can observe at three locations over the course of a year.</li>
+<li><code>airports</code>: Names, codes, and locations of the 1,458 domestic destinations.</li>
 </ul>
 </div>
 <div id="flights-data-frame" class="section level3">
 <h3><span class="header-section-number">1.4.2</span> <code>flights</code> data frame</h3>
-<p>We’ll begin by exploring the <code>flights</code> data frame and get an idea of its structure. Run the following code in your console, either by typing it or by cutting &amp; pasting it. It displays the contents of the <code>flights</code> data frame in your console. Note that depending on the size of your monitor, the output may vary slightly.</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights</code></pre>
+<p>We’ll begin by exploring the <code>flights</code> data frame and get an idea of its structure. Run the following code in your console, either by typing it or by cutting-and-pasting it. It displays the contents of the <code>flights</code> data frame in your console. Note that depending on the size of your monitor, the output may vary slightly.</p>
+<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb5-1" data-line-number="1">flights</a></code></pre></div>
 <pre><code># A tibble: 336,776 x 19
     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;
@@ -816,28 +839,29 @@ <h3><span class="header-section-number">1.4.2</span> <code>flights</code> data f
 #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</code></pre>
 <p>Let’s unpack this output:</p>
 <ul>
-<li><code>A tibble: 336,776 x 19</code>: A <code>tibble</code> is a specific kind of data frame and is short for “tidy table” (we’ll discuss what it means for a data frame to be “tidy” later on in Section <a href="4-tidy.html#tidy-data-ex">4.2</a>). This particular data frame has
+<li><code>A tibble: 336,776 x 19</code>: A <code>tibble</code> is a specific kind of data frame in R. This particular data frame has
 <ul>
 <li><code>336,776</code> rows corresponding to different <em>observations</em>. Here, each observation is a flight.</li>
 <li><code>19</code> columns corresponding to 19 <em>variables</em> describing each observation.</li>
 </ul></li>
-<li><code>year</code>, <code>month</code>, <code>day</code>, <code>dep_time</code>, <code>sched_dep_time</code>, <code>dep_delay</code>, and <code>arr_time</code> are the different columns, in other words, the different variables of this data set.</li>
-<li>We then have a preview of the first 10 rows of observations corresponding to the first 10 flights. R is only showing the first 10 rows, because if it showed all <code>336,776</code> rows it would overwhelm your screen.</li>
+<li><code>year</code>, <code>month</code>, <code>day</code>, <code>dep_time</code>, <code>sched_dep_time</code>, <code>dep_delay</code>, and <code>arr_time</code> are the different columns, in other words, the different variables of this dataset.</li>
+<li>We then have a preview of the first 10 rows of observations corresponding to the first 10 flights. R is only showing the first 10 rows, because if it showed all <code>336,776</code> rows, it would overwhelm your screen.</li>
 <li><code>... with 336,766 more rows, and 11 more variables:</code> indicating to us that 336,766 more rows of data and 11 more variables could not fit in this screen.</li>
 </ul>
-<p>Unfortunately, this output does not allow us to explore the data very well. Let’s look at some different ways to explore data frames.</p>
+<p>Unfortunately, this output does not allow us to explore the data very well, but it does give a nice preview. Let’s look at some different ways to explore data frames.</p>
 </div>
 <div id="exploredataframes" class="section level3">
 <h3><span class="header-section-number">1.4.3</span> Exploring data frames</h3>
 <p>There are many ways to get a feel for the data contained in a data frame such as <code>flights</code>. We present three functions that take as their “argument” (their input) the data frame in question. We also include a fourth method for exploring one particular column of a data frame:</p>
 <ol style="list-style-type: decimal">
-<li>Using the <code>View()</code> function, which brings up RStudio’s built-in spreadsheet viewer.</li>
+<li>Using the <code>View()</code> function, which brings up RStudio’s built-in data viewer.</li>
 <li>Using the <code>glimpse()</code> function, which is included in the <code>dplyr</code> package.</li>
 <li>Using the <code>kable()</code> function, which is included in the <code>knitr</code> package.</li>
-<li>Using the <code>$</code> “extraction operator”, which is used to view a single variable/column in a data frame.</li>
+<li>Using the <code>$</code> “extraction operator,” which is used to view a single variable/column in a data frame.</li>
 </ol>
 <p><strong>1. <code>View()</code></strong>:</p>
-<p>Run <code>View(flights)</code>  in your console in RStudio, either by typing it or cutting &amp; pasting it into the console pane, and explore this data frame in the resulting pop-up viewer. You should get into the habit of always viewing any data frames you encounter. Note the uppercase <code>V</code> in <code>View</code>. R is case-sensitive, so you’ll get an error message if you run <code>view(flights)</code> instead of <code>View(flights)</code>.</p>
+<p>Run <code>View(flights)</code>  in your console in RStudio, either by typing it or cutting-and-pasting it into the console pane. Explore this data frame in the resulting pop up viewer. You should get into the habit of viewing any data frames you encounter. Note the uppercase <code>V</code> in <code>View()</code>. R is case-sensitive, so you’ll get an error message if you run <code>view(flights)</code> instead of <code>View(flights)</code>.</p>
+
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -853,11 +877,12 @@ <h3><span class="header-section-number">1.4.3</span> Exploring data frames</h3>
 <div class="learncheck">
 
 </div>
+
 <p>By running <code>View(flights)</code>, we can explore the different <em>variables</em> listed in the columns. Observe that there are many different types of variables. Some of the variables like <code>distance</code>, <code>day</code>, and <code>arr_delay</code> are what we will call <em>quantitative</em> variables.  These variables are numerical in nature. Other variables here are  <em>categorical</em>.</p>
-<p>Note that if you look in the leftmost column of the <code>View(flights)</code> output, you will see a column of numbers. These are the row numbers of the dataset. If you glance across a row with the same number, say row 5, you can get an idea of what each row is representing. In other words, this will allow you to identify what object is being described in a given row. This is often called the <em>observational unit</em>. The observational unit in this example is an individual flight departing from New York City in 2013. You can identify the observational unit by determining what “thing” is being measured or described by each of the variables. We’ll talk more about observational units in Section <a href="1-getting-started.html#identification-vs-measurement-variables">1.4.4</a> on <em>identification</em> and <em>measurement</em> variables.</p>
+<p>Note that if you look in the leftmost column of the <code>View(flights)</code> output, you will see a column of numbers. These are the row numbers of the dataset. If you glance across a row with the same number, say row 5, you can get an idea of what each row is representing. This will allow you to identify what object is being described in a given row by taking note of the values of the columns in that specific row. This is often called the <em>observational unit</em>. The observational unit in this example is an individual flight departing from New York City in 2013. You can identify the observational unit by determining what “thing” is being measured or described by each of the variables. We’ll talk more about observational units in Subsection <a href="1-getting-started.html#identification-vs-measurement-variables">1.4.4</a> on <em>identification</em> and <em>measurement</em> variables.</p>
 <p><strong>2. <code>glimpse()</code></strong>:</p>
-<p>The second way to explore a data frame is using the <code>glimpse()</code> function  included in the  <code>dplyr</code> package. Thus, you can only use the <code>glimpse()</code> function after you’ve loaded the <code>dplyr</code> package by running <code>library(dplyr)</code>. This function provides us with an alternative perspective for exploring a data frame than the <code>View()</code> function:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">glimpse</span>(flights)</code></pre>
+<p>The second way we’ll cover to explore a data frame is using the <code>glimpse()</code> function  included in the  <code>dplyr</code> package. Thus, you can only use the <code>glimpse()</code> function after you’ve loaded the <code>dplyr</code> package by running <code>library(dplyr)</code>. This function provides us with an alternative perspective for exploring a data frame than the <code>View()</code> function:</p>
+<div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb7-1" data-line-number="1"><span class="kw">glimpse</span>(flights)</a></code></pre></div>
 <pre><code>Observations: 336,776
 Variables: 19
 $ year           &lt;int&gt; 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
@@ -879,7 +904,9 @@ <h3><span class="header-section-number">1.4.3</span> Exploring data frames</h3>
 $ hour           &lt;dbl&gt; 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, …
 $ minute         &lt;dbl&gt; 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, …
 $ time_hour      &lt;dttm&gt; 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 …</code></pre>
-<p>Observe that <code>glimpse()</code> will give you the first few entries of each variable in a row after the variable name. In addition, the <em>data type</em> (see Subsection <a href="1-getting-started.html#programming-concepts">1.2.1</a>) of the variable is given immediately after each variable’s name inside <code>&lt; &gt;</code>. Here, <code>int</code> and <code>dbl</code> refer to “integer” and “double”, which are computer coding terminology for quantitative/numerical variables. In contrast, <code>chr</code> refers to “character”, which is computer terminology for text data. Text data, such as the <code>carrier</code> or <code>origin</code> of a flight, are categorical variables. The <code>time_hour</code> variable is another data type: <code>dttm</code>. These types of variables represent date and time combinations. However, we won’t work with dates and times in this book, we leave this topic for a more advanced data science book.</p>
+<p>Observe that <code>glimpse()</code> will give you the first few entries of each variable in a row after the variable name. In addition, the <em>data type</em> (see Subsection <a href="1-getting-started.html#programming-concepts">1.2.1</a>) of the variable is given immediately after each variable’s name inside <code>&lt; &gt;</code>. Here, <code>int</code> and <code>dbl</code> refer to “integer” and “double”, which are computer coding terminology for quantitative/numerical variables. “Doubles” take up twice the size to store on a computer compared to integers.</p>
+<p>In contrast, <code>chr</code> refers to “character”, which is computer terminology for text data. In most forms, text data, such as the <code>carrier</code> or <code>origin</code> of a flight, are categorical variables. The <code>time_hour</code> variable is another data type: <code>dttm</code>. These types of variables represent date and time combinations. However, we won’t work with dates and times in this book; we leave this topic for other data science books like <a href="https://ubc-dsci.github.io/introduction-to-datascience/"><em>Introduction to Data Science</em> by Tiffany-Anne Timbers, Melissa Lee, and Trevor Campbell</a> or <a href="https://r4ds.had.co.nz/dates-and-times.html"><em>R for Data Science</em></a> <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2017</a>)</span>.</p>
+
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -889,39 +916,43 @@ <h3><span class="header-section-number">1.4.3</span> Exploring data frames</h3>
 <div class="learncheck">
 
 </div>
+
+<!--
+\newpage
+-->
 <p><strong>3. <code>kable()</code></strong>:</p>
 <p>The final way to explore the entirety of a data frame is using the <code>kable()</code>  function from the  <code>knitr</code> package. Let’s explore the different carrier codes for all the airlines in our dataset two ways. Run both of these lines of code in the console:</p>
-<pre class="sourceCode r"><code class="sourceCode r">airlines
-<span class="kw">kable</span>(airlines)</code></pre>
-<p>At first glance, it may not appear that there is much difference in the outputs. However when using tools for producing reproducible reports such as <a href="http://rmarkdown.rstudio.com/lesson-1.html">R Markdown</a>, the latter code produces output that is much more legible and reader-friendly.</p>
+<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb9-1" data-line-number="1">airlines</a>
+<a class="sourceLine" id="cb9-2" data-line-number="2"><span class="kw">kable</span>(airlines)</a></code></pre></div>
+<p>At first glance, it may not appear that there is much difference in the outputs. However, when using tools for producing reproducible reports such as <a href="http://rmarkdown.rstudio.com/lesson-1.html">R Markdown</a>, the latter code produces output that is much more legible and reader-friendly. You’ll see us use this reader-friendly style in many places in the book when we want to print a data frame as a nice table.</p>
 <p><strong>4. <code>$</code> operator</strong></p>
 <p>Lastly, the <code>$</code> operator  allows us to extract and then explore a single variable within a data frame. For example, run the following in your console</p>
-<pre class="sourceCode r"><code class="sourceCode r">airlines<span class="op">$</span>name</code></pre>
+<div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb10-1" data-line-number="1">airlines<span class="op">$</span>name</a></code></pre></div>
 <p>We used the <code>$</code> operator to extract only the <code>name</code> variable and return it as a vector of length 16. We’ll only be occasionally exploring data frames using the <code>$</code> operator, instead favoring the <code>View()</code> and <code>glimpse()</code> functions.</p>
 </div>
 <div id="identification-vs-measurement-variables" class="section level3">
-<h3><span class="header-section-number">1.4.4</span> Identification &amp; measurement variables</h3>
-<p>There is a subtle difference between the kinds of variables that you will encounter in data frames: <em>identification variables</em> and <em>measurement variables</em>. For example, let’s explore the <code>airports</code> data frame by showing the output of <code>glimpse(airports)</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">glimpse</span>(airports)</code></pre>
+<h3><span class="header-section-number">1.4.4</span> Identification and measurement variables</h3>
+<p>There is a subtle difference between the kinds of variables that you will encounter in data frames. There are <em>identification variables</em> and <em>measurement variables</em>. For example, let’s explore the <code>airports</code> data frame by showing the output of <code>glimpse(airports)</code>:</p>
+<div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb11-1" data-line-number="1"><span class="kw">glimpse</span>(airports)</a></code></pre></div>
 <pre><code>Observations: 1,458
 Variables: 8
 $ faa   &lt;chr&gt; &quot;04G&quot;, &quot;06A&quot;, &quot;06C&quot;, &quot;06N&quot;, &quot;09J&quot;, &quot;0A9&quot;, &quot;0G6&quot;, &quot;0G7&quot;, &quot;0P2&quot;, …
 $ name  &lt;chr&gt; &quot;Lansdowne Airport&quot;, &quot;Moton Field Municipal Airport&quot;, &quot;Schaumbu…
 $ lat   &lt;dbl&gt; 41.1, 32.5, 42.0, 41.4, 31.1, 36.4, 41.5, 42.9, 39.8, 48.1, 39.…
 $ lon   &lt;dbl&gt; -80.6, -85.7, -88.1, -74.4, -81.4, -82.2, -84.5, -76.8, -76.6, …
-$ alt   &lt;int&gt; 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 1…
+$ alt   &lt;dbl&gt; 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 1…
 $ tz    &lt;dbl&gt; -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5,…
 $ dst   &lt;chr&gt; &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;U&quot;, &quot;A&quot;, &quot;A&quot;, &quot;U&quot;, &quot;A&quot;…
 $ tzone &lt;chr&gt; &quot;America/New_York&quot;, &quot;America/Chicago&quot;, &quot;America/Chicago&quot;, &quot;Amer…</code></pre>
 <p>The variables <code>faa</code> and <code>name</code> are what we will call <em>identification variables</em>, variables that uniquely identify each observational unit. In this case, the identification variables uniquely identify airports. Such variables are mainly used in practice to uniquely identify each row in a data frame. <code>faa</code> gives the unique code provided by the FAA for that airport, while the <code>name</code> variable gives the longer official name of the airport. The remaining variables (<code>lat</code>, <code>lon</code>, <code>alt</code>, <code>tz</code>, <code>dst</code>, <code>tzone</code>) are often called <em>measurement</em> or <em>characteristic</em> variables: variables that describe properties of each observational unit. For example, <code>lat</code> and <code>long</code> describe the latitude and longitude of each airport.</p>
-<p>Furthermore, sometimes a single variable might not be enough to uniquely identify each observational unit: combinations of variables might be needed. While it is not an absolute rule, for organizational purposes it is considered good practice to have your identification variables in the left-most columns of your data frame.</p>
+<p>Furthermore, sometimes a single variable might not be enough to uniquely identify each observational unit: combinations of variables might be needed. While it is not an absolute rule, for organizational purposes it is considered good practice to have your identification variables in the leftmost columns of your data frame.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
 <p><strong>(LC1.5)</strong> What properties of each airport do the variables <code>lat</code>, <code>lon</code>, <code>alt</code>, <code>tz</code>, <code>dst</code>, and <code>tzone</code> describe in the <code>airports</code> data frame? Take your best guess.</p>
-<p><strong>(LC1.6)</strong> Provide the names of variables in a data frame with at least three variables in which one of them is an identification variable and the other two are not. In other words, create your own tidy data frame that matches these conditions.</p>
+<p><strong>(LC1.6)</strong> Provide the names of variables in a data frame with at least three variables where one of them is an identification variable and the other two are not. Further, create your own tidy data frame that matches these conditions.</p>
 <div class="learncheck">
 
 </div>
@@ -929,14 +960,14 @@ <h3><span class="header-section-number">1.4.4</span> Identification &amp; measur
 <div id="help-files" class="section level3">
 <h3><span class="header-section-number">1.4.5</span> Help files</h3>
 <p>Another nice feature of R are help files, which provide documentation for various functions and datasets. You can bring up help files by adding a <code>?</code>  before the name of a function or data frame and then run this in the console. You will then be presented with a page showing the corresponding documentation if it exists. For example, let’s look at the help file for the <code>flights</code> data frame.</p>
-<pre class="sourceCode r"><code class="sourceCode r">?flights</code></pre>
-<p>The help file should pop-up in the Help pane of RStudio. If you have questions about a function or data frame included in an R package, you should get in the habit of consulting the help file right away.</p>
+<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb13-1" data-line-number="1">?flights</a></code></pre></div>
+<p>The help file should pop up in the Help pane of RStudio. If you have questions about a function or data frame included in an R package, you should get in the habit of consulting the help file right away.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC1.7)</strong> Look at the help file for the <code>airports</code> data frame. Revise your earlier guesses about what the variables <code>lat</code>, <code>lon</code>, <code>alt</code>, <code>tz</code>, <code>dst</code>, and <code>tzone</code> each describe. How good were your guesses?</p>
+<p><strong>(LC1.7)</strong> Look at the help file for the <code>airports</code> data frame. Revise your earlier guesses about what the variables <code>lat</code>, <code>lon</code>, <code>alt</code>, <code>tz</code>, <code>dst</code>, and <code>tzone</code> each describe.</p>
 <div class="learncheck">
 
 </div>
@@ -944,27 +975,30 @@ <h3><span class="header-section-number">1.4.5</span> Help files</h3>
 </div>
 <div id="conclusion" class="section level2">
 <h2><span class="header-section-number">1.5</span> Conclusion</h2>
-<p>We’ve given you what we feel is a minimally viable set of tools to explore data in R. Does this chapter contain everything you need to know? Absolutely not. To try to include everything in this chapter would make the chapter so large it wouldn’t be useful! As we said earlier, the best way to further add to your toolbox is to learn by doing.</p>
+<p>We’ve given you what we feel is a minimally viable set of tools to explore data in R. Does this chapter contain everything you need to know? Absolutely not. To try to include everything in this chapter would make the chapter so large it wouldn’t be useful! As we said earlier, the best way to add to your toolbox is to get into RStudio and run and write code as much as possible.</p>
 <div id="additional-resources" class="section level3">
 <h3><span class="header-section-number">1.5.1</span> Additional resources</h3>
-<p>If you are completely new to the world of coding, R, and RStudio and feel you could benefit from a more detailed introduction, we suggest you check out ModernDive co-author Chester Ismay’s short book <a href="https://rbasics.netlify.com/">“Getting used to R, RStudio, and R Markdown”</a> <span class="citation">(Ismay <a href="#ref-usedtor2016">2016</a>)</span>, which includes screencast recordings that you can follow along and pause as you learn. Furthermore, this book contains an introduction to R Markdown, a tool used for reproducible research in R.</p>
-<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-21"></span>
-<img src="images/copyright/getting-used-to-R.png" alt="Preview of Getting used to R, RStudio, and R Markdown book." width="\textwidth" />
+<p>If you are new to the world of coding, R, and RStudio and feel you could benefit from a more detailed introduction, we suggest you check out the short book, <a href="https://rbasics.netlify.com/"><em>Getting Used to R, RStudio, and R Markdown</em></a> <span class="citation">(Ismay and Kennedy <a href="#ref-usedtor2016">2016</a>)</span>. It includes screencast recordings that you can follow along and pause as you learn. This book also contains an introduction to R Markdown, a tool used for reproducible research in R.</p>
+
+<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-25"></span>
+<img src="images/copyright/getting-used-to-R.png" alt="Preview of Getting Used to R, RStudio, and R Markdown." width="\textwidth" />
 <p class="caption">
-FIGURE 1.6: Preview of Getting used to R, RStudio, and R Markdown book.
+FIGURE 1.6: Preview of <em>Getting Used to R, RStudio, and R Markdown</em>.
 </p>
 </div>
 </div>
 <div id="whats-to-come" class="section level3">
 <h3><span class="header-section-number">1.5.2</span> What’s to come?</h3>
-<p>As we stated earlier, however, the best way to learn R is to learn by doing. We’re now going to start the “Data Science with tidyverse” portion of this book in Chapter <a href="2-viz.html#viz">2</a> with what we feel is the most important tool in a data scientist’s toolbox: data visualization. We’ll continue to explore the data included in the <code>nycflights13</code> package using the <code>ggplot2</code> package for data visualization. You’ll see that data visualization is a powerful tool to add to your toolbox for data exploring that provides additional insight to what the <code>View()</code> and <code>glimpse()</code> functions can provide.</p>
-<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-22"></span>
-<img src="images/flowcharts/flowchart/flowchart.004.png" alt="ModernDive flowchart - On to Part I!" width="110%" />
+<p>We’re now going to start the “Data Science with <code>tidyverse</code>” portion of this book in Chapter <a href="2-viz.html#viz">2</a> as shown in Figure <a href="1-getting-started.html#fig:moderndive-flowchart">1.7</a> with what we feel is the most important tool in a data scientist’s toolbox: data visualization. We’ll continue to explore the data included in the <code>nycflights13</code> package using the <code>ggplot2</code> package for data visualization. You’ll see that data visualization is a powerful tool to add to your toolbox for data exploration that provides additional insight to what the <code>View()</code> and <code>glimpse()</code> functions can provide.</p>
+
+<div class="figure" style="text-align: center"><span id="fig:moderndive-flowchart"></span>
+<img src="images/flowcharts/flowchart/flowchart.004.png" alt="ModernDive flowchart - on to Part I!" width="100%" height="100%" />
 <p class="caption">
-FIGURE 1.7: ModernDive flowchart - On to Part I!
+FIGURE 1.7: <em>ModernDive</em> flowchart - on to Part I!
 </p>
 </div>
 
+
 </div>
 </div>
 </div>
@@ -974,13 +1008,19 @@ <h3><span class="header-section-number">1.5.2</span> What’s to come?</h3>
 <h3>References</h3>
 <div id="refs" class="references">
 <div id="ref-R-infer">
-<p>Bray, Andrew, Chester Ismay, Evgeni Chasnovski, Ben Baumer, and Mine Cetinkaya-Rundel. 2019. <em>Infer: Tidy Statistical Inference</em>. <a href="https://github.com/tidymodels/infer">https://github.com/tidymodels/infer</a>.</p>
+<p>Bray, Andrew, Chester Ismay, Evgeni Chasnovski, Ben Baumer, and Mine Cetinkaya-Rundel. 2019. <em>Infer: Tidy Statistical Inference</em>.</p>
+</div>
+<div id="ref-rds2016">
+<p>Grolemund, Garrett, and Hadley Wickham. 2017. <em>R for Data Science</em>. First. Sebastopol, CA: O’Reilly Media. <a href="https://r4ds.had.co.nz/">https://r4ds.had.co.nz/</a>.</p>
 </div>
 <div id="ref-usedtor2016">
-<p>Ismay, Chester. 2016. <em>Getting Used to R, RStudio, and R Markdown</em>. <a href="http://ismayc.github.io/rbasics-book">http://ismayc.github.io/rbasics-book</a>.</p>
+<p>Ismay, Chester, and Patrick C. Kennedy. 2016. <em>Getting Used to R, RStudio, and R Markdown</em>. <a href="https://rbasics.netlify.com">https://rbasics.netlify.com</a>.</p>
 </div>
 <div id="ref-R-moderndive">
-<p>Ismay, Chester. 2019. <em>Moderndive: Tidyverse-Friendly Introductory Linear Regression</em>. <a href="https://CRAN.R-project.org/package=moderndive">https://CRAN.R-project.org/package=moderndive</a>.</p>
+<p>Kim, Albert Y., and Chester Ismay. 2019. <em>Moderndive: Tidyverse-Friendly Introductory Linear Regression</em>. <a href="https://CRAN.R-project.org/package=moderndive">https://CRAN.R-project.org/package=moderndive</a>.</p>
+</div>
+<div id="ref-R-ggplot2">
+<p>Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, and Hiroaki Yutani. 2019. <em>Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics</em>. <a href="https://CRAN.R-project.org/package=ggplot2">https://CRAN.R-project.org/package=ggplot2</a>.</p>
 </div>
 <div id="ref-R-dplyr">
 <p>Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2019. <em>Dplyr: A Grammar of Data Manipulation</em>. <a href="https://CRAN.R-project.org/package=dplyr">https://CRAN.R-project.org/package=dplyr</a>.</p>
@@ -991,17 +1031,19 @@ <h3>References</h3>
           </div>
         </div>
       </div>
-<a href="index.html" class="navigation navigation-prev " aria-label="Previous page"><i class="fa fa-angle-left"></i></a>
+<a href="about-the-authors.html" class="navigation navigation-prev " aria-label="Previous page"><i class="fa fa-angle-left"></i></a>
 <a href="2-viz.html" class="navigation navigation-next " aria-label="Next page"><i class="fa fa-angle-right"></i></a>
     </div>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -1009,12 +1051,11 @@ <h3>References</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -1029,6 +1070,10 @@ <h3>References</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -1045,8 +1090,9 @@ <h3>References</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/10-inference-for-regression.html b/docs/10-inference-for-regression.html
index 54a60aba2..1ec5f3e11 100644
--- a/docs/10-inference-for-regression.html
+++ b/docs/10-inference-for-regression.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>Chapter 10 Inference for Regression | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="Chapter 10 Inference for Regression | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="Chapter 10 Inference for Regression | Statistical Inference via Data Science" />
@@ -21,18 +21,18 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="9-hypothesis-testing.html">
-<link rel="next" href="11-thinking-with-data.html">
+<link rel="prev" href="9-hypothesis-testing.html"/>
+<link rel="next" href="11-thinking-with-data.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -582,54 +595,57 @@ <h3>Needed packages</h3>
 <li>As well as the more advanced <code>purrr</code>, <code>tibble</code>, <code>stringr</code>, and <code>forcats</code> packages</li>
 </ul>
 <p>If needed, read Section <a href="1-getting-started.html#packages">1.3</a> for information on how to install and load R packages.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(tidyverse)
-<span class="kw">library</span>(moderndive)
-<span class="kw">library</span>(infer)</code></pre>
+<div class="sourceCode" id="cb417"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb417-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb417-2" data-line-number="2"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb417-3" data-line-number="3"><span class="kw">library</span>(infer)</a></code></pre></div>
 </div>
 <div id="regression-refresher" class="section level2">
 <h2><span class="header-section-number">10.1</span> Regression refresher</h2>
-<p>Before jumping into inference for regression, let’s remind ourselves of the University of Texas teaching evaluations analysis in Section <a href="5-regression.html#model1">5.1</a>.</p>
-<div id="teaching-evals-analysis" class="section level3">
-<h3><span class="header-section-number">10.1.1</span> Teaching evals analysis</h3>
+<p>Before jumping into inference for regression, let’s remind ourselves of the University of Texas Austin teaching evaluations analysis in Section <a href="5-regression.html#model1">5.1</a>.</p>
+<div id="teaching-evaluations-analysis" class="section level3">
+<h3><span class="header-section-number">10.1.1</span> Teaching evaluations analysis</h3>
 <p>Recall using simple linear regression  we modeled the relationship between</p>
 <ol style="list-style-type: decimal">
-<li>A numerical outcome variable <span class="math inline">\(y\)</span>, the instructor’s teaching score and</li>
-<li>A single numerical explanatory variable <span class="math inline">\(x\)</span>, the instructor’s “beauty” score.</li>
+<li>A numerical outcome variable <span class="math inline">\(y\)</span> (the instructor’s teaching score) and</li>
+<li>A single numerical explanatory variable <span class="math inline">\(x\)</span> (the instructor’s “beauty” score).</li>
 </ol>
-<p>We first created an <code>evals_ch6</code> data frame that selected a subset of variables from the <code>evals</code> data frame included in the <code>moderndive</code> package. This <code>evals_ch6</code> data frame contains only the variables of interest for our analysis, in particular the instructor’s teaching <code>score</code> and the “beauty” rating <code>bty_avg</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">evals_ch6 &lt;-<span class="st"> </span>evals <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">select</span>(ID, score, bty_avg, age)
-<span class="kw">glimpse</span>(evals_ch6)</code></pre>
+<p>We first created an <code>evals_ch5</code> data frame that selected a subset of variables from the <code>evals</code> data frame included in the <code>moderndive</code> package. This <code>evals_ch5</code> data frame contains only the variables of interest for our analysis, in particular the instructor’s teaching <code>score</code> and the “beauty” rating <code>bty_avg</code>:</p>
+<div class="sourceCode" id="cb418"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb418-1" data-line-number="1">evals_ch5 &lt;-<span class="st"> </span>evals <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb418-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(ID, score, bty_avg, age)</a>
+<a class="sourceLine" id="cb418-3" data-line-number="3"><span class="kw">glimpse</span>(evals_ch5)</a></code></pre></div>
 <pre><code>Observations: 463
 Variables: 4
 $ ID      &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
 $ score   &lt;dbl&gt; 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4…
 $ bty_avg &lt;dbl&gt; 5.00, 5.00, 5.00, 5.00, 3.00, 3.00, 3.00, 3.33, 3.33, 3.17, 3…
 $ age     &lt;int&gt; 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4…</code></pre>
-<p>In Section <a href="5-regression.html#model1EDA">5.1.1</a>, we performed an exploratory data analysis of the relationship between these two variables. We saw there that there was a weakly positive correlation of 0.187 between the two variables. This was evidenced in Figure <a href="10-inference-for-regression.html#fig:regline">10.1</a> of the scatterplot along with the “best-fitting” regression line that summarizes the linear relationship between the two variables. Recall in Subsection <a href="5-regression.html#leastsquares">5.3.2</a> that we defined a “best-fitting” line as the line that minimizes the <em>sum of squared residuals</em>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(evals_ch6, <span class="kw">aes</span>(<span class="dt">x =</span> bty_avg, <span class="dt">y =</span> score)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Beauty Score&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Teaching Score&quot;</span>,
-       <span class="dt">title =</span> <span class="st">&quot;Relationship between teaching and beauty scores&quot;</span>) <span class="op">+</span><span class="st">  </span>
-<span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>)</code></pre>
+<p>In Subsection <a href="5-regression.html#model1EDA">5.1.1</a>, we performed an exploratory data analysis of the relationship between these two variables of <code>score</code> and <code>bty_avg</code>. We saw there that a weakly positive correlation of 0.187 existed between the two variables.</p>
+<p>This was evidenced in Figure <a href="10-inference-for-regression.html#fig:regline">10.1</a> of the scatterplot along with the “best-fitting” regression line that summarizes the linear relationship between the two variables of <code>score</code> and <code>bty_avg</code>. Recall in Subsection <a href="5-regression.html#leastsquares">5.3.2</a> that we defined a “best-fitting” line as the line that minimizes the <em>sum of squared residuals</em>.</p>
+<div class="sourceCode" id="cb420"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb420-1" data-line-number="1"><span class="kw">ggplot</span>(evals_ch5, </a>
+<a class="sourceLine" id="cb420-2" data-line-number="2">       <span class="kw">aes</span>(<span class="dt">x =</span> bty_avg, <span class="dt">y =</span> score)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb420-3" data-line-number="3"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb420-4" data-line-number="4"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Beauty Score&quot;</span>, </a>
+<a class="sourceLine" id="cb420-5" data-line-number="5">       <span class="dt">y =</span> <span class="st">&quot;Teaching Score&quot;</span>,</a>
+<a class="sourceLine" id="cb420-6" data-line-number="6">       <span class="dt">title =</span> <span class="st">&quot;Relationship between teaching and beauty scores&quot;</span>) <span class="op">+</span><span class="st">  </span></a>
+<a class="sourceLine" id="cb420-7" data-line-number="7"><span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:regline"></span>
-<img src="moderndive_files/figure-html/regline-1.png" alt="Relationship with regression line." width="\textwidth" />
+<img src="ModernDive_files/figure-html/regline-1.png" alt="Relationship with regression line." width="\textwidth" />
 <p class="caption">
 FIGURE 10.1: Relationship with regression line.
 </p>
 </div>
-<p>Looking at this plot again, you might be asking “Does that line really have all that positive of a slope?” It does increase from left to right as the <code>bty_avg</code> variable increases, but by how much? To get to this information, recall that we followed a two-step procedure:</p>
+<p>Looking at this plot again, you might be asking, “Does that line really have all that positive of a slope?”. It does increase from left to right as the <code>bty_avg</code> variable increases, but by how much? To get to this information, recall that we followed a two-step procedure:</p>
 <ol style="list-style-type: decimal">
 <li>We first “fit” the linear regression model using the <code>lm()</code> function with the formula <code>score ~ bty_avg</code>. We saved this model in <code>score_model</code>.</li>
-<li>We get the regression table by applying the <code>get_regression_table()</code>  from the <code>moderndive</code> package to <code>score_model</code>.</li>
+<li>We get the regression table by applying the <code>get_regression_table()</code> function from the <code>moderndive</code> package to <code>score_model</code>.</li>
 </ol>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit regression model:</span>
-score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals_ch6)
-<span class="co"># Get regression table:</span>
-<span class="kw">get_regression_table</span>(score_model)</code></pre>
+<div class="sourceCode" id="cb421"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb421-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb421-2" data-line-number="2">score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals_ch5)</a>
+<a class="sourceLine" id="cb421-3" data-line-number="3"><span class="co"># Get regression table:</span></a>
+<a class="sourceLine" id="cb421-4" data-line-number="4"><span class="kw">get_regression_table</span>(score_model)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:regtable-11">TABLE 10.1: </span>Previously seen linear regression table.
+<span id="tab:regtable-11">TABLE 10.1: </span>Previously seen linear regression table
 </caption>
 <thead>
 <tr>
@@ -717,15 +733,14 @@ <h3><span class="header-section-number">10.1.1</span> Teaching evals analysis</h
 <blockquote>
 <p>For every increase of one unit in “beauty” rating, there is an associated increase, on average, of 0.067 units of evaluation score.</p>
 </blockquote>
-<p>Thus, the slope value quantifies the relationship between the y variable of <code>score</code> and the x variable <code>bty_avg</code>. We also discussed the intercept value of <span class="math inline">\(b_0\)</span> = 3.88 and its lack of practical interpretation, since the range of possible “beauty” scores does not include 0.</p>
+<p>Thus, the slope value quantifies the relationship between the <span class="math inline">\(y\)</span> variable <code>score</code> and the <span class="math inline">\(x\)</span> variable <code>bty_avg</code>. We also discussed the intercept value of <span class="math inline">\(b_0\)</span> = 3.88 and its lack of practical interpretation, since the range of possible “beauty” scores does not include 0.</p>
 </div>
 <div id="sampling-scenario-2" class="section level3">
 <h3><span class="header-section-number">10.1.2</span> Sampling scenario</h3>
-<p>Let’s now revisit this study in terms of terminology and notation related to sampling we studied in Section <a href="7-sampling.html#terminology-and-notation">7.3.1</a>.</p>
-<p>First, let’s view the instructors for these 463 courses as a <em>representative sample</em> from a greater <em>study population</em>. In our case, let’s assume that the study population is <em>all</em> instructors at UT Austin and that the sample of instructors who taught these 463 is a representative sample. Unfortunately, we can only <em>assume</em> these two facts without more knowledge of the <em>sampling methodology</em> used by the researchers.</p>
+<p>Let’s now revisit this study in terms of the terminology and notation related to sampling we studied in Subsection <a href="7-sampling.html#terminology-and-notation">7.3.1</a>.</p>
+<p>First, let’s view the instructors for these 463 courses as a <em>representative sample</em> from a greater <em>study population</em>. In our case, let’s assume that the study population is <em>all</em> instructors at UT Austin and that the sample of instructors who taught these 463 courses is a representative sample. Unfortunately, we can only <em>assume</em> these two facts without more knowledge of the <em>sampling methodology</em> used by the researchers.</p>
 <p>Since we are viewing these <span class="math inline">\(n\)</span> = 463 courses as a sample, we can view our fitted slope <span class="math inline">\(b_1\)</span> = 0.067 as a <em>point estimate</em> of the <em>population slope</em> <span class="math inline">\(\beta_1\)</span>. In other words, <span class="math inline">\(\beta_1\)</span> quantifies the relationship between teaching <code>score</code> and “beauty” average <code>bty_avg</code> for <em>all</em> instructors at UT Austin. Similarly, we can view our fitted intercept <span class="math inline">\(b_0\)</span> = 3.88 as a <em>point estimate</em> of the <em>population intercept</em> <span class="math inline">\(\beta_0\)</span> for <em>all</em> instructors at UT Austin.</p>
-<p>Putting these two ideas together, we can view the equation of the fitted line <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(b_0 + b_1 \cdot x\)</span> = <span class="math inline">\(3.880 + 0.067 \cdot \text{bty}\_\text{avg}\)</span> as an estimate of some true and unknown <em>population line</em> <span class="math inline">\(y = \beta_0 + \beta_1 \cdot x\)</span>.</p>
-<p>Thus we can draw parallels between our teaching evals analysis and all the sampling scenarios we’ve seen previously in Table <a href="7-sampling.html#tab:table-ch8">7.5</a>. In this chapter, we’ll focus on the final two scenarios: regression slopes and regression intercepts.</p>
+<p>Putting these two ideas together, we can view the equation of the fitted line <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(b_0 + b_1 \cdot x\)</span> = <span class="math inline">\(3.880 + 0.067 \cdot \text{bty}\_\text{avg}\)</span> as an estimate of some true and unknown <em>population line</em> <span class="math inline">\(y = \beta_0 + \beta_1 \cdot x\)</span>. Thus we can draw parallels between our teaching evaluations analysis and all the sampling scenarios we’ve seen previously. In this chapter, we’ll focus on the final scenario of regression slopes as shown in Table <a href="10-inference-for-regression.html#tab:summarytable-ch11">10.2</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:summarytable-ch11">TABLE 10.2: </span>Scenarios of sampling for inference
@@ -745,7 +760,7 @@ <h3><span class="header-section-number">10.1.2</span> Sampling scenario</h3>
 Point estimate
 </th>
 <th style="text-align:left;">
-Notation.
+Symbol(s)
 </th>
 </tr>
 </thead>
@@ -835,35 +850,17 @@ <h3><span class="header-section-number">10.1.2</span> Sampling scenario</h3>
 <span class="math inline">\(b_1\)</span> or <span class="math inline">\(\widehat{\beta}_1\)</span>
 </td>
 </tr>
-<tr>
-<td style="text-align:right;width: 0.5in; ">
-6
-</td>
-<td style="text-align:left;width: 0.7in; ">
-Population regression intercept
-</td>
-<td style="text-align:left;width: 1in; ">
-<span class="math inline">\(\beta_0\)</span>
-</td>
-<td style="text-align:left;width: 1.1in; ">
-Fitted regression intercept
-</td>
-<td style="text-align:left;width: 1in; ">
-<span class="math inline">\(b_0\)</span> or <span class="math inline">\(\widehat{\beta}_0\)</span>
-</td>
-</tr>
 </tbody>
 </table>
-<p>Since we are now viewing our fitted slope <span class="math inline">\(b_1\)</span> and fitted intercept <span class="math inline">\(b_0\)</span> as <em>point estimates</em> based on a <em>sample</em>, these estimates will be subject to <em>sampling variability</em>, as we’ve seen numerous times throughout this book. In other words, if we collected new sample of data on a different set of <span class="math inline">\(n\)</span> = 463 courses and their instructors, the new fitted slope <span class="math inline">\(b_1\)</span> will likely differ from 0.067. The same goes for the new fitted intercept <span class="math inline">\(b_0\)</span>.</p>
-<p>But by how much will they differ? In other words, by how much will these estimates <em>vary</em>? This information is contained in the remaining columns of the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>. Our knowledge of sampling from Chapter <a href="7-sampling.html#sampling">7</a>, confidence intervals from Chapter <a href="8-confidence-intervals.html#confidence-intervals">8</a>, and hypothesis tests from Chapter <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> will help us interpret these remaining columns.</p>
+<p>Since we are now viewing our fitted slope <span class="math inline">\(b_1\)</span> and fitted intercept <span class="math inline">\(b_0\)</span> as <em>point estimates</em> based on a <em>sample</em>, these estimates will again be subject to <em>sampling variability</em>. In other words, if we collected a new sample of data on a different set of <span class="math inline">\(n\)</span> = 463 courses and their instructors, the new fitted slope <span class="math inline">\(b_1\)</span> will likely differ from 0.067. The same goes for the new fitted intercept <span class="math inline">\(b_0\)</span>. But by how much will these estimates <em>vary</em>? This information is in the remaining columns of the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>. Our knowledge of sampling from Chapter <a href="7-sampling.html#sampling">7</a>, confidence intervals from Chapter <a href="8-confidence-intervals.html#confidence-intervals">8</a>, and hypothesis tests from Chapter <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> will help us interpret these remaining columns.</p>
 </div>
 </div>
 <div id="regression-interp" class="section level2">
 <h2><span class="header-section-number">10.2</span> Interpreting regression tables</h2>
-<p>In Chapters <a href="5-regression.html#regression">5</a> and <a href="6-multiple-regression.html#multiple-regression">6</a> and in our regression refresher earlier, we focused only on the two leftmost columns the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>: <code>term</code> and <code>estimate</code>. Let’s now shift our attention to the remaining columns: <code>std_error</code>, <code>statistic</code>, <code>p_value</code>, <code>lower_ci</code> and <code>upper_ci</code>.</p>
+<p>We’ve so far focused only on the two leftmost columns of the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>: <code>term</code> and <code>estimate</code>. Let’s now shift our attention to the remaining columns: <code>std_error</code>, <code>statistic</code>, <code>p_value</code>, <code>lower_ci</code> and <code>upper_ci</code> in Table <a href="10-inference-for-regression.html#tab:score-model-part-deux">10.3</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:score-model-part-deux">TABLE 10.3: </span>Previously seen regression table.
+<span id="tab:score-model-part-deux">TABLE 10.3: </span>Previously seen regression table
 </caption>
 <thead>
 <tr>
@@ -950,7 +947,7 @@ <h3><span class="header-section-number">10.2.1</span> Standard error</h3>
 <p>Say we hypothetically collected 1000 such samples of pairs of teaching and beauty scores, computed the 1000 resulting values of the fitted slope <span class="math inline">\(b_1\)</span>, and visualized them in a histogram. This would be a visualization of the <em>sampling distribution</em> of <span class="math inline">\(b_1\)</span>, which we defined in Subsection <a href="7-sampling.html#sampling-definitions">7.3.2</a>. Further recall that the standard deviation of the <em>sampling distribution</em> of <span class="math inline">\(b_1\)</span> has a special name: the <em>standard error</em>.</p>
 <p>Recall that we constructed three sampling distributions for the sample proportion <span class="math inline">\(\widehat{p}\)</span> using shovels of size 25, 50, and 100 in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a>. We observed that as the sample size increased, the standard error decreased as evidenced by the narrowing sampling distribution.</p>
 <p>The <em>standard error</em> of <span class="math inline">\(b_1\)</span> similarly quantifies how much variation in the fitted slope <span class="math inline">\(b_1\)</span> one would expect between different samples. So in our case, we can expect about 0.016 units of variation in the <code>bty_avg</code> slope variable. Recall that the <code>estimate</code> and <code>std_error</code> values play a key role in <em>inferring</em> the value of the unknown population slope <span class="math inline">\(\beta_1\)</span> relating to <em>all</em> instructors.</p>
-<p>In Section <a href="10-inference-for-regression.html#infer-regression">10.4</a>, we’ll perform a simulation using the <code>infer</code> package to construct the bootstrap distribution for <span class="math inline">\(b_1\)</span> in this case. Recall from Subsection <a href="8-confidence-intervals.html#bootstrap-vs-sampling">8.7.1</a> that the bootstrap distribution is an <em>approximation</em> to the sampling distribution in that they have a similar shape. Since they have a similar shape, they have similar <em>standard errors</em>. However, unlike the sampling distribution, the bootstrap distribution is constructed from a <em>single</em> sample, which is a practice more aligned with what’s done in real-life.</p>
+<p>In Section <a href="10-inference-for-regression.html#infer-regression">10.4</a>, we’ll perform a simulation using the <code>infer</code> package to construct the bootstrap distribution for <span class="math inline">\(b_1\)</span> in this case. Recall from Subsection <a href="8-confidence-intervals.html#bootstrap-vs-sampling">8.7.1</a> that the bootstrap distribution is an <em>approximation</em> to the sampling distribution in that they have a similar shape. Since they have a similar shape, they have similar <em>standard errors</em>. However, unlike the sampling distribution, the bootstrap distribution is constructed from a <em>single</em> sample, which is a practice more aligned with what’s done in real life.</p>
 </div>
 <div id="regression-test-statistic" class="section level3">
 <h3><span class="header-section-number">10.2.2</span> Test statistic</h3>
@@ -958,45 +955,44 @@ <h3><span class="header-section-number">10.2.2</span> Test statistic</h3>
 <p><span class="math display">\[
 \begin{aligned}
 H_0 &amp;: \beta_1 = 0\\
-\text{vs } H_A&amp;: \beta_1 \neq 0
+\text{vs } H_A&amp;: \beta_1 \neq 0.
 \end{aligned}
 \]</span></p>
 <p>Recall our terminology, notation, and definitions related to hypothesis tests we introduced in Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a>.</p>
 <blockquote>
-<p>A <em>hypothesis test</em> consists of a test between two competing hypotheses: 1) a <em>null hypothesis</em> <span class="math inline">\(H_0\)</span> versus 2) an <em>alternative hypothesis</em> <span class="math inline">\(H_A\)</span>.</p>
+<p>A <em>hypothesis test</em> consists of a test between two competing hypotheses: (1) a <em>null hypothesis</em> <span class="math inline">\(H_0\)</span> versus (2) an <em>alternative hypothesis</em> <span class="math inline">\(H_A\)</span>.</p>
 <p>A <em>test statistic</em> is a point estimate/sample statistic formula used for hypothesis testing.</p>
 </blockquote>
 <p>Here, our <em>null hypothesis</em> <span class="math inline">\(H_0\)</span> assumes that the population slope <span class="math inline">\(\beta_1\)</span> is 0. If the population slope <span class="math inline">\(\beta_1\)</span> is truly 0, then this is saying that there is <em>no true relationship</em> between teaching and “beauty” scores for <em>all</em> the instructors in our population. In other words, <span class="math inline">\(x\)</span> = “beauty” score would have no associated effect on <span class="math inline">\(y\)</span> = teaching score.
-The <em>alternative hypothesis</em> <span class="math inline">\(H_A\)</span>, on the other hand, assumes that population slope <span class="math inline">\(\beta_1\)</span> is not 0, meaning it could be either positive or negative, suggesting either a positive or negative relationship between teaching and “beauty” scores. Recall we called such alternative hypotheses <em>two-sided</em>. By convention, all hypothesis testing for regression assumes two-sided alternatives.</p>
+The <em>alternative hypothesis</em> <span class="math inline">\(H_A\)</span>, on the other hand, assumes that the population slope <span class="math inline">\(\beta_1\)</span> is not 0, meaning it could be either positive or negative. This suggests either a positive or negative relationship between teaching and “beauty” scores. Recall we called such alternative hypotheses <em>two-sided</em>. By convention, all hypothesis testing for regression assumes two-sided alternatives.</p>
 <p>Recall our “hypothesized universe” of no gender discrimination we <em>assumed</em> in our <code>promotions</code> activity in Section <a href="9-hypothesis-testing.html#ht-activity">9.1</a>. Similarly here when conducting this hypothesis test, we’ll assume a “hypothesized universe” where there is no relationship between teaching and “beauty” scores. In other words, we’ll assume the null hypothesis <span class="math inline">\(H_0: \beta_1 = 0\)</span> is true.</p>
-<p>The <code>statistic</code> column in the regression table is a tricky one however. It corresponds to a standardized <em>t-test statistic</em>, much like the <em>two-sample <span class="math inline">\(t\)</span> statistic</em> we saw in Subsection <a href="9-hypothesis-testing.html#theory-hypo">9.6.1</a> where we used a theory-based method for conducting hypothesis tests. In both these cases, the <em>null distribution</em> can be mathematically proven to be a <em><span class="math inline">\(t\)</span>-distribution</em>. Since such test statistics are tricky for individuals new to statistical inference to study, we’ll skip this and jump into interpreting the p-value. If you’re curious however, we’ve included a discussion of this standardized <em>t-test statistic</em> in Subsection <a href="10-inference-for-regression.html#theory-regression">10.5.1</a>.</p>
+<p>The <code>statistic</code> column in the regression table is a tricky one, however. It corresponds to a standardized <em>t-test statistic</em>, much like the <em>two-sample <span class="math inline">\(t\)</span> statistic</em> we saw in Subsection <a href="9-hypothesis-testing.html#theory-hypo">9.6.1</a> where we used a theory-based method for conducting hypothesis tests. In both these cases, the <em>null distribution</em> can be mathematically proven to be a <em><span class="math inline">\(t\)</span>-distribution</em>. Since such test statistics are tricky for individuals new to statistical inference to study, we’ll skip this and jump into interpreting the <span class="math inline">\(p\)</span>-value. If you’re curious, we have included a discussion of this standardized <em>t-test statistic</em> in Subsection <a href="10-inference-for-regression.html#theory-regression">10.5.1</a>.</p>
 </div>
 <div id="p-value" class="section level3">
 <h3><span class="header-section-number">10.2.3</span> p-value</h3>
-<p>The fifth column of the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a> <code>p-value</code> corresponds to the <em>p-value</em> of the hypothesis test <span class="math inline">\(H_0: \beta_1 = 0\)</span> versus <span class="math inline">\(H_A: \beta_1 \neq 0\)</span>.</p>
-<p>Again recalling our terminology, notation, and definitions related to hypothesis tests we introduced in Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a>, let’s focus on the definition of the p-value:</p>
+<p>The fifth column of the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a> <code>p_value</code> corresponds to the <em>p-value</em> of the hypothesis test <span class="math inline">\(H_0: \beta_1 = 0\)</span> versus <span class="math inline">\(H_A: \beta_1 \neq 0\)</span>.</p>
+<p>Again recalling our terminology, notation, and definitions related to hypothesis tests we introduced in Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a>, let’s focus on the definition of the <span class="math inline">\(p\)</span>-value:</p>
 <blockquote>
-<p>A <em>p-value</em> is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em></p>
+<p>A <em>p-value</em> is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>.</p>
 </blockquote>
-<p>Recall that you can intuitively think of the p-value as quantifying how “extreme” the observed fitted slope of <span class="math inline">\(b_1\)</span> = 0.067 is in a “hypothesized universe” where is there is no relationship between teaching and “beauty” scores.</p>
-<p>Following the hypothesis testing procedure we outlined in Section <a href="9-hypothesis-testing.html#ht-interpretation">9.4</a>, since the p-value in this case is 0, for any choice of significance level <span class="math inline">\(\alpha\)</span> we would reject <span class="math inline">\(H_0\)</span> in favor of <span class="math inline">\(H_A\)</span>. Using non-statistical language, this is saying: we reject the hypothesis that there is no relationship between teaching and “beauty” scores in favor of the hypothesis that that is. In other words, the evidence suggests there is a significant relationship, one that is positive.</p>
-<p>More precisely however, the p-value corresponds to how extreme the observed test statistic of 4.09 is when compared to the appropriate <em>null distribution</em>. In Section <a href="10-inference-for-regression.html#infer-regression">10.4</a>, we’ll perform a simulation using the <code>infer</code> package to construct the null distribution in this case.</p>
+<p>Recall that you can intuitively think of the <span class="math inline">\(p\)</span>-value as quantifying how “extreme” the observed fitted slope of <span class="math inline">\(b_1\)</span> = 0.067 is in a “hypothesized universe” where there is no relationship between teaching and “beauty” scores.</p>
+<p>Following the hypothesis testing procedure we outlined in Section <a href="9-hypothesis-testing.html#ht-interpretation">9.4</a>, since the <span class="math inline">\(p\)</span>-value in this case is 0, for any choice of significance level <span class="math inline">\(\alpha\)</span> we would reject <span class="math inline">\(H_0\)</span> in favor of <span class="math inline">\(H_A\)</span>. Using non-statistical language, this is saying: we reject the hypothesis that there is no relationship between teaching and “beauty” scores in favor of the hypothesis that there is. That is to say, the evidence suggests there is a significant relationship, one that is positive.</p>
+<p>More precisely, however, the <span class="math inline">\(p\)</span>-value corresponds to how extreme the observed test statistic of 4.09 is when compared to the appropriate <em>null distribution</em>. In Section <a href="10-inference-for-regression.html#infer-regression">10.4</a>, we’ll perform a simulation using the <code>infer</code> package to construct the null distribution in this case.</p>
 <p>An extra caveat here is that the results of this hypothesis test are only valid if certain “conditions for inference for regression” are met, which we’ll introduce shortly in Section <a href="10-inference-for-regression.html#regression-conditions">10.3</a>.</p>
 </div>
 <div id="confidence-interval" class="section level3">
 <h3><span class="header-section-number">10.2.4</span> Confidence interval</h3>
-<p>The two rightmost columns of the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a> <code>lower_ci</code> and <code>upper_ci</code> correspond to the endpoints of the 95% <em>confidence interval</em> for the population slope <span class="math inline">\(\beta_1\)</span>. Recall our analogy of “nets are to fish” what “confidence intervals are to population parameters” from Section <a href="8-confidence-intervals.html#ci-build-up">8.3</a>. The resulting 95% confidence interval for <span class="math inline">\(\beta_1\)</span> of (0.035, 0.099) is a range of plausible values for the population slope <span class="math inline">\(\beta_1\)</span> of the linear relationship between teaching and “beauty” scores.</p>
-<p>As we introduced in Section <a href="8-confidence-intervals.html#shorthand">8.5.2</a> on the precise and shorthand interpretation of confidence intervals, the statistically precise interpretation of this confidence interval is: “if we repeated this sampling procedure a large number of times, we expect about 95% of the resulting confidence intervals to capture the value of the population slope <span class="math inline">\(\beta_1\)</span>.” However, we’ll summarize this using our shorthand interpretation that “we’re 95% ‘confident’ that the true population slope <span class="math inline">\(\beta_1\)</span> lies between 0.035 and 0.099.”</p>
-<p>Notice in this case that the resulting 95% confidence interval for <span class="math inline">\(\beta_1\)</span> of (0.035, 0.099) does not contain a very particular value: <span class="math inline">\(\beta_1\)</span> equals 0. Recall we mentioned that if the population regression slope <span class="math inline">\(\beta_1\)</span> is 0, this is equivalent to saying there is <em>no</em> relationship between teaching and “beauty” scores. Since <span class="math inline">\(\beta_1\)</span> = 0 is not in our plausible range of values for <span class="math inline">\(\beta_1\)</span>, we are inclined to believe that there in fact <em>is</em> a relationship between teaching and “beauty” scores.</p>
-<p>So in this case, the conclusion about the population slope <span class="math inline">\(\beta_1\)</span> from the 95% confidence interval matches the conclusion from the hypothesis test: evidence suggests that there is a meaningful relationship between teaching and “beauty” scores!</p>
-<p>Recall from Subsection <a href="8-confidence-intervals.html#ci-width">8.5.3</a> however, that the <em>confidence level</em> is one of many factors that determine confidence interval widths. So for example, say we used a higher confidence level of 99% instead of 95%. The resulting confidence intervals for <span class="math inline">\(\beta_1\)</span> would be wider and thus might now include 0. The lesson to remember here is that any confidence interval based conclusion depends highly on the confidence level used.</p>
+<p>The two rightmost columns of the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a> (<code>lower_ci</code> and <code>upper_ci</code>) correspond to the endpoints of the 95% <em>confidence interval</em> for the population slope <span class="math inline">\(\beta_1\)</span>. Recall our analogy of “nets are to fish” what “confidence intervals are to population parameters” from Section <a href="8-confidence-intervals.html#ci-build-up">8.3</a>. The resulting 95% confidence interval for <span class="math inline">\(\beta_1\)</span> of (0.035, 0.099) can be thought of as a range of plausible values for the population slope <span class="math inline">\(\beta_1\)</span> of the linear relationship between teaching and “beauty” scores.</p>
+<p>As we introduced in Subsection <a href="8-confidence-intervals.html#shorthand">8.5.2</a> on the precise and shorthand interpretation of confidence intervals, the statistically precise interpretation of this confidence interval is: “if we repeated this sampling procedure a large number of times, we expect about 95% of the resulting confidence intervals to capture the value of the population slope <span class="math inline">\(\beta_1\)</span>.” However, we’ll summarize this using our shorthand interpretation that “we’re 95% ‘confident’ that the true population slope <span class="math inline">\(\beta_1\)</span> lies between 0.035 and 0.099.”</p>
+<p>Notice in this case that the resulting 95% confidence interval for <span class="math inline">\(\beta_1\)</span> of <span class="math inline">\((0.035, \, 0.099)\)</span> does not contain a very particular value: <span class="math inline">\(\beta_1\)</span> equals 0. Recall we mentioned that if the population regression slope <span class="math inline">\(\beta_1\)</span> is 0, this is equivalent to saying there is <em>no</em> relationship between teaching and “beauty” scores. Since <span class="math inline">\(\beta_1\)</span> = 0 is not in our plausible range of values for <span class="math inline">\(\beta_1\)</span>, we are inclined to believe that there, in fact, <em>is</em> a relationship between teaching and “beauty” scores and a positive one at that. So in this case, the conclusion about the population slope <span class="math inline">\(\beta_1\)</span> from the 95% confidence interval matches the conclusion from the hypothesis test: evidence suggests that there is a meaningful relationship between teaching and “beauty” scores.</p>
+<p>Recall from Subsection <a href="8-confidence-intervals.html#ci-width">8.5.3</a>, however, that the <em>confidence level</em> is one of many factors that determine confidence interval widths. So for example, say we used a higher confidence level of 99% instead of 95%. The resulting confidence interval for <span class="math inline">\(\beta_1\)</span> would be wider and thus might now include 0. The lesson to remember here is that any confidence-interval-based conclusion depends highly on the confidence level used.</p>
 <p>What are the calculations that went into computing the two endpoints of the 95% confidence interval for <span class="math inline">\(\beta_1\)</span>?</p>
-<p>Recall our sampling bowl example from Section <a href="8-confidence-intervals.html#theory-ci">8.7.2</a> <code>lower_ci</code> and <code>upper_ci</code>. Since the sampling and bootstrap distributions of the sample proportion <span class="math inline">\(\widehat{p}\)</span> were roughly normal, we could use the rule of thumb for bell-shaped distributions from Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a> to create a 95% confidence interval for <span class="math inline">\(p\)</span> with the following equation:</p>
+<p>Recall our sampling bowl example from Subsection <a href="8-confidence-intervals.html#theory-ci">8.7.2</a> discussing <code>lower_ci</code> and <code>upper_ci</code>. Since the sampling and bootstrap distributions of the sample proportion <span class="math inline">\(\widehat{p}\)</span> were roughly normal, we could use the rule of thumb for bell-shaped distributions from Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a> to create a 95% confidence interval for <span class="math inline">\(p\)</span> with the following equation:</p>
 <p><span class="math display">\[\widehat{p} \pm \text{MoE}_{\widehat{p}} = \widehat{p} \pm 1.96 \cdot \text{SE}_{\widehat{p}} = \widehat{p} \pm 1.96 \cdot \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\]</span></p>
-<p>We can generalize this to other point estimates that have roughly normally shaped sampling and bootstrap distributions:</p>
-<p><span class="math display">\[\text{point estimate} \pm \text{MoE} = \text{point estimate} \pm 1.96 \cdot \text{SE}\]</span></p>
+<p>We can generalize this to other point estimates that have roughly normally shaped sampling and/or bootstrap distributions:</p>
+<p><span class="math display">\[\text{point estimate} \pm \text{MoE} = \text{point estimate} \pm 1.96 \cdot \text{SE}.\]</span></p>
 <p>We’ll show in Section <a href="10-inference-for-regression.html#infer-regression">10.4</a> that the sampling/bootstrap distribution for the fitted slope <span class="math inline">\(b_1\)</span> is in fact bell-shaped as well. Thus we can construct a 95% confidence interval for <span class="math inline">\(\beta_1\)</span> with the following equation:</p>
-<p><span class="math display">\[b_1 \pm \text{MoE}_{b_1} = b_1 \pm 1.96 \cdot \text{SE}_{b_1}\]</span></p>
+<p><span class="math display">\[b_1 \pm \text{MoE}_{b_1} = b_1 \pm 1.96 \cdot \text{SE}_{b_1}.\]</span></p>
 <p>What is the value of the standard error <span class="math inline">\(\text{SE}_{b_1}\)</span>? It is in fact in the third column of the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>: 0.016. Thus</p>
 <p><span class="math display">\[
 \begin{aligned}
@@ -1004,20 +1000,20 @@ <h3><span class="header-section-number">10.2.4</span> Confidence interval</h3>
 &amp;= (0.036, 0.098)
 \end{aligned}
 \]</span></p>
-<p>This closely matches the (0.035, 0.099) confidence interval in the last two columns of Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>.</p>
-<p>Much like hypothesis tests however, the results of this confidence interval also only valid if the “conditions for inference for regression” discussed in Section <a href="10-inference-for-regression.html#regression-conditions">10.3</a> are met.</p>
+<p>This closely matches the <span class="math inline">\((0.035, 0.099)\)</span> confidence interval in the last two columns of Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>.</p>
+<p>Much like hypothesis tests, however, the results of this confidence interval also are only valid if the “conditions for inference for regression” to be discussed in Section <a href="10-inference-for-regression.html#regression-conditions">10.3</a> are met.</p>
 </div>
 <div id="regression-table-computation" class="section level3">
 <h3><span class="header-section-number">10.2.5</span> How does R compute the table?</h3>
-<p>Since we didn’t perform the simulation to get the values of the standard error, test statistic, p-value, and endpoints of the 95% confidence interval in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>, you might be wondering how were these values computed. What did R do behind the scenes? Does R run simulations like we did using the <code>infer</code> package in Chapters <a href="8-confidence-intervals.html#confidence-intervals">8</a> and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> on confidence intervals and hypothesis testing?</p>
-<p>The answer is no! Much like the theory-based method for constructing confidence intervals you saw in Section <a href="8-confidence-intervals.html#theory-ci">8.7.2</a> and the theory-based hypothesis test you saw in Section <a href="9-hypothesis-testing.html#theory-hypo">9.6.1</a>, there exist mathematical formulas that allow you to construct confidence intervals and conduct hypothesis tests for inference for regression. These formulas were derived in a time when computers didn’t exist, so it would’ve been impossible to run the extensive computer simulations we have in this book. We present these formulas in Subsection <a href="10-inference-for-regression.html#theory-regression">10.5.1</a> on “theory-based inference for regression.”</p>
-<p>In the upcoming Section <a href="10-inference-for-regression.html#infer-regression">10.4</a>, we’ll go over a simulation-based approach to constructing confidence intervals and conducting hypothesis tests using the <code>infer</code> package. In particular, we’ll convince you that the bootstrap distribution of the fitted slope <span class="math inline">\(b_1\)</span> is indeed bell-shaped.</p>
+<p>Since we didn’t perform the simulation to get the values of the standard error, test statistic, <span class="math inline">\(p\)</span>-value, and endpoints of the 95% confidence interval in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>, you might be wondering how were these values computed. What did R do behind the scenes? Does R run simulations like we did using the <code>infer</code> package in Chapters <a href="8-confidence-intervals.html#confidence-intervals">8</a> and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> on confidence intervals and hypothesis testing?</p>
+<p>The answer is no! Much like the theory-based method for constructing confidence intervals you saw in Subsection <a href="8-confidence-intervals.html#theory-ci">8.7.2</a> and the theory-based hypothesis test you saw in Subsection <a href="9-hypothesis-testing.html#theory-hypo">9.6.1</a>, there exist mathematical formulas that allow you to construct confidence intervals and conduct hypothesis tests for inference for regression. These formulas were derived in a time when computers didn’t exist, so it would’ve been impossible to run the extensive computer simulations we have in this book. We present these formulas in Subsection <a href="10-inference-for-regression.html#theory-regression">10.5.1</a> on “theory-based inference for regression.”</p>
+<p>In Section <a href="10-inference-for-regression.html#infer-regression">10.4</a>, we’ll go over a simulation-based approach to constructing confidence intervals and conducting hypothesis tests using the <code>infer</code> package. In particular, we’ll convince you that the bootstrap distribution of the fitted slope <span class="math inline">\(b_1\)</span> is indeed bell-shaped.</p>
 </div>
 </div>
 <div id="regression-conditions" class="section level2">
 <h2><span class="header-section-number">10.3</span> Conditions for inference for regression</h2>
-<p>Recall in Section <a href="8-confidence-intervals.html#se-method">8.3.2</a> we stated that we could only use the standard-error based method for constructing confidence intervals if the bootstrap distribution was bell shaped. Similarly, there are certain conditions that need to be met in order for the results of our hypothesis tests and confidence intervals we described in Section <a href="10-inference-for-regression.html#regression-interp">10.2</a> to have valid meaning. These conditions must be met for the assumed underlying mathematical and probability theory to hold true.</p>
-<p>For inference for regression, there are four conditions that need to be met. Note the first four letters of these conditions as highlighted in bold in what follows: <strong>LINE</strong>. This can serve as a nice reminder of what to check for whenever you perform linear regression. </p>
+<p>Recall in Subsection <a href="8-confidence-intervals.html#se-method">8.3.2</a> we stated that we could only use the standard-error-based method for constructing confidence intervals if the bootstrap distribution was bell shaped. Similarly, there are certain conditions that need to be met in order for the results of our hypothesis tests and confidence intervals we described in Section <a href="10-inference-for-regression.html#regression-interp">10.2</a> to have valid meaning. These conditions must be met for the assumed underlying mathematical and probability theory to hold true.</p>
+<p>For inference for regression, there are four conditions that need to be met. Note the first four letters of these conditions are highlighted in bold in what follows: <strong>LINE</strong>. This can serve as a nice reminder of what to check for whenever you perform linear regression. </p>
 <ol style="list-style-type: decimal">
 <li><strong>L</strong>inearity of relationship between variables</li>
 <li><strong>I</strong>ndependence of the residuals</li>
@@ -1025,22 +1021,22 @@ <h2><span class="header-section-number">10.3</span> Conditions for inference for
 <li><strong>E</strong>quality of variance of the residuals</li>
 </ol>
 <p>Conditions <strong>L</strong>, <strong>N</strong>, and <strong>E</strong> can be verified through what is known as a <em>residual analysis</em>. Condition <strong>I</strong> can only be verified through an understanding of how the data was collected.</p>
-<p>In this section, we’ll go over a refresher on residuals, verify whether each of the 4 <strong>LINE</strong> conditions hold true, and then discuss the implications.</p>
+<p>In this section, we’ll go over a refresher on residuals, verify whether each of the four <strong>LINE</strong> conditions hold true, and then discuss the implications.</p>
 <div id="residuals-refresher" class="section level3">
 <h3><span class="header-section-number">10.3.1</span> Residuals refresher</h3>
-<p>Recall our definition of a residual from Section <a href="5-regression.html#model1points">5.1.3</a>: it is the <em>observed value</em> minus the <em>fitted value</em> <span class="math inline">\(y - \widehat{y}\)</span>. Recall that residuals can be thought of as the error or the “lack-of-fit” between the observed value <span class="math inline">\(y\)</span> and the fitted value <span class="math inline">\(\widehat{y}\)</span> on the regression line in Figure <a href="10-inference-for-regression.html#fig:regline">10.1</a>. In Figure <a href="10-inference-for-regression.html#fig:residual-example">10.2</a>, we illustrate one particular residual out of 463 using an arrow, as well its corresponding observed and fitted values using a circle and a square.</p>
+<p>Recall our definition of a residual from Subsection <a href="5-regression.html#model1points">5.1.3</a>: it is the <em>observed value</em> minus the <em>fitted value</em> denoted by <span class="math inline">\(y - \widehat{y}\)</span>. Recall that residuals can be thought of as the error or the “lack-of-fit” between the observed value <span class="math inline">\(y\)</span> and the fitted value <span class="math inline">\(\widehat{y}\)</span> on the regression line in Figure <a href="10-inference-for-regression.html#fig:regline">10.1</a>. In Figure <a href="10-inference-for-regression.html#fig:residual-example">10.2</a>, we illustrate one particular residual out of 463 using an arrow, as well as its corresponding observed and fitted values using a circle and a square, respectively.</p>
 <div class="figure" style="text-align: center"><span id="fig:residual-example"></span>
-<img src="moderndive_files/figure-html/residual-example-1.png" alt="Example of observed value, fitted value, and residual." width="\textwidth" />
+<img src="ModernDive_files/figure-html/residual-example-1.png" alt="Example of observed value, fitted value, and residual." width="\textwidth" />
 <p class="caption">
 FIGURE 10.2: Example of observed value, fitted value, and residual.
 </p>
 </div>
-<p>Furthermore, we can automate the calculation of all <span class="math inline">\(n\)</span> = 463 residuals by applying the <code>get_regression_points()</code> function to our saved regression model in <code>score_model</code>. Observe how the resulting values of <code>residual</code> are roughly equal to <code>score - score_hat</code> (there is a slight difference due to rounding error).</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit regression model:</span>
-score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals_ch6)
-<span class="co"># Get regression points:</span>
-regression_points &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(score_model)
-regression_points</code></pre>
+<p>Furthermore, we can automate the calculation of all <span class="math inline">\(n\)</span> = 463 residuals by applying the <code>get_regression_points()</code> function to our saved regression model in <code>score_model</code>. Observe how the resulting values of <code>residual</code> are roughly equal to <code>score - score_hat</code> (there is potentially a slight difference due to rounding error).</p>
+<div class="sourceCode" id="cb422"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb422-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb422-2" data-line-number="2">score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals_ch5)</a>
+<a class="sourceLine" id="cb422-3" data-line-number="3"><span class="co"># Get regression points:</span></a>
+<a class="sourceLine" id="cb422-4" data-line-number="4">regression_points &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(score_model)</a>
+<a class="sourceLine" id="cb422-5" data-line-number="5">regression_points</a></code></pre></div>
 <pre><code># A tibble: 463 x 5
       ID score bty_avg score_hat residual
    &lt;int&gt; &lt;dbl&gt;   &lt;dbl&gt;     &lt;dbl&gt;    &lt;dbl&gt;
@@ -1055,25 +1051,25 @@ <h3><span class="header-section-number">10.3.1</span> Residuals refresher</h3>
  9     9 3.4   3.333       4.102 -0.702  
 10    10 4.5   3.16700     4.091  0.40900
 # … with 453 more rows</code></pre>
-<p>A <em>residual analysis</em> is used to verify conditions <strong>L</strong>, <strong>N</strong>, and <strong>E</strong> and can be performed using appropriate data visualizations. While there are more sophisticated statistical approaches that can also be done, we’ll focus on the much simpler approach of look at plots.</p>
+<p>A <em>residual analysis</em> is used to verify conditions <strong>L</strong>, <strong>N</strong>, and <strong>E</strong> and can be performed using appropriate data visualizations. While there are more sophisticated statistical approaches that can also be done, we’ll focus on the much simpler approach of looking at plots.</p>
 </div>
 <div id="linearity-of-relationship" class="section level3">
 <h3><span class="header-section-number">10.3.2</span> Linearity of relationship</h3>
-<p>The first condition is that the relationship between the outcome variable <span class="math inline">\(y\)</span> and the explanatory variable <span class="math inline">\(x\)</span> must be <strong>L</strong>inear. Recall the scatterplot in Figure <a href="10-inference-for-regression.html#fig:regline">10.1</a> where we had the explanatory variable <span class="math inline">\(x\)</span> “beauty” score and the outcome variable <span class="math inline">\(y\)</span> teaching score. Would you say that the relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> is linear? It’s hard to say because of the scatter of the points about the line. In the authors’ opinions, we feel this relationship is “linear enough”.</p>
-<p>Let’s present an example where the relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> is clearly not linear in Figure <a href="10-inference-for-regression.html#fig:non-linear">10.3</a>. In this case, the points clearly do not form a line, but rather a U-shaped polynomial line. In this case, any results from an inference for regression would not be valid.</p>
+<p>The first condition is that the relationship between the outcome variable <span class="math inline">\(y\)</span> and the explanatory variable <span class="math inline">\(x\)</span> must be <strong>L</strong>inear. Recall the scatterplot in Figure <a href="10-inference-for-regression.html#fig:regline">10.1</a> where we had the explanatory variable <span class="math inline">\(x\)</span> as “beauty” score and the outcome variable <span class="math inline">\(y\)</span> as teaching score. Would you say that the relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> is linear? It’s hard to say because of the scatter of the points about the line. In the authors’ opinions, we feel this relationship is “linear enough.”</p>
+<p>Let’s present an example where the relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> is clearly not linear in Figure <a href="10-inference-for-regression.html#fig:non-linear">10.3</a>. In this case, the points clearly do not form a line, but rather a U-shaped polynomial curve. In this case, any results from an inference for regression would not be valid.</p>
 <div class="figure" style="text-align: center"><span id="fig:non-linear"></span>
-<img src="moderndive_files/figure-html/non-linear-1.png" alt="Example of clearly non-linear relationship." width="\textwidth" />
+<img src="ModernDive_files/figure-html/non-linear-1.png" alt="Example of a clearly non-linear relationship." width="\textwidth" />
 <p class="caption">
-FIGURE 10.3: Example of clearly non-linear relationship.
+FIGURE 10.3: Example of a clearly non-linear relationship.
 </p>
 </div>
 </div>
 <div id="independence-of-residuals" class="section level3">
 <h3><span class="header-section-number">10.3.3</span> Independence of residuals</h3>
 <p>The second condition is that the residuals must be <strong>I</strong>ndependent. In other words, the different observations in our data must be independent of one another.</p>
-<p>For our UT Austin data, while there is data on 463 courses, these 463 courses were actually taught by 94 unique instructors. In other words, the same professor is often included more than once in our data. The original <code>evals</code> data frame that we used to construct the <code>evals_ch6</code> data frame has a variable <code>prof_ID</code>, which is an anonymized identification variable for the professor:</p>
-<pre class="sourceCode r"><code class="sourceCode r">evals <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(ID, prof_ID, score, bty_avg)</code></pre>
+<p>For our UT Austin data, while there is data on 463 courses, these 463 courses were actually taught by 94 unique instructors. In other words, the same professor is often included more than once in our data. The original <code>evals</code> data frame that we used to construct the <code>evals_ch5</code> data frame has a variable <code>prof_ID</code>, which is an anonymized identification variable for the professor:</p>
+<div class="sourceCode" id="cb424"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb424-1" data-line-number="1">evals <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb424-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(ID, prof_ID, score, bty_avg)</a></code></pre></div>
 <pre><code># A tibble: 463 x 4
       ID prof_ID score bty_avg
    &lt;int&gt;   &lt;int&gt; &lt;dbl&gt;   &lt;dbl&gt;
@@ -1089,50 +1085,50 @@ <h3><span class="header-section-number">10.3.3</span> Independence of residuals<
 10    10       4 4.5   3.16700
 # … with 453 more rows</code></pre>
 <p>For example, the professor with <code>prof_ID</code> equal to 1 taught the first 4 courses in the data, the professor with <code>prof_ID</code> equal to 2 taught the next 3, and so on. Given that the same professor taught these first four courses, it is reasonable to expect that these four teaching scores are related to each other. If a professor gets a high <code>score</code> in one class, chances are fairly good they’ll get a high <code>score</code> in another. This dataset thus provides different information than if we had 463 unique instructors teaching the 463 courses.</p>
-<p>In this case we say there exists <em>dependence</em> between observations. The first four courses taught by professor 1 are dependent, the next 3 courses taught by professor 2 are related, and so on. Any proper analysis of this data needs to take into account that we have <em>repeated measures</em> for the same profs.</p>
+<p>In this case, we say there exists <em>dependence</em> between observations. The first four courses taught by professor 1 are dependent, the next 3 courses taught by professor 2 are related, and so on. Any proper analysis of this data needs to take into account that we have <em>repeated measures</em> for the same profs.</p>
 <p>So in this case, the independence condition is not met. What does this mean for our analysis? We’ll address this in Subsection <a href="10-inference-for-regression.html#what-is-the-conclusion">10.3.6</a> coming up, after we check the remaining two conditions.</p>
 </div>
 <div id="normality-of-residuals" class="section level3">
 <h3><span class="header-section-number">10.3.4</span> Normality of residuals</h3>
-<p>The third condition is that the residuals should follow a <strong>N</strong>ormal distribution. Furthermore, the center of this distribution should be 0. In other words, sometimes the regression model will make positive errors: <span class="math inline">\(y - \widehat{y} &gt; 0\)</span>. Other times, the regression model will make equally negative errors: <span class="math inline">\(y - \widehat{y} &lt; 0\)</span>. However, <em>on average</em> the errors should equal 0.</p>
+<p>The third condition is that the residuals should follow a <strong>N</strong>ormal distribution. Furthermore, the center of this distribution should be 0. In other words, sometimes the regression model will make positive errors: <span class="math inline">\(y - \widehat{y} &gt; 0\)</span>. Other times, the regression model will make equally negative errors: <span class="math inline">\(y - \widehat{y} &lt; 0\)</span>. However, <em>on average</em> the errors should equal 0 and their shape should be similar to that of a bell.</p>
 <p>The simplest way to check the normality of the residuals is to look at a histogram, which we visualize in Figure <a href="10-inference-for-regression.html#fig:model1residualshist">10.4</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(regression_points, <span class="kw">aes</span>(<span class="dt">x =</span> residual)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.25</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Residual&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb426"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb426-1" data-line-number="1"><span class="kw">ggplot</span>(regression_points, <span class="kw">aes</span>(<span class="dt">x =</span> residual)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb426-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.25</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb426-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Residual&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:model1residualshist"></span>
-<img src="moderndive_files/figure-html/model1residualshist-1.png" alt="Histogram of residuals." width="\textwidth" />
+<img src="ModernDive_files/figure-html/model1residualshist-1.png" alt="Histogram of residuals." width="\textwidth" />
 <p class="caption">
 FIGURE 10.4: Histogram of residuals.
 </p>
 </div>
-<p>This histogram shows that we have more positive residuals than negative. Since the residual <span class="math inline">\(y-\widehat{y}\)</span> is positive when <span class="math inline">\(y &gt; \widehat{y}\)</span>, it seems our regression model’s fitted teaching scores <span class="math inline">\(\widehat{y}\)</span> tend to <em>underestimate</em> the true teaching scores <span class="math inline">\(y\)</span>. Furthermore, this histogram has a slight <em>left-skew</em> in that there is a tail on the left. Another way to say the residuals exhibit a <em>negative skew</em>.</p>
+<p>This histogram shows that we have more positive residuals than negative. Since the residual <span class="math inline">\(y-\widehat{y}\)</span> is positive when <span class="math inline">\(y &gt; \widehat{y}\)</span>, it seems our regression model’s fitted teaching scores <span class="math inline">\(\widehat{y}\)</span> tend to <em>underestimate</em> the true teaching scores <span class="math inline">\(y\)</span>. Furthermore, this histogram has a slight <em>left-skew</em> in that there is a tail on the left. This is another way to say the residuals exhibit a <em>negative skew</em>.</p>
 <p>Is this a problem? Again, there is a certain amount of subjectivity in the response. In the authors’ opinion, while there is a slight skew to the residuals, we feel it isn’t drastic. On the other hand, others might disagree with our assessment.</p>
 <p>Let’s present examples where the residuals clearly do and don’t follow a normal distribution in Figure <a href="10-inference-for-regression.html#fig:normal-residuals">10.5</a>. In this case of the model yielding the clearly non-normal residuals on the right, any results from an inference for regression would not be valid.</p>
 <div class="figure" style="text-align: center"><span id="fig:normal-residuals"></span>
-<img src="moderndive_files/figure-html/normal-residuals-1.png" alt="Example of clearly normal and clearly non-normal residuals." width="\textwidth" />
+<img src="ModernDive_files/figure-html/normal-residuals-1.png" alt="Example of clearly normal and clearly not normal residuals." width="\textwidth" />
 <p class="caption">
-FIGURE 10.5: Example of clearly normal and clearly non-normal residuals.
+FIGURE 10.5: Example of clearly normal and clearly not normal residuals.
 </p>
 </div>
 </div>
 <div id="equality-of-variance" class="section level3">
 <h3><span class="header-section-number">10.3.5</span> Equality of variance</h3>
-<p>The fourth and final condition is that the residuals should exhibit <strong>E</strong>qual variance for across all values of the explanatory variable <span class="math inline">\(x\)</span>. In other words, the value and spread of the residuals should not depend on the value of the explanatory variable <span class="math inline">\(x\)</span>.</p>
-<p>Recall the scatterplot in Figure <a href="10-inference-for-regression.html#fig:regline">10.1</a>: we had the explanatory variable <span class="math inline">\(x\)</span> “beauty” score on the x-axis and the outcome variable<span class="math inline">\(y\)</span> teaching score on the y-axis. Instead, let’s create a scatterplot that has the same values on the x-axis, but now with the residual <span class="math inline">\(y-\widehat{y}\)</span> on the y-axis as seen in Figure <a href="10-inference-for-regression.html#fig:numxplot6">10.6</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(regression_points, <span class="kw">aes</span>(<span class="dt">x =</span> bty_avg, <span class="dt">y =</span> residual)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Beauty Score&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Residual&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_hline</span>(<span class="dt">yintercept =</span> <span class="dv">0</span>, <span class="dt">col =</span> <span class="st">&quot;blue&quot;</span>, <span class="dt">size =</span> <span class="dv">1</span>)</code></pre>
+<p>The fourth and final condition is that the residuals should exhibit <strong>E</strong>qual variance across all values of the explanatory variable <span class="math inline">\(x\)</span>. In other words, the value and spread of the residuals should not depend on the value of the explanatory variable <span class="math inline">\(x\)</span>.</p>
+<p>Recall the scatterplot in Figure <a href="10-inference-for-regression.html#fig:regline">10.1</a>: we had the explanatory variable <span class="math inline">\(x\)</span> of “beauty” score on the x-axis and the outcome variable <span class="math inline">\(y\)</span> of teaching score on the y-axis. Instead, let’s create a scatterplot that has the same values on the x-axis, but now with the residual <span class="math inline">\(y-\widehat{y}\)</span> on the y-axis as seen in Figure <a href="10-inference-for-regression.html#fig:numxplot6">10.6</a>.</p>
+<div class="sourceCode" id="cb427"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb427-1" data-line-number="1"><span class="kw">ggplot</span>(regression_points, <span class="kw">aes</span>(<span class="dt">x =</span> bty_avg, <span class="dt">y =</span> residual)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb427-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb427-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Beauty Score&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Residual&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb427-4" data-line-number="4"><span class="st">  </span><span class="kw">geom_hline</span>(<span class="dt">yintercept =</span> <span class="dv">0</span>, <span class="dt">col =</span> <span class="st">&quot;blue&quot;</span>, <span class="dt">size =</span> <span class="dv">1</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:numxplot6"></span>
-<img src="moderndive_files/figure-html/numxplot6-1.png" alt="Plot of residuals over beauty score." width="\textwidth" />
+<img src="ModernDive_files/figure-html/numxplot6-1.png" alt="Plot of residuals over beauty score." width="\textwidth" />
 <p class="caption">
 FIGURE 10.6: Plot of residuals over beauty score.
 </p>
 </div>
-<p>You can think of this plot as a modified version of the plot with the regression line in Figure <a href="10-inference-for-regression.html#fig:regline">10.1</a>, but with the regression line flattened out to <span class="math inline">\(y=0\)</span>. Looking at this plot, would you say that the spread of the residuals around the blue line is constant across all values of the explanatory variable <span class="math inline">\(x\)</span> “beauty” score? This question is rather qualitative and subjective in nature, thus different people may respond with different answers. For example, some people might say that there is slightly more variation in the residuals for smaller values of <span class="math inline">\(x\)</span> than with for higher ones. However, it can be argued that there isn’t a <em>drastic</em> non-constancy.</p>
+<p>You can think of Figure <a href="10-inference-for-regression.html#fig:numxplot6">10.6</a> as a modified version of the plot with the regression line in Figure <a href="10-inference-for-regression.html#fig:regline">10.1</a>, but with the regression line flattened out to <span class="math inline">\(y=0\)</span>. Looking at this plot, would you say that the spread of the residuals around the line at <span class="math inline">\(y=0\)</span> is constant across all values of the explanatory variable <span class="math inline">\(x\)</span> of “beauty” score? This question is rather qualitative and subjective in nature, thus different people may respond with different answers. For example, some people might say that there is slightly more variation in the residuals for smaller values of <span class="math inline">\(x\)</span> than for higher ones. However, it can be argued that there isn’t a <em>drastic</em> non-constancy.</p>
 <p>In Figure <a href="10-inference-for-regression.html#fig:equal-variance-residuals">10.7</a> let’s present an example where the residuals clearly do not have equal variance across all values of the explanatory variable <span class="math inline">\(x\)</span>.</p>
 <div class="figure" style="text-align: center"><span id="fig:equal-variance-residuals"></span>
-<img src="moderndive_files/figure-html/equal-variance-residuals-1.png" alt="Example of clearly non-equal variance." width="\textwidth" />
+<img src="ModernDive_files/figure-html/equal-variance-residuals-1.png" alt="Example of clearly non-equal variance." width="\textwidth" />
 <p class="caption">
 FIGURE 10.7: Example of clearly non-equal variance.
 </p>
@@ -1149,19 +1145,19 @@ <h3><span class="header-section-number">10.3.6</span> What’s the conclusion?</
 <li><strong>E</strong>quality of variance: Yes</li>
 </ol>
 <p>So what does this mean for the results of our confidence intervals and hypothesis tests in Section <a href="10-inference-for-regression.html#regression-interp">10.2</a>?</p>
-<p>First, the <strong>I</strong>ndependence condition. The fact that there exist dependencies between different rows in <code>evals_ch6</code> must be addressed. In more advanced statistics courses, you’ll learn how to incorporate such dependencies into your regression models. One such technique is called <em>hierarchical/multilevel modeling</em>.</p>
+<p>First, the <strong>I</strong>ndependence condition. The fact that there exist dependencies between different rows in <code>evals_ch5</code> must be addressed. In more advanced statistics courses, you’ll learn how to incorporate such dependencies into your regression models. One such technique is called <em>hierarchical/multilevel modeling</em>.</p>
 <p>Second, when conditions <strong>L</strong>, <strong>N</strong>, <strong>E</strong> are not met, it often means there is a shortcoming in our model. For example, it may be the case that using only a single explanatory variable is insufficient, as we did with “beauty” score. We may need to incorporate more explanatory variables in a multiple regression model as we did in Chapter <a href="6-multiple-regression.html#multiple-regression">6</a>.</p>
-<p>In our case, the best we can do is view the results suggested by our confidence intervals and hypothesis tests as preliminary. That while a preliminary analysis suggests that there is a significant relationship between teaching and “beauty” scores, further investigation is warranted. In particular, by improving the preliminary <code>score ~ bty_avg</code> model so that the 4 conditions are met. When the 4 conditions are roughly met, then we can put more faith into our confidence intervals and p-values.</p>
-<p>The conditions for inference in regression problems are a key part of regression analysis that are of vital importance to the processes of constructing confidence intervals and conducting hypothesis tests. However, it is often the case with regression analysis in the real-world that not all the conditions are completely met. Furthermore, as you saw there is a level of subjectivity in the residual analyses to verify the <strong>L</strong>, <strong>N</strong>, and <strong>E</strong> conditions. So what can you do? We as authors advocate for transparency in communicating all results. This lets the stakeholders of any analysis know about a model’s shortcomings or whether the model is “good enough.”</p>
+<p>In our case, the best we can do is view the results suggested by our confidence intervals and hypothesis tests as preliminary. While a preliminary analysis suggests that there is a significant relationship between teaching and “beauty” scores, further investigation is warranted; in particular, by improving the preliminary <code>score ~ bty_avg</code> model so that the four conditions are met. When the four conditions are roughly met, then we can put more faith into our confidence intervals and <span class="math inline">\(p\)</span>-values.</p>
+<p>The conditions for inference in regression problems are a key part of regression analysis that are of vital importance to the processes of constructing confidence intervals and conducting hypothesis tests. However, it is often the case with regression analysis in the real world that not all the conditions are completely met. Furthermore, as you saw, there is a level of subjectivity in the residual analyses to verify the <strong>L</strong>, <strong>N</strong>, and <strong>E</strong> conditions. So what can you do? We as authors advocate for transparency in communicating all results. This lets the stakeholders of any analysis know about a model’s shortcomings or whether the model is “good enough.” So while this checking of assumptions has lead to some fuzzy “it depends” results, we decided as authors to show you these scenarios to help prepare you for difficult statistical decisions you may need to make down the road.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC10.1)</strong> Continue with our regression using <code>age</code> as the explanatory variable and teaching <code>score</code> as the outcome variable.</p>
+<p><strong>(LC10.1)</strong> Continuing with our regression using <code>age</code> as the explanatory variable and teaching <code>score</code> as the outcome variable.</p>
 <ul>
 <li>Use the <code>get_regression_points()</code> function to get the observed values, fitted values, and residuals for all 463 instructors.</li>
-<li>Perform a residual analysis and look for any systematic patterns in the residuals. Ideally, there should be little to no pattern.</li>
+<li>Perform a residual analysis and look for any systematic patterns in the residuals. Ideally, there should be little to no pattern but comment on what you find here.</li>
 </ul>
 <div class="learncheck">
 
@@ -1174,24 +1170,23 @@ <h2><span class="header-section-number">10.4</span> Simulation-based inference f
 <p>In this section, we’ll use the simulation-based methods you previously learned in Chapters <a href="8-confidence-intervals.html#confidence-intervals">8</a> and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> to recreate the values in the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>. In particular, we’ll use the <code>infer</code> package workflow to</p>
 <ul>
 <li>Construct a 95% confidence interval for the population slope <span class="math inline">\(\beta_1\)</span> using bootstrap resampling with replacement. We did this previously in Sections <a href="8-confidence-intervals.html#bootstrap-process">8.4</a> with the <code>pennies</code> data and <a href="8-confidence-intervals.html#case-study-two-prop-ci">8.6</a> with the <code>mythbusters_yawn</code> data.</li>
-<li>Conduct a hypothesis test of <span class="math inline">\(H_0: \beta_1 = 0\)</span> vs <span class="math inline">\(H_A: \beta_1 \neq 1\)</span> using a permutation test. We did this previously in Sections <a href="9-hypothesis-testing.html#ht-infer">9.3</a> with the <code>promotions</code> data and <a href="9-hypothesis-testing.html#ht-case-study">9.5</a> with the <code>movies_sample</code> IMDb data.</li>
+<li>Conduct a hypothesis test of <span class="math inline">\(H_0: \beta_1 = 0\)</span> versus <span class="math inline">\(H_A: \beta_1 \neq 0\)</span> using a permutation test. We did this previously in Sections <a href="9-hypothesis-testing.html#ht-infer">9.3</a> with the <code>promotions</code> data and <a href="9-hypothesis-testing.html#ht-case-study">9.5</a> with the <code>movies_sample</code> IMDb data.</li>
 </ul>
 <div id="confidence-interval-for-slope" class="section level3">
 <h3><span class="header-section-number">10.4.1</span> Confidence interval for slope</h3>
 <p>We’ll construct a 95% confidence interval for <span class="math inline">\(\beta_1\)</span> using the <code>infer</code> workflow outlined in Subsection <a href="8-confidence-intervals.html#infer-workflow">8.4.2</a>. Specifically, we’ll first construct the bootstrap distribution for the fitted slope <span class="math inline">\(b_1\)</span> using our single sample of 463 courses:</p>
 <ol style="list-style-type: decimal">
-<li><code>specify()</code> the variables of interest in <code>evals_ch6</code> with the formula: <code>score ~ bty_avg</code>.</li>
+<li><code>specify()</code> the variables of interest in <code>evals_ch5</code> with the formula: <code>score ~ bty_avg</code>.</li>
 <li><code>generate()</code> replicates by using <code>bootstrap</code> resampling with replacement from the original sample of 463 courses. We generate <code>reps = 1000</code> replicates using <code>type = &quot;bootstrap&quot;</code>.</li>
 <li><code>calculate()</code> the summary statistic of interest: the fitted <code>slope</code> <span class="math inline">\(b_1\)</span>.</li>
 </ol>
-<p>Then using this bootstrap distribution we’ll construct the 95% confidence interval using the percentile method and (if appropriate) the standard error method as well. It is important to note in this case that the bootstrapping with replacement is done <em>row-by-row</em>. Thus, the original pairs of <code>score</code> and <code>bty_avg</code> values are always kept together, but different pairs of <code>score</code> and <code>bty_avg</code> values may be resampled multiple times</p>
-<p>The resulting confidence interval will denote a range of plausible values for the unknown population slope <span class="math inline">\(\beta_1\)</span> quantifying the relationship between teaching and “beauty” scores for <em>all</em> professors at UT Austin.</p>
+<p>Using this bootstrap distribution, we’ll construct the 95% confidence interval using the percentile method and (if appropriate) the standard error method as well. It is important to note in this case that the bootstrapping with replacement is done <em>row-by-row</em>. Thus, the original pairs of <code>score</code> and <code>bty_avg</code> values are always kept together, but different pairs of <code>score</code> and <code>bty_avg</code> values may be resampled multiple times. The resulting confidence interval will denote a range of plausible values for the unknown population slope <span class="math inline">\(\beta_1\)</span> quantifying the relationship between teaching and “beauty” scores for <em>all</em> professors at UT Austin.</p>
 <p>Let’s first construct the bootstrap distribution for the fitted slope <span class="math inline">\(b_1\)</span>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">bootstrap_distn_slope &lt;-<span class="st"> </span>evals_ch6 <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> score <span class="op">~</span><span class="st"> </span>bty_avg) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;slope&quot;</span>)</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">bootstrap_distn_slope</code></pre>
+<div class="sourceCode" id="cb428"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb428-1" data-line-number="1">bootstrap_distn_slope &lt;-<span class="st"> </span>evals_ch5 <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb428-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> score <span class="op">~</span><span class="st"> </span>bty_avg) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb428-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb428-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;slope&quot;</span>)</a>
+<a class="sourceLine" id="cb428-5" data-line-number="5">bootstrap_distn_slope</a></code></pre></div>
 <pre><code># A tibble: 1,000 x 2
    replicate      stat
        &lt;int&gt;     &lt;dbl&gt;
@@ -1206,21 +1201,21 @@ <h3><span class="header-section-number">10.4.1</span> Confidence interval for sl
  9         9 0.0796269
 10        10 0.0761299
 # … with 990 more rows</code></pre>
-<p>Observe how we have 1000 values of the bootstrapped slope <span class="math inline">\(b_1\)</span> in the <code>stat</code> column. Let’s visualize these resulting 1000 bootstrapped values in Figure <a href="10-inference-for-regression.html#fig:bootstrap-distribution-slope">10.8</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(bootstrap_distn_slope)</code></pre>
+<p>Observe how we have 1000 values of the bootstrapped slope <span class="math inline">\(b_1\)</span> in the <code>stat</code> column. Let’s visualize the 1000 bootstrapped values in Figure <a href="10-inference-for-regression.html#fig:bootstrap-distribution-slope">10.8</a>.</p>
+<div class="sourceCode" id="cb430"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb430-1" data-line-number="1"><span class="kw">visualize</span>(bootstrap_distn_slope)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:bootstrap-distribution-slope"></span>
-<img src="moderndive_files/figure-html/bootstrap-distribution-slope-1.png" alt="Bootstrap distribution of slope." width="\textwidth" />
+<img src="ModernDive_files/figure-html/bootstrap-distribution-slope-1.png" alt="Bootstrap distribution of slope." width="\textwidth" />
 <p class="caption">
 FIGURE 10.8: Bootstrap distribution of slope.
 </p>
 </div>
-<p>Observe how the bootstrap distribution is roughly bell-shaped. Recall from Section <a href="8-confidence-intervals.html#bootstrap-vs-sampling">8.7.1</a> that shape of the bootstrap distribution of <span class="math inline">\(b_1\)</span> closely approximates the shape of the sampling distribution of <span class="math inline">\(b_1\)</span>.</p>
+<p>Observe how the bootstrap distribution is roughly bell-shaped. Recall from Subsection <a href="8-confidence-intervals.html#bootstrap-vs-sampling">8.7.1</a> that the shape of the bootstrap distribution of <span class="math inline">\(b_1\)</span> closely approximates the shape of the sampling distribution of <span class="math inline">\(b_1\)</span>.</p>
 <div id="percentile-method-1" class="section level4 unnumbered">
 <h4>Percentile-method</h4>
-<p>First, let’s compute the 95% confidence interval for <span class="math inline">\(\beta_1\)</span> using the percentile method, in other words by identifying the 2.5<sup>th</sup> and 97.5<sup>th</sup> percentiles which include the middle 95% of values. Recall that this method does not require the bootstrap distribution to be normally shaped.</p>
-<pre class="sourceCode r"><code class="sourceCode r">percentile_ci &lt;-<span class="st"> </span>bootstrap_distn_slope <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>, <span class="dt">level =</span> <span class="fl">0.95</span>)
-percentile_ci</code></pre>
+<p>First, let’s compute the 95% confidence interval for <span class="math inline">\(\beta_1\)</span> using the percentile method. We’ll do so by identifying the 2.5th and 97.5th percentiles which include the middle 95% of values. Recall that this method does not require the bootstrap distribution to be normally shaped.</p>
+<div class="sourceCode" id="cb431"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb431-1" data-line-number="1">percentile_ci &lt;-<span class="st"> </span>bootstrap_distn_slope <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb431-2" data-line-number="2"><span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>, <span class="dt">level =</span> <span class="fl">0.95</span>)</a>
+<a class="sourceLine" id="cb431-3" data-line-number="3">percentile_ci</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
      `2.5%`   `97.5%`
       &lt;dbl&gt;     &lt;dbl&gt;
@@ -1230,37 +1225,37 @@ <h4>Percentile-method</h4>
 <div id="standard-error-method" class="section level4 unnumbered">
 <h4>Standard error method</h4>
 <p>Since the bootstrap distribution in Figure <a href="10-inference-for-regression.html#fig:bootstrap-distribution-slope">10.8</a> appears to be roughly bell-shaped, we can also construct a 95% confidence interval for <span class="math inline">\(\beta_1\)</span> using the standard error method.</p>
-<p>In order to do this, we need to first compute fitted slope <span class="math inline">\(b_1\)</span>, which will act as the center of our standard error-based confidence interval. While we saw in the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a> that this was <span class="math inline">\(b_1\)</span> = 0.067, we can also use the <code>infer</code> pipeline with the <code>generate()</code> step removed:</p>
-<pre class="sourceCode r"><code class="sourceCode r">observed_slope &lt;-<span class="st"> </span>evals <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(score <span class="op">~</span><span class="st"> </span>bty_avg) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;slope&quot;</span>)
-observed_slope</code></pre>
+<p>In order to do this, we need to first compute the fitted slope <span class="math inline">\(b_1\)</span>, which will act as the center of our standard error-based confidence interval. While we saw in the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a> that this was <span class="math inline">\(b_1\)</span> = 0.067, we can also use the <code>infer</code> pipeline with the <code>generate()</code> step removed to calculate it:</p>
+<div class="sourceCode" id="cb433"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb433-1" data-line-number="1">observed_slope &lt;-<span class="st"> </span>evals <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb433-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(score <span class="op">~</span><span class="st"> </span>bty_avg) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb433-3" data-line-number="3"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;slope&quot;</span>)</a>
+<a class="sourceLine" id="cb433-4" data-line-number="4">observed_slope</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
        stat
       &lt;dbl&gt;
 1 0.0666370</code></pre>
 <p>We then use the <code>get_ci()</code> function with <code>level = 0.95</code> to compute the 95% confidence interval for <span class="math inline">\(\beta_1\)</span>. Note that setting the <code>point_estimate</code> argument to the <code>observed_slope</code> of 0.067 sets the center of the confidence interval.</p>
-<pre class="sourceCode r"><code class="sourceCode r">se_ci &lt;-<span class="st"> </span>bootstrap_distn_slope <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_ci</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;se&quot;</span>, <span class="dt">point_estimate =</span> observed_slope)
-se_ci</code></pre>
+<div class="sourceCode" id="cb435"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb435-1" data-line-number="1">se_ci &lt;-<span class="st"> </span>bootstrap_distn_slope <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb435-2" data-line-number="2"><span class="st">  </span><span class="kw">get_ci</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;se&quot;</span>, <span class="dt">point_estimate =</span> observed_slope)</a>
+<a class="sourceLine" id="cb435-3" data-line-number="3">se_ci</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
       lower     upper
       &lt;dbl&gt;     &lt;dbl&gt;
 1 0.0333767 0.0998974</code></pre>
-<p>The resulting standard error-based 95% confidence interval for <span class="math inline">\(\beta_1\)</span> of (0.033, 0.1) is however slightly different than the confidence interval in the regression Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a> of (0.035, 0.099).</p>
+<p>The resulting standard error-based 95% confidence interval for <span class="math inline">\(\beta_1\)</span> of <span class="math inline">\((0.033, 0.1)\)</span> is slightly different than the confidence interval in the regression Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a> of <span class="math inline">\((0.035, 0.099)\)</span>.</p>
 </div>
 <div id="comparing-all-three" class="section level4 unnumbered">
 <h4>Comparing all three</h4>
 <p>Let’s compare all three confidence intervals in Figure <a href="10-inference-for-regression.html#fig:bootstrap-distribution-slope-CI">10.9</a>, where the percentile-based confidence interval is marked with solid lines, the standard error based confidence interval is marked with dashed lines, and the theory-based confidence interval (0.035, 0.099) from the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a> is marked with dotted lines.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(bootstrap_distn_slope) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> percentile_ci, <span class="dt">fill =</span> <span class="ot">NULL</span>, 
-                            <span class="dt">linetype =</span> <span class="st">&quot;solid&quot;</span>, <span class="dt">color =</span> <span class="st">&quot;black&quot;</span>) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> se_ci, <span class="dt">fill =</span> <span class="ot">NULL</span>, 
-                            <span class="dt">linetype =</span> <span class="st">&quot;dashed&quot;</span>, <span class="dt">color =</span> <span class="st">&quot;black&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> <span class="kw">c</span>(<span class="fl">0.035</span>, <span class="fl">0.099</span>), <span class="dt">fill =</span> <span class="ot">NULL</span>, 
-                            <span class="dt">linetype =</span> <span class="st">&quot;dotted&quot;</span>, <span class="dt">color =</span> <span class="st">&quot;black&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb437"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb437-1" data-line-number="1"><span class="kw">visualize</span>(bootstrap_distn_slope) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb437-2" data-line-number="2"><span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> percentile_ci, <span class="dt">fill =</span> <span class="ot">NULL</span>, </a>
+<a class="sourceLine" id="cb437-3" data-line-number="3">                            <span class="dt">linetype =</span> <span class="st">&quot;solid&quot;</span>, <span class="dt">color =</span> <span class="st">&quot;grey90&quot;</span>) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb437-4" data-line-number="4"><span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> se_ci, <span class="dt">fill =</span> <span class="ot">NULL</span>, </a>
+<a class="sourceLine" id="cb437-5" data-line-number="5">                            <span class="dt">linetype =</span> <span class="st">&quot;dashed&quot;</span>, <span class="dt">color =</span> <span class="st">&quot;grey60&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb437-6" data-line-number="6"><span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> <span class="kw">c</span>(<span class="fl">0.035</span>, <span class="fl">0.099</span>), <span class="dt">fill =</span> <span class="ot">NULL</span>, </a>
+<a class="sourceLine" id="cb437-7" data-line-number="7">                            <span class="dt">linetype =</span> <span class="st">&quot;dotted&quot;</span>, <span class="dt">color =</span> <span class="st">&quot;black&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:bootstrap-distribution-slope-CI"></span>
-<img src="moderndive_files/figure-html/bootstrap-distribution-slope-CI-1.png" alt="Comparing three confidence intervals for the slope." width="\textwidth" />
+<img src="ModernDive_files/figure-html/bootstrap-distribution-slope-CI-1.png" alt="Comparing three confidence intervals for the slope." width="\textwidth" />
 <p class="caption">
 FIGURE 10.9: Comparing three confidence intervals for the slope.
 </p>
@@ -1270,47 +1265,44 @@ <h4>Comparing all three</h4>
 </div>
 <div id="hypothesis-test-for-slope" class="section level3">
 <h3><span class="header-section-number">10.4.2</span> Hypothesis test for slope</h3>
-<p>Let’s now conduct a hypothesis test of <span class="math inline">\(H_0: \beta_1 = 0\)</span> vs <span class="math inline">\(H_A: \beta_1 \neq 1\)</span>. We will use the <code>infer</code> package, which follows the hypothesis testing paradigm in the “There is Only One Test” diagram in Figure <a href="9-hypothesis-testing.html#fig:htdowney">9.14</a>.</p>
+<p>Let’s now conduct a hypothesis test of <span class="math inline">\(H_0: \beta_1 = 0\)</span> vs. <span class="math inline">\(H_A: \beta_1 \neq 0\)</span>. We will use the <code>infer</code> package, which follows the hypothesis testing paradigm in the “There is only one test” diagram in Figure <a href="9-hypothesis-testing.html#fig:htdowney">9.14</a>.</p>
 <p>Let’s first think about what it means for <span class="math inline">\(\beta_1\)</span> to be zero as assumed in the null hypothesis <span class="math inline">\(H_0\)</span>. Recall we said if <span class="math inline">\(\beta_1 = 0\)</span>, then this is saying there is no relationship between the teaching and “beauty” scores. Thus assuming this particular null hypothesis <span class="math inline">\(H_0\)</span> means that in our “hypothesized universe” there is no relationship between <code>score</code> and <code>bty_avg</code>. We can therefore shuffle/permute the <code>bty_avg</code> variable to no consequence.</p>
-<p>We construct the null distribution of the fitted slope <span class="math inline">\(b_1\)</span> by following the steps. Recall from Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a> on terminology, notation, and definitions related to hypothesis testing where we defined the <em>null distribution</em>: the sampling distribution of our test statistic <span class="math inline">\(b_1\)</span> assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true.</p>
+<p>We construct the null distribution of the fitted slope <span class="math inline">\(b_1\)</span> by performing the steps that follow. Recall from Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a> on terminology, notation, and definitions related to hypothesis testing where we defined the <em>null distribution</em>: the sampling distribution of our test statistic <span class="math inline">\(b_1\)</span> assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true.</p>
 <ol style="list-style-type: decimal">
-<li><code>specify()</code> the variables of interest in <code>evals_ch6</code> with the formula: <code>score ~ bty_avg</code>.</li>
+<li><code>specify()</code> the variables of interest in <code>evals_ch5</code> with the formula: <code>score ~ bty_avg</code>.</li>
 <li><code>hypothesize()</code> the null hypothesis of <code>independence</code>. Recall from Section <a href="9-hypothesis-testing.html#ht-infer">9.3</a> that this is an additional step that needs to be added for hypothesis testing.</li>
-<li><code>generate()</code> replicates by permuting/shuffling the explanatory variable <code>bty_avg</code> from the original sample of 463 courses. We generate <code>reps = 1000</code> replicates using <code>type = &quot;permute&quot;</code>.</li>
+<li><code>generate()</code> replicates by permuting/shuffling values from the original sample of 463 courses. We generate <code>reps = 1000</code> replicates using <code>type = &quot;permute&quot;</code> here.</li>
 <li><code>calculate()</code> the test statistic of interest: the fitted <code>slope</code> <span class="math inline">\(b_1\)</span>.</li>
 </ol>
 <p>In this case, we <code>permute</code> the values of <code>bty_avg</code> across the values of <code>score</code> 1000 times. We can do this shuffling/permuting since we assumed a “hypothesized universe” of no relationship between these two variables. Then we <code>calculate</code> the <code>&quot;slope&quot;</code> coefficient for each of these 1000 <code>generate</code>d samples.</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distn_slope &lt;-<span class="st"> </span>evals <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(score <span class="op">~</span><span class="st"> </span>bty_avg) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;slope&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb438"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb438-1" data-line-number="1">null_distn_slope &lt;-<span class="st"> </span>evals <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb438-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(score <span class="op">~</span><span class="st"> </span>bty_avg) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb438-3" data-line-number="3"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb438-4" data-line-number="4"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb438-5" data-line-number="5"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;slope&quot;</span>)</a></code></pre></div>
 <p>Observe the resulting null distribution for the fitted slope <span class="math inline">\(b_1\)</span> in Figure <a href="10-inference-for-regression.html#fig:null-distribution-slope">10.10</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(null_distn_slope)</code></pre>
 <div class="figure" style="text-align: center"><span id="fig:null-distribution-slope"></span>
-<img src="moderndive_files/figure-html/null-distribution-slope-1.png" alt="Null distribution." width="\textwidth" />
+<img src="ModernDive_files/figure-html/null-distribution-slope-1.png" alt="Null distribution of slopes." width="\textwidth" />
 <p class="caption">
-FIGURE 10.10: Null distribution.
+FIGURE 10.10: Null distribution of slopes.
 </p>
 </div>
-<p>Notice how it is centered at <span class="math inline">\(b_1\)</span> = 0. This is because in our hypothesized universe, there is no relationship between <code>score</code> and <code>bty_avg</code>. In other words <span class="math inline">\(\beta_1\)</span> = 0. Thus the most typical fitted slope <span class="math inline">\(b_1\)</span> we observe across our simulations is 0. Observe furthermore how there is variation around this central value of 0.</p>
-<p>Let’s visualize the p-value in the null distribution by comparing it to the observed test statistic of <span class="math inline">\(b_1\)</span> = 0.067 in Figure <a href="10-inference-for-regression.html#fig:p-value-slope">10.11</a>. We’ll do this by adding a <code>shade_p_value()</code> layer to the previous <code>visualize()</code> code.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(null_distn_slope) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">shade_p_value</span>(<span class="dt">obs_stat =</span> observed_slope, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</code></pre>
+<p>Notice how it is centered at <span class="math inline">\(b_1\)</span> = 0. This is because in our hypothesized universe, there is no relationship between <code>score</code> and <code>bty_avg</code> and so <span class="math inline">\(\beta_1 = 0\)</span>. Thus, the most typical fitted slope <span class="math inline">\(b_1\)</span> we observe across our simulations is 0. Observe, furthermore, how there is variation around this central value of 0.</p>
+<p>Let’s visualize the <span class="math inline">\(p\)</span>-value in the null distribution by comparing it to the observed test statistic of <span class="math inline">\(b_1\)</span> = 0.067 in Figure <a href="10-inference-for-regression.html#fig:p-value-slope">10.11</a>. We’ll do this by adding a <code>shade_p_value()</code> layer to the previous <code>visualize()</code> code.</p>
 <div class="figure" style="text-align: center"><span id="fig:p-value-slope"></span>
-<img src="moderndive_files/figure-html/p-value-slope-1.png" alt="Null distribution and p-value." width="\textwidth" />
+<img src="ModernDive_files/figure-html/p-value-slope-1.png" alt="Null distribution and $p$-value." width="\textwidth" />
 <p class="caption">
-FIGURE 10.11: Null distribution and p-value.
+FIGURE 10.11: Null distribution and <span class="math inline">\(p\)</span>-value.
 </p>
 </div>
-<p>Since the observed fitted slope 0.067 falls far to the right of this null distribution and thus the shaded region doesn’t overlap it, we’ll have a <span class="math inline">\(p\)</span>-value of 0. For completeness’s sake, however, let’s compute the numerical value of the p-value anyways using the <code>get_p_value()</code> function. It takes the same inputs as the <code>shade_p_value()</code> function:</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distn_slope <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_p_value</span>(<span class="dt">obs_stat =</span> observed_slope, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</code></pre>
+<p>Since the observed fitted slope 0.067 falls far to the right of this null distribution and thus the shaded region doesn’t overlap it, we’ll have a <span class="math inline">\(p\)</span>-value of 0. For completeness, however, let’s compute the numerical value of the <span class="math inline">\(p\)</span>-value anyways using the <code>get_p_value()</code> function. Recall that it takes the same inputs as the <code>shade_p_value()</code> function:</p>
+<div class="sourceCode" id="cb439"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb439-1" data-line-number="1">null_distn_slope <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb439-2" data-line-number="2"><span class="st">  </span><span class="kw">get_p_value</span>(<span class="dt">obs_stat =</span> observed_slope, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   p_value
     &lt;dbl&gt;
 1       0</code></pre>
-<p>This matches the p-value of 0 in the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>. We therefore reject the null hypothesis <span class="math inline">\(H_0: \beta_1 = 0\)</span> in favor of the alternative hypothesis <span class="math inline">\(H_A: \beta_1 \neq 1\)</span>. We thus have evidence that suggests there is a significant relationship between teaching and “beauty” scores for <em>all</em> instructors at UT Austin.</p>
+<p>This matches the <span class="math inline">\(p\)</span>-value of 0 in the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>. We therefore reject the null hypothesis <span class="math inline">\(H_0: \beta_1 = 0\)</span> in favor of the alternative hypothesis <span class="math inline">\(H_A: \beta_1 \neq 0\)</span>. We thus have evidence that suggests there is a significant relationship between teaching and “beauty” scores for <em>all</em> instructors at UT Austin.</p>
 <p>When the conditions for inference for regression are met and the null distribution has a bell shape, we are likely to see similar results between the simulation-based results we just demonstrated and the theory-based results shown in the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>.</p>
 <div class="learncheck">
 <p>
@@ -1339,14 +1331,13 @@ <h2><span class="header-section-number">10.5</span> Conclusion</h2>
 -->
 <div id="theory-regression" class="section level3">
 <h3><span class="header-section-number">10.5.1</span> Theory-based inference for regression</h3>
-<p>Recall in Section <a href="10-inference-for-regression.html#regression-table-computation">10.2.5</a> when we interpreted the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>, we mentioned that R does not compute its values using simulation-based methods for constructing confidence intervals and conducting hypothesis tests as we did in Chapters <a href="8-confidence-intervals.html#confidence-intervals">8</a> and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> using the <code>infer</code> package.</p>
-<p>Rather, R uses a theory-based approach using mathematical formulas, much like the theory-based confidence intervals you saw in Subsection <a href="8-confidence-intervals.html#theory-ci">8.7.2</a> and the theory-based hypothesis tests you saw in Subsection <a href="9-hypothesis-testing.html#theory-hypo">9.6.1</a>. These formulas were derived in a time when computers didn’t exist, so it would’ve been impossible to run the extensive computer simulations.</p>
+<p>Recall in Subsection <a href="10-inference-for-regression.html#regression-table-computation">10.2.5</a> when we interpreted the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>, we mentioned that R does not compute its values using simulation-based methods for constructing confidence intervals and conducting hypothesis tests as we did in Chapters <a href="8-confidence-intervals.html#confidence-intervals">8</a> and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> using the <code>infer</code> package. Rather, R uses a theory-based approach using mathematical formulas, much like the theory-based confidence intervals you saw in Subsection <a href="8-confidence-intervals.html#theory-ci">8.7.2</a> and the theory-based hypothesis tests you saw in Subsection <a href="9-hypothesis-testing.html#theory-hypo">9.6.1</a>. These formulas were derived in a time when computers didn’t exist, so it would’ve been incredibly labor intensive to run extensive simulations.</p>
 <p>In particular, there is a formula for the <em>standard error</em> of the fitted slope <span class="math inline">\(b_1\)</span>:</p>
 <p><span class="math display">\[\text{SE}_{b_1} = \dfrac{\dfrac{s_y}{s_x} \cdot \sqrt{1-r^2}}{\sqrt{n-2}}\]</span></p>
 <!-- Really like this breakdown: https://stats.stackexchange.com/questions/342632/how-to-understand-se-of-regression-slope-equation -->
-<p>As with many formulas in statistics, there’s a lot going on here, so let’s first break down what each symbol represents. First <span class="math inline">\(s_x\)</span> and <span class="math inline">\(s_y\)</span> are the <em>sample standard deviations</em> of the explanatory variable <code>bty_avg</code> and the response variable <code>score</code> respectively. Second, <span class="math inline">\(r\)</span> is the sample <em>correlation coefficient</em> between <code>score</code> and <code>bty_avg</code>. This was computed as 0.187 in Chapter <a href="5-regression.html#regression">5</a>. Lastly, <span class="math inline">\(n\)</span> is the number of pairs of points in the <code>evals_ch6</code> data frame, here 463.</p>
-<p>To put this formula into words, the standard error of <span class="math inline">\(b_1\)</span> depends on the relationship between the variability of the response variable and the variability of the explanatory variable as measured in the <span class="math inline">\(s_y / s_x\)</span> term. Next it looks into the relationship of how the two variables relate to each other in the <span class="math inline">\(\sqrt{1-r^2}\)</span> term.</p>
-<p>However, the most important observation to make in the previous formula is that there is a <span class="math inline">\(n - 2\)</span> in the denominator. In other words, as the sample size <span class="math inline">\(n\)</span> increases, the standard error <span class="math inline">\(\text{SE}_{b_1}\)</span> decreases. Just as we demonstrated in Section <a href="7-sampling.html#moral-of-the-story">7.3.3</a> when we used shovels with <span class="math inline">\(n\)</span> = 25, 50, and 100, the amount of sampling variation of the fitted slope <span class="math inline">\(b_1\)</span> will depend on the sample size <span class="math inline">\(n\)</span>. In particular, as the sample size increases, both the sampling and bootstrap distributions narrows. In other words, the standard error <span class="math inline">\(\text{SE}_{b_1}\)</span> decreases. Hence our estimates <span class="math inline">\(b_1\)</span> of the true population slope <span class="math inline">\(\beta_1\)</span> get more and more <em>precise</em>.</p>
+<p>As with many formulas in statistics, there’s a lot going on here, so let’s first break down what each symbol represents. First <span class="math inline">\(s_x\)</span> and <span class="math inline">\(s_y\)</span> are the <em>sample standard deviations</em> of the explanatory variable <code>bty_avg</code> and the response variable <code>score</code>, respectively. Second, <span class="math inline">\(r\)</span> is the sample <em>correlation coefficient</em> between <code>score</code> and <code>bty_avg</code>. This was computed as 0.187 in Chapter <a href="5-regression.html#regression">5</a>. Lastly, <span class="math inline">\(n\)</span> is the number of pairs of points in the <code>evals_ch5</code> data frame, here 463.</p>
+<p>To put this formula into words, the standard error of <span class="math inline">\(b_1\)</span> depends on the relationship between the variability of the response variable and the variability of the explanatory variable as measured in the <span class="math inline">\(s_y / s_x\)</span> term. Next, it looks into how the two variables relate to each other in the <span class="math inline">\(\sqrt{1-r^2}\)</span> term.</p>
+<p>However, the most important observation to make in the previous formula is that there is an <span class="math inline">\(n - 2\)</span> in the denominator. In other words, as the sample size <span class="math inline">\(n\)</span> increases, the standard error <span class="math inline">\(\text{SE}_{b_1}\)</span> decreases. Just as we demonstrated in Subsection <a href="7-sampling.html#moral-of-the-story">7.3.3</a> when we used shovels with <span class="math inline">\(n\)</span> = 25, 50, and 100 slots, the amount of sampling variation of the fitted slope <span class="math inline">\(b_1\)</span> will depend on the sample size <span class="math inline">\(n\)</span>. In particular, as the sample size increases, both the sampling and bootstrap distributions narrow and the standard error <span class="math inline">\(\text{SE}_{b_1}\)</span> decreases. Hence, our estimates of <span class="math inline">\(b_1\)</span> for the true population slope <span class="math inline">\(\beta_1\)</span> get more and more <em>precise</em>.</p>
 <p>R then uses this formula for the standard error of <span class="math inline">\(b_1\)</span> in the third column of the regression table and subsequently to construct 95% confidence intervals. But what about the hypothesis test? Much like with our theory-based hypothesis test in Subsection <a href="9-hypothesis-testing.html#theory-hypo">9.6.1</a>, R uses the following <em><span class="math inline">\(t\)</span>-statistic</em> as the test statistic for hypothesis testing:</p>
 <p><span class="math display">\[
 t = \dfrac{ b_1 - \beta_1}{ \text{SE}_{b_1}}
@@ -1355,15 +1346,13 @@ <h3><span class="header-section-number">10.5.1</span> Theory-based inference for
 <p><span class="math display">\[
 t = \dfrac{ b_1 - 0}{ \text{SE}_{b_1}} = \dfrac{ b_1 }{ \text{SE}_{b_1}}
 \]</span></p>
-<p>What are the values of <span class="math inline">\(b_1\)</span> and <span class="math inline">\(\text{SE}_{b_1}\)</span>? They are in the <code>estimate</code> and <code>std_error</code> column of the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>. Thus the value of 4.09 in the table is computed as 0.067/0.016 = 4.188. Note there is a slight difference due to rounding error.</p>
-<p>Lastly, to compute the p-value, we need to compare to observed test statistic of 4.09 to the appropriate null distribution. Recall from Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a>, that a null distribution is the sampling distribution of the test statistic <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>. Much like in our theory-based hypothesis test in Section <a href="9-hypothesis-testing.html#theory-hypo">9.6.1</a>, it can be mathematically proven that this distribution is a <span class="math inline">\(t\)</span>-distribution with degrees of freedom equal to <span class="math inline">\(df\)</span> = n - 2 = 463 - 2 = 461.</p>
-<p>Don’t worry if you’re feeling a little overwhelmed at this point. There is a lot of background theory to understand before you can fully make sense of the equations for theory-based methods. That being said, theory-based methods and simulation-based methods for constructing confidence intervals and conducting hypothesis tests often yield consistent results.</p>
-<p>In our opinion, two large benefits of simulation-based methods over theory-based is that 1) they are easier for people new to statistical inference to understand and 2) they also work in situations where theory-based methods and mathematical formulas don’t exist.</p>
+<p>What are the values of <span class="math inline">\(b_1\)</span> and <span class="math inline">\(\text{SE}_{b_1}\)</span>? They are in the <code>estimate</code> and <code>std_error</code> column of the regression table in Table <a href="10-inference-for-regression.html#tab:regtable-11">10.1</a>. Thus the value of 4.09 in the table is computed as 0.067/0.016 = 4.188. Note there is a difference due to some rounding error here.</p>
+<p>Lastly, to compute the <span class="math inline">\(p\)</span>-value, we need to compare the observed test statistic of 4.09 to the appropriate null distribution. Recall from Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a>, that a null distribution is the sampling distribution of the test statistic <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>. Much like in our theory-based hypothesis test in Subsection <a href="9-hypothesis-testing.html#theory-hypo">9.6.1</a>, it can be mathematically proven that this distribution is a <span class="math inline">\(t\)</span>-distribution with degrees of freedom equal to <span class="math inline">\(df = n - 2 = 463 - 2 = 461\)</span>.</p>
+<p>Don’t worry if you’re feeling a little overwhelmed at this point. There is a lot of background theory to understand before you can fully make sense of the equations for theory-based methods. That being said, theory-based methods and simulation-based methods for constructing confidence intervals and conducting hypothesis tests often yield consistent results. As mentioned before, in our opinion, two large benefits of simulation-based methods over theory-based are that (1) they are easier for people new to statistical inference to understand, and (2) they also work in situations where theory-based methods and mathematical formulas don’t exist.</p>
 </div>
 <div id="summary-of-statistical-inference" class="section level3">
 <h3><span class="header-section-number">10.5.2</span> Summary of statistical inference</h3>
-<p>We’ve now completed the last two sampling scenarios first introduced in the “Scenarios of sampling for inference” table in Subsection <a href="7-sampling.html#sampling-conclusion-table">7.5.1</a>, which we re-display in Table <a href="10-inference-for-regression.html#tab:table-ch11">10.4</a>.</p>
-<p>Armed with the regression modeling techniques you learned in Chapters <a href="5-regression.html#regression">5</a> and <a href="6-multiple-regression.html#multiple-regression">6</a>, your understanding of sampling for inference in Chapter <a href="7-sampling.html#sampling">7</a>, and the tools for statistical inference like confidence intervals and hypothesis tests in Chapters <a href="8-confidence-intervals.html#confidence-intervals">8</a> and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>, you’re now equipped to study the significance of relationships between variables in a wide array of data!</p>
+<p>We’ve finished the last two scenarios from the “Scenarios of sampling for inference” table in Subsection <a href="7-sampling.html#sampling-conclusion-table">7.5.1</a>, which we re-display in Table <a href="10-inference-for-regression.html#tab:table-ch11">10.4</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:table-ch11">TABLE 10.4: </span>Scenarios of sampling for inference
@@ -1383,7 +1372,7 @@ <h3><span class="header-section-number">10.5.2</span> Summary of statistical inf
 Point estimate
 </th>
 <th style="text-align:left;">
-Notation.
+Symbol(s)
 </th>
 </tr>
 </thead>
@@ -1392,16 +1381,16 @@ <h3><span class="header-section-number">10.5.2</span> Summary of statistical inf
 <td style="text-align:right;width: 0.5in; ">
 1
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.5in; ">
 Population proportion
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(p\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.6in; ">
 Sample proportion
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(\widehat{p}\)</span>
 </td>
 </tr>
@@ -1409,16 +1398,16 @@ <h3><span class="header-section-number">10.5.2</span> Summary of statistical inf
 <td style="text-align:right;width: 0.5in; ">
 2
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.5in; ">
 Population mean
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(\mu\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.6in; ">
 Sample mean
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(\overline{x}\)</span> or <span class="math inline">\(\widehat{\mu}\)</span>
 </td>
 </tr>
@@ -1426,16 +1415,16 @@ <h3><span class="header-section-number">10.5.2</span> Summary of statistical inf
 <td style="text-align:right;width: 0.5in; ">
 3
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.5in; ">
 Difference in population proportions
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(p_1 - p_2\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.6in; ">
 Difference in sample proportions
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(\widehat{p}_1 - \widehat{p}_2\)</span>
 </td>
 </tr>
@@ -1443,16 +1432,16 @@ <h3><span class="header-section-number">10.5.2</span> Summary of statistical inf
 <td style="text-align:right;width: 0.5in; ">
 4
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.5in; ">
 Difference in population means
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(\mu_1 - \mu_2\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.6in; ">
 Difference in sample means
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(\overline{x}_1 - \overline{x}_2\)</span>
 </td>
 </tr>
@@ -1460,38 +1449,22 @@ <h3><span class="header-section-number">10.5.2</span> Summary of statistical inf
 <td style="text-align:right;width: 0.5in; ">
 5
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.5in; ">
 Population regression slope
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(\beta_1\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.6in; ">
 Fitted regression slope
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(b_1\)</span> or <span class="math inline">\(\widehat{\beta}_1\)</span>
 </td>
 </tr>
-<tr>
-<td style="text-align:right;width: 0.5in; ">
-6
-</td>
-<td style="text-align:left;width: 0.7in; ">
-Population regression intercept
-</td>
-<td style="text-align:left;width: 1in; ">
-<span class="math inline">\(\beta_0\)</span>
-</td>
-<td style="text-align:left;width: 1.1in; ">
-Fitted regression intercept
-</td>
-<td style="text-align:left;width: 1in; ">
-<span class="math inline">\(b_0\)</span> or <span class="math inline">\(\widehat{\beta}_0\)</span>
-</td>
-</tr>
 </tbody>
 </table>
+<p>Armed with the regression modeling techniques you learned in Chapters <a href="5-regression.html#regression">5</a> and <a href="6-multiple-regression.html#multiple-regression">6</a>, your understanding of sampling for inference in Chapter <a href="7-sampling.html#sampling">7</a>, and the tools for statistical inference like confidence intervals and hypothesis tests in Chapters <a href="8-confidence-intervals.html#confidence-intervals">8</a> and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>, you’re now equipped to study the significance of relationships between variables in a wide array of data! Many of the ideas presented here can be extended into multiple regression and other more advanced modeling techniques.</p>
 </div>
 <div id="additional-resources-8" class="section level3">
 <h3><span class="header-section-number">10.5.3</span> Additional resources</h3>
@@ -1499,7 +1472,7 @@ <h3><span class="header-section-number">10.5.3</span> Additional resources</h3>
 </div>
 <div id="whats-to-come-9" class="section level3">
 <h3><span class="header-section-number">10.5.4</span> What’s to come</h3>
-<p>You’ve now concluded the last major part of the book on “Statistical Inference via <code>infer</code>.” The closing Chapter <a href="11-thinking-with-data.html#thinking-with-data">11</a> concludes this book with various case studies involving real data, such as house prices in Seattle, WA. You’ll see how the principles in this book can help you become a great storyteller with data!</p>
+<p>You’ve now concluded the last major part of the book on “Statistical Inference with <code>infer</code>.” The closing Chapter <a href="11-thinking-with-data.html#thinking-with-data">11</a> concludes this book with various short case studies involving real data, such as house prices in the city of Seattle, Washington in the US. You’ll see how the principles in this book can help you become a great storyteller with data!</p>
 
 </div>
 </div>
@@ -1518,11 +1491,13 @@ <h3><span class="header-section-number">10.5.4</span> What’s to come</h3>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -1530,12 +1505,11 @@ <h3><span class="header-section-number">10.5.4</span> What’s to come</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -1550,6 +1524,10 @@ <h3><span class="header-section-number">10.5.4</span> What’s to come</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -1566,8 +1544,9 @@ <h3><span class="header-section-number">10.5.4</span> What’s to come</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/11-thinking-with-data.html b/docs/11-thinking-with-data.html
index 57ebaccd3..57e5132da 100644
--- a/docs/11-thinking-with-data.html
+++ b/docs/11-thinking-with-data.html
@@ -4,35 +4,35 @@
 
   <meta charset="utf-8" />
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
-  <title>Chapter 11 Tell the Story with Data | Statistical Inference via Data Science</title>
+  <title>Chapter 11 Tell Your Story with Data | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
-  <meta property="og:title" content="Chapter 11 Tell the Story with Data | Statistical Inference via Data Science" />
+  <meta property="og:title" content="Chapter 11 Tell Your Story with Data | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
-  <meta name="twitter:title" content="Chapter 11 Tell the Story with Data | Statistical Inference via Data Science" />
+  <meta name="twitter:title" content="Chapter 11 Tell Your Story with Data | Statistical Inference via Data Science" />
   <meta name="twitter:site" content="@ModernDive" />
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="10-inference-for-regression.html">
-<link rel="next" href="A-appendixA.html">
+<link rel="prev" href="10-inference-for-regression.html"/>
+<link rel="next" href="A-appendixA.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -569,83 +582,87 @@ <h1>
 <img src='https://moderndive.com/wide_format.png' alt="ModernDive">
 </html>
 <div id="thinking-with-data" class="section level1">
-<h1><span class="header-section-number">Chapter 11</span> Tell the Story with Data</h1>
-<p>Recall in the Preface and at the end of chapters throughout this book, we displayed the “ModernDive flowchart” mapping your journey through this book.</p>
+<h1><span class="header-section-number">Chapter 11</span> Tell Your Story with Data</h1>
+<p>Recall in the Preface and at the end of chapters throughout this book, we displayed the “<em>ModernDive</em> flowchart” mapping your journey through this book.</p>
+
 <div class="figure" style="text-align: center"><span id="fig:moderndive-figure-conclusion"></span>
-<img src="images/flowcharts/flowchart/flowchart.002.png" alt="ModernDive Flowchart." width="\textwidth" />
+<img src="images/flowcharts/flowchart/flowchart.002.png" alt="ModernDive flowchart." width="100%" height="100%" />
 <p class="caption">
-FIGURE 11.1: ModernDive Flowchart.
+FIGURE 11.1: <em>ModernDive</em> flowchart.
 </p>
 </div>
-<p>Let’s go over a refresher of what you’ve covered so far. You first got started with data in Chapter <a href="1-getting-started.html#getting-started">1</a> where you learned about the difference between R and RStudio, started coding in R, installed and loaded your first R packages, and explored your first dataset: all domestic departure <code>flights</code> from a New York City airport in 2013. Then you covered the following three portions of this book:</p>
+<div id="review" class="section level2">
+<h2><span class="header-section-number">11.1</span> Review</h2>
+<p>Let’s go over a refresher of what you’ve covered so far. You first got started with data in Chapter <a href="1-getting-started.html#getting-started">1</a> where you learned about the difference between R and RStudio, started coding in R, installed and loaded your first R packages, and explored your first dataset: all domestic departure <code>flights</code> from a major New York City airport in 2013. Then you covered the following three parts of this book (Parts 2 and 4 are combined into a single portion):</p>
 <ol style="list-style-type: decimal">
-<li>Data science with <code>tidyverse</code>. You assembled your data science toolbox using <code>tidyverse</code> packages. In particular you
+<li>Data science with <code>tidyverse</code>. You assembled your data science toolbox using <code>tidyverse</code> packages. In particular, you
 <ul>
 <li>Ch.<a href="2-viz.html#viz">2</a>: Visualized data using the <code>ggplot2</code> package.</li>
 <li>Ch.<a href="3-wrangling.html#wrangling">3</a>: Wrangled data using the <code>dplyr</code> package.</li>
 <li>Ch.<a href="4-tidy.html#tidy">4</a>: Learned about the concept of “tidy” data as a standardized data frame input and output format for all packages in the <code>tidyverse</code>. Furthermore, you learned how to import spreadsheet files into R using the <code>readr</code> package.</li>
 </ul></li>
-<li>Data modeling with <code>moderndive</code>. Using these data science tools and helper functions from the <code>moderndive</code> package, you fit your first data models. In particular:
+<li>Data modeling with <code>moderndive</code>. Using these data science tools and helper functions from the <code>moderndive</code> package, you fit your first data models. In particular, you
 <ul>
-<li>Ch.<a href="5-regression.html#regression">5</a>: Basic regression models with only one explanatory variable.</li>
-<li>Ch.<a href="6-multiple-regression.html#multiple-regression">6</a>: Multiple regression models with more than one explanatory variable.</li>
+<li>Ch.<a href="5-regression.html#regression">5</a>: Discovered basic regression models with only one explanatory variable.</li>
+<li>Ch.<a href="6-multiple-regression.html#multiple-regression">6</a>: Examined multiple regression models with more than one explanatory variable.</li>
 </ul></li>
-<li>Statistical inference with <code>infer</code>. Once again using your newly acquired data science tools, you unpacked statistical inference using the <code>infer</code> package. In particular you:
+<li>Statistical inference with <code>infer</code>. Once again using your newly acquired data science tools, you unpacked statistical inference using the <code>infer</code> package. In particular, you
 <ul>
-<li>Ch.<a href="7-sampling.html#sampling">7</a>: Learned about the role that sampling variability plays in statistical inference and the role that sample size plays in sampling variability.</li>
-<li>Ch.<a href="8-confidence-intervals.html#confidence-intervals">8</a>: Constructed confidence intervals.</li>
-<li>Ch.<a href="9-hypothesis-testing.html#hypothesis-testing">9</a>: Conducted hypothesis tests.</li>
+<li>Ch.<a href="7-sampling.html#sampling">7</a>: Learned about the role that sampling variability plays in statistical inference and the role that sample size plays in this sampling variability.</li>
+<li>Ch.<a href="8-confidence-intervals.html#confidence-intervals">8</a>: Constructed confidence intervals using bootstrapping.</li>
+<li>Ch.<a href="9-hypothesis-testing.html#hypothesis-testing">9</a>: Conducted hypothesis tests using permutation.</li>
 </ul></li>
-<li>Data modeling with <code>moderndive</code> (revisited): Armed with your understanding of statistical inference, you revisited and reviewed the models you constructed in Ch.<a href="5-regression.html#regression">5</a> &amp; Ch.<a href="6-multiple-regression.html#multiple-regression">6</a>. In particular you:
+<li>Data modeling with <code>moderndive</code> (revisited): Armed with your understanding of statistical inference, you revisited and reviewed the models you constructed in Ch.<a href="5-regression.html#regression">5</a> and Ch.<a href="6-multiple-regression.html#multiple-regression">6</a>. In particular, you
 <ul>
 <li>Ch.<a href="10-inference-for-regression.html#inference-for-regression">10</a>: Interpreted confidence intervals and hypothesis tests in a regression setting.</li>
 </ul></li>
 </ol>
-<p>All this was our way of guiding you through your first experiences of <a href="https://arxiv.org/pdf/1410.3127.pdf">“thinking with data,”</a> an expression originally coined by Google’s Diane Lambert . The philosophy underlying this expression guided the path we set for you in the flowchart in Figure <a href="11-thinking-with-data.html#fig:moderndive-figure-conclusion">11.1</a>. This philosophy is well summarized in the introduction to <a href="https://peerj.com/collections/50-practicaldatascistats/">“Practical Data Science for Stats”</a>: a collection of pre-prints focusing on the practical side of data science workflows and statistical analysis curated by <a href="https://twitter.com/jennybryan">Jennifer Bryan</a>  and <a href="https://twitter.com/hadleywickham">Hadley Wickham</a>. They quote:</p>
+<p>We’ve guided you through your first experiences of <a href="https://arxiv.org/pdf/1410.3127.pdf">“thinking with data,”</a> an expression originally coined by  Dr. Diane Lambert. The philosophy underlying this expression guided your path in the flowchart in Figure <a href="11-thinking-with-data.html#fig:moderndive-figure-conclusion">11.1</a>.</p>
+<p>This philosophy is also well-summarized in <a href="https://peerj.com/collections/50-practicaldatascistats/">“Practical Data Science for Stats”</a>: a collection of pre-prints focusing on the practical side of data science workflows and statistical analysis curated by <a href="https://twitter.com/jennybryan">Dr. Jennifer Bryan</a>  and <a href="https://twitter.com/hadleywickham">Dr. Hadley Wickham</a>. They quote:</p>
 <blockquote>
 <p>There are many aspects of day-to-day analytical work that are almost absent from the conventional statistics literature and curriculum. And yet these activities account for a considerable share of the time and effort of data analysts and applied statisticians. The goal of this collection is to increase the visibility and adoption of modern data analytical workflows. We aim to facilitate the transfer of tools and frameworks between industry and academia, between software engineering and statistics and computer science, and across different domains.</p>
 </blockquote>
-<p>In other words, to be equipped to “think with data” in the 21st century, analysts need practice going through the <a href="http://r4ds.had.co.nz/explore-intro.html">“Data/Science Pipeline”</a> we saw in the Preface (re-displayed in Figure <a href="11-thinking-with-data.html#fig:pipeline-figure-conclusion">11.2</a>). It is our opinion that for too long, statistics education only focused on parts of this pipeline, instead of going through it in its <em>entirety</em> .</p>
+<p>In other words, to be equipped to “think with data” in the 21st century, analysts need practice going through the <a href="http://r4ds.had.co.nz/explore-intro.html">“data/science pipeline”</a> we saw in the Preface (re-displayed in Figure <a href="11-thinking-with-data.html#fig:pipeline-figure-conclusion">11.2</a>). It is our opinion that, for too long, statistics education has only focused on parts of this pipeline, instead of going through it in its <em>entirety</em>.</p>
 <div class="figure" style="text-align: center"><span id="fig:pipeline-figure-conclusion"></span>
-<img src="images/r4ds/data_science_pipeline.png" alt="Data/Science Pipeline." width="\textwidth" />
+<img src="images/r4ds/data_science_pipeline.png" alt="Data/science pipeline." width="70%" height="70%" />
 <p class="caption">
-FIGURE 11.2: Data/Science Pipeline.
+FIGURE 11.2: Data/science pipeline.
 </p>
 </div>
-<p>To conclude this book, we’ll present you with some additional case studies of working with data. In Section <a href="11-thinking-with-data.html#seattle-house-prices">11.1</a> we’ll take you through a full-pass of the “Data/Science Pipeline” in order to analyze the sale price of houses in Seattle, WA, USA.</p>
-<p>In Section <a href="11-thinking-with-data.html#data-journalism">11.2</a>, we’ll present you with some examples of effective data storytelling drawn from the data journalism website <a href="https://fivethirtyeight.com/">FiveThirtyEight.com</a>. We present these case studies to you because we believe that you should not only be able to “think with data,” but also be able to “tell the story with data.” Let’s explore how this might be done!</p>
+<p>To conclude this book, we’ll present you with some additional case studies of working with data. In Section <a href="11-thinking-with-data.html#seattle-house-prices">11.2</a> we’ll take you through a full-pass of the “Data/Science Pipeline” in order to analyze the sale price of houses in Seattle, WA, USA. In Section <a href="11-thinking-with-data.html#data-journalism">11.3</a>, we’ll present you with some examples of effective data storytelling drawn from the data journalism website, <a href="https://fivethirtyeight.com/">FiveThirtyEight.com</a>. We present these case studies to you because we believe that you should not only be able to “think with data,” but also be able to “tell your story with data.” Let’s explore how to do this!</p>
 <div id="needed-packages-9" class="section level3 unnumbered">
 <h3>Needed packages</h3>
 <p>Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Read Section <a href="1-getting-started.html#packages">1.3</a> for information on how to install and load R packages.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(tidyverse)
-<span class="kw">library</span>(moderndive)
-<span class="kw">library</span>(skimr)
-<span class="kw">library</span>(fivethirtyeight)</code></pre>
+<div class="sourceCode" id="cb441"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb441-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb441-2" data-line-number="2"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb441-3" data-line-number="3"><span class="kw">library</span>(skimr)</a>
+<a class="sourceLine" id="cb441-4" data-line-number="4"><span class="kw">library</span>(fivethirtyeight)</a></code></pre></div>
+</div>
 </div>
 <div id="seattle-house-prices" class="section level2">
-<h2><span class="header-section-number">11.1</span> Case study: Seattle house prices</h2>
+<h2><span class="header-section-number">11.2</span> Case study: Seattle house prices</h2>
 <p><a href="https://www.kaggle.com/">Kaggle.com</a> is a machine learning and predictive modeling competition website that hosts datasets uploaded by companies, governmental organizations, and other individuals. One of their datasets is the <a href="https://www.kaggle.com/harlfoxem/housesalesprediction">“House Sales in King County, USA”</a>. It consists of sale prices of homes sold between May 2014 and May 2015 in King County, Washington, USA, which includes the greater Seattle metropolitan area. This dataset is in the <code>house_prices</code> data frame included in the <code>moderndive</code> package.</p>
 <p>The dataset consists of 21,613 houses and 21 variables describing these houses (for a full list and description of these variables, see the help file by running <code>?house_prices</code> in the console). In this case study, we’ll create a multiple regression model where:</p>
-<ol style="list-style-type: decimal">
+<ul>
 <li>The outcome variable <span class="math inline">\(y\)</span> is the sale <code>price</code> of houses.</li>
 <li>Two explanatory variables:
 <ol style="list-style-type: decimal">
 <li>A numerical explanatory variable <span class="math inline">\(x_1\)</span>: house size <code>sqft_living</code> as measured in square feet of living space. Note that 1 square foot is about 0.09 square meters.</li>
-<li>A categorical explanatory variable <span class="math inline">\(x_2\)</span>: house <code>condition</code>, a categorical variable with 5 levels where <code>1</code> indicates “poor” and <code>5</code> indicates “excellent.”</li>
+<li>A categorical explanatory variable <span class="math inline">\(x_2\)</span>: house <code>condition</code>, a categorical variable with five levels where <code>1</code> indicates “poor” and <code>5</code> indicates “excellent.”</li>
 </ol></li>
-</ol>
+</ul>
 <div id="house-prices-EDA-I" class="section level3">
-<h3><span class="header-section-number">11.1.1</span> Exploratory data analysis: Part I</h3>
+<h3><span class="header-section-number">11.2.1</span> Exploratory data analysis: Part I</h3>
 <p>As we’ve said numerous times throughout this book, a crucial first step when presented with data is to perform an exploratory data analysis (EDA). Exploratory data analysis can give you a sense of your data, help identify issues with your data, bring to light any outliers, and help inform model construction.</p>
-<p>Recall the three common steps in an exploratory data analysis we introduced in Section <a href="5-regression.html#model1EDA">5.1.1</a>:</p>
+<p>Recall the three common steps in an exploratory data analysis we introduced in Subsection <a href="5-regression.html#model1EDA">5.1.1</a>:</p>
 <ol style="list-style-type: decimal">
 <li>Looking at the raw data values.</li>
 <li>Computing summary statistics.</li>
 <li>Creating data visualizations.</li>
 </ol>
 <p>First, let’s look at the raw data using <code>View()</code> to bring up RStudio’s spreadsheet viewer and the <code>glimpse()</code> function from the <code>dplyr</code> package:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">View</span>(house_prices)
-<span class="kw">glimpse</span>(house_prices)</code></pre>
+<div class="sourceCode" id="cb442"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb442-1" data-line-number="1"><span class="kw">View</span>(house_prices)</a>
+<a class="sourceLine" id="cb442-2" data-line-number="2"><span class="kw">glimpse</span>(house_prices)</a></code></pre></div>
 <pre><code>Observations: 21,613
 Variables: 21
 $ id            &lt;chr&gt; &quot;7129300520&quot;, &quot;6414100192&quot;, &quot;5631500400&quot;, &quot;2487200875&quot;,…
@@ -669,78 +686,78 @@ <h3><span class="header-section-number">11.1.1</span> Exploratory data analysis:
 $ long          &lt;dbl&gt; -122, -122, -122, -122, -122, -122, -122, -122, -122, -…
 $ sqft_living15 &lt;int&gt; 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 2…
 $ sqft_lot15    &lt;int&gt; 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113,…</code></pre>
-<p>Here are some questions you can ask yourself at this stage of an EDA: Which variables are numerical and which are categorical? For the categorical variables, what are their levels? Besides the variables we’ll be using in our regression model, what other variables do you think would be useful to use in a model for house price?</p>
-<p>Observe, for example, that while the <code>condition</code> variable has values <code>1</code> through <code>5</code>, these are saved in R as <code>fct</code> factors. This is R’s way of saving categorical variables. So you should think of these as the “labels” <code>1</code> through <code>5</code> and not the numerical values <code>1</code> through <code>5</code>.</p>
+<p>Here are some questions you can ask yourself at this stage of an EDA: Which variables are numerical? Which are categorical? For the categorical variables, what are their levels? Besides the variables we’ll be using in our regression model, what other variables do you think would be useful to use in a model for house price?</p>
+<p>Observe, for example, that while the <code>condition</code> variable has values <code>1</code> through <code>5</code>, these are saved in R as <code>fct</code> standing for “factors.” This is one of R’s ways of saving categorical variables. So you should think of these as the “labels” <code>1</code> through <code>5</code> and not the numerical values <code>1</code> through <code>5</code>.</p>
 <p>Let’s now perform the second step in an EDA: computing summary statistics. Recall from Section <a href="3-wrangling.html#summarize">3.3</a> that <em>summary statistics</em> are single numerical values that summarize a large number of values. Examples of summary statistics include the mean, the median, the standard deviation, and various percentiles.</p>
-<p>We could do this using the <code>summarize()</code> function the <code>dplyr</code> package along with R’s built-in <em>summary functions</em>, like <code>mean()</code> and <code>median()</code>. However, recall in Section <a href="3-wrangling.html#mutate">3.5</a>, we saw the following code that computes a variety of summary statistics of the variable <code>gain</code>, which is the amount of time that a flight makes up mid-air:</p>
-<pre class="sourceCode r"><code class="sourceCode r">gain_summary &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(
-    <span class="dt">min =</span> <span class="kw">min</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">q1 =</span> <span class="kw">quantile</span>(gain, <span class="fl">0.25</span>, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">median =</span> <span class="kw">quantile</span>(gain, <span class="fl">0.5</span>, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">q3 =</span> <span class="kw">quantile</span>(gain, <span class="fl">0.75</span>, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">max =</span> <span class="kw">max</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">mean =</span> <span class="kw">mean</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">sd =</span> <span class="kw">sd</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">missing =</span> <span class="kw">sum</span>(<span class="kw">is.na</span>(gain))
-  )</code></pre>
+<p>We could do this using the <code>summarize()</code> function in the <code>dplyr</code> package along with R’s built-in <em>summary functions</em>, like <code>mean()</code> and <code>median()</code>. However, recall in Section <a href="3-wrangling.html#mutate">3.5</a>, we saw the following code that computes a variety of summary statistics of the variable <code>gain</code>, which is the amount of time that a flight makes up mid-air:</p>
+<div class="sourceCode" id="cb444"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb444-1" data-line-number="1">gain_summary &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb444-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(</a>
+<a class="sourceLine" id="cb444-3" data-line-number="3">    <span class="dt">min =</span> <span class="kw">min</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb444-4" data-line-number="4">    <span class="dt">q1 =</span> <span class="kw">quantile</span>(gain, <span class="fl">0.25</span>, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb444-5" data-line-number="5">    <span class="dt">median =</span> <span class="kw">quantile</span>(gain, <span class="fl">0.5</span>, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb444-6" data-line-number="6">    <span class="dt">q3 =</span> <span class="kw">quantile</span>(gain, <span class="fl">0.75</span>, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb444-7" data-line-number="7">    <span class="dt">max =</span> <span class="kw">max</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb444-8" data-line-number="8">    <span class="dt">mean =</span> <span class="kw">mean</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb444-9" data-line-number="9">    <span class="dt">sd =</span> <span class="kw">sd</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb444-10" data-line-number="10">    <span class="dt">missing =</span> <span class="kw">sum</span>(<span class="kw">is.na</span>(gain))</a>
+<a class="sourceLine" id="cb444-11" data-line-number="11">  )</a></code></pre></div>
 <p>To repeat this for all three <code>price</code>, <code>sqft_living</code>, and <code>condition</code> variables would be tedious to code up. So instead, let’s use the convenient <code>skim()</code> function from the <code>skimr</code> package we first used in Subsection <a href="6-multiple-regression.html#model4EDA">6.1.1</a>, being sure to only <code>select()</code> the variables of interest for our model:</p>
-<pre class="sourceCode r"><code class="sourceCode r">house_prices <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(price, sqft_living, condition) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">skim</span>()</code></pre>
+<div class="sourceCode" id="cb445"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb445-1" data-line-number="1">house_prices <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb445-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(price, sqft_living, condition) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb445-3" data-line-number="3"><span class="st">  </span><span class="kw">skim</span>()</a></code></pre></div>
 <pre><code>Skim summary statistics
  n obs: 21613 
  n variables: 3 
 
-── Variable type:factor ─────────────────────────────────────────────────────────────
+── Variable type:factor 
   variable missing complete     n n_unique                         top_counts ordered
  condition       0    21613 21613        5 3: 14031, 4: 5679, 5: 1701, 2: 172   FALSE
 
-── Variable type:integer ─────────────────────────────────────────────────
+── Variable type:integer 
     variable missing complete     n   mean     sd  p0  p25  p50  p75  p100
  sqft_living       0    21613 21613 2079.9 918.44 290 1427 1910 2550 13540 
 
-── Variable type:numeric ─────────────────────────────────────────────────────────────
+── Variable type:numeric 
  variable missing complete     n      mean       sd    p0    p25    p50    p75    p100
     price       0    21613 21613 540088.14 367127.2 75000 321950 450000 645000 7700000</code></pre>
-<p>Observe that the mean <code>price</code> of $540,088 is larger than the median of $450,000. This is because a small number of very expensive houses are inflating the average. In other words, there are “outlier” house prices in our dataset. (This fact will become very apparent when we create our visualizations next.)</p>
+<p>Observe that the mean <code>price</code> of $540,088 is larger than the median of $450,000. This is because a small number of very expensive houses are inflating the average. In other words, there are “outlier” house prices in our dataset. (This fact will become even more apparent when we create our visualizations next.)</p>
 <p>However, the median is not as sensitive to such outlier house prices. This is why news about the real estate market generally report median house prices and not mean/average house prices. We say here that the median is more <em>robust to outliers</em> than the mean. Similarly, while both the standard deviation and interquartile-range (IQR) are both measures of spread and variability, the IQR is more <em>robust to outliers</em>.</p>
-<p>Let’s now perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Let’s first create <em>univariate</em> visualizations, in other produce plots focusing on single variables at a time. Since <code>price</code> and <code>sqft_living</code> are numerical variables, we can visualize their distributions using a <code>geom_histogram()</code> as seen in Section <a href="2-viz.html#histograms">2.5</a> on histograms. On the other hand, since <code>condition</code> is categorical, we can visualize its distribution using a <code>geom_bar()</code>. Recall from Section <a href="2-viz.html#geombar">2.8</a> on barplots that since <code>condition</code> is not “pre-counted”, we use a <code>geom_bar()</code> and not a <code>geom_col()</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Histogram of house price:</span>
-<span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> price)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;price (USD)&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;House price&quot;</span>)
-
-<span class="co"># Histogram of sqft_living:</span>
-<span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> sqft_living)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;living space (square feet)&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;House size&quot;</span>)
-
-<span class="co"># Barplot of condition:</span>
-<span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> condition)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;condition&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;House condition&quot;</span>)</code></pre>
+<p>Let’s now perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Let’s first create <em>univariate</em> visualizations. These are plots focusing on a single variable at a time. Since <code>price</code> and <code>sqft_living</code> are numerical variables, we can visualize their distributions using a <code>geom_histogram()</code> as seen in Section <a href="2-viz.html#histograms">2.5</a> on histograms. On the other hand, since <code>condition</code> is categorical, we can visualize its distribution using a <code>geom_bar()</code>. Recall from Section <a href="2-viz.html#geombar">2.8</a> on barplots that since <code>condition</code> is not “pre-counted”, we use a <code>geom_bar()</code> and not a <code>geom_col()</code>.</p>
+<div class="sourceCode" id="cb447"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb447-1" data-line-number="1"><span class="co"># Histogram of house price:</span></a>
+<a class="sourceLine" id="cb447-2" data-line-number="2"><span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> price)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb447-3" data-line-number="3"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb447-4" data-line-number="4"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;price (USD)&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;House price&quot;</span>)</a>
+<a class="sourceLine" id="cb447-5" data-line-number="5"></a>
+<a class="sourceLine" id="cb447-6" data-line-number="6"><span class="co"># Histogram of sqft_living:</span></a>
+<a class="sourceLine" id="cb447-7" data-line-number="7"><span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> sqft_living)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb447-8" data-line-number="8"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb447-9" data-line-number="9"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;living space (square feet)&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;House size&quot;</span>)</a>
+<a class="sourceLine" id="cb447-10" data-line-number="10"></a>
+<a class="sourceLine" id="cb447-11" data-line-number="11"><span class="co"># Barplot of condition:</span></a>
+<a class="sourceLine" id="cb447-12" data-line-number="12"><span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> condition)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb447-13" data-line-number="13"><span class="st">  </span><span class="kw">geom_bar</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb447-14" data-line-number="14"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;condition&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;House condition&quot;</span>)</a></code></pre></div>
 <p>In Figure <a href="11-thinking-with-data.html#fig:house-prices-viz">11.3</a>, we display all three of these visualizations at once.</p>
 <div class="figure" style="text-align: center"><span id="fig:house-prices-viz"></span>
-<img src="moderndive_files/figure-html/house-prices-viz-1.png" alt="Exploratory visualizations of Seattle house prices data." width="\textwidth" />
+<img src="ModernDive_files/figure-html/house-prices-viz-1.png" alt="Exploratory visualizations of Seattle house prices data." width="\textwidth" />
 <p class="caption">
 FIGURE 11.3: Exploratory visualizations of Seattle house prices data.
 </p>
 </div>
-<p>First, observe in the bottom plot that most houses are of condition “3”, with a few more of condition “4” and “5”, and almost none that are “1” or “2”.</p>
-<p>Next, observe in the histogram for <code>price</code> in the top-left plot that a majority of houses are less than two million dollars. Observe also that the x-axis stretches out to 8 million dollars, even though there does not appear to be any houses close to that price. This is because there are a <em>very small number</em> of houses with prices closer to 8 million. These are the outlier house prices we mentioned earlier. We say that the variable <code>price</code> is <em>right skewed</em> as exhibited by the long right tail.</p>
-<p>Notice, observe in the histogram of <code>sqft_living</code> in the middle plot as well that most houses appear to have less than 5000 square feet of living space. For comparison an American football field is about 57,600 square feet whereas a standard soccer /association football field is about 64,000 square feet. Observe also that this variable is also right skewed, although not as drastically as the <code>price</code> variable.</p>
-<p>For both the <code>price</code> and <code>sqft_living</code> variables, the right-skew makes distinguishing houses at the lower end of the x-axis hard. This is because the scale of the x-axis is compressed by the small number of very expensive and very large houses.</p>
-<p>So what can we do about this skew? Let’s apply a <em>log10-transformation</em> to these variables. If you are unfamiliar with such transformations, we highly recommend you read Appendix <a href="A-appendixA.html#appendix-log10-transformations">A.3</a> on log-transformations. Briefly however, log-transformations allow us to alter the scale a variable to focus on <em>multiplicative</em> changes instead of <em>additive</em> changes. In other words, <em>relative</em> changes instead of <em>absolute</em> changes. Such multiplicative/relative changes are also called changes in <em>orders of magnitude</em>.</p>
-<p>Let’s create new log10-transformed versions of the right-skewed variable <code>price</code> and <code>sqft_living</code> using the <code>mutate()</code> function from Section <a href="3-wrangling.html#mutate">3.5</a>, but we’ll give the latter the name <code>log10_size</code>, which is shorter and easier to understand than the name <code>log10_sqft_living</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">house_prices &lt;-<span class="st"> </span>house_prices <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">mutate</span>(
-    <span class="dt">log10_price =</span> <span class="kw">log10</span>(price),
-    <span class="dt">log10_size =</span> <span class="kw">log10</span>(sqft_living)
-    )</code></pre>
+<p>First, observe in the bottom plot that most houses are of condition “3”, with a few more of conditions “4” and “5”, and almost none that are “1” or “2”.</p>
+<p>Next, observe in the histogram for <code>price</code> in the top-left plot that a majority of houses are less than two million dollars. Observe also that the x-axis stretches out to 8 million dollars, even though there does not appear to be any houses close to that price. This is because there are a <em>very small number</em> of houses with prices closer to 8 million. These are the outlier house prices we mentioned earlier. We say that the variable <code>price</code> is <em>right-skewed</em> as exhibited by the long right tail.</p>
+<p>Further, observe in the histogram of <code>sqft_living</code> in the middle plot as well that most houses appear to have less than 5000 square feet of living space. For comparison, a football field in the US is about 57,600 square feet, whereas a standard soccer/association football field is about 64,000 square feet. Observe also that this variable is also right-skewed, although not as drastically as the <code>price</code> variable.</p>
+<p>For both the <code>price</code> and <code>sqft_living</code> variables, the right-skew makes distinguishing houses at the lower end of the x-axis hard. This is because the scale of the x-axis is compressed by the small number of quite expensive and immensely-sized houses.</p>
+<p>So what can we do about this skew? Let’s apply a <em>log10 transformation</em> to these variables. If you are unfamiliar with such transformations, we highly recommend you read Appendix <a href="A-appendixA.html#appendix-log10-transformations">A.3</a> on logarithmic (log) transformations. In summary, log transformations allow us to alter the scale of a variable to focus on <em>multiplicative</em> changes instead of <em>additive</em> changes. In other words, they shift the view to be on <em>relative</em> changes instead of <em>absolute</em> changes. Such multiplicative/relative changes are also called changes in <em>orders of magnitude</em>.</p>
+<p>Let’s create new log10 transformed versions of the right-skewed variable <code>price</code> and <code>sqft_living</code> using the <code>mutate()</code> function from Section <a href="3-wrangling.html#mutate">3.5</a>, but we’ll give the latter the name <code>log10_size</code>, which is shorter and easier to understand than the name <code>log10_sqft_living</code>.</p>
+<div class="sourceCode" id="cb448"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb448-1" data-line-number="1">house_prices &lt;-<span class="st"> </span>house_prices <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb448-2" data-line-number="2"><span class="st">  </span><span class="kw">mutate</span>(</a>
+<a class="sourceLine" id="cb448-3" data-line-number="3">    <span class="dt">log10_price =</span> <span class="kw">log10</span>(price),</a>
+<a class="sourceLine" id="cb448-4" data-line-number="4">    <span class="dt">log10_size =</span> <span class="kw">log10</span>(sqft_living)</a>
+<a class="sourceLine" id="cb448-5" data-line-number="5">    )</a></code></pre></div>
 <p>Let’s display the before and after effects of this transformation on these variables for only the first 10 rows of <code>house_prices</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">house_prices <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(price, log10_price, sqft_living, log10_size)</code></pre>
+<div class="sourceCode" id="cb449"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb449-1" data-line-number="1">house_prices <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb449-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(price, log10_price, sqft_living, log10_size)</a></code></pre></div>
 <pre><code># A tibble: 21,613 x 4
      price log10_price sqft_living log10_size
      &lt;dbl&gt;       &lt;dbl&gt;       &lt;int&gt;      &lt;dbl&gt;
@@ -755,108 +772,112 @@ <h3><span class="header-section-number">11.1.1</span> Exploratory data analysis:
  9  229500     5.36078        1780    3.25042
 10  323000     5.50920        1890    3.27646
 # … with 21,603 more rows</code></pre>
-<p>Observe in particular the houses in the sixth and third row. The house in the sixth row has <code>price</code> $1,225,000, which is just above one million dollars. Since <span class="math inline">\(10^6\)</span> is one million, its <code>log10_price</code> is 6.09. Contrast this with all other houses with <code>log10_price</code> less than six, since they all have <code>price</code> less than $1,000,000. The house in the third row is the only house with <code>sqft_living</code> less than 1000. Since <span class="math inline">\(1000 = 10^3\)</span>, it’s the lone house with <code>log10_size</code> less than 3.</p>
+<p>Observe in particular the houses in the sixth and third rows. The house in the sixth row has <code>price</code> $1,225,000, which is just above one million dollars. Since <span class="math inline">\(10^6\)</span> is one million, its <code>log10_price</code> is around 6.09.</p>
+<p>Contrast this with all other houses with <code>log10_price</code> less than six, since they all have <code>price</code> less than $1,000,000. The house in the third row is the only house with <code>sqft_living</code> less than 1000. Since <span class="math inline">\(1000 = 10^3\)</span>, it’s the lone house with <code>log10_size</code> less than 3.</p>
 <p>Let’s now visualize the before and after effects of this transformation for <code>price</code> in Figure <a href="11-thinking-with-data.html#fig:log10-price-viz">11.4</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Before log10-transformation:</span>
-<span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> price)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;price (USD)&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;House price: Before&quot;</span>)
-
-<span class="co"># After log10-transformation:</span>
-<span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> log10_price)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;log10 price (USD)&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;House price: After&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb451"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb451-1" data-line-number="1"><span class="co"># Before log10 transformation:</span></a>
+<a class="sourceLine" id="cb451-2" data-line-number="2"><span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> price)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb451-3" data-line-number="3"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb451-4" data-line-number="4"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;price (USD)&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;House price: Before&quot;</span>)</a>
+<a class="sourceLine" id="cb451-5" data-line-number="5"></a>
+<a class="sourceLine" id="cb451-6" data-line-number="6"><span class="co"># After log10 transformation:</span></a>
+<a class="sourceLine" id="cb451-7" data-line-number="7"><span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> log10_price)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb451-8" data-line-number="8"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb451-9" data-line-number="9"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;log10 price (USD)&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;House price: After&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:log10-price-viz"></span>
-<img src="moderndive_files/figure-html/log10-price-viz-1.png" alt="House price before and after log10-transformation." width="\textwidth" />
+<img src="ModernDive_files/figure-html/log10-price-viz-1.png" alt="House price before and after log10 transformation." width="\textwidth" />
 <p class="caption">
-FIGURE 11.4: House price before and after log10-transformation.
+FIGURE 11.4: House price before and after log10 transformation.
 </p>
 </div>
-<p>Observe that after the transformation, the distribution is much less skewed, and in this case, more symmetric and more bell-shaped. Now you can now more easily distinguish the lower priced houses.</p>
-<p>Let’s do the same for house size, where the variable <code>sqft_living</code> and was log10-transformed to <code>log10_size</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Before log10-transformation:</span>
-<span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> sqft_living)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;living space (square feet)&quot;</span>, 
-       <span class="dt">title =</span> <span class="st">&quot;House size: Before&quot;</span>)
-
-<span class="co"># After log10-transformation:</span>
-<span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> log10_size)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;log10 living space (square feet)&quot;</span>, 
-       <span class="dt">title =</span> <span class="st">&quot;House size: After&quot;</span>)</code></pre>
+<p>Observe that after the transformation, the distribution is much less skewed, and in this case, more symmetric and more bell-shaped. Now you can more easily distinguish the lower priced houses.</p>
+<p>Let’s do the same for house size, where the variable <code>sqft_living</code> was log10 transformed to <code>log10_size</code>.</p>
+<div class="sourceCode" id="cb452"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb452-1" data-line-number="1"><span class="co"># Before log10 transformation:</span></a>
+<a class="sourceLine" id="cb452-2" data-line-number="2"><span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> sqft_living)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb452-3" data-line-number="3"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb452-4" data-line-number="4"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;living space (square feet)&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;House size: Before&quot;</span>)</a>
+<a class="sourceLine" id="cb452-5" data-line-number="5"></a>
+<a class="sourceLine" id="cb452-6" data-line-number="6"><span class="co"># After log10 transformation:</span></a>
+<a class="sourceLine" id="cb452-7" data-line-number="7"><span class="kw">ggplot</span>(house_prices, <span class="kw">aes</span>(<span class="dt">x =</span> log10_size)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb452-8" data-line-number="8"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb452-9" data-line-number="9"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;log10 living space (square feet)&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;House size: After&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:log10-size-viz"></span>
-<img src="moderndive_files/figure-html/log10-size-viz-1.png" alt="House size before and after log10-transformation." width="\textwidth" />
+<img src="ModernDive_files/figure-html/log10-size-viz-1.png" alt="House size before and after log10 transformation." width="\textwidth" />
 <p class="caption">
-FIGURE 11.5: House size before and after log10-transformation.
+FIGURE 11.5: House size before and after log10 transformation.
 </p>
 </div>
-<p>Observe in Figure <a href="11-thinking-with-data.html#fig:log10-size-viz">11.5</a> that the log10-transformation has a similar effect of un-skewing the variable. We emphasize that while in these two cases the resulting distributions are more symmetric and bell-shaped, this is not always necessarily the case.</p>
-<p>Given the now un-skewed nature of <code>log10_price</code> and <code>log10_size</code>, we are going to revise our multiple regression model to use our new variables:</p>
+<p>Observe in Figure <a href="11-thinking-with-data.html#fig:log10-size-viz">11.5</a> that the log10 transformation has a similar effect of unskewing the variable. We emphasize that while in these two cases the resulting distributions are more symmetric and bell-shaped, this is not always necessarily the case.</p>
+<p>Given the now symmetric nature of <code>log10_price</code> and <code>log10_size</code>, we are going to revise our multiple regression model to use our new variables:</p>
 <ol style="list-style-type: decimal">
 <li>The outcome variable <span class="math inline">\(y\)</span> is the sale <code>log10_price</code> of houses.</li>
 <li>Two explanatory variables:
 <ol style="list-style-type: decimal">
-<li>A numerical explanatory variable <span class="math inline">\(x_1\)</span>: house size <code>log10_size</code> as measured in log10 square feet of living space.</li>
-<li>A categorical explanatory variable <span class="math inline">\(x_2\)</span>: house <code>condition</code>, a categorical variable with 5 levels where <code>1</code> indicates “poor” and <code>5</code> indicates “excellent.”</li>
+<li>A numerical explanatory variable <span class="math inline">\(x_1\)</span>: house size <code>log10_size</code> as measured in log base 10 square feet of living space.</li>
+<li>A categorical explanatory variable <span class="math inline">\(x_2\)</span>: house <code>condition</code>, a categorical variable with five levels where <code>1</code> indicates “poor” and <code>5</code> indicates “excellent.”</li>
 </ol></li>
 </ol>
 </div>
 <div id="house-prices-EDA-II" class="section level3">
-<h3><span class="header-section-number">11.1.2</span> Exploratory data analysis: Part II</h3>
+<h3><span class="header-section-number">11.2.2</span> Exploratory data analysis: Part II</h3>
 <p>Let’s now continue our EDA by creating <em>multivariate</em> visualizations. Unlike the <em>univariate</em> histograms and barplot in the earlier Figures <a href="11-thinking-with-data.html#fig:house-prices-viz">11.3</a>, <a href="11-thinking-with-data.html#fig:log10-price-viz">11.4</a>, and <a href="11-thinking-with-data.html#fig:log10-size-viz">11.5</a>, <em>multivariate</em> visualizations show relationships between more than one variable. This is an important step of an EDA to perform since the goal of modeling is to explore relationships between variables.</p>
-<p>Since our model involves a numerical outcome variable, a numerical explanatory variable, and a categorical explanatory variable, we are in a similar regression modeling situation as in Section <a href="6-multiple-regression.html#model4">6.1</a> where we studied UT Austin teaching scores dataset. Recall in that case the numerical outcome variable was teaching <code>score</code>, the numerical explanatory variable was instructor <code>age</code>, and the categorical explanatory variable was (binary) <code>gender</code>.</p>
-<p>We thus have two choices of models we can fit. Either 1) an <em>interaction model</em> where the regression line for each <code>condition</code> level will have both a different slope and a different intercept or 2) a <em>parallel slopes model</em> where the regression line for each <code>condition</code> level will have the same slope but different intercepts.</p>
-<p>Recall from Subsection <a href="6-multiple-regression.html#model4table">6.1.3</a> on the parallel slopes model that the <code>ggplot2</code> package does not have a convenient way to plot a parallel slopes model. We therefore use the special purpose <code>gg_parallel_slopes()</code> function included in the <code>moderndive</code> package. We plot both resulting models in Figure <a href="11-thinking-with-data.html#fig:house-price-parallel-slopes">11.6</a>, with the interaction model in the left-hand plot.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Plot interaction model</span>
-<span class="kw">ggplot</span>(house_prices, 
-       <span class="kw">aes</span>(<span class="dt">x =</span> log10_size, <span class="dt">y =</span> log10_price, <span class="dt">col =</span> condition)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> <span class="fl">0.05</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">y =</span> <span class="st">&quot;log10 price&quot;</span>, <span class="dt">x =</span> <span class="st">&quot;log10 size&quot;</span>, 
-       <span class="dt">title =</span> <span class="st">&quot;House prices in Seattle&quot;</span>)
-
-<span class="co"># Plot parallel slopes model</span>
-<span class="kw">gg_parallel_slopes</span>(<span class="dt">y =</span> <span class="st">&quot;log10_price&quot;</span>, <span class="dt">num_x =</span> <span class="st">&quot;log10_size&quot;</span>, 
-                   <span class="dt">cat_x =</span> <span class="st">&quot;condition&quot;</span>, <span class="dt">data =</span> house_prices, 
-                   <span class="dt">alpha =</span> <span class="fl">0.05</span>)</code></pre>
+<p>Since our model involves a numerical outcome variable, a numerical explanatory variable, and a categorical explanatory variable, we are in a similar regression modeling situation as in Section <a href="6-multiple-regression.html#model4">6.1</a> where we studied the UT Austin teaching scores dataset. Recall in that case the numerical outcome variable was teaching <code>score</code>, the numerical explanatory variable was instructor <code>age</code>, and the categorical explanatory variable was (binary) <code>gender</code>.</p>
+<p>We thus have two choices of models we can fit: either (1) an <em>interaction model</em> where the regression line for each <code>condition</code> level will have both a different slope and a different intercept or (2) a <em>parallel slopes model</em> where the regression line for each <code>condition</code> level will have the same slope but different intercepts.</p>
+<p>Recall from Subsection <a href="6-multiple-regression.html#model4table">6.1.3</a> that the <code>geom_parallel_slopes()</code> function is a special purpose function that Evgeni Chasnovski created and included in the <code>moderndive</code> package, since the <code>geom_smooth()</code> method in the <code>ggplot2</code> package does not have a convenient way to plot parallel slopes models. We plot both resulting models in Figure <a href="11-thinking-with-data.html#fig:house-price-parallel-slopes">11.6</a>, with the interaction model on the left.</p>
+<div class="sourceCode" id="cb453"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb453-1" data-line-number="1"><span class="co"># Plot interaction model</span></a>
+<a class="sourceLine" id="cb453-2" data-line-number="2"><span class="kw">ggplot</span>(house_prices, </a>
+<a class="sourceLine" id="cb453-3" data-line-number="3">       <span class="kw">aes</span>(<span class="dt">x =</span> log10_size, <span class="dt">y =</span> log10_price, <span class="dt">col =</span> condition)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb453-4" data-line-number="4"><span class="st">  </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> <span class="fl">0.05</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb453-5" data-line-number="5"><span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb453-6" data-line-number="6"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">y =</span> <span class="st">&quot;log10 price&quot;</span>, </a>
+<a class="sourceLine" id="cb453-7" data-line-number="7">       <span class="dt">x =</span> <span class="st">&quot;log10 size&quot;</span>, </a>
+<a class="sourceLine" id="cb453-8" data-line-number="8">       <span class="dt">title =</span> <span class="st">&quot;House prices in Seattle&quot;</span>)</a>
+<a class="sourceLine" id="cb453-9" data-line-number="9"><span class="co"># Plot parallel slopes model</span></a>
+<a class="sourceLine" id="cb453-10" data-line-number="10"><span class="kw">ggplot</span>(house_prices, </a>
+<a class="sourceLine" id="cb453-11" data-line-number="11">       <span class="kw">aes</span>(<span class="dt">x =</span> log10_size, <span class="dt">y =</span> log10_price, <span class="dt">col =</span> condition)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb453-12" data-line-number="12"><span class="st">  </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> <span class="fl">0.05</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb453-13" data-line-number="13"><span class="st">  </span><span class="kw">geom_parallel_slopes</span>(<span class="dt">se =</span> <span class="ot">FALSE</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb453-14" data-line-number="14"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">y =</span> <span class="st">&quot;log10 price&quot;</span>, </a>
+<a class="sourceLine" id="cb453-15" data-line-number="15">       <span class="dt">x =</span> <span class="st">&quot;log10 size&quot;</span>, </a>
+<a class="sourceLine" id="cb453-16" data-line-number="16">       <span class="dt">title =</span> <span class="st">&quot;House prices in Seattle&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:house-price-parallel-slopes"></span>
-<img src="moderndive_files/figure-html/house-price-parallel-slopes-1.png" alt="Interaction and parallel slopes models." width="\textwidth" />
+<img src="ModernDive_files/figure-html/house-price-parallel-slopes-1.png" alt="Interaction and parallel slopes models." width="\textwidth" />
 <p class="caption">
 FIGURE 11.6: Interaction and parallel slopes models.
 </p>
 </div>
-<p>In both cases, we see there is a positive relationship between house price and size, meaning as houses are larger, they tend to be more expensive. Furthermore, in both plots it seems that houses of condition 5 tend to be the most expensive for most house sizes as evidenced by the fact that the purple line is highest, followed by condition 4 and 3. As for condition 1 and 2, this pattern isn’t as clear. Recall from the univariate barplot of <code>condition</code> in Figure <a href="11-thinking-with-data.html#fig:house-prices-viz">11.3</a>, there are very few houses of condition 1 or 2.</p>
-<p>Let’s also show a faceted version of just the interaction model in Figure <a href="11-thinking-with-data.html#fig:house-price-interaction-2">11.7</a>. It is now much more apparent that there are very few houses of condition 1 or 2.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(house_prices, 
-       <span class="kw">aes</span>(<span class="dt">x =</span> log10_size, <span class="dt">y =</span> log10_price, <span class="dt">col =</span> condition)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> <span class="fl">0.4</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">y =</span> <span class="st">&quot;log10 price&quot;</span>, <span class="dt">x =</span> <span class="st">&quot;log10 size&quot;</span>, 
-       <span class="dt">title =</span> <span class="st">&quot;House prices in Seattle&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">facet_wrap</span>(<span class="op">~</span>condition)</code></pre>
+<p>In both cases, we see there is a positive relationship between house price and size, meaning as houses are larger, they tend to be more expensive. Furthermore, in both plots it seems that houses of condition 5 tend to be the most expensive for most house sizes as evidenced by the fact that the line for condition 5 is highest, followed by conditions 4 and 3. As for conditions 1 and 2, this pattern isn’t as clear. Recall from the univariate barplot of <code>condition</code> in Figure <a href="11-thinking-with-data.html#fig:house-prices-viz">11.3</a>, there are only a few houses of condition 1 or 2.</p>
+<p>Let’s also show a faceted version of just the interaction model in Figure <a href="11-thinking-with-data.html#fig:house-price-interaction-2">11.7</a>. It is now much more apparent just how few houses are of condition 1 or 2.</p>
+<div class="sourceCode" id="cb454"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb454-1" data-line-number="1"><span class="kw">ggplot</span>(house_prices, </a>
+<a class="sourceLine" id="cb454-2" data-line-number="2">       <span class="kw">aes</span>(<span class="dt">x =</span> log10_size, <span class="dt">y =</span> log10_price, <span class="dt">col =</span> condition)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb454-3" data-line-number="3"><span class="st">  </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> <span class="fl">0.4</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb454-4" data-line-number="4"><span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb454-5" data-line-number="5"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">y =</span> <span class="st">&quot;log10 price&quot;</span>, </a>
+<a class="sourceLine" id="cb454-6" data-line-number="6">       <span class="dt">x =</span> <span class="st">&quot;log10 size&quot;</span>, </a>
+<a class="sourceLine" id="cb454-7" data-line-number="7">       <span class="dt">title =</span> <span class="st">&quot;House prices in Seattle&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb454-8" data-line-number="8"><span class="st">  </span><span class="kw">facet_wrap</span>(<span class="op">~</span><span class="st"> </span>condition)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:house-price-interaction-2"></span>
-<img src="moderndive_files/figure-html/house-price-interaction-2-1.png" alt="Facetted plot of interaction model." width="\textwidth" />
+<img src="ModernDive_files/figure-html/house-price-interaction-2-1.png" alt="Faceted plot of interaction model." width="\textwidth" />
 <p class="caption">
-FIGURE 11.7: Facetted plot of interaction model.
+FIGURE 11.7: Faceted plot of interaction model.
 </p>
 </div>
-<p>Which exploratory visualization of the interaction model is better, the one in the left-hand plot of Figure <a href="11-thinking-with-data.html#fig:house-price-parallel-slopes">11.6</a> or the faceted version in Figure <a href="11-thinking-with-data.html#fig:house-price-interaction-2">11.7</a>? There is no universal right answer. You need to make a choice depending on what you want to convey, and own that choice.</p>
+<p>Which exploratory visualization of the interaction model is better, the one in the left-hand plot of Figure <a href="11-thinking-with-data.html#fig:house-price-parallel-slopes">11.6</a> or the faceted version in Figure <a href="11-thinking-with-data.html#fig:house-price-interaction-2">11.7</a>? There is no universal right answer. You need to make a choice depending on what you want to convey, and own that choice, with including and discussing both also as an option as needed.</p>
 </div>
 <div id="house-prices-regression" class="section level3">
-<h3><span class="header-section-number">11.1.3</span> Regression modeling</h3>
+<h3><span class="header-section-number">11.2.3</span> Regression modeling</h3>
 <p>Which of the two models in Figure <a href="11-thinking-with-data.html#fig:house-price-parallel-slopes">11.6</a> is “better”? The interaction model in the left-hand plot or the parallel slopes model in the right-hand plot?</p>
 <p>We had a similar discussion in Subsection <a href="6-multiple-regression.html#model-selection">6.3.1</a> on <em>model selection</em>. Recall that we stated that we should only favor more complex models if the additional complexity is <em>warranted</em>. In this case, the more complex model is the interaction model since it considers five intercepts and five slopes total. This is in contrast to the parallel slopes model which considers five intercepts but only one common slope.</p>
-<p>Is the additional complexity of the interaction model warranted? Looking at the left-hand plot Figure <a href="11-thinking-with-data.html#fig:house-price-parallel-slopes">11.6</a>, we’re of the opinion that it is, as evidenced by the slight x-like pattern to some of the lines. Therefore, we’ll focus the rest of this analysis only on the interaction model. This visual approach is somewhat subjective however, so feel free to disagree!</p>
-<p>What are the 5 different slopes and 5 different intercepts for the interaction model? We can obtain these values from the regression table. Recall our two-step process for getting the regression table:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit regression model:</span>
-price_interaction &lt;-<span class="st"> </span><span class="kw">lm</span>(log10_price <span class="op">~</span><span class="st"> </span>log10_size <span class="op">*</span><span class="st"> </span>condition, 
-                        <span class="dt">data =</span> house_prices)
-<span class="co"># Get regression table:</span>
-<span class="kw">get_regression_table</span>(price_interaction)</code></pre>
+<p>Is the additional complexity of the interaction model warranted? Looking at the left-hand plot in Figure <a href="11-thinking-with-data.html#fig:house-price-parallel-slopes">11.6</a>, we’re of the opinion that it is, as evidenced by the slight x-like pattern to some of the lines. Therefore, we’ll focus the rest of this analysis only on the interaction model. This visual approach is somewhat subjective, however, so feel free to disagree! What are the five different slopes and five different intercepts for the interaction model? We can obtain these values from the regression table. Recall our two-step process for getting the regression table:</p>
+<div class="sourceCode" id="cb455"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb455-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb455-2" data-line-number="2">price_interaction &lt;-<span class="st"> </span><span class="kw">lm</span>(log10_price <span class="op">~</span><span class="st"> </span>log10_size <span class="op">*</span><span class="st"> </span>condition, </a>
+<a class="sourceLine" id="cb455-3" data-line-number="3">                        <span class="dt">data =</span> house_prices)</a>
+<a class="sourceLine" id="cb455-4" data-line-number="4"></a>
+<a class="sourceLine" id="cb455-5" data-line-number="5"><span class="co"># Get regression table:</span></a>
+<a class="sourceLine" id="cb455-6" data-line-number="6"><span class="kw">get_regression_table</span>(price_interaction)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:seattle-interaction">TABLE 11.1: </span>Regression table for interaction model.
+<span id="tab:seattle-interaction">TABLE 11.1: </span>Regression table for interaction model
 </caption>
 <thead>
 <tr>
@@ -1116,44 +1137,77 @@ <h3><span class="header-section-number">11.1.3</span> Regression modeling</h3>
 </tr>
 </tbody>
 </table>
-<p>Recall we saw in Section <a href="6-multiple-regression.html#model4interactiontable">6.1.2</a> how to interpret a regression table when there exist both numerical and categorical explanatory variables. Let’s now do the same for all 10 values in the <code>estimate</code> column of Table <a href="11-thinking-with-data.html#tab:seattle-interaction">11.1</a>.</p>
+<p>Recall we saw in Subsection <a href="6-multiple-regression.html#model4interactiontable">6.1.2</a> how to interpret a regression table when there are both numerical and categorical explanatory variables. Let’s now do the same for all 10 values in the <code>estimate</code> column of Table <a href="11-thinking-with-data.html#tab:seattle-interaction">11.1</a>.</p>
 <p>In this case, the “baseline for comparison” group for the categorical variable <code>condition</code> are the condition 1 houses, since “1” comes first alphanumerically. Thus, the <code>intercept</code> and <code>log10_size</code> values are the intercept and slope for <code>log10_size</code> for this baseline group. Next, the <code>condition2</code> through <code>condition5</code> terms are the <em>offsets</em> in intercepts relative to the condition 1 intercept. Finally, the <code>log10_size:condition2</code> through <code>log10_size:condition5</code> are the <em>offsets</em> in slopes for <code>log10_size</code> relative to the condition 1 slope for <code>log10_size</code>.</p>
 <p>Let’s simplify this by writing out the equation of each of the five regression lines using these 10 <code>estimate</code> values. We’ll write out each line in the following format:</p>
 <p><span class="math display">\[
 \widehat{\log10(\text{price})} = \hat{\beta}_0 + \hat{\beta}_{\text{size}} \cdot \log10(\text{size})
 \]</span></p>
 <ol style="list-style-type: decimal">
-<li><p>Condition 1: <span class="math inline">\(\widehat{\log10(\text{price})} = 3.33 + 0.69 \cdot \log10(\text{size})\)</span></p></li>
-<li><p>Condition 2: <span class="math inline">\(\widehat{\log10(\text{price})} = (3.33 + 0.047) + (0.69 - 0.024) \cdot \log10(\text{size}) = 3.38 + 0.666 \cdot \log10(\text{size})\)</span></p></li>
-<li><p>Condition 3: <span class="math inline">\(\widehat{\log10(\text{price})} = (3.33 - 0.367) + (0.69 + 0.133) \cdot \log10(\text{size}) = 2.96 + 0.823 \cdot \log10(\text{size})\)</span></p></li>
-<li><p>Condition 4: <span class="math inline">\(\widehat{\log10(\text{price})} = (3.33 - 0.398) + (0.69 + 0.146) \cdot \log10(\text{size}) = 2.93 + 0.836 \cdot \log10(\text{size})\)</span></p></li>
-<li><p>Condition 5: <span class="math inline">\(\widehat{\log10(\text{price})} = (3.33 - 0.883) + (0.69 + 0.31) \cdot \log10(\text{size}) = 2.45 + 1 \cdot \log10(\text{size})\)</span></p></li>
+<li>Condition 1:</li>
 </ol>
-<p>These correspond to the regression lines in the left-hand plot of Figure <a href="11-thinking-with-data.html#fig:house-price-parallel-slopes">11.6</a> and the faceted plot in Figure <a href="11-thinking-with-data.html#fig:house-price-interaction-2">11.7</a>. For homes of all 5 condition types, as the size of the house increases, the price increases. This is what most would expect. However, the rate of increase of price with size is fastest for the homes with condition 3, 4, and 5 of 0.823, 0.836, and 1 respectively. These are the three largest slopes out of the five.</p>
+<p><span class="math display">\[\widehat{\log10(\text{price})} = 3.33 + 0.69 \cdot \log10(\text{size})\]</span></p>
+<ol start="2" style="list-style-type: decimal">
+<li>Condition 2:</li>
+</ol>
+<p><span class="math display">\[
+\begin{aligned} 
+\widehat{\log10(\text{price})} &amp;= (3.33 + 0.047) + (0.69 - 0.024) \cdot \log10(\text{size}) \\ 
+                               &amp;= 3.377 + 0.666 \cdot \log10(\text{size})
+\end{aligned}
+\]</span></p>
+<ol start="3" style="list-style-type: decimal">
+<li>Condition 3:</li>
+</ol>
+<p><span class="math display">\[
+\begin{aligned} 
+\widehat{\log10(\text{price})} &amp;= (3.33 - 0.367) + (0.69 + 0.133) \cdot \log10(\text{size}) \\
+                               &amp;= 2.963 + 0.823 \cdot \log10(\text{size})
+\end{aligned}
+\]</span></p>
+<ol start="4" style="list-style-type: decimal">
+<li>Condition 4:</li>
+</ol>
+<p><span class="math display">\[
+\begin{aligned}
+\widehat{\log10(\text{price})} &amp;= (3.33 - 0.398) + (0.69 + 0.146) \cdot \log10(\text{size}) \\
+                               &amp;= 2.932 + 0.836 \cdot \log10(\text{size})
+\end{aligned}
+\]</span></p>
+<ol start="5" style="list-style-type: decimal">
+<li>Condition 5:</li>
+</ol>
+<p><span class="math display">\[
+\begin{aligned}
+\widehat{\log10(\text{price})} &amp;= (3.33 - 0.883) + (0.69 + 0.31) \cdot \log10(\text{size}) \\
+                               &amp;= 2.447 + 1 \cdot \log10(\text{size})
+\end{aligned}
+\]</span></p>
+<p>These correspond to the regression lines in the left-hand plot of Figure <a href="11-thinking-with-data.html#fig:house-price-parallel-slopes">11.6</a> and the faceted plot in Figure <a href="11-thinking-with-data.html#fig:house-price-interaction-2">11.7</a>. For homes of all five condition types, as the size of the house increases, the price increases. This is what most would expect. However, the rate of increase of price with size is fastest for the homes with conditions 3, 4, and 5 of 0.823, 0.836, and 1, respectively. These are the three largest slopes out of the five.</p>
 </div>
 <div id="house-prices-making-predictions" class="section level3">
-<h3><span class="header-section-number">11.1.4</span> Making predictions</h3>
+<h3><span class="header-section-number">11.2.4</span> Making predictions</h3>
 <p>Say you’re a realtor and someone calls you asking you how much their home will sell for. They tell you that it’s in condition = 5 and is sized 1900 square feet. What do you tell them? Let’s use the interaction model we fit to make predictions!</p>
 <p>We first make this prediction visually in Figure <a href="11-thinking-with-data.html#fig:house-price-interaction-3">11.8</a>. The predicted <code>log10_price</code> of this house is marked with a black dot. This is where the following two lines intersect:</p>
 <ul>
-<li>The purple regression line for the condition = 5 homes and</li>
-<li>The vertical dashed black line at <code>log10_size</code> equals 3.28, since our predictor variable is the log10-transformed square feet of living space of <span class="math inline">\(\log10(1900) = 3.28\)</span> .</li>
+<li>The regression line for the condition = 5 homes and</li>
+<li>The vertical dashed black line at <code>log10_size</code> equals 3.28, since our predictor variable is the log10 transformed square feet of living space of <span class="math inline">\(\log10(1900) = 3.28\)</span>.</li>
 </ul>
 <div class="figure" style="text-align: center"><span id="fig:house-price-interaction-3"></span>
-<img src="moderndive_files/figure-html/house-price-interaction-3-1.png" alt="Interaction model with prediction." width="\textwidth" />
+<img src="ModernDive_files/figure-html/house-price-interaction-3-1.png" alt="Interaction model with prediction." width="\textwidth" />
 <p class="caption">
 FIGURE 11.8: Interaction model with prediction.
 </p>
 </div>
 <p>Eyeballing it, it seems the predicted <code>log10_price</code> seems to be around 5.75. Let’s now obtain the exact numerical value for the prediction using the equation of the regression line for the condition = 5 houses, being sure to <code>log10()</code> the square footage first.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="fl">2.45</span> <span class="op">+</span><span class="st"> </span><span class="dv">1</span> <span class="op">*</span><span class="st"> </span><span class="kw">log10</span>(<span class="dv">1900</span>)</code></pre>
+<div class="sourceCode" id="cb456"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb456-1" data-line-number="1"><span class="fl">2.45</span> <span class="op">+</span><span class="st"> </span><span class="dv">1</span> <span class="op">*</span><span class="st"> </span><span class="kw">log10</span>(<span class="dv">1900</span>)</a></code></pre></div>
 <pre><code>[1] 5.73</code></pre>
-<p>This value is very close to our earlier visually made prediction of 5.75. But wait! Is our prediction for the price of this house $5.75? No! Remember that we are using <code>log10_price</code> as our outcome variable! So if we want a prediction in dollar units of <code>price</code>, we need to un-log this by taking a power of 10 as described in Appendix <a href="A-appendixA.html#appendix-log10-transformations">A.3</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="dv">10</span><span class="op">^</span>(<span class="fl">2.45</span> <span class="op">+</span><span class="st"> </span><span class="dv">1</span> <span class="op">*</span><span class="st"> </span><span class="kw">log10</span>(<span class="dv">1900</span>))</code></pre>
+<p>This value is very close to our earlier visually made prediction of 5.75. But wait! Is our prediction for the price of this house $5.75? No! Remember that we are using <code>log10_price</code> as our outcome variable! So, if we want a prediction in dollar units of <code>price</code>, we need to unlog this by taking a power of 10 as described in Appendix <a href="A-appendixA.html#appendix-log10-transformations">A.3</a>.</p>
+<div class="sourceCode" id="cb458"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb458-1" data-line-number="1"><span class="dv">10</span><span class="op">^</span>(<span class="fl">2.45</span> <span class="op">+</span><span class="st"> </span><span class="dv">1</span> <span class="op">*</span><span class="st"> </span><span class="kw">log10</span>(<span class="dv">1900</span>))</a></code></pre></div>
 <pre><code>[1] 535493</code></pre>
-<p>So we our predicted price for this home of condition 5 and size 1900 square feet is $535,493.</p>
+<p>So our predicted price for this home of condition 5 and of size 1900 square feet is $535,493.</p>
 <!--
-TODO: 
+TODO: Inference for regression for Seattle house prices
 
 ### Inference for regression {#house-prices-inference-for-regression}
 
@@ -1170,28 +1224,28 @@ <h3><span class="header-section-number">11.1.4</span> Making predictions</h3>
 <!--
 **` paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Check that the LINE conditions are met for inference to be made in this Seattle house prices example.
 -->
-<p><strong>(LC11.1)</strong> Repeat the regression modeling in Subsection <a href="11-thinking-with-data.html#house-prices-regression">11.1.3</a> and the prediction making you just did on the house of condition 5 and size 1900 square feet in Subsection <a href="11-thinking-with-data.html#house-prices-making-predictions">11.1.4</a>, but using the parallel slopes model you visualized in Figure <a href="11-thinking-with-data.html#fig:house-price-parallel-slopes">11.6</a>. Hint: it’s $524,807!</p>
+<p><strong>(LC11.1)</strong> Repeat the regression modeling in Subsection <a href="11-thinking-with-data.html#house-prices-regression">11.2.3</a> and the prediction making you just did on the house of condition 5 and size 1900 square feet in Subsection <a href="11-thinking-with-data.html#house-prices-making-predictions">11.2.4</a>, but using the parallel slopes model you visualized in Figure <a href="11-thinking-with-data.html#fig:house-price-parallel-slopes">11.6</a>. Show that it’s $524,807!</p>
 <div class="learncheck">
 
 </div>
 </div>
 </div>
 <div id="data-journalism" class="section level2">
-<h2><span class="header-section-number">11.2</span> Case study: Effective data storytelling</h2>
-<p>As we’ve progressed throughout this book, you’ve seen how to work with data in a variety of ways. You’ve learned effective strategies for plotting data by understanding which types of plots work best for which combinations of variable types. You’ve summarized data in spreadsheet form and calculated summary statistics for a variety of different variables. Furthermore, you’ve seen the value of statistical inference as a process to come to conclusions about a population by using sampling. Lastly, you’ve explored how to fit linear regression model and the importance of checking the conditions required so that all confidence intervals and hypothesis tests have valid interpretation. All throughout, you’ve learned many computational techniques and focused on writing R code that’s reproducible.</p>
+<h2><span class="header-section-number">11.3</span> Case study: Effective data storytelling</h2>
+<p>As we’ve progressed throughout this book, you’ve seen how to work with data in a variety of ways. You’ve learned effective strategies for plotting data by understanding which types of plots work best for which combinations of variable types. You’ve summarized data in spreadsheet form and calculated summary statistics for a variety of different variables. Furthermore, you’ve seen the value of statistical inference as a process to come to conclusions about a population by using sampling. Lastly, you’ve explored how to fit linear regression models and the importance of checking the conditions required so that all confidence intervals and hypothesis tests have valid interpretation. All throughout, you’ve learned many computational techniques and focused on writing R code that’s reproducible.</p>
 <p>We now present another set of case studies, but this time on the “effective data storytelling” done by data journalists around the world. Great data stories don’t mislead the reader, but rather engulf them in understanding the importance that data plays in our lives through storytelling.</p>
 <div id="bechdel-test-for-hollywood-gender-representation" class="section level3">
-<h3><span class="header-section-number">11.2.1</span> Bechdel test for Hollywood gender representation</h3>
-<p>We recommend you read and analyze Walt Hickey’s FiveThirtyEight.com article <a href="http://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/">“The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women.”</a> In it, Walt Hickey did a study across several decades of how many movies pass the <a href="https://bechdeltest.com/">Bechdel test</a>, an informal test of gender representation in a movie created by Alison Bechdel.</p>
-<p>As you read over the article, think carefully about how Walt is using data, graphics, and analyses to tell the reader a story. In the spirit of reproducibility, FiveThirtyEight has also shared the <a href="https://github.com/fivethirtyeight/data/tree/master/bechdel">data and R code</a> that they used for this article. You can also find the data used in many more of their articles on their <a href="https://github.com/fivethirtyeight/data">GitHub</a> page.</p>
-<p>ModernDive co-authors Chester Ismay and Albert Y. Kim along with Jennifer Chunn went one step further by creating the <code>fivethirtyeight</code> package which provides access to these datasets. For a complete list of all 107 datasets included in the <code>fivethirtyeight</code> package, check out the package webpage at <a href="https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html" class="uri">https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html</a>.</p>
+<h3><span class="header-section-number">11.3.1</span> Bechdel test for Hollywood gender representation</h3>
+<p>We recommend you read and analyze Walt Hickey’s FiveThirtyEight.com article, <a href="http://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/">“The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women.”</a> In it, Walt completed a multidecade study of how many movies pass the <a href="https://bechdeltest.com/">Bechdel test</a>, an informal test of gender representation in a movie that was created by  Alison Bechdel.</p>
+<p>As you read over the article, think carefully about how Walt Hickey is using data, graphics, and analyses to tell the reader a story. In the spirit of reproducibility, FiveThirtyEight have also shared the <a href="https://github.com/fivethirtyeight/data/tree/master/bechdel">data and R code</a> that they used for this article. You can also find the data used in many more of their articles on their <a href="https://github.com/fivethirtyeight/data">GitHub</a> page.</p>
+<p><em>ModernDive</em> co-authors Chester Ismay and Albert Y. Kim along with Jennifer Chunn went one step further by creating the <code>fivethirtyeight</code> package which provides access to these datasets more easily in R. For a complete list of all 127 datasets included in the <code>fivethirtyeight</code> package, check out the package webpage at <a href="https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html" class="uri">https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html</a>.</p>
 <p>Furthermore, example “vignettes” of fully reproducible start-to-finish analyses of some of these data using <code>dplyr</code>, <code>ggplot2</code>, and other packages in the <code>tidyverse</code> are available <a href="https://fivethirtyeight-r.netlify.com/articles/">here</a>. For example, a vignette showing how to reproduce one of the plots at the end of the article on the Bechdel test is available <a href="https://fivethirtyeight-r.netlify.com/articles/bechdel.html">here</a>.</p>
 </div>
 <div id="us-births-in-1999" class="section level3">
-<h3><span class="header-section-number">11.2.2</span> US Births in 1999</h3>
-<p>Here is another example involving the <code>US_births_1994_2003</code> data frame included in the <code>fivethirtyeight</code> package. This data provides information about the number of daily births in the United States between 1994 and 2003. For more information on this data frame including a link to the original article on FiveThirtyEight.com, check out the help file by running <code>?US_births_1994_2003</code> in the console.</p>
+<h3><span class="header-section-number">11.3.2</span> US Births in 1999</h3>
+<p>The <code>US_births_1994_2003</code> data frame included in the <code>fivethirtyeight</code> package provides information about the number of daily births in the United States between 1994 and 2003. For more information on this data frame including a link to the original article on FiveThirtyEight.com, check out the help file by running <code>?US_births_1994_2003</code> in the console.</p>
 <p>It’s always a good idea to preview your data, either by using RStudio’s spreadsheet <code>View()</code> function or using <code>glimpse()</code> from the <code>dplyr</code> package:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">glimpse</span>(US_births_<span class="dv">1994</span>_<span class="dv">2003</span>)</code></pre>
+<div class="sourceCode" id="cb460"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb460-1" data-line-number="1"><span class="kw">glimpse</span>(US_births_<span class="dv">1994</span>_<span class="dv">2003</span>)</a></code></pre></div>
 <pre><code>Observations: 3,652
 Variables: 6
 $ year          &lt;int&gt; 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1…
@@ -1201,22 +1255,24 @@ <h3><span class="header-section-number">11.2.2</span> US Births in 1999</h3>
 $ day_of_week   &lt;ord&gt; Sat, Sun, Mon, Tues, Wed, Thurs, Fri, Sat, Sun, Mon, Tu…
 $ births        &lt;int&gt; 8096, 7772, 10142, 11248, 11053, 11406, 11251, 8653, 79…</code></pre>
 <p>We’ll focus on the number of <code>births</code> for each <code>date</code>, but only for births that occurred in 1999. Recall from Section <a href="3-wrangling.html#filter">3.2</a> we can do this using the <code>filter()</code> function from the <code>dplyr</code> package:</p>
-<pre class="sourceCode r"><code class="sourceCode r">US_births_<span class="dv">1999</span> &lt;-<span class="st"> </span>US_births_<span class="dv">1994</span>_<span class="dv">2003</span> <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">filter</span>(year <span class="op">==</span><span class="st"> </span><span class="dv">1999</span>)</code></pre>
-<p>As discussed in Section <a href="2-viz.html#linegraphs">2.4</a>, since <code>date</code> is a notion of time and thus has sequential ordering to it, a linegraph would be a more appropriate visualization to use than a scatterplot. In other words, we should use a <code>geom_line()</code> instead of <code>geom_point()</code>. Recall that such plots are called <em>time series</em> plots.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(US_births_<span class="dv">1999</span>, <span class="kw">aes</span>(<span class="dt">x =</span> date, <span class="dt">y =</span> births)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_line</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Data&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Number of births&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;US Births in 1999&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb462"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb462-1" data-line-number="1">US_births_<span class="dv">1999</span> &lt;-<span class="st"> </span>US_births_<span class="dv">1994</span>_<span class="dv">2003</span> <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb462-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(year <span class="op">==</span><span class="st"> </span><span class="dv">1999</span>)</a></code></pre></div>
+<p>As discussed in Section <a href="2-viz.html#linegraphs">2.4</a>, since <code>date</code> is a notion of time and thus has sequential ordering to it, a linegraph would be a more appropriate visualization to use than a scatterplot. In other words, we should use a <code>geom_line()</code> instead of <code>geom_point()</code>. Recall that such plots are called  <em>time series</em> plots.</p>
+<div class="sourceCode" id="cb463"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb463-1" data-line-number="1"><span class="kw">ggplot</span>(US_births_<span class="dv">1999</span>, <span class="kw">aes</span>(<span class="dt">x =</span> date, <span class="dt">y =</span> births)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb463-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_line</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb463-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Date&quot;</span>, </a>
+<a class="sourceLine" id="cb463-4" data-line-number="4">       <span class="dt">y =</span> <span class="st">&quot;Number of births&quot;</span>, </a>
+<a class="sourceLine" id="cb463-5" data-line-number="5">       <span class="dt">title =</span> <span class="st">&quot;US Births in 1999&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:us-births"></span>
-<img src="moderndive_files/figure-html/us-births-1.png" alt="Number of births in US in 1999." width="\textwidth" />
+<img src="ModernDive_files/figure-html/us-births-1.png" alt="Number of births in the US in 1999." width="\textwidth" />
 <p class="caption">
-FIGURE 11.9: Number of births in US in 1999.
+FIGURE 11.9: Number of births in the US in 1999.
 </p>
 </div>
-<p>We see a big dip occurring just before January 1st, 2000, mostly likely due to the holiday season. However, what about the large spike of over 14,000 births occurring just before October 1st, 1999? What could be the reason for this anomalously high spike?</p>
+<p>We see a big dip occurring just before January 1st, 2000, most likely due to the holiday season. However, what about the large spike of over 14,000 births occurring just before October 1st, 1999? What could be the reason for this anomalously high spike?</p>
 <p>Let’s sort the rows of <code>US_births_1999</code> in descending order of the number of births. Recall from Section <a href="3-wrangling.html#arrange">3.6</a> that we can use the <code>arrange()</code> function from the <code>dplyr</code> function to do this, making sure to sort <code>births</code> in <code>desc</code>ending order:</p>
-<pre class="sourceCode r"><code class="sourceCode r">US_births_<span class="dv">1999</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(births))</code></pre>
+<div class="sourceCode" id="cb464"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb464-1" data-line-number="1">US_births_<span class="dv">1999</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb464-2" data-line-number="2"><span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(births))</a></code></pre></div>
 <pre><code># A tibble: 365 x 6
     year month date_of_month date       day_of_week births
    &lt;int&gt; &lt;int&gt;         &lt;int&gt; &lt;date&gt;     &lt;ord&gt;        &lt;int&gt;
@@ -1241,17 +1297,122 @@ <h3><span class="header-section-number">11.2.2</span> US Births in 1999</h3>
 <div class="learncheck">
 
 </div>
-<p>Time to think with data and further tell the story with data! How could statistical modeling help you here? What types of statistical inference would be helpful? What else can you find and where can you take this analysis? We leave these questions to you as the reader to explore and examine. Remember to get in touch with us via our contact info in the Preface. We’d love to see what you come up with!</p>
+<p>Time to think with data and further tell your story with data! How could statistical modeling help you here? What types of statistical inference would be helpful? What else can you find and where can you take this analysis? What assumptions did you make in this analysis? We leave these questions to you as the reader to explore and examine.</p>
+<p>Remember to get in touch with us via our contact info in the Preface. We’d love to see what you come up with!</p>
+<p>Please check out additional problem sets and labs at <a href="https://moderndive.com/labs" class="uri">https://moderndive.com/labs</a> as well.</p>
 </div>
-<div id="script-of-r-code" class="section level3">
-<h3><span class="header-section-number">11.2.3</span> Script of R code</h3>
-<p>An R script file of all R code used in this chapter is available <a href="scripts/11-tell-the-story-with-data.R">here</a>.</p>
+<div id="scripts-of-r-code" class="section level3">
+<h3><span class="header-section-number">11.3.3</span> Scripts of R code</h3>
+<p>An R script file of all R code used in this chapter is available <a href="scripts/11-tell-your-story-with-data.R">here</a>.</p>
+<p>R code files saved as *.R files for all relevant chapters throughout the entire book are in the following table.</p>
+<table>
+<thead>
+<tr>
+<th style="text-align:right;">
+chapter
+</th>
+<th style="text-align:left;">
+link
+</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td style="text-align:right;">
+1
+</td>
+<td style="text-align:left;">
+<a href="https://moderndive.com/scripts/01-getting-started.R" class="uri">https://moderndive.com/scripts/01-getting-started.R</a>
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+2
+</td>
+<td style="text-align:left;">
+<a href="https://moderndive.com/scripts/02-visualization.R" class="uri">https://moderndive.com/scripts/02-visualization.R</a>
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+3
+</td>
+<td style="text-align:left;">
+<a href="https://moderndive.com/scripts/03-wrangling.R" class="uri">https://moderndive.com/scripts/03-wrangling.R</a>
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+4
+</td>
+<td style="text-align:left;">
+<a href="https://moderndive.com/scripts/04-tidy.R" class="uri">https://moderndive.com/scripts/04-tidy.R</a>
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+5
+</td>
+<td style="text-align:left;">
+<a href="https://moderndive.com/scripts/05-regression.R" class="uri">https://moderndive.com/scripts/05-regression.R</a>
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+6
+</td>
+<td style="text-align:left;">
+<a href="https://moderndive.com/scripts/06-multiple-regression.R" class="uri">https://moderndive.com/scripts/06-multiple-regression.R</a>
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+7
+</td>
+<td style="text-align:left;">
+<a href="https://moderndive.com/scripts/07-sampling.R" class="uri">https://moderndive.com/scripts/07-sampling.R</a>
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+8
+</td>
+<td style="text-align:left;">
+<a href="https://moderndive.com/scripts/08-confidence-intervals.R" class="uri">https://moderndive.com/scripts/08-confidence-intervals.R</a>
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+9
+</td>
+<td style="text-align:left;">
+<a href="https://moderndive.com/scripts/09-hypothesis-testing.R" class="uri">https://moderndive.com/scripts/09-hypothesis-testing.R</a>
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+10
+</td>
+<td style="text-align:left;">
+<a href="https://moderndive.com/scripts/10-inference-for-regression.R" class="uri">https://moderndive.com/scripts/10-inference-for-regression.R</a>
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+11
+</td>
+<td style="text-align:left;">
+<a href="https://moderndive.com/scripts/11-tell-your-story-with-data.R" class="uri">https://moderndive.com/scripts/11-tell-your-story-with-data.R</a>
+</td>
+</tr>
+</tbody>
+</table>
 </div>
 </div>
 <div id="concluding-remarks" class="section level2 unnumbered">
 <h2>Concluding remarks</h2>
-<p>Now that you’ve made it to this point in the book, we suspect that you know a thing or two about how to work with data in R! You’ve also gained a lot of knowledge about how to use simulation techniques for statistical inference and how these techniques help build intuition about traditional theory-based inferential methods like the <span class="math inline">\(t\)</span>-test.</p>
-<p>The hope is that you’ve come to appreciate the power of data in all respects, such as data wrangling, tidying datasets, and data visualization, data modeling, and statistical inference. In our opinion, however, data visualization may be the most important tool for a data scientist to have in their toolbox. If you can create truly beautiful graphics that display information in ways that the reader can clearly understand, you have great power to tell your tale with data. Let’s hope that these skills help you tell great stories with data into the future. Thanks for coming along this journey as we dove into modern data analysis using R and the <code>tidyverse</code>!</p>
+<p>Now that you’ve made it to this point in the book, we suspect that you know a thing or two about how to work with data in R! You’ve also gained a lot of knowledge about how to use simulation-based techniques for statistical inference and how these techniques help build intuition about traditional theory-based inferential methods like the <span class="math inline">\(t\)</span>-test.</p>
+<p>The hope is that you’ve come to appreciate the power of data in all respects, such as data wrangling, tidying datasets, data visualization, data modeling, and statistical inference. In our opinion, while each of these is important, data visualization may be the most important tool for a citizen or professional data scientist to have in their toolbox. If you can create truly beautiful graphics that display information in ways that the reader can clearly understand, you have great power to tell your tale with data. Let’s hope that these skills help you tell great stories with data into the future. Thanks for coming along this journey as we dove into modern data analysis using R and the <code>tidyverse</code>!</p>
 
 
 </div>
@@ -1270,11 +1431,13 @@ <h2>Concluding remarks</h2>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -1282,12 +1445,11 @@ <h2>Concluding remarks</h2>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -1295,13 +1457,17 @@ <h2>Concluding remarks</h2>
 "size": 2
 },
 "edit": {
-"link": "https://github.com/moderndive/moderndive_book/edit/master/11-tell-the-story-with-data.Rmd",
+"link": "https://github.com/moderndive/moderndive_book/edit/master/11-tell-your-story-with-data.Rmd",
 "text": "Edit"
 },
 "history": {
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -1318,8 +1484,9 @@ <h2>Concluding remarks</h2>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/2-viz.html b/docs/2-viz.html
index d5500c668..2b7191abe 100644
--- a/docs/2-viz.html
+++ b/docs/2-viz.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>Chapter 2 Data Visualization | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="Chapter 2 Data Visualization | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="Chapter 2 Data Visualization | Statistical Inference via Data Science" />
@@ -21,18 +21,18 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="1-getting-started.html">
-<link rel="next" href="3-wrangling.html">
+<link rel="prev" href="1-getting-started.html"/>
+<link rel="next" href="3-wrangling.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -570,39 +583,39 @@ <h1>
 </html>
 <div id="viz" class="section level1">
 <h1><span class="header-section-number">Chapter 2</span> Data Visualization</h1>
-<p>We begin the development of your data science toolbox with data visualization. By visualizing data, we gain valuable insights that we couldn’t initially obtain from just looking at the raw data values. We’ll use the <code>ggplot2</code> package as it provides an easy way to customize your plots. <code>ggplot2</code> is rooted in the data visualization theory known as <em>The Grammar of Graphics</em> <span class="citation">(Wilkinson <a href="#ref-wilkinson2005">2005</a>)</span>, developed by Leland Wilkinson. </p>
-<p>At their most basic, graphics/plots/charts (we use these terms interchangeably in this book) provide a nice way to explore the patterns in data, such as the presence of <em>outliers</em>, <em>distributions</em> of individual variables, and <em>relationships</em> between groups of variables. Graphics are designed to emphasize the findings and insights you want your audience to understand. This does however require a balancing act. On the one hand, you want to highlight as many interesting findings as possible. On the other hand, you don’t want to include so much information that it overwhelms your audience.</p>
-<p>As we will see, plots  also help us to identify patterns and outliers in our data. We’ll see that a common extension of these ideas is to compare the <em>distribution</em>  of one quantitative variable (i.e., what the spread of a variable looks like or how the variable is <em>distributed</em> in terms of its values) as we go across the levels of a different categorical variable.</p>
+<p>We begin the development of your data science toolbox with data visualization. By visualizing data, we gain valuable insights we couldn’t initially obtain from just looking at the raw data values. We’ll use the <code>ggplot2</code> package, as it provides an easy way to customize your plots. <code>ggplot2</code> is rooted in the data visualization theory known as <em>the grammar of graphics</em> <span class="citation">(Wilkinson <a href="#ref-wilkinson2005">2005</a>)</span>, developed by Leland Wilkinson. </p>
+<p>At their most basic, graphics/plots/charts (we use these terms interchangeably in this book) provide a nice way to explore the patterns in data, such as the presence of <em>outliers</em>, <em>distributions</em> of individual variables, and <em>relationships</em> between groups of variables. Graphics are designed to emphasize the findings and insights you want your audience to understand. This does, however, require a balancing act. On the one hand, you want to highlight as many interesting findings as possible. On the other hand, you don’t want to include so much information that it overwhelms your audience.</p>
+<p>As we will see, plots  also help us to identify patterns and outliers in our data. We’ll see that a common extension of these ideas is to compare the <em>distribution</em>  of one numerical variable, such as what are the center and spread of the values, as we go across the levels of a different categorical variable.</p>
 <div id="needed-packages" class="section level3 unnumbered">
 <h3>Needed packages</h3>
 <p>Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Read Section <a href="1-getting-started.html#packages">1.3</a> for information on how to install and load R packages.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(nycflights13)
-<span class="kw">library</span>(ggplot2)
-<span class="kw">library</span>(dplyr)</code></pre>
+<div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb14-1" data-line-number="1"><span class="kw">library</span>(nycflights13)</a>
+<a class="sourceLine" id="cb14-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
+<a class="sourceLine" id="cb14-3" data-line-number="3"><span class="kw">library</span>(dplyr)</a></code></pre></div>
 </div>
 <div id="grammarofgraphics" class="section level2">
-<h2><span class="header-section-number">2.1</span> The Grammar of Graphics</h2>
-<p>We begin with a discussion of a theoretical framework for data visualization known as “The Grammar of Graphics.” This framework serves as the foundation for the  <code>ggplot2</code> package which we’ll use extensively in this chapter.  Think of how we construct sentences in English to form sentences by combining different elements, like nouns, verbs, particles, subjects, objects, etc. We can’t just combine these elements in any arbitrary order; we must do so following a set of rules known as a linguistic grammar. Similarly to a linguistic grammar, “The Grammar of Graphics” defines a set of rules for constructing <em>statistical graphics</em> by combining different types of <em>layers</em>. This grammar was created by Leland Wilkinson <span class="citation">(Wilkinson <a href="#ref-wilkinson2005">2005</a>)</span> and has been implemented in a variety of data visualization software platforms like R, but also <a href="https://plot.ly/">Plotly</a> and <a href="https://www.tableau.com/">Tableau</a>.</p>
+<h2><span class="header-section-number">2.1</span> The grammar of graphics</h2>
+<p>We start with a discussion of a theoretical framework for data visualization known as “the grammar of graphics.” This framework serves as the foundation for the  <code>ggplot2</code> package which we’ll use extensively in this chapter.  Think of how we construct and form sentences in English by combining different elements, like nouns, verbs, articles, subjects, objects, etc. We can’t just combine these elements in any arbitrary order; we must do so following a set of rules known as a linguistic grammar. Similarly to a linguistic grammar, “the grammar of graphics” defines a set of rules for constructing <em>statistical graphics</em> by combining different types of <em>layers</em>. This grammar was created by Leland Wilkinson <span class="citation">(Wilkinson <a href="#ref-wilkinson2005">2005</a>)</span> and has been implemented in a variety of data visualization software platforms like R, but also <a href="https://plot.ly/">Plotly</a> and <a href="https://www.tableau.com/">Tableau</a>.</p>
 <div id="components-of-the-grammar" class="section level3">
-<h3><span class="header-section-number">2.1.1</span> Components of the Grammar</h3>
+<h3><span class="header-section-number">2.1.1</span> Components of the grammar</h3>
 <p>In short, the grammar tells us that:</p>
 <blockquote>
 <p><strong>A statistical graphic is a <code>mapping</code> of <code>data</code> variables to <code>aes</code>thetic attributes of <code>geom</code>etric objects.</strong></p>
 </blockquote>
 <p>Specifically, we can break a graphic into the following three essential components:</p>
 <ol style="list-style-type: decimal">
-<li><code>data</code>: the data set containing the variables of interest.</li>
+<li><code>data</code>: the dataset containing the variables of interest.</li>
 <li><code>geom</code>: the geometric object in question. This refers to the type of object we can observe in a plot. For example: points, lines, and bars.</li>
-<li><code>aes</code>: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are <em>mapped</em> to variables in the data set.</li>
+<li><code>aes</code>: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are <em>mapped</em> to variables in the dataset.</li>
 </ol>
 <p>You might be wondering why we wrote the terms <code>data</code>, <code>geom</code>, and <code>aes</code> in a computer code type font. We’ll see very shortly that we’ll specify the elements of the grammar in R using these terms. However, let’s first break down the grammar with an example.</p>
 </div>
 <div id="gapminder" class="section level3">
 <h3><span class="header-section-number">2.1.2</span> Gapminder data</h3>
-<p>In February 2006, a statistician named Hans Rosling gave a TED talk titled <a href="https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen">“The best stats you’ve ever seen”</a> where he presented global economic, health, and development data from the website <a href="http://www.gapminder.org/tools/#_locale_id=en;&amp;chart-type=bubbles">gapminder.org</a>. For example, for data on 142 countries in 2007, let’s consider only 6 countries in Table <a href="2-viz.html#tab:gapminder-2007">2.1</a>.</p>
+<p>In February 2006, a Swedish physician and data advocate named Hans Rosling gave a TED talk titled <a href="https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen">“The best stats you’ve ever seen”</a> where he presented global economic, health, and development data from the website <a href="http://www.gapminder.org/tools/#_locale_id=en;&amp;chart-type=bubbles">gapminder.org</a>. For example, for data on 142 countries in 2007, let’s consider only a few countries in Table <a href="2-viz.html#tab:gapminder-2007">2.1</a> as a peak into the data.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:gapminder-2007">TABLE 2.1: </span>Gapminder 2007 Data: First 6 of 142 countries
+<span id="tab:gapminder-2007">TABLE 2.1: </span>Gapminder 2007 Data: First 3 of 142 countries
 </caption>
 <thead>
 <tr>
@@ -675,57 +688,6 @@ <h3><span class="header-section-number">2.1.2</span> Gapminder data</h3>
 6223
 </td>
 </tr>
-<tr>
-<td style="text-align:left;">
-Angola
-</td>
-<td style="text-align:left;">
-Africa
-</td>
-<td style="text-align:right;">
-42.7
-</td>
-<td style="text-align:right;">
-12420476
-</td>
-<td style="text-align:right;">
-4797
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-Argentina
-</td>
-<td style="text-align:left;">
-Americas
-</td>
-<td style="text-align:right;">
-75.3
-</td>
-<td style="text-align:right;">
-40301927
-</td>
-<td style="text-align:right;">
-12779
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-Australia
-</td>
-<td style="text-align:left;">
-Oceania
-</td>
-<td style="text-align:right;">
-81.2
-</td>
-<td style="text-align:right;">
-20434176
-</td>
-<td style="text-align:right;">
-34435
-</td>
-</tr>
 </tbody>
 </table>
 <p>Each row in this table corresponds to a country in 2007. For each row, we have 5 columns:</p>
@@ -736,12 +698,12 @@ <h3><span class="header-section-number">2.1.2</span> Gapminder data</h3>
 <li><strong>Population</strong>: Number of people living in the country.</li>
 <li><strong>GDP per Capita</strong>: Gross domestic product (in US dollars).</li>
 </ol>
-<p>Now consider Figure <a href="2-viz.html#fig:gapminder">2.1</a>, which plots this data for all 142 countries in the data.</p>
+<p>Now consider Figure <a href="2-viz.html#fig:gapminder">2.1</a>, which plots this for all 142 of the data’s countries.</p>
 <!--
 Note that R will deal with large numbers using scientific notation.  So in the legend for "Population", 1.25e+09 is 1.25 $\times$ 10^9^ = 1,250,000,000 = 1.25 billion. 
 -->
 <div class="figure" style="text-align: center"><span id="fig:gapminder"></span>
-<img src="moderndive_files/figure-html/gapminder-1.png" alt="Life expectancy over GDP per capita in 2007." width="\textwidth" />
+<img src="ModernDive_files/figure-html/gapminder-1.png" alt="Life expectancy over GDP per capita in 2007." width="\textwidth" />
 <p class="caption">
 FIGURE 2.1: Life expectancy over GDP per capita in 2007.
 </p>
@@ -754,10 +716,10 @@ <h3><span class="header-section-number">2.1.2</span> Gapminder data</h3>
 <li>The <code>data</code> variable <strong>Continent</strong> gets mapped to the <code>color</code> <code>aes</code>thetic of the points.</li>
 </ol>
 <p>We’ll see shortly that <code>data</code> corresponds to the particular data frame where our data is saved and that “data variables” correspond to particular columns in the data frame. Furthermore, the type of <code>geom</code>etric object  considered in this plot are points. That being said, while in this example we are considering points, graphics are not limited to just points. We can also use lines, bars, and other geometric objects.</p>
-<p>Let’s summarize the three essential components of the Grammar in Table <a href="2-viz.html#tab:summary-table-gapminder">2.2</a>.</p>
+<p>Let’s summarize the three essential components of the grammar in Table <a href="2-viz.html#tab:summary-table-gapminder">2.2</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:summary-table-gapminder">TABLE 2.2: </span>Summary of Grammar of Graphics for this plot
+<span id="tab:summary-table-gapminder">TABLE 2.2: </span>Summary of the grammar of graphics for this plot
 </caption>
 <thead>
 <tr>
@@ -822,7 +784,7 @@ <h3><span class="header-section-number">2.1.2</span> Gapminder data</h3>
 </div>
 <div id="other-components" class="section level3">
 <h3><span class="header-section-number">2.1.3</span> Other components</h3>
-<p>There are other components of the Grammar of Graphics we can control as well. As you start to delve deeper into the Grammar of Graphics, you’ll start to encounter these topics more frequently. In this book, we’ll keep things simple and only work with these two additional components:</p>
+<p>There are other components of the grammar of graphics we can control as well. As you start to delve deeper into the grammar of graphics, you’ll start to encounter these topics more frequently. In this book, we’ll keep things simple and only work with these two additional components:</p>
 <ul>
 <li><code>facet</code>ing breaks up a plot into several plots split by the values of another variable (Section <a href="2-viz.html#facets">2.6</a>) </li>
 <li><code>position</code> adjustments for barplots (Section <a href="2-viz.html#geombar">2.8</a>) 
@@ -834,22 +796,22 @@ <h3><span class="header-section-number">2.1.3</span> Other components</h3>
 - `stat`istical transformations: this includes smoothing, binning values into a histogram, or no transformation at all (known as the `"identity"` transformation).
 --></li>
 </ul>
-<p>Other more complex components like <code>scales</code> and <code>coord</code>inate systems are left for a more advanced text such as <a href="http://r4ds.had.co.nz/data-visualisation.html#aesthetic-mappings">R for Data Science</a> <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2016</a>)</span>. Generally speaking, the Grammar of Graphics allows for a high degree of customization of plots and also a consistent framework for easily updating and modifying them.</p>
+<p>Other more complex components like <code>scales</code> and <code>coord</code>inate systems are left for a more advanced text such as <a href="http://r4ds.had.co.nz/data-visualisation.html#aesthetic-mappings"><em>R for Data Science</em></a> <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2017</a>)</span>. Generally speaking, the grammar of graphics allows for a high degree of customization of plots and also a consistent framework for easily updating and modifying them.</p>
 </div>
 <div id="ggplot2-package" class="section level3">
 <h3><span class="header-section-number">2.1.4</span> ggplot2 package</h3>
-<p>In this book, we will use the <code>ggplot2</code> package for data visualization, which is an implementation of the Grammar of Graphics for R <span class="citation">(Wickham, Chang, et al. <a href="#ref-R-ggplot2">2019</a>)</span>. As we noted earlier, a lot of the previous section was written in a computer code type font. This is because the various components of the Grammar of Graphics are specified in the <code>ggplot()</code>  function included in the <code>ggplot2</code> package. The <code>ggplot()</code> function expects the following arguments (i.e. inputs) at a minimum:</p>
+<p>In this book, we will use the <code>ggplot2</code> package for data visualization, which is an implementation of the <code>g</code>rammar of <code>g</code>raphics for R <span class="citation">(Wickham, Chang, et al. <a href="#ref-R-ggplot2">2019</a>)</span>. As we noted earlier, a lot of the previous section was written in a computer code type font. This is because the various components of the grammar of graphics are specified in the <code>ggplot()</code>  function included in the <code>ggplot2</code> package. For the purposes of this book, we’ll always provide the <code>ggplot()</code> function with the following arguments (i.e., inputs) at a minimum:</p>
 <ul>
 <li>The data frame where the variables exist: the <code>data</code> argument.</li>
 <li>The mapping of the variables to aesthetic attributes: the <code>mapping</code> argument which specifies the <code>aes</code>thetic attributes involved.</li>
 </ul>
 <p>After we’ve specified these components, we then add <em>layers</em> to the plot using the <code>+</code> sign. The most essential layer to add to a plot is the layer that specifies which type of <code>geom</code>etric object we want the plot to involve: points, lines, bars, and others. Other layers we can add to a plot include the plot title, axes labels, visual themes for the plots, and facets (which we’ll see in Section <a href="2-viz.html#facets">2.6</a>).</p>
-<p>Let’s now put the theory of the Grammar of Graphics into practice.</p>
+<p>Let’s now put the theory of the grammar of graphics into practice.</p>
 </div>
 </div>
 <div id="FiveNG" class="section level2">
-<h2><span class="header-section-number">2.2</span> Five Named Graphs - The 5NG</h2>
-<p>In order to keep things simple in this book, we will only focus on five different types of graphics in this book, each with a commonly given name. We term these “five named graphs” the <strong>5NG</strong>: </p>
+<h2><span class="header-section-number">2.2</span> Five named graphs - the 5NG</h2>
+<p>In order to keep things simple in this book, we will only focus on five different types of graphics, each with a commonly given name. We term these “five named graphs” or in abbreviated form, the <strong>5NG</strong>: </p>
 <ol style="list-style-type: decimal">
 <li>scatterplots</li>
 <li>linegraphs</li>
@@ -857,20 +819,19 @@ <h2><span class="header-section-number">2.2</span> Five Named Graphs - The 5NG</
 <li>histograms</li>
 <li>barplots</li>
 </ol>
-<p>We’ll also present some variations of these plots, but with this basic repertoire of five graphics in your toolbox, you can visualize a wide array of different variable types. Note that certain plots are only appropriate for categorical variables while others are only appropriate for quantitative variables.</p>
+<p>We’ll also present some variations of these plots, but with this basic repertoire of five graphics in your toolbox, you can visualize a wide array of different variable types. Note that certain plots are only appropriate for categorical variables, while others are only appropriate for numerical variables.</p>
 </div>
 <div id="scatterplots" class="section level2">
 <h2><span class="header-section-number">2.3</span> 5NG#1: Scatterplots</h2>
-<p>The simplest of the 5NG are <em>scatterplots</em>,  also called <em>bivariate plots</em>. They allow you to visualize the <em>relationship</em> between two numerical variables. While you may already be familiar with scatterplots, let’s view them through the lens of the Grammar of Graphics we presented in Section <a href="2-viz.html#grammarofgraphics">2.1</a>. Specifically, we will visualize the relationship between the following two numerical variables in the <code>flights</code> data frame included in the  <code>nycflights13</code> package:</p>
+<p>The simplest of the 5NG are <em>scatterplots</em>,  also called <em>bivariate plots</em>. They allow you to visualize the <em>relationship</em> between two numerical variables. While you may already be familiar with scatterplots, let’s view them through the lens of the grammar of graphics we presented in Section <a href="2-viz.html#grammarofgraphics">2.1</a>. Specifically, we will visualize the relationship between the following two numerical variables in the <code>flights</code> data frame included in the  <code>nycflights13</code> package:</p>
 <ol style="list-style-type: decimal">
 <li><code>dep_delay</code>: departure delay on the horizontal “x” axis and</li>
 <li><code>arr_delay</code>: arrival delay on the vertical “y” axis</li>
 </ol>
-<p>for Alaska Airlines flights leaving NYC in 2013. This requires paring down the data from all 336,776 flights that left NYC in 2013, to only the 714 <em>Alaska Airlines</em> flights that left NYC in 2013. We do this so our scatterplot will involve a manageable 714 points, and not an overwhelmingly large number like 336,776. To achieve this, we’ll take the <code>flights</code> data frame, filter the rows so that only the 714 rows corresponding to Alaska Airlines flights are kept, and save this in a new data frame called <code>alaska_flights</code> using the <code>&lt;-</code> <em>assignment</em> operator :</p>
-<pre class="sourceCode r"><code class="sourceCode r">alaska_flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(carrier <span class="op">==</span><span class="st"> &quot;AS&quot;</span>)</code></pre>
-<p>For now we suggest you don’t worry if you don’t fully understand this code. We’ll see later in Chapter <a href="3-wrangling.html#wrangling">3</a> on data wrangling that this code uses the <code>dplyr</code> package for data wrangling to achieve our goal: it takes the <code>flights</code> data frame and <code>filter</code> it to only return the rows where <code>carrier</code> is equal to <code>&quot;AS&quot;</code>, Alaska Airlines’ carrier code. Recall from Section <a href="1-getting-started.html#code">1.2</a> that testing for equality is specified with  <code>==</code> and not <code>=</code>.</p>
-<p>For now however, convince yourself that this code achieves what it is supposed to by exploring the resulting data frame by running <code>View(alaska_flights)</code>. You’ll see that it has 714 rows, consisting of only 714 Alaska Airlines flights.</p>
+<p>for Alaska Airlines flights leaving NYC in 2013. This requires paring down the data from all 336,776 flights that left NYC in 2013, to only the 714 <em>Alaska Airlines</em> flights that left NYC in 2013. We do this so our scatterplot will involve a manageable 714 points, and not an overwhelmingly large number like 336,776. To achieve this, we’ll take the <code>flights</code> data frame, filter the rows so that only the 714 rows corresponding to Alaska Airlines flights are kept, and save this in a new data frame called <code>alaska_flights</code> using the <code>&lt;-</code> <em>assignment</em> operator: </p>
+<div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb15-1" data-line-number="1">alaska_flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb15-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(carrier <span class="op">==</span><span class="st"> &quot;AS&quot;</span>)</a></code></pre></div>
+<p>For now, we suggest you don’t worry if you don’t fully understand this code. We’ll see later in Chapter <a href="3-wrangling.html#wrangling">3</a> on data wrangling that this code uses the <code>dplyr</code> package for data wrangling to achieve our goal: it takes the <code>flights</code> data frame and <code>filter</code>s it to only return the rows where <code>carrier</code> is equal to <code>&quot;AS&quot;</code>, Alaska Airlines’ carrier code. Recall from Section <a href="1-getting-started.html#code">1.2</a> that testing for equality is specified with  <code>==</code> and not <code>=</code>. Convince yourself that this code achieves what it is supposed to by exploring the resulting data frame by running <code>View(alaska_flights)</code>. You’ll see that it has 714 rows, consisting of only 714 Alaska Airlines flights.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -881,31 +842,30 @@ <h2><span class="header-section-number">2.3</span> 5NG#1: Scatterplots</h2>
 
 </div>
 <div id="geompoint" class="section level3">
-<h3><span class="header-section-number">2.3.1</span> Scatterplots via geom_point</h3>
-<p>Let’s now go over the code that will create the desired scatterplot, while keeping in the Grammar of Graphics we introduced in Section <a href="2-viz.html#grammarofgraphics">2.1</a>. Let’s take a look at the code and break it down piece-by-piece.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> alaska_flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> dep_delay, <span class="dt">y =</span> arr_delay)) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">geom_point</span>()</code></pre>
-<p>Within the <code>ggplot()</code>  function, we specify two of the components of the Grammar of Graphics as arguments (i.e. inputs):</p>
+<h3><span class="header-section-number">2.3.1</span> Scatterplots via <code>geom_point</code></h3>
+<p>Let’s now go over the code that will create the desired scatterplot, while keeping in mind the grammar of graphics framework we introduced in Section <a href="2-viz.html#grammarofgraphics">2.1</a>. Let’s take a look at the code and break it down piece-by-piece.</p>
+<div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb16-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> alaska_flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> dep_delay, <span class="dt">y =</span> arr_delay)) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb16-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>()</a></code></pre></div>
+<p>Within the <code>ggplot()</code>  function, we specify two of the components of the grammar of graphics as arguments (i.e., inputs):</p>
 <ol style="list-style-type: decimal">
-<li>The <code>data</code> to be the <code>alaska_flights</code> data frame by setting <code>data = alaska_flights</code>.</li>
-<li>The <code>aes</code>thetic  <code>mapping</code> by setting <code>mapping = aes(x = dep_delay, y = arr_delay)</code>. Specifically, the variable <code>dep_delay</code> maps to the <code>x</code> position aesthetic while the variable <code>arr_delay</code> maps to the <code>y</code> position aesthetic.</li>
+<li>The <code>data</code> as the <code>alaska_flights</code> data frame via <code>data = alaska_flights</code>.</li>
+<li>The <code>aes</code>thetic  <code>mapping</code> by setting <code>mapping = aes(x = dep_delay, y = arr_delay)</code>. Specifically, the variable <code>dep_delay</code> maps to the <code>x</code> position aesthetic, while the variable <code>arr_delay</code> maps to the <code>y</code> position.</li>
 </ol>
-<p>We then add a layer to the <code>ggplot()</code> function call using the <code>+</code> sign. The added layer in question specifies the third component of the grammar: the <code>geom</code>etric object. In this case the geometric object is set to be points by specifying <code>geom_point()</code>.</p>
-<p>After running these two lines of code in your console, you’ll notice two outputs: the graphic shown in Figure <a href="2-viz.html#fig:noalpha">2.2</a> and a warning message.</p>
+<p>We then add a layer to the <code>ggplot()</code> function call using the <code>+</code> sign. The added layer in question specifies the third component of the grammar: the <code>geom</code>etric object. In this case, the geometric object is set to be points by specifying <code>geom_point()</code>. After running these two lines of code in your console, you’ll notice two outputs: a warning message and the graphic shown in Figure <a href="2-viz.html#fig:noalpha">2.2</a>.</p>
 <pre><code>Warning: Removed 5 rows containing missing values (geom_point).</code></pre>
 <div class="figure" style="text-align: center"><span id="fig:noalpha"></span>
-<img src="moderndive_files/figure-html/noalpha-1.png" alt="Arrival delays vs departure delays for Alaska Airlines flights from NYC in 2013." width="\textwidth" />
+<img src="ModernDive_files/figure-html/noalpha-1.png" alt="Arrival delays versus departure delays for Alaska Airlines flights from NYC in 2013." width="\textwidth" />
 <p class="caption">
-FIGURE 2.2: Arrival delays vs departure delays for Alaska Airlines flights from NYC in 2013.
+FIGURE 2.2: Arrival delays versus departure delays for Alaska Airlines flights from NYC in 2013.
 </p>
 </div>
 <p>Let’s first unpack the graphic in Figure <a href="2-viz.html#fig:noalpha">2.2</a>. Observe that a <em>positive relationship</em> exists between <code>dep_delay</code> and <code>arr_delay</code>: as departure delays increase, arrival delays tend to also increase. Observe also the large mass of points clustered near (0, 0), the point indicating flights that neither departed nor arrived late.</p>
-<p>Let’s turn our attention to the warning message. R is alerting us to the fact that 5 rows were ignored due to them being missing. For these 5 rows, either the value for <code>dep_delay</code> or <code>arr_delay</code> or both were missing (recorded in R as <code>NA</code>), and thus these rows were ignored in our plot.</p>
+<p>Let’s turn our attention to the warning message. R is alerting us to the fact that five rows were ignored due to them being missing. For these 5 rows, either the value for <code>dep_delay</code> or <code>arr_delay</code> or both were missing (recorded in R as <code>NA</code>), and thus these rows were ignored in our plot.</p>
 <p>Before we continue, let’s make a few more observations about this code that created the scatterplot. Note that the <code>+</code> sign comes at the end of lines, and not at the beginning. You’ll get an error in R if you put it at the beginning of a line.  When adding layers to a plot, you are encouraged to start a new line after the <code>+</code> (by pressing the Return/Enter button on your keyboard) so that the code for each layer is on a new line. As we add more and more layers to plots, you’ll see this will greatly improve the legibility of your code.</p>
 <p>To stress the importance of adding the layer specifying the <code>geom</code>etric object, consider Figure <a href="2-viz.html#fig:nolayers">2.3</a> where no layers are added. Because the <code>geom</code>etric object was not specified, we have a blank plot which is not very useful!</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> alaska_flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> dep_delay, <span class="dt">y =</span> arr_delay))</code></pre>
+<div class="sourceCode" id="cb18"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb18-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> alaska_flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> dep_delay, <span class="dt">y =</span> arr_delay))</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:nolayers"></span>
-<img src="moderndive_files/figure-html/nolayers-1.png" alt="A plot with no layers." width="\textwidth" />
+<img src="ModernDive_files/figure-html/nolayers-1.png" alt="A plot with no layers." width="\textwidth" />
 <p class="caption">
 FIGURE 2.3: A plot with no layers.
 </p>
@@ -916,8 +876,8 @@ <h3><span class="header-section-number">2.3.1</span> Scatterplots via geom_point
 </p>
 </div>
 <p><strong>(LC2.2)</strong> What are some practical reasons why <code>dep_delay</code> and <code>arr_delay</code> have a positive relationship?</p>
-<p><strong>(LC2.3)</strong> What variables in the <code>weather</code> data frame would you expect to have a negative correlation (i.e. a negative relationship) with <code>dep_delay</code>? Why? Remember that we are focusing on numerical variables here. Hint: Explore the <code>weather</code> dataset by using the <code>View()</code> function.</p>
-<p><strong>(LC2.4)</strong> Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaskan flights?</p>
+<p><strong>(LC2.3)</strong> What variables in the <code>weather</code> data frame would you expect to have a negative correlation (i.e., a negative relationship) with <code>dep_delay</code>? Why? Remember that we are focusing on numerical variables here. Hint: Explore the <code>weather</code> dataset by using the <code>View()</code> function.</p>
+<p><strong>(LC2.4)</strong> Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaska Air flights?</p>
 <p><strong>(LC2.5)</strong> What are some other features of the plot that stand out to you?</p>
 <p><strong>(LC2.6)</strong> Create a new scatterplot using different variables in the <code>alaska_flights</code> data frame by modifying the example given.</p>
 <div class="learncheck">
@@ -925,7 +885,7 @@ <h3><span class="header-section-number">2.3.1</span> Scatterplots via geom_point
 </div>
 </div>
 <div id="overplotting" class="section level3">
-<h3><span class="header-section-number">2.3.2</span> Over-plotting</h3>
+<h3><span class="header-section-number">2.3.2</span> Overplotting</h3>
 <p>The large mass of points near (0, 0) in Figure <a href="2-viz.html#fig:noalpha">2.2</a> can cause some confusion since it is hard to tell the true number of points that are plotted. This is the result of a phenomenon called  <em>overplotting</em>. As one may guess, this corresponds to points being plotted on top of each other over and over again. When overplotting occurs, it is difficult to know the number of points being plotted. There are two methods to address the issue of overplotting. Either by</p>
 <ol style="list-style-type: decimal">
 <li>Adjusting the transparency of the points or</li>
@@ -933,44 +893,44 @@ <h3><span class="header-section-number">2.3.2</span> Over-plotting</h3>
 </ol>
 <p><strong>Method 1: Changing the transparency</strong></p>
 <p>The first way of addressing overplotting is to change the transparency/opacity of the points by setting the <code>alpha</code> argument in <code>geom_point()</code>. We can change the <code>alpha</code> argument to be any value between <code>0</code> and <code>1</code>, where <code>0</code> sets the points to be 100% transparent and <code>1</code> sets the points to be 100% opaque. By default, <code>alpha</code> is set to <code>1</code>. In other words, if we don’t explicitly set an <code>alpha</code> value, R will use <code>alpha = 1</code>.</p>
-<p>Note how the following code is identical to the code in Section <a href="2-viz.html#scatterplots">2.3</a> that created the scatterplot with overplotting, but with <code>alpha = 0.2</code> added to the <code>geom_point()</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> alaska_flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> dep_delay, <span class="dt">y =</span> arr_delay)) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> <span class="fl">0.2</span>)</code></pre>
+<p>Note how the following code is identical to the code in Section <a href="2-viz.html#scatterplots">2.3</a> that created the scatterplot with overplotting, but with <code>alpha = 0.2</code> added to the <code>geom_point()</code> function:</p>
+<div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb19-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> alaska_flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> dep_delay, <span class="dt">y =</span> arr_delay)) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb19-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> <span class="fl">0.2</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:alpha"></span>
-<img src="moderndive_files/figure-html/alpha-1.png" alt="Arrival vs departure delays scatterplot with alpha = 0.2." width="\textwidth" />
+<img src="ModernDive_files/figure-html/alpha-1.png" alt="Arrival vs. departure delays scatterplot with alpha = 0.2." width="\textwidth" />
 <p class="caption">
-FIGURE 2.4: Arrival vs departure delays scatterplot with alpha = 0.2.
+FIGURE 2.4: Arrival vs. departure delays scatterplot with alpha = 0.2.
 </p>
 </div>
 <p>The key feature to note in Figure <a href="2-viz.html#fig:alpha">2.4</a> is that the transparency   of the points is cumulative: areas with a high-degree of overplotting are darker, whereas areas with a lower degree are less dark. Note furthermore that there is no <code>aes()</code> surrounding <code>alpha = 0.2</code>. This is because we are not mapping a variable to an aesthetic attribute, but rather merely changing the default setting of <code>alpha</code>. In fact, you’ll receive an error if you try to change the second line to read <code>geom_point(aes(alpha = 0.2))</code>.</p>
 <p><strong>Method 2: Jittering the points</strong></p>
-<p>The second way of addressing overplotting is by <em>jittering</em> all the points, in other words give each point a small “nudge” in a random direction. You can think of “jittering” as shaking the points around a bit on the plot. Let’s illustrate using a simple example first. Say we have a data frame with 4 identical rows of x &amp; y values: (0,0), (0,0), (0,0), and (0,0). In Figure <a href="2-viz.html#fig:jitter-example-plot-1">2.5</a>, we present both the regular scatterplot of these 4 points (on the left) and its jittered counterpart (on the right).</p>
+<p>The second way of addressing overplotting is by <em>jittering</em> all the points. This means giving each point a small “nudge” in a random direction. You can think of “jittering” as shaking the points around a bit on the plot. Let’s illustrate using a simple example first. Say we have a data frame with 4 identical rows of x and y values: (0,0), (0,0), (0,0), and (0,0). In Figure <a href="2-viz.html#fig:jitter-example-plot-1">2.5</a>, we present both the regular scatterplot of these 4 points (on the left) and its jittered counterpart (on the right).</p>
 <div class="figure" style="text-align: center"><span id="fig:jitter-example-plot-1"></span>
-<img src="moderndive_files/figure-html/jitter-example-plot-1-1.png" alt="Regular and jittered scatterplot." width="\textwidth" />
+<img src="ModernDive_files/figure-html/jitter-example-plot-1-1.png" alt="Regular and jittered scatterplot." width="\textwidth" />
 <p class="caption">
 FIGURE 2.5: Regular and jittered scatterplot.
 </p>
 </div>
-<p>In the left-hand regular scatterplot, observe that the 4 points are superimposed on top of each other. While we know there are 4 values being plotted, this fact might not be apparent to others. In the right-hand jittered scatterplot, observe that since each point is given a random “nudge”, it is now plainly evident that this plot involves four points.</p>
-<p>Keep in mind however that jittering is strictly a visualization tool; even after creating a jittered scatterplot, the original values saved in the data frame remain unchanged. </p>
+<p>In the left-hand regular scatterplot, observe that the 4 points are superimposed on top of each other. While we know there are 4 values being plotted, this fact might not be apparent to others. In the right-hand jittered scatterplot, it is now plainly evident that this plot involves four points since each point is given a random “nudge.”</p>
+<p>Keep in mind, however, that jittering is strictly a visualization tool; even after creating a jittered scatterplot, the original values saved in the data frame remain unchanged. </p>
 <p>To create a jittered scatterplot, instead of using <code>geom_point()</code>, we use <code>geom_jitter()</code>. Observe how the following code is very similar to the code that created the scatterplot with overplotting in Subsection <a href="2-viz.html#geompoint">2.3.1</a>, but with <code>geom_point()</code>  replaced with <code>geom_jitter()</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> alaska_flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> dep_delay, <span class="dt">y =</span> arr_delay)) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">geom_jitter</span>(<span class="dt">width =</span> <span class="dv">30</span>, <span class="dt">height =</span> <span class="dv">30</span>)</code></pre>
+<div class="sourceCode" id="cb20"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb20-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> alaska_flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> dep_delay, <span class="dt">y =</span> arr_delay)) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb20-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_jitter</span>(<span class="dt">width =</span> <span class="dv">30</span>, <span class="dt">height =</span> <span class="dv">30</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:jitter"></span>
-<img src="moderndive_files/figure-html/jitter-1.png" alt="Arrival vs departure delays jittered scatterplot." width="\textwidth" />
+<img src="ModernDive_files/figure-html/jitter-1.png" alt="Arrival versus departure delays jittered scatterplot." width="\textwidth" />
 <p class="caption">
-FIGURE 2.6: Arrival vs departure delays jittered scatterplot.
+FIGURE 2.6: Arrival versus departure delays jittered scatterplot.
 </p>
 </div>
-<p>In order to specify how much jitter to add, we adjusted the <code>width</code> and <code>height</code> arguments to <code>geom_jitter()</code>. This corresponds to how hard you’d like to shake the plot in horizontal x-axis units and vertical y-axis units respectively. In this case, both axes are in minutes. How much jitter should we add using the <code>width</code> and <code>height</code> arguments? On the one hand, it is important to add just enough jitter to break any overlap in points, but on the other hand, not so much that we completely alter the original pattern in points.</p>
-<p>As can be seen in the resulting Figure <a href="2-viz.html#fig:jitter">2.6</a>, in this case jittering doesn’t really provide much new insight. In this particular case, it can be argued that changing the transparency of the points by setting <code>alpha</code> proved more effective. When would it be better to use a jittered scatterplot? When would it be better to alter the points’ transparency? There is no single right answer that applies to all situations. You need to make a subjective choice and own that choice. At the very least when confronted with overplotting however, we suggest you make both types of plots and see which one better emphasizes the point you are trying to make.</p>
+<p>In order to specify how much jitter to add, we adjusted the <code>width</code> and <code>height</code> arguments to <code>geom_jitter()</code>. This corresponds to how hard you’d like to shake the plot in horizontal x-axis units and vertical y-axis units, respectively. In this case, both axes are in minutes. How much jitter should we add using the <code>width</code> and <code>height</code> arguments? On the one hand, it is important to add just enough jitter to break any overlap in points, but on the other hand, not so much that we completely alter the original pattern in points.</p>
+<p>As can be seen in the resulting Figure <a href="2-viz.html#fig:jitter">2.6</a>, in this case jittering doesn’t really provide much new insight. In this particular case, it can be argued that changing the transparency of the points by setting <code>alpha</code> proved more effective. When would it be better to use a jittered scatterplot? When would it be better to alter the points’ transparency? There is no single right answer that applies to all situations. You need to make a subjective choice and own that choice. At the very least when confronted with overplotting, however, we suggest you make both types of plots and see which one better emphasizes the point you are trying to make.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
 <p><strong>(LC2.7)</strong> Why is setting the <code>alpha</code> argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot?</p>
-<p><strong>(LC2.8)</strong> After viewing Figure <a href="2-viz.html#fig:alpha">2.4</a>, give an approximate range of arrival delays and departure delays that occur the most frequently. How has that region changed compared to when you observed the same plot without the <code>alpha = 0.2</code> set in Figure <a href="2-viz.html#fig:noalpha">2.2</a>?</p>
+<p><strong>(LC2.8)</strong> After viewing Figure <a href="2-viz.html#fig:alpha">2.4</a>, give an approximate range of arrival delays and departure delays that occur most frequently. How has that region changed compared to when you observed the same plot without <code>alpha = 0.2</code> set in Figure <a href="2-viz.html#fig:noalpha">2.2</a>?</p>
 <div class="learncheck">
 
 </div>
@@ -983,40 +943,42 @@ <h3><span class="header-section-number">2.3.3</span> Summary</h3>
 </div>
 <div id="linegraphs" class="section level2">
 <h2><span class="header-section-number">2.4</span> 5NG#2: Linegraphs</h2>
-<p>The next of the five named graphs are linegraphs. Linegraphs  show the relationship between two numerical variables when the variable on the x-axis, also called the <em>explanatory</em> variable, is of a sequential nature. In other words there is an inherent ordering to the variable. The most common example of linegraphs have some notion of time on the x-axis: hours, days, weeks, years, etc. Since time is sequential, we connect consecutive observations of the variable on the y-axis with a line. Linegraphs that have some notion of time on the x-axis are also called <em>time series</em> plots. Let’s illustrate linegraphs using another data set in the <code>nycflights13</code>  package: the <code>weather</code> data frame.</p>
-<p>Let’s explore the <code>weather</code> data frame by running <code>View(weather)</code> and <code>glimpse(weather)</code> and furthermore let’s read the associated help file by running <code>?weather</code> to bring up the help file.</p>
-<p>Observe that there is a variable called <code>temp</code> of hourly temperature recordings in Fahrenheit at weather stations near all three airports in New York City: Newark (<code>origin</code> code <code>EWR</code>), John F. Kennedy International, and La Guardia (<code>LGA</code>). However, instead of considering hourly temperatures for all days in 2013 for all three airports, for simplicity let’s only consider hourly temperatures at Newark airport for the first 15 days in January.</p>
-<p>Recall in Section <a href="2-viz.html#scatterplots">2.3</a> we used the <code>filter()</code> function to only choose the subset of rows of <code>flights</code> corresponding to Alaska Airlines flights. We similarly use <code>filter()</code>  here, but by using the <code>&amp;</code> operator we only choose the subset of rows of <code>weather</code> where the <code>origin</code> is <code>&quot;EWR&quot;</code>, the <code>month</code> is January, and the <code>day</code> is between <code>1</code> and <code>15</code>. Recall we performed a similar task in Section <a href="2-viz.html#scatterplots">2.3</a> when creating the <code>alaska_flights</code> data frame of only Alaska Airlines flights, a topic we’ll explore more in Chapter <a href="3-wrangling.html#wrangling">3</a> on data wrangling.</p>
-<pre class="sourceCode r"><code class="sourceCode r">early_january_weather &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(origin <span class="op">==</span><span class="st"> &quot;EWR&quot;</span> <span class="op">&amp;</span><span class="st"> </span>month <span class="op">==</span><span class="st"> </span><span class="dv">1</span> <span class="op">&amp;</span><span class="st"> </span>day <span class="op">&lt;=</span><span class="st"> </span><span class="dv">15</span>)</code></pre>
+<p>The next of the five named graphs are linegraphs. Linegraphs  show the relationship between two numerical variables when the variable on the x-axis, also called the <em>explanatory</em> variable, is of a sequential nature. In other words, there is an inherent ordering to the variable.</p>
+<p>The most common examples of linegraphs have some notion of time on the x-axis: hours, days, weeks, years, etc. Since time is sequential, we connect consecutive observations of the variable on the y-axis with a line. Linegraphs that have some notion of time on the x-axis are also called <em>time series</em> plots. Let’s illustrate linegraphs using another dataset in the <code>nycflights13</code>  package: the <code>weather</code> data frame.</p>
+<p>Let’s explore the <code>weather</code> data frame by running <code>View(weather)</code> and <code>glimpse(weather)</code>. Furthermore let’s read the associated help file by running <code>?weather</code> to bring up the help file.</p>
+<p>Observe that there is a variable called <code>temp</code> of hourly temperature recordings in Fahrenheit at weather stations near all three major airports in New York City: Newark (<code>origin</code> code <code>EWR</code>), John F. Kennedy International (<code>JFK</code>), and LaGuardia (<code>LGA</code>). However, instead of considering hourly temperatures for all days in 2013 for all three airports, for simplicity let’s only consider hourly temperatures at Newark airport for the first 15 days in January.</p>
+<p>Recall in Section <a href="2-viz.html#scatterplots">2.3</a>, we used the <code>filter()</code> function to only choose the subset of rows of <code>flights</code> corresponding to Alaska Airlines flights. We similarly use <code>filter()</code>  here, but by using the <code>&amp;</code> operator we only choose the subset of rows of <code>weather</code> where the <code>origin</code> is <code>&quot;EWR&quot;</code>, the <code>month</code> is January, <strong>and</strong> the <code>day</code> is between <code>1</code> and <code>15</code>. Recall we performed a similar task in Section <a href="2-viz.html#scatterplots">2.3</a> when creating the <code>alaska_flights</code> data frame of only Alaska Airlines flights, a topic we’ll explore more in Chapter <a href="3-wrangling.html#wrangling">3</a> on data wrangling.</p>
+<div class="sourceCode" id="cb21"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb21-1" data-line-number="1">early_january_weather &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb21-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(origin <span class="op">==</span><span class="st"> &quot;EWR&quot;</span> <span class="op">&amp;</span><span class="st"> </span>month <span class="op">==</span><span class="st"> </span><span class="dv">1</span> <span class="op">&amp;</span><span class="st"> </span>day <span class="op">&lt;=</span><span class="st"> </span><span class="dv">15</span>)</a></code></pre></div>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
 <p><strong>(LC2.9)</strong> Take a look at both the <code>weather</code> and <code>early_january_weather</code> data frames by running <code>View(weather)</code> and <code>View(early_january_weather)</code>. In what respect do these data frames differ?</p>
-<p><strong>(LC2.10)</strong> <code>View()</code> the <code>flights</code> data frame again. Why does the <code>time_hour</code> variable uniquely identify the hour of the measurement whereas the <code>hour</code> variable does not?</p>
+<p><strong>(LC2.10)</strong> <code>View()</code> the <code>flights</code> data frame again. Why does the <code>time_hour</code> variable uniquely identify the hour of the measurement, whereas the <code>hour</code> variable does not?</p>
 <div class="learncheck">
 
 </div>
 <div id="geomline" class="section level3">
-<h3><span class="header-section-number">2.4.1</span> Linegraphs via geom_line</h3>
+<h3><span class="header-section-number">2.4.1</span> Linegraphs via <code>geom_line</code></h3>
 <p>Let’s create a time series plot of the hourly temperatures saved in the <code>early_january_weather</code> data frame by using <code>geom_line()</code> to create a linegraph, instead of using <code>geom_point()</code> like we used previously to create scatterplots:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> early_january_weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> time_hour, <span class="dt">y =</span> temp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_line</span>()</code></pre>
+<div class="sourceCode" id="cb22"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb22-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> early_january_weather, </a>
+<a class="sourceLine" id="cb22-2" data-line-number="2">       <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> time_hour, <span class="dt">y =</span> temp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb22-3" data-line-number="3"><span class="st">  </span><span class="kw">geom_line</span>()</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:hourlytemp"></span>
-<img src="moderndive_files/figure-html/hourlytemp-1.png" alt="Hourly temperature in Newark for January 1-15, 2013." width="\textwidth" />
+<img src="ModernDive_files/figure-html/hourlytemp-1.png" alt="Hourly temperature in Newark for January 1-15, 2013." width="\textwidth" />
 <p class="caption">
 FIGURE 2.7: Hourly temperature in Newark for January 1-15, 2013.
 </p>
 </div>
-<p>Much as with the <code>ggplot()</code> code that created the scatterplot of departure and arrival delays for Alaska Airlines flights in Figure <a href="2-viz.html#fig:noalpha">2.2</a>, let’s break down this code piece-by-piece in terms of the Grammar of Graphics:</p>
-<p>Within the <code>ggplot()</code> function call, we specify two of the components of the Grammar of Graphics as arguments:</p>
+<p>Much as with the <code>ggplot()</code> code that created the scatterplot of departure and arrival delays for Alaska Airlines flights in Figure <a href="2-viz.html#fig:noalpha">2.2</a>, let’s break down this code piece-by-piece in terms of the grammar of graphics:</p>
+<p>Within the <code>ggplot()</code> function call, we specify two of the components of the grammar of graphics as arguments:</p>
 <ol style="list-style-type: decimal">
 <li>The <code>data</code> to be the <code>early_january_weather</code> data frame by setting <code>data = early_january_weather</code>.</li>
-<li>The <code>aes</code>thetic <code>mapping</code> by setting <code>mapping = aes(x = time_hour, y = temp)</code>. Specifically, the variable <code>time_hour</code> maps to the <code>x</code> position aesthetic while the variable <code>temp</code> maps to the <code>y</code> position aesthetic.</li>
+<li>The <code>aes</code>thetic <code>mapping</code> by setting <code>mapping = aes(x = time_hour, y = temp)</code>. Specifically, the variable <code>time_hour</code> maps to the <code>x</code> position aesthetic, while the variable <code>temp</code> maps to the <code>y</code> position aesthetic.</li>
 </ol>
-<p>We add a layer to the <code>ggplot()</code> function call using the <code>+</code> sign. The layer in question specifies the third component of the grammar: the <code>geom</code>etric object in question. In this case the geometric object is a <code>line</code>, set by specifying <code>geom_line()</code>.</p>
+<p>We add a layer to the <code>ggplot()</code> function call using the <code>+</code> sign. The layer in question specifies the third component of the grammar: the <code>geom</code>etric object in question. In this case, the geometric object is a <code>line</code> set by specifying <code>geom_line()</code>.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -1031,7 +993,7 @@ <h3><span class="header-section-number">2.4.1</span> Linegraphs via geom_line</h
 </div>
 <div id="summary-1" class="section level3">
 <h3><span class="header-section-number">2.4.2</span> Summary</h3>
-<p>Linegraphs, just like scatterplots, display the relationship between two numerical variables. However it is preferred to use linegraphs over scatterplots when the variable on the x-axis (i.e. the explanatory variable) has an inherent ordering, such as some notion of time.</p>
+<p>Linegraphs, just like scatterplots, display the relationship between two numerical variables. However, it is preferred to use linegraphs over scatterplots when the variable on the x-axis (i.e., the explanatory variable) has an inherent ordering, such as some notion of time.</p>
 </div>
 </div>
 <div id="histograms" class="section level2">
@@ -1045,7 +1007,7 @@ <h2><span class="header-section-number">2.5</span> 5NG#3: Histograms</h2>
 </ol>
 <p>One way to visualize this <em>distribution</em>  of this single variable <code>temp</code> is to plot them on a horizontal line as we do in Figure <a href="2-viz.html#fig:temp-on-line">2.8</a>:</p>
 <div class="figure" style="text-align: center"><span id="fig:temp-on-line"></span>
-<img src="moderndive_files/figure-html/temp-on-line-1.png" alt="Plot of hourly temperature recordings from NYC in 2013." width="\textwidth" />
+<img src="ModernDive_files/figure-html/temp-on-line-1.png" alt="Plot of hourly temperature recordings from NYC in 2013." width="\textwidth" />
 <p class="caption">
 FIGURE 2.8: Plot of hourly temperature recordings from NYC in 2013.
 </p>
@@ -1059,12 +1021,12 @@ <h2><span class="header-section-number">2.5</span> 5NG#3: Histograms</h2>
 </ol>
 <p>Let’s drill-down on an example of a histogram, shown in Figure <a href="2-viz.html#fig:histogramexample">2.9</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:histogramexample"></span>
-<img src="moderndive_files/figure-html/histogramexample-1.png" alt="Example histogram." width="\textwidth" />
+<img src="ModernDive_files/figure-html/histogramexample-1.png" alt="Example histogram." width="\textwidth" />
 <p class="caption">
 FIGURE 2.9: Example histogram.
 </p>
 </div>
-<p>Let’s focus only on temperatures between 30°F (-1°C) and 60°F (15°C) for now. Observe that there are three bins of equal width between 30 and 60°F. Thus we have three bins of width 10°F each: one bin for the 30-40°F range, another bin for the 40-50°F range, and another bin for the 50-60°F range. Since:</p>
+<p>Let’s focus only on temperatures between 30°F (-1°C) and 60°F (15°C) for now. Observe that there are three bins of equal width between 30°F and 60°F. Thus we have three bins of width 10°F each: one bin for the 30-40°F range, another bin for the 40-50°F range, and another bin for the 50-60°F range. Since:</p>
 <ol style="list-style-type: decimal">
 <li>The bin for the 30-40°F range has a height of around 5000. In other words, around 5000 of the hourly temperature recordings are between 30°F and 40°F.</li>
 <li>The bin for the 40-50°F range has a height of around 4300. In other words, around 4300 of the hourly temperature recordings are between 40°F and 50°F.</li>
@@ -1072,50 +1034,50 @@ <h2><span class="header-section-number">2.5</span> 5NG#3: Histograms</h2>
 </ol>
 <p>All nine bins spanning 10°F to 100°F on the x-axis have this interpretation.</p>
 <div id="geomhistogram" class="section level3">
-<h3><span class="header-section-number">2.5.1</span> Histograms via geom_histogram</h3>
-<p>Let’s now present the <code>ggplot()</code> code to plot your first histogram! Unlike with scatterplots and linegraphs, there is now only one variable being mapped in <code>aes()</code>: the single numerical variable <code>temp</code>. The y-aesthetic of a histogram, the count of the observations in each bin, gets computed for you automatically. Furthermore, the geometric object layer is now a <code>geom_histogram()</code>. . After running the following code, you’ll see the histogram in Figure <a href="2-viz.html#fig:weather-histogram">2.10</a> as well as warning messages. We’ll discuss the warning messages first.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>()</code></pre>
+<h3><span class="header-section-number">2.5.1</span> Histograms via <code>geom_histogram</code></h3>
+<p>Let’s now present the <code>ggplot()</code> code to plot your first histogram! Unlike with scatterplots and linegraphs, there is now only one variable being mapped in <code>aes()</code>: the single numerical variable <code>temp</code>. The y-aesthetic of a histogram, the count of the observations in each bin, gets computed for you automatically. Furthermore, the geometric object layer is now a <code>geom_histogram()</code>.  After running the following code, you’ll see the histogram in Figure <a href="2-viz.html#fig:weather-histogram">2.10</a> as well as warning messages. We’ll discuss the warning messages first.</p>
+<div class="sourceCode" id="cb23"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb23-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb23-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>()</a></code></pre></div>
 <pre><code>`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.</code></pre>
 <pre><code>Warning: Removed 1 rows containing non-finite values (stat_bin).</code></pre>
 <div class="figure" style="text-align: center"><span id="fig:weather-histogram"></span>
-<img src="moderndive_files/figure-html/weather-histogram-1.png" alt="Histogram of hourly temperatures at three NYC airports." width="\textwidth" />
+<img src="ModernDive_files/figure-html/weather-histogram-1.png" alt="Histogram of hourly temperatures at three NYC airports." width="\textwidth" />
 <p class="caption">
 FIGURE 2.10: Histogram of hourly temperatures at three NYC airports.
 </p>
 </div>
-<p>The first message is telling us that the histogram was constructed using <code>bins = 30</code>, in other words 30 equally spaced bins. This is known in computer programming as a default value; unless you override this default number of bins with a number you specify, R will choose 30 by default. We’ll see in the next section how to change the number of bins away from this default value.</p>
+<p>The first message is telling us that the histogram was constructed using <code>bins = 30</code> for 30 equally spaced bins. This is known in computer programming as a default value; unless you override this default number of bins with a number you specify, R will choose 30 by default. We’ll see in the next section how to change the number of bins to another value than the default.</p>
 <p>The second message is telling us something similar to the warning message we received when we ran the code to create a scatterplot of departure and arrival delays for Alaska Airlines flights in Figure <a href="2-viz.html#fig:noalpha">2.2</a>: that because one row has a missing <code>NA</code> value for <code>temp</code>, it was omitted from the histogram. R is just giving us a friendly heads up that this was the case.</p>
-<p>Now let’s unpack the resulting histogram in Figure <a href="2-viz.html#fig:weather-histogram">2.10</a>. Observe that values less than 25°F as well as values above 80°F are rather rare. However, because of the large number of bins, it’s hard to get a sense for which range of temperatures is spanned by each bin; everything is one giant amorphous blob. So let’s add white vertical borders demarcating the bins by adding a <code>color = &quot;white&quot;</code> argument to <code>geom_histogram()</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>)</code></pre>
+<p>Now let’s unpack the resulting histogram in Figure <a href="2-viz.html#fig:weather-histogram">2.10</a>. Observe that values less than 25°F as well as values above 80°F are rather rare. However, because of the large number of bins, it’s hard to get a sense for which range of temperatures is spanned by each bin; everything is one giant amorphous blob. So let’s add white vertical borders demarcating the bins by adding a <code>color = &quot;white&quot;</code> argument to <code>geom_histogram()</code> and ignore the warning about setting the number of bins to a better value:</p>
+<div class="sourceCode" id="cb26"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb26-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb26-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:weather-histogram-2"></span>
-<img src="moderndive_files/figure-html/weather-histogram-2-1.png" alt="Histogram of hourly temperatures at three NYC airports with white borders." width="\textwidth" />
+<img src="ModernDive_files/figure-html/weather-histogram-2-1.png" alt="Histogram of hourly temperatures at three NYC airports with white borders." width="\textwidth" />
 <p class="caption">
 FIGURE 2.11: Histogram of hourly temperatures at three NYC airports with white borders.
 </p>
 </div>
 <p>We now have an easier time associating ranges of temperatures to each of the bins in Figure <a href="2-viz.html#fig:weather-histogram-2">2.11</a>. We can also vary the color of the bars by setting the  <code>fill</code> argument. For example, you can set the bin colors to be “blue steel” by setting <code>fill = &quot;steelblue&quot;</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>, <span class="dt">fill =</span> <span class="st">&quot;steelblue&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb27"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb27-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb27-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>, <span class="dt">fill =</span> <span class="st">&quot;steelblue&quot;</span>)</a></code></pre></div>
 <p>If you’re curious, run  <code>colors()</code> to see all 657 possible choice of colors in R!</p>
 </div>
 <div id="adjustbins" class="section level3">
 <h3><span class="header-section-number">2.5.2</span> Adjusting the bins</h3>
-<p>Observe in Figure <a href="2-viz.html#fig:weather-histogram-2">2.11</a> that in the 50-75°F range there appear to be roughly 8 bins. Thus each bin has width 25 divided by 8, or roughly 3.12°F, which is not a very easily interpretable range to work with. Let’s improve this by adjusting the number of bins in our histogram in one of two ways:</p>
+<p>Observe in Figure <a href="2-viz.html#fig:weather-histogram-2">2.11</a> that in the 50-75°F range there appear to be roughly 8 bins. Thus each bin has width 25 divided by 8, or 3.125°F, which is not a very easily interpretable range to work with. Let’s improve this by adjusting the number of bins in our histogram in one of two ways:</p>
 <ol style="list-style-type: decimal">
 <li>By adjusting the number of bins via the  <code>bins</code> argument to <code>geom_histogram()</code>.</li>
 <li>By adjusting the width of the bins via the  <code>binwidth</code> argument to <code>geom_histogram()</code>.</li>
 </ol>
 <p>Using the first method, we have the power to specify how many bins we would like to cut the x-axis up in. As mentioned in the previous section, the default number of bins is 30. We can override this default, to say 40 bins, as follows:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">bins =</span> <span class="dv">40</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb28"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb28-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb28-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">bins =</span> <span class="dv">40</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>)</a></code></pre></div>
 <p>Using the second method, instead of specifying the number of bins, we specify the width of the bins by using the <code>binwidth</code> argument in the <code>geom_histogram()</code> layer. For example, let’s set the width of each bin to be 10°F.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">10</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb29"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb29-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb29-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">10</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>)</a></code></pre></div>
 <p>We compare both resulting histograms side-by-side in Figure <a href="2-viz.html#fig:hist-bins">2.12</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:hist-bins"></span>
-<img src="moderndive_files/figure-html/hist-bins-1.png" alt="Setting histogram bins in two ways." width="\textwidth" />
+<img src="ModernDive_files/figure-html/hist-bins-1.png" alt="Setting histogram bins in two ways." width="\textwidth" />
 <p class="caption">
 FIGURE 2.12: Setting histogram bins in two ways.
 </p>
@@ -1126,7 +1088,7 @@ <h3><span class="header-section-number">2.5.2</span> Adjusting the bins</h3>
 </p>
 </div>
 <p><strong>(LC2.14)</strong> What does changing the number of bins from 30 to 40 tell us about the distribution of temperatures?</p>
-<p><strong>(LC2.15)</strong> Would you classify the distribution of temperatures as symmetric or skewed?</p>
+<p><strong>(LC2.15)</strong> Would you classify the distribution of temperatures as symmetric or skewed in one direction or another?</p>
 <p><strong>(LC2.16)</strong> What would you guess is the “center” value in this distribution? Why did you make that choice?</p>
 <p><strong>(LC2.17)</strong> Is this data spread out greatly from the center or is it close? Why?</p>
 <div class="learncheck">
@@ -1140,23 +1102,23 @@ <h3><span class="header-section-number">2.5.3</span> Summary</h3>
 </div>
 <div id="facets" class="section level2">
 <h2><span class="header-section-number">2.6</span> Facets</h2>
-<p>Before continuing the next of the 5NG, let’s briefly introduce a new concept called <em>faceting</em>. Faceting is used when we’d like to split a particular visualization by the values of another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ.</p>
+<p>Before continuing with the next of the 5NG, let’s briefly introduce a new concept called <em>faceting</em>. Faceting is used when we’d like to split a particular visualization by the values of another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ.</p>
 <p>For example, suppose we were interested in looking at how the histogram of hourly temperature recordings at the three NYC airports we saw in Figure <a href="2-viz.html#fig:histogramexample">2.9</a> differed in each month. We could “split” this histogram by the 12 possible months in a given year. In other words, we would plot histograms of <code>temp</code> for each <code>month</code> separately. We do this by adding <code>facet_wrap(~ month)</code> layer. Note the <code>~</code> is a “tilde” and can generally be found on the key next to the “1” key on US keyboards. The tilde is required and you’ll receive the error <code>Error in as.quoted(facets) : object 'month' not found</code> if you don’t include it here.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">5</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">facet_wrap</span>(<span class="op">~</span><span class="st"> </span>month)</code></pre>
+<div class="sourceCode" id="cb30"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb30-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb30-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">5</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb30-3" data-line-number="3"><span class="st">  </span><span class="kw">facet_wrap</span>(<span class="op">~</span><span class="st"> </span>month)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:facethistogram"></span>
-<img src="moderndive_files/figure-html/facethistogram-1.png" alt="Faceted histogram of hourly temperatures by month." width="\textwidth" />
+<img src="ModernDive_files/figure-html/facethistogram-1.png" alt="Faceted histogram of hourly temperatures by month." width="\textwidth" />
 <p class="caption">
 FIGURE 2.13: Faceted histogram of hourly temperatures by month.
 </p>
 </div>
-<p>We can also specify the number of rows and columns in the grid by using the <code>nrow</code> and <code>ncol</code> arguments inside of  <code>facet_wrap()</code>. For example, say we would like our faceted histogram to have 4 rows instead of 3. We simply add a <code>nrow = 4</code> argument to <code>facet_wrap(~ month)</code></p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">5</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">facet_wrap</span>(<span class="op">~</span><span class="st"> </span>month, <span class="dt">nrow =</span> <span class="dv">4</span>)</code></pre>
+<p>We can also specify the number of rows and columns in the grid by using the <code>nrow</code> and <code>ncol</code> arguments inside of  <code>facet_wrap()</code>. For example, say we would like our faceted histogram to have 4 rows instead of 3. We simply add an <code>nrow = 4</code> argument to <code>facet_wrap(~ month)</code></p>
+<div class="sourceCode" id="cb31"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb31-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> temp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb31-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">5</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb31-3" data-line-number="3"><span class="st">  </span><span class="kw">facet_wrap</span>(<span class="op">~</span><span class="st"> </span>month, <span class="dt">nrow =</span> <span class="dv">4</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:facethistogram2"></span>
-<img src="moderndive_files/figure-html/facethistogram2-1.png" alt="Faceted histogram with 4 instead of 3 rows." width="\textwidth" />
+<img src="ModernDive_files/figure-html/facethistogram2-1.png" alt="Faceted histogram with 4 instead of 3 rows." width="\textwidth" />
 <p class="caption">
 FIGURE 2.14: Faceted histogram with 4 instead of 3 rows.
 </p>
@@ -1169,53 +1131,53 @@ <h2><span class="header-section-number">2.6</span> Facets</h2>
 </div>
 <p><strong>(LC2.18)</strong> What other things do you notice about this faceted plot? How does a faceted plot help us see relationships between two variables?</p>
 <p><strong>(LC2.19)</strong> What do the numbers 1-12 correspond to in the plot? What about 25, 50, 75, 100?</p>
-<p><strong>(LC2.20)</strong> For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics.</p>
-<p><strong>(LC2.21)</strong> Does the <code>temp</code> variable in the <code>weather</code> data set have a lot of variability? Why do you say that?</p>
+<p><strong>(LC2.20)</strong> For which types of datasets would faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics.</p>
+<p><strong>(LC2.21)</strong> Does the <code>temp</code> variable in the <code>weather</code> dataset have a lot of variability? Why do you say that?</p>
 <div class="learncheck">
 
 </div>
 </div>
 <div id="boxplots" class="section level2">
 <h2><span class="header-section-number">2.7</span> 5NG#4: Boxplots</h2>
-<p>While faceted histograms are one type of visualization used to compare the distribution of a numerical variable split by the values of another variable, another type of visualization that achieves this same goal are <em>side-by-side boxplots</em>. A boxplot  is constructed from the information provided in the <em>five-number summary</em> of a numerical variable (see Appendix <a href="A-appendixA.html#appendix-stat-terms">A.1</a>).</p>
-<p>To keep things simple for now, let’s only consider the 2141 hourly temperature recordings for the month of November, each represented as a point in Figure <a href="2-viz.html#fig:nov1">2.15</a>.</p>
+<p>While faceted histograms are one type of visualization used to compare the distribution of a numerical variable split by the values of another variable, another type of visualization that achieves this same goal is a <em>side-by-side boxplot</em>. A boxplot  is constructed from the information provided in the <em>five-number summary</em> of a numerical variable (see Appendix <a href="A-appendixA.html#appendix-stat-terms">A.1</a>).</p>
+<p>To keep things simple for now, let’s only consider the 2141 hourly temperature recordings for the month of November, each represented as a jittered point in Figure <a href="2-viz.html#fig:nov1">2.15</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:nov1"></span>
-<img src="moderndive_files/figure-html/nov1-1.png" alt="November temperatures represented as points." width="\textwidth" />
+<img src="ModernDive_files/figure-html/nov1-1.png" alt="November temperatures represented as jittered points." width="\textwidth" />
 <p class="caption">
-FIGURE 2.15: November temperatures represented as points.
+FIGURE 2.15: November temperatures represented as jittered points.
 </p>
 </div>
-<p>These 2141 observations have the following five-number summary:</p>
+<p>These 2141 observations have the following <em>five-number summary</em>:</p>
 <ol style="list-style-type: decimal">
 <li>Minimum: 21°F</li>
-<li>First quartile AKA 25<sup>th</sup> percentile: 36°F</li>
-<li>Median AKA second quartile AKA 50<sup>th</sup> percentile: 45°F</li>
-<li>Third quartile AKA 75<sup>th</sup> percentile: 52°F</li>
+<li>First quartile (25th percentile): 36°F</li>
+<li>Median (second quartile, 50th percentile): 45°F</li>
+<li>Third quartile (75th percentile): 52°F</li>
 <li>Maximum: 71°F</li>
 </ol>
-<p>In the left-most plot of Figure <a href="2-viz.html#fig:nov2">2.16</a>, let’s mark these 5 values with dashed horizontal lines on top of the 2141 points. In the middle plot of Figure <a href="2-viz.html#fig:nov2">2.16</a> let’s add the <em>boxplot</em>. In the right-most plot of Figure <a href="2-viz.html#fig:nov2">2.16</a>, let’s remove the points and the dashed horizontal lines for clarity’s sake.</p>
+<p>In the leftmost plot of Figure <a href="2-viz.html#fig:nov2">2.16</a>, let’s mark these 5 values with dashed horizontal lines on top of the 2141 points. In the middle plot of Figure <a href="2-viz.html#fig:nov2">2.16</a> let’s add the <em>boxplot</em>. In the rightmost plot of Figure <a href="2-viz.html#fig:nov2">2.16</a>, let’s remove the points and the dashed horizontal lines for clarity’s sake.</p>
 <div class="figure" style="text-align: center"><span id="fig:nov2"></span>
-<img src="moderndive_files/figure-html/nov2-1.png" alt="Building up a boxplot of November temperatures." width="\textwidth" />
+<img src="ModernDive_files/figure-html/nov2-1.png" alt="Building up a boxplot of November temperatures." width="\textwidth" />
 <p class="caption">
 FIGURE 2.16: Building up a boxplot of November temperatures.
 </p>
 </div>
 <p>What the boxplot does is visually summarize the 2141 points by cutting the 2141 temperature recordings into <em>quartiles</em> at the dashed lines, where each quartile contains roughly 2141 <span class="math inline">\(\div\)</span> 4 <span class="math inline">\(\approx\)</span> 535 observations. Thus</p>
 <ol style="list-style-type: decimal">
-<li>25% of points fall below the bottom edge of the box, which is the first quartile of 36°F. In other words 25% of observations were colder than 36°F.</li>
-<li>25% of points fall between the bottom edge of the box and the solid middle line, which is the median of 45°F. In other words 25% of observations were between 36 and 45°F and 50% of observations were colder than 45°F.</li>
-<li>25% of points fall between the solid middle line and the top edge of the box, which is the third quartile of 52°F. In other words 25% of observations were between 45 and 52°F and 75% of observations were colder than 52°F.</li>
-<li>25% of points fall above the top edge of the box. In other words 25% of observations were warmer than 52°F.</li>
-<li>The middle 50% of points lie within the <em>interquartile range (IQR)</em>  between the first and third quartile of 52 - 36 = 16°F. The interquartile range is a measure of a numerical variable’s <em>spread</em>.</li>
+<li>25% of points fall below the bottom edge of the box, which is the first quartile of 36°F. In other words, 25% of observations were below 36°F.</li>
+<li>25% of points fall between the bottom edge of the box and the solid middle line, which is the median of 45°F. Thus, 25% of observations were between 36°F and 45°F and 50% of observations were below 45°F.</li>
+<li>25% of points fall between the solid middle line and the top edge of the box, which is the third quartile of 52°F. It follows that 25% of observations were between 45°F and 52°F and 75% of observations were below 52°F.</li>
+<li>25% of points fall above the top edge of the box. In other words, 25% of observations were above 52°F.</li>
+<li>The middle 50% of points lie within the <em>interquartile range (IQR)</em>  between the first and third quartile. Thus, the IQR for this example is 52 - 36 = 16°F. The interquartile range is a measure of a numerical variable’s <em>spread</em>.</li>
 </ol>
-<p>Furthermore, in the right-most plot of Figure <a href="2-viz.html#fig:nov2">2.16</a>, we see the <em>whiskers</em>  of the boxplot. The whiskers stick out from either end of the box all the way to the minimum and maximum observed temperatures of 21°F and 71°F respectively. However, the whiskers don’t always extend to the smallest and largest observed values as they do here. They in fact extend no more than 1.5 <span class="math inline">\(\times\)</span> the interquartile range from either end of the box. In this case of the November temperatures, no more than 1.5 <span class="math inline">\(\times\)</span> 16°F = 24°F from either end of the box. Any observed values outside this range get marked with points called <em>outliers</em>, which we’ll see in the next section.</p>
+<p>Furthermore, in the rightmost plot of Figure <a href="2-viz.html#fig:nov2">2.16</a>, we see the <em>whiskers</em>  of the boxplot. The whiskers stick out from either end of the box all the way to the minimum and maximum observed temperatures of 21°F and 71°F, respectively. However, the whiskers don’t always extend to the smallest and largest observed values as they do here. They in fact extend no more than 1.5 <span class="math inline">\(\times\)</span> the interquartile range from either end of the box. In this case of the November temperatures, no more than 1.5 <span class="math inline">\(\times\)</span> 16°F = 24°F from either end of the box. Any observed values outside this range get marked with points called <em>outliers</em>, which we’ll see in the next section.</p>
 <div id="geomboxplot" class="section level3">
-<h3><span class="header-section-number">2.7.1</span> Boxplots via geom_boxplot</h3>
+<h3><span class="header-section-number">2.7.1</span> Boxplots via <code>geom_boxplot</code></h3>
 <p>Let’s now create a side-by-side boxplot  of hourly temperatures split by the 12 months as we did previously with the faceted histograms. We do this by mapping the <code>month</code> variable to the x-position aesthetic, the <code>temp</code> variable to the y-position aesthetic, and by adding a <code>geom_boxplot()</code> layer:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> month, <span class="dt">y =</span> temp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_boxplot</span>()</code></pre>
+<div class="sourceCode" id="cb32"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb32-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> month, <span class="dt">y =</span> temp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb32-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_boxplot</span>()</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:badbox"></span>
-<img src="moderndive_files/figure-html/badbox-1.png" alt="Invalid boxplot specification." width="\textwidth" />
+<img src="ModernDive_files/figure-html/badbox-1.png" alt="Invalid boxplot specification." width="\textwidth" />
 <p class="caption">
 FIGURE 2.17: Invalid boxplot specification.
 </p>
@@ -1223,24 +1185,24 @@ <h3><span class="header-section-number">2.7.1</span> Boxplots via geom_boxplot</
 <pre><code>Warning messages:
 1: Continuous x aesthetic -- did you forget aes(group=...)? 
 2: Removed 1 rows containing non-finite values (stat_boxplot). </code></pre>
-<p>Observe in Figure <a href="2-viz.html#fig:badbox">2.17</a> that this plot does not provide information about temperature separated by month. The first warning message clues us in as to why. It is telling us that we have a “continuous”, or numerical variable, on the x-position aesthetic. Boxplots however require a categorical variable to be mapped to the x-position aesthetic. The second warning message is identical to the warning message when plotting a histogram of hourly temperatures: that one of the values was recorded as <code>NA</code> missing.</p>
-<p>We can convert the numerical variable <code>month</code> into a categorical variable by using the <code>factor()</code>  function. So after applying <code>factor(month)</code>, month goes from having numerical values 1, 2, …, and 12 to having labels “1”, “2”, …, and “12.”</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> <span class="kw">factor</span>(month), <span class="dt">y =</span> temp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_boxplot</span>()</code></pre>
+<p>Observe in Figure <a href="2-viz.html#fig:badbox">2.17</a> that this plot does not provide information about temperature separated by month. The first warning message clues us in as to why. It is telling us that we have a “continuous”, or numerical variable, on the x-position aesthetic. Boxplots, however, require a categorical variable to be mapped to the x-position aesthetic. The second warning message is identical to the warning message when plotting a histogram of hourly temperatures: that one of the values was recorded as <code>NA</code> missing.</p>
+<p>We can convert the numerical variable <code>month</code> into a <code>factor</code> categorical variable by using the <code>factor()</code>  function. So after applying <code>factor(month)</code>, month goes from having numerical values just the 1, 2, …, and 12 to having an associated ordering. With this ordering, <code>ggplot()</code> now knows how to work with this variable to produce the needed plot.</p>
+<div class="sourceCode" id="cb34"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb34-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> <span class="kw">factor</span>(month), <span class="dt">y =</span> temp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb34-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_boxplot</span>()</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:monthtempbox"></span>
-<img src="moderndive_files/figure-html/monthtempbox-1.png" alt="Side-by-side boxplot of temperature split by month." width="\textwidth" />
+<img src="ModernDive_files/figure-html/monthtempbox-1.png" alt="Side-by-side boxplot of temperature split by month." width="\textwidth" />
 <p class="caption">
 FIGURE 2.18: Side-by-side boxplot of temperature split by month.
 </p>
 </div>
-<p>The resulting Figure <a href="2-viz.html#fig:monthtempbox">2.18</a> shows 12 separate “box and whiskers” plots similar to the right-most plot of Figure <a href="2-viz.html#fig:nov2">2.16</a> focusing only on November:</p>
+<p>The resulting Figure <a href="2-viz.html#fig:monthtempbox">2.18</a> shows 12 separate “box and whiskers” plots similar to the rightmost plot of Figure <a href="2-viz.html#fig:nov2">2.16</a> of only November temperatures. Thus the different boxplots are shown “side-by-side.”</p>
 <ul>
-<li>The “box” portions of the visualization represent the 1<sup>st</sup> quartile, the median AKA the 2<sup>nd</sup> quartile, and the 3<sup>rd</sup> quartile.</li>
-<li>The height of each box, i.e. the value of the 3<sup>rd</sup> quartile minus the value of the 1<sup>st</sup> quartile, is the interquartile range (IQR). It is a measure of the spread of the middle 50% of values, with longer boxes indicating more variability.</li>
-<li>The “whisker” portions of these plots extend out from the bottoms and tops of the boxes and represent points less than the 25<sup>th</sup> percentile and greater than the 75<sup>th</sup> percentiles respectively. They’re set to extend out no more than <span class="math inline">\(1.5 \times IQR\)</span> units away from either end of the boxes. We say “no more than” because the ends of the whiskers have to correspond to observed temperatures. The length of these whiskers show how the data outside the middle 50% of values vary, with longer whiskers indicating more variability.</li>
-<li>The dots representing values falling outside the whiskers are called  <em>outliers</em>. These can be thought of as anomalous values.</li>
+<li>The “box” portions of the visualization represent the 1st quartile, the median (the 2nd quartile), and the 3rd quartile.</li>
+<li>The height of each box (the value of the 3rd quartile minus the value of the 1st quartile) is the interquartile range (IQR). It is a measure of the spread of the middle 50% of values, with longer boxes indicating more variability.</li>
+<li>The “whisker” portions of these plots extend out from the bottoms and tops of the boxes and represent points less than the 25th percentile and greater than the 75th percentiles, respectively. They’re set to extend out no more than <span class="math inline">\(1.5 \times IQR\)</span> units away from either end of the boxes. We say “no more than” because the ends of the whiskers have to correspond to observed temperatures. The length of these whiskers show how the data outside the middle 50% of values vary, with longer whiskers indicating more variability.</li>
+<li>The dots representing values falling outside the whiskers are called  <em>outliers</em>. These can be thought of as anomalous (“out-of-the-ordinary”) values.</li>
 </ul>
-<p>It is important to keep in mind that the definition of an outlier is somewhat arbitrary and not absolute. In this case, they are defined by the length of the whiskers, which are no more than <span class="math inline">\(1.5 \times IQR\)</span> units long. Looking at this plot we can see, as expected, that summer months (6 through 8) have higher median temperatures as evidenced by the higher solid lines in the middle of the boxes. We can easily compare temperatures across months by drawing imaginary horizontal lines across the plot. Furthermore, the height of the 12 boxes as quantified by the interquartile ranges are informative too; they tell us about variability, or spread, of temperatures recorded in a given month.</p>
+<p>It is important to keep in mind that the definition of an outlier is somewhat arbitrary and not absolute. In this case, they are defined by the length of the whiskers, which are no more than <span class="math inline">\(1.5 \times IQR\)</span> units long for each boxplot. Looking at this side-by-side plot we can see, as expected, that summer months (6 through 8) have higher median temperatures as evidenced by the higher solid lines in the middle of the boxes. We can easily compare temperatures across months by drawing imaginary horizontal lines across the plot. Furthermore, the heights of the 12 boxes as quantified by the interquartile ranges are informative too; they tell us about variability, or spread, of temperatures recorded in a given month.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -1248,7 +1210,7 @@ <h3><span class="header-section-number">2.7.1</span> Boxplots via geom_boxplot</
 </div>
 <p><strong>(LC2.22)</strong> What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.</p>
 <p><strong>(LC2.23)</strong> Which months have the highest variability in temperature? What reasons can you give for this?</p>
-<p><strong>(LC2.24)</strong> We looked at the distribution of the numerical variable <code>temp</code> split by the numerical variable <code>month</code> that we converted to a categorical variable using the <code>factor()</code> function. Why would a boxplot of <code>temp</code> split by the numerical variable <code>pressure</code> similarly converted to a categorical variable using the <code>factor()</code> not be informative?</p>
+<p><strong>(LC2.24)</strong> We looked at the distribution of the numerical variable <code>temp</code> split by the numerical variable <code>month</code> that we converted using the <code>factor()</code> function in order to make a side-by-side boxplot. Why would a boxplot of <code>temp</code> split by the numerical variable <code>pressure</code> similarly converted to a categorical variable using the <code>factor()</code> not be informative?</p>
 <p><strong>(LC2.25)</strong> Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?</p>
 <div class="learncheck">
 
@@ -1256,20 +1218,21 @@ <h3><span class="header-section-number">2.7.1</span> Boxplots via geom_boxplot</
 </div>
 <div id="summary-3" class="section level3">
 <h3><span class="header-section-number">2.7.2</span> Summary</h3>
-<p>Side-by-side boxplots provide us with a way to compare the distribution of a numerical variable across multiple values of another variable. One can see where the median falls across the different groups by looking at the solid line in the center of the boxes. To study the spread of a numerical variable within one of the boxes, look at both the length of the box and also how far the whiskers extend from either end of the box. Outliers are even more easily identified when looking at a boxplot than when looking at a histogram as they are marked with distinct points.</p>
+<p>Side-by-side boxplots provide us with a way to compare the distribution of a numerical variable across multiple values of another variable. One can see where the median falls across the different groups by comparing the solid lines in the center of the boxes.</p>
+<p>To study the spread of a numerical variable within one of the boxes, look at both the length of the box and also how far the whiskers extend from either end of the box. Outliers are even more easily identified when looking at a boxplot than when looking at a histogram as they are marked with distinct points.</p>
 </div>
 </div>
 <div id="geombar" class="section level2">
 <h2><span class="header-section-number">2.8</span> 5NG#5: Barplots</h2>
-<p>Both histograms and boxplots are tools to visualize the distribution of numerical variables. Another common task is to visualize the distribution of a categorical variable. This is a simpler task, as we are simply counting different categories of a categorical variable, also known as the  <em>levels</em> of the categorical variable. Often the best way to visualize these different counts, also known as  <em>frequencies</em>, is with barplots (also called barcharts).</p>
+<p>Both histograms and boxplots are tools to visualize the distribution of numerical variables. Another commonly desired task is to visualize the distribution of a categorical variable. This is a simpler task, as we are simply counting different categories within a categorical variable, also known as the  <em>levels</em> of the categorical variable. Often the best way to visualize these different counts, also known as  <em>frequencies</em>, is with barplots (also called barcharts).</p>
 <p>One complication, however, is how your data is represented. Is the categorical variable of interest “pre-counted” or not? For example, run the following code that manually creates two data frames representing a collection of fruit: 3 apples and 2 oranges.</p>
-<pre class="sourceCode r"><code class="sourceCode r">fruits &lt;-<span class="st"> </span><span class="kw">tibble</span>(
-  <span class="dt">fruit =</span> <span class="kw">c</span>(<span class="st">&quot;apple&quot;</span>, <span class="st">&quot;apple&quot;</span>, <span class="st">&quot;orange&quot;</span>, <span class="st">&quot;apple&quot;</span>, <span class="st">&quot;orange&quot;</span>)
-)
-fruits_counted &lt;-<span class="st"> </span><span class="kw">tibble</span>(
-  <span class="dt">fruit =</span> <span class="kw">c</span>(<span class="st">&quot;apple&quot;</span>, <span class="st">&quot;orange&quot;</span>),
-  <span class="dt">number =</span> <span class="kw">c</span>(<span class="dv">3</span>, <span class="dv">2</span>)
-)</code></pre>
+<div class="sourceCode" id="cb35"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb35-1" data-line-number="1">fruits &lt;-<span class="st"> </span><span class="kw">tibble</span>(</a>
+<a class="sourceLine" id="cb35-2" data-line-number="2">  <span class="dt">fruit =</span> <span class="kw">c</span>(<span class="st">&quot;apple&quot;</span>, <span class="st">&quot;apple&quot;</span>, <span class="st">&quot;orange&quot;</span>, <span class="st">&quot;apple&quot;</span>, <span class="st">&quot;orange&quot;</span>)</a>
+<a class="sourceLine" id="cb35-3" data-line-number="3">)</a>
+<a class="sourceLine" id="cb35-4" data-line-number="4">fruits_counted &lt;-<span class="st"> </span><span class="kw">tibble</span>(</a>
+<a class="sourceLine" id="cb35-5" data-line-number="5">  <span class="dt">fruit =</span> <span class="kw">c</span>(<span class="st">&quot;apple&quot;</span>, <span class="st">&quot;orange&quot;</span>),</a>
+<a class="sourceLine" id="cb35-6" data-line-number="6">  <span class="dt">number =</span> <span class="kw">c</span>(<span class="dv">3</span>, <span class="dv">2</span>)</a>
+<a class="sourceLine" id="cb35-7" data-line-number="7">)</a></code></pre></div>
 <p>We see both the <code>fruits</code> and <code>fruits_counted</code> data frames represent the same collection of fruit. Whereas <code>fruits</code> just lists the fruit individually…</p>
 <pre><code># A tibble: 5 x 1
   fruit 
@@ -1287,45 +1250,44 @@ <h2><span class="header-section-number">2.8</span> 5NG#5: Barplots</h2>
 2 orange      2</code></pre>
 <p>Depending on how your categorical data is represented, you’ll need to add a different <code>geom</code>etric layer type to your <code>ggplot()</code> to create a barplot, as we now explore.</p>
 <div id="barplots-via-geom_bar-or-geom_col" class="section level3">
-<h3><span class="header-section-number">2.8.1</span> Barplots via geom_bar or geom_col</h3>
-<p>Let’s generate barplots using these two different representations of the same basket of fruit: 3 apples and 2 oranges. Using the <code>fruits</code> data frame where all 5 fruit are listed individually in 5 rows, we map the <code>fruit</code> variable to the x-position aesthetic and add a  <code>geom_bar()</code> layer:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> fruits, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> fruit)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>()</code></pre>
+<h3><span class="header-section-number">2.8.1</span> Barplots via <code>geom_bar</code> or <code>geom_col</code></h3>
+<p>Let’s generate barplots using these two different representations of the same basket of fruit: 3 apples and 2 oranges. Using the <code>fruits</code> data frame where all 5 fruits are listed individually in 5 rows, we map the <code>fruit</code> variable to the x-position aesthetic and add a  <code>geom_bar()</code> layer:</p>
+<div class="sourceCode" id="cb38"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb38-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> fruits, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> fruit)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb38-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>()</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:geombar"></span>
-<img src="moderndive_files/figure-html/geombar-1.png" alt="Barplot when counts are not pre-counted." width="\textwidth" />
+<img src="ModernDive_files/figure-html/geombar-1.png" alt="Barplot when counts are not pre-counted." width="\textwidth" />
 <p class="caption">
 FIGURE 2.19: Barplot when counts are not pre-counted.
 </p>
 </div>
-<p>However, using the <code>fruits_counted</code> data frame where the fruit have been “pre-counted”, we once again map the <code>fruit</code> variable to the x-position aesthetic, but here we also map the <code>count</code> variable to the y-position aesthetic, and add a  <code>geom_col()</code> layer instead.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> fruits_counted, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> fruit, <span class="dt">y =</span> number)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_col</span>()</code></pre>
+<p>However, using the <code>fruits_counted</code> data frame where the fruits have been “pre-counted”, we once again map the <code>fruit</code> variable to the x-position aesthetic, but here we also map the <code>count</code> variable to the y-position aesthetic, and add a  <code>geom_col()</code> layer instead.</p>
+<div class="sourceCode" id="cb39"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb39-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> fruits_counted, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> fruit, <span class="dt">y =</span> number)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb39-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_col</span>()</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:geomcol"></span>
-<img src="moderndive_files/figure-html/geomcol-1.png" alt="Barplot when counts are pre-counted." width="\textwidth" />
+<img src="ModernDive_files/figure-html/geomcol-1.png" alt="Barplot when counts are pre-counted." width="\textwidth" />
 <p class="caption">
 FIGURE 2.20: Barplot when counts are pre-counted.
 </p>
 </div>
-<p>Compare the barplots in Figures <a href="2-viz.html#fig:geombar">2.19</a> and <a href="2-viz.html#fig:geomcol">2.20</a>. They are identical because they reflect counts of the same five fruit. However depending on how our categorical data is represented, either “pre-counted” or not, we must add a different <code>geom</code> layer. When the categorical variable whose distribution you want to visualize</p>
+<p>Compare the barplots in Figures <a href="2-viz.html#fig:geombar">2.19</a> and <a href="2-viz.html#fig:geomcol">2.20</a>. They are identical because they reflect counts of the same five fruits. However, depending on how our categorical data is represented, either “pre-counted” or not, we must add a different <code>geom</code> layer. When the categorical variable whose distribution you want to visualize</p>
 <ul>
 <li>Is <em>not</em> pre-counted in your data frame, we use <code>geom_bar()</code>.</li>
 <li>Is pre-counted in your data frame, we use <code>geom_col()</code> with the y-position aesthetic mapped to the variable that has the counts.</li>
 </ul>
-<p>Let’s now go back to the <code>flights</code> data frame in the <code>nycflights13</code> package and visualize the distribution of the categorical variable <code>carrier</code>. In other words, let’s visualize the number of domestic flights out New York City each airline company flew in 2013. Recall from Section <a href="1-getting-started.html#exploredataframes">1.4.3</a> when you first explored the <code>flights</code> data frame you saw that each row corresponds to a flight. In other words the <code>flights</code> data frame is more like the <code>fruits</code> data frame than the <code>fruits_counted</code> data frame because the flights have not been pre-counted by <code>carrier</code>. Thus we should use <code>geom_bar()</code> instead of <code>geom_col()</code> to create a barplot. Much like a <code>geom_histogram()</code>, there is only one variable in the <code>aes()</code> aesthetic mapping: the variable <code>carrier</code> gets mapped to the <code>x</code>-position.</p>
+<p>Let’s now go back to the <code>flights</code> data frame in the <code>nycflights13</code> package and visualize the distribution of the categorical variable <code>carrier</code>. In other words, let’s visualize the number of domestic flights out of New York City each airline company flew in 2013. Recall from Subsection <a href="1-getting-started.html#exploredataframes">1.4.3</a> when you first explored the <code>flights</code> data frame, you saw that each row corresponds to a flight. In other words, the <code>flights</code> data frame is more like the <code>fruits</code> data frame than the <code>fruits_counted</code> data frame because the flights have not been pre-counted by <code>carrier</code>. Thus we should use <code>geom_bar()</code> instead of <code>geom_col()</code> to create a barplot. Much like a <code>geom_histogram()</code>, there is only one variable in the <code>aes()</code> aesthetic mapping: the variable <code>carrier</code> gets mapped to the <code>x</code>-position. As a difference though, histograms have bars that touch whereas bar graphs have white space between the bars going from left to right.</p>
 
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>()</code></pre>
+<div class="sourceCode" id="cb40"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb40-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb40-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>()</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:flightsbar"></span>
-<img src="moderndive_files/figure-html/flightsbar-1.png" alt="Number of flights departing NYC in 2013 by airline using geom_bar()." width="\textwidth" />
+<img src="ModernDive_files/figure-html/flightsbar-1.png" alt="Number of flights departing NYC in 2013 by airline using geom_bar()." width="\textwidth" />
 <p class="caption">
 FIGURE 2.21: Number of flights departing NYC in 2013 by airline using geom_bar().
 </p>
 </div>
-<p>Observe in Figure <a href="2-viz.html#fig:flightsbar">2.21</a> that United Air Lines (UA), JetBlue Airways (B6), and ExpressJet Airlines (EV) had the most flights depart New York City in 2013. If you don’t know which airlines correspond to which carrier codes, then run <code>View(airlines)</code> to see a directory of airlines. For example: AA is American Airlines; B6 is JetBlue Airways; DL is Delta Airlines; EV is ExpressJet Airlines; MQ is Envoy Air; while UA is United Airlines.</p>
-<p>Alternatively, say you had a data frame <code>flights_counted</code> where the number of flights for each <code>carrier</code> was pre-counted like in Table <a href="2-viz.html#tab:flights-counted">2.3</a>.</p>
+<p>Observe in Figure <a href="2-viz.html#fig:flightsbar">2.21</a> that United Airlines (UA), JetBlue Airways (B6), and ExpressJet Airlines (EV) had the most flights depart NYC in 2013. If you don’t know which airlines correspond to which carrier codes, then run <code>View(airlines)</code> to see a directory of airlines. For example, B6 is JetBlue Airways. Alternatively, say you had a data frame where the number of flights for each <code>carrier</code> was pre-counted as in Table <a href="2-viz.html#tab:flights-counted">2.3</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:flights-counted">TABLE 2.3: </span>Number of flights pre-counted for each carrier.
+<span id="tab:flights-counted">TABLE 2.3: </span>Number of flights pre-counted for each carrier
 </caption>
 <thead>
 <tr>
@@ -1468,33 +1430,31 @@ <h3><span class="header-section-number">2.8.1</span> Barplots via geom_bar or ge
 </tr>
 </tbody>
 </table>
-<p>In order to create a barplot visualizing the distribution of the categorical variable <code>carrier</code> in this case, we would use <code>geom_col()</code> instead with <code>x</code> mapped to <code>carrier</code> and <code>y</code> mapped to <code>number</code> as shown in what follows. The resulting barplot would be identical to Figure <a href="2-viz.html#fig:flightsbar">2.21</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights_table, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier, <span class="dt">y =</span> number)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_col</span>()</code></pre>
+<p>In order to create a barplot visualizing the distribution of the categorical variable <code>carrier</code> in this case, we would now use <code>geom_col()</code> instead of <code>geom_bar()</code>, with an additional <code>y = number</code> in the aesthetic mapping on top of the <code>x = carrier</code>. The resulting barplot would be identical to Figure <a href="2-viz.html#fig:flightsbar">2.21</a>.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC2.26)</strong> Why are histograms inappropriate for visualizing categorical variables?</p>
+<p><strong>(LC2.26)</strong> Why are histograms inappropriate for categorical variables?</p>
 <p><strong>(LC2.27)</strong> What is the difference between histograms and barplots?</p>
 <p><strong>(LC2.28)</strong> How many Envoy Air flights departed NYC in 2013?</p>
-<p><strong>(LC2.29)</strong> What was the seventh highest airline in terms of departed flights from NYC in 2013? How could we better present the table to get this answer quickly?</p>
+<p><strong>(LC2.29)</strong> What was the 7th highest airline for departed flights from NYC in 2013? How could we better present the table to get this answer quickly?</p>
 <div class="learncheck">
 
 </div>
 </div>
 <div id="must-avoid-pie-charts" class="section level3">
 <h3><span class="header-section-number">2.8.2</span> Must avoid pie charts!</h3>
-<p>One of the most common plots used to visualize the distribution of categorical data is the  pie chart. While they may seem harmless enough, they actually present a problem in that humans are unable to judge angles well. As Naomi Robbins describes in her book “Creating More Effective Graphs” <span class="citation">(Robbins <a href="#ref-robbins2013">2013</a>)</span>, we overestimate angles greater than 90 degrees and we underestimate angles less than 90 degrees. In other words, it is difficult for us to determine the relative size of one piece of the pie compared to another.</p>
+<p>One of the most common plots used to visualize the distribution of categorical data is the  pie chart. While they may seem harmless enough, pie charts actually present a problem in that humans are unable to judge angles well. As Naomi Robbins describes in her book, <em>Creating More Effective Graphs</em> <span class="citation">(Robbins <a href="#ref-robbins2013">2013</a>)</span>, we overestimate angles greater than 90 degrees and we underestimate angles less than 90 degrees. In other words, it is difficult for us to determine the relative size of one piece of the pie compared to another.</p>
 <p>Let’s examine the same data used in our previous barplot of the number of flights departing NYC by airline in Figure <a href="2-viz.html#fig:flightsbar">2.21</a>, but this time we will use a pie chart in Figure <a href="2-viz.html#fig:carrierpie">2.22</a>. Try to answer the following questions:</p>
 <ul>
-<li>How much larger is the portion of the pie for ExpressJet Airlines (<code>EV</code>) compared to US Airways (<code>US</code>),</li>
-<li>What is the third largest carrier in terms of departing flights, and</li>
+<li>How much larger is the portion of the pie for ExpressJet Airlines (<code>EV</code>) compared to US Airways (<code>US</code>)?</li>
+<li>What is the third largest carrier in terms of departing flights?</li>
 <li>How many carriers have fewer flights than United Airlines (<code>UA</code>)?</li>
 </ul>
 <div class="figure" style="text-align: center"><span id="fig:carrierpie"></span>
-<img src="moderndive_files/figure-html/carrierpie-1.png" alt="The dreaded pie chart." width="75%" />
+<img src="ModernDive_files/figure-html/carrierpie-1.png" alt="The dreaded pie chart." width="\textwidth" />
 <p class="caption">
 FIGURE 2.22: The dreaded pie chart.
 </p>
@@ -1513,48 +1473,58 @@ <h3><span class="header-section-number">2.8.2</span> Must avoid pie charts!</h3>
 </div>
 <div id="two-categ-barplot" class="section level3">
 <h3><span class="header-section-number">2.8.3</span> Two categorical variables</h3>
-<p>Barplots are a very common way to visualize the frequency of different categories, or levels, of a single categorical variable. Another use of barplots is to visualize the <em>joint</em> distribution of two categorical variables at the same time. Let’s examine the <em>joint</em> distribution of outgoing domestic flights from NYC by <code>carrier</code> as well as <code>origin</code>. In other words, the number of flights for each <code>carrier</code> and <code>origin</code> combination. For example, the number of WestJet flights from <code>JFK</code>, the number of WestJet flights from <code>LGA</code>, the number of WestJet flights from <code>EWR</code>, the number of American Airlines flights from <code>JFK</code>, and so on. Recall the <code>ggplot()</code> code that created the barplot of <code>carrier</code> frequency in Figure <a href="2-viz.html#fig:flightsbar">2.21</a>:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>()</code></pre>
-<p>We can now map the additional variable <code>origin</code> by adding a <code>fill = origin</code> inside the <code>aes()</code> aesthetic mapping; the <code>fill</code> aesthetic of any bar corresponds to the color used to fill the bars.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier, <span class="dt">fill =</span> origin)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>()</code></pre>
+<p>Barplots are a very common way to visualize the frequency of different categories, or levels, of a single categorical variable. Another use of barplots is to visualize the <em>joint</em> distribution of two categorical variables at the same time. Let’s examine the <em>joint</em> distribution of outgoing domestic flights from NYC by <code>carrier</code> as well as <code>origin</code>. In other words, the number of flights for each <code>carrier</code> and <code>origin</code> combination.</p>
+<p>For example, the number of WestJet flights from <code>JFK</code>, the number of WestJet flights from <code>LGA</code>, the number of WestJet flights from <code>EWR</code>, the number of American Airlines flights from <code>JFK</code>, and so on. Recall the <code>ggplot()</code> code that created the barplot of <code>carrier</code> frequency in Figure <a href="2-viz.html#fig:flightsbar">2.21</a>:</p>
+<div class="sourceCode" id="cb41"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb41-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier)) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb41-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>()</a></code></pre></div>
+<p>We can now map the additional variable <code>origin</code> by adding a <code>fill = origin</code> inside the <code>aes()</code> aesthetic mapping.</p>
+<div class="sourceCode" id="cb42"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb42-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier, <span class="dt">fill =</span> origin)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb42-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>()</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:flights-stacked-bar"></span>
-<img src="moderndive_files/figure-html/flights-stacked-bar-1.png" alt="Stacked barplot comparing the number of flights by carrier and origin." width="\textwidth" />
+<img src="ModernDive_files/figure-html/flights-stacked-bar-1.png" alt="Stacked barplot of flight amount by carrier and origin." width="\textwidth" />
 <p class="caption">
-FIGURE 2.23: Stacked barplot comparing the number of flights by carrier and origin.
+FIGURE 2.23: Stacked barplot of flight amount by carrier and origin.
 </p>
 </div>
 <p>Figure <a href="2-viz.html#fig:flights-stacked-bar">2.23</a> is an example of a  <em>stacked barplot</em>. While simple to make, in certain aspects it is not ideal. For example, it is difficult to compare the heights of the different colors between the bars, corresponding to comparing the number of flights from each <code>origin</code> airport between the carriers.</p>
-<p>Before we continue, let’s address some common points of confusion among new R users. First, note that <code>fill</code> is another aesthetic mapping much like <code>x</code>-position; thus we were careful to include it within the parentheses of the <code>aes()</code> mapping. The following code, where the <code>fill</code> aesthetic is specified outside the <code>aes()</code> mapping will yield an error. This is a fairly common error that new <code>ggplot</code> users make:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier), <span class="dt">fill =</span> origin) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>()</code></pre>
-<p>Second, the <code>fill</code> aesthetic corresponds to the color used to fill the bars, while the <code>color</code> aesthetic corresponds to the color of the outline of the bars. This is identical to how we added color to our histogram in Subsection <a href="2-viz.html#geomhistogram">2.5.1</a>: we set the outline of the bars to white by setting <code>color = &quot;white&quot;</code> and the colors of the bars to be blue steel by setting <code>fill = &quot;steelblue&quot;</code>. Observe in Figure <a href="2-viz.html#fig:flights-stacked-bar-color">2.24</a> that mapping <code>origin</code> to <code>color</code> and not <code>fill</code> yields grey bars with different colored outlines.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier, <span class="dt">color =</span> origin)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>()</code></pre>
+<p>Before we continue, let’s address some common points of confusion among new R users. First, the <code>fill</code> aesthetic corresponds to the color used to fill the bars, while the <code>color</code> aesthetic corresponds to the color of the outline of the bars. This is identical to how we added color to our histogram in Subsection <a href="2-viz.html#geomhistogram">2.5.1</a>: we set the outline of the bars to white by setting <code>color = &quot;white&quot;</code> and the colors of the bars to blue steel by setting <code>fill = &quot;steelblue&quot;</code>. Observe in Figure <a href="2-viz.html#fig:flights-stacked-bar-color">2.24</a> that mapping <code>origin</code> to <code>color</code> and not <code>fill</code> yields grey bars with different colored outlines.</p>
+<div class="sourceCode" id="cb43"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb43-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier, <span class="dt">color =</span> origin)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb43-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>()</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:flights-stacked-bar-color"></span>
-<img src="moderndive_files/figure-html/flights-stacked-bar-color-1.png" alt="Stacked barplot with color aesthetic used instead of fill." width="\textwidth" />
+<img src="ModernDive_files/figure-html/flights-stacked-bar-color-1.png" alt="Stacked barplot with color aesthetic used instead of fill." width="\textwidth" />
 <p class="caption">
 FIGURE 2.24: Stacked barplot with color aesthetic used instead of fill.
 </p>
 </div>
-<p>An alternative to stacked barplots are  <em>side-by-side barplots</em>, also known as <em>dodged barplots</em>, as seen in Figure <a href="2-viz.html#fig:flights-dodged-bar-color">2.25</a>. The code to create a side-by-side barplot is identical to the code to create a stacked barplot, but with a  <code>position = &quot;dodge&quot;</code> argument added to <code>geom_bar()</code>. In other words, we are overriding the default barplot type, which is a <em>stacked</em> barplot, and specifying it to be a side-by-side barplot.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier, <span class="dt">fill =</span> origin)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>(<span class="dt">position =</span> <span class="st">&quot;dodge&quot;</span>)</code></pre>
+<p>Second, note that <code>fill</code> is another aesthetic mapping much like <code>x</code>-position; thus we were careful to include it within the parentheses of the <code>aes()</code> mapping. The following code, where the <code>fill</code> aesthetic is specified outside the <code>aes()</code> mapping will yield an error. This is a fairly common error that new <code>ggplot</code> users make:</p>
+<div class="sourceCode" id="cb44"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb44-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier), <span class="dt">fill =</span> origin) <span class="op">+</span></a>
+<a class="sourceLine" id="cb44-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>()</a></code></pre></div>
+<p>An alternative to stacked barplots are  <em>side-by-side barplots</em>, also known as <em>dodged barplots</em>, as seen in Figure <a href="2-viz.html#fig:flights-dodged-bar-color">2.25</a>. The code to create a side-by-side barplot is identical to the code to create a stacked barplot, but with a  <code>position = &quot;dodge&quot;</code> argument added to <code>geom_bar()</code>. In other words, we are overriding the default barplot type, which is a <em>stacked</em> barplot, and specifying it to be a side-by-side barplot instead.</p>
+<div class="sourceCode" id="cb45"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb45-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier, <span class="dt">fill =</span> origin)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb45-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>(<span class="dt">position =</span> <span class="st">&quot;dodge&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:flights-dodged-bar-color"></span>
-<img src="moderndive_files/figure-html/flights-dodged-bar-color-1.png" alt="Side-by-side barplot comparing number of flights by carrier and origin." width="\textwidth" />
+<img src="ModernDive_files/figure-html/flights-dodged-bar-color-1.png" alt="Side-by-side barplot comparing number of flights by carrier and origin." width="\textwidth" />
 <p class="caption">
 FIGURE 2.25: Side-by-side barplot comparing number of flights by carrier and origin.
 </p>
 </div>
-<p>Lastly, another type of barplot is a  <em>faceted barplot</em>. Recall in Section <a href="2-viz.html#facets">2.6</a> we visualized the distribution of hourly temperatures at the 3 NYC airports <em>split</em> by month using facets. We apply the same principle to our barplot visualizing the frequency of <code>carrier</code> split by <code>origin</code>: instead of mapping <code>origin</code></p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">facet_wrap</span>(<span class="op">~</span><span class="st"> </span>origin, <span class="dt">ncol =</span> <span class="dv">1</span>)</code></pre>
+<p>Note the width of the bars for <code>AS</code>, <code>F9</code>, <code>FL</code>, <code>HA</code> and <code>YV</code> is different than the others. We can make one tweak to the <code>position</code> argument to get them to be the same size in terms of width as the other bars by using the more robust <code>position_dodge()</code> function.</p>
+<div class="sourceCode" id="cb46"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb46-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier, <span class="dt">fill =</span> origin)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb46-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>(<span class="dt">position =</span> <span class="kw">position_dodge</span>(<span class="dt">preserve =</span> <span class="st">&quot;single&quot;</span>))</a></code></pre></div>
+<div class="figure" style="text-align: center"><span id="fig:flights-dodged-bar-color-tweak"></span>
+<img src="ModernDive_files/figure-html/flights-dodged-bar-color-tweak-1.png" alt="Side-by-side barplot comparing number of flights by carrier and origin (with formatting tweak)." width="\textwidth" />
+<p class="caption">
+FIGURE 2.26: Side-by-side barplot comparing number of flights by carrier and origin (with formatting tweak).
+</p>
+</div>
+<p>Lastly, another type of barplot is a  <em>faceted barplot</em>. Recall in Section <a href="2-viz.html#facets">2.6</a> we visualized the distribution of hourly temperatures at the 3 NYC airports <em>split</em> by month using facets. We apply the same principle to our barplot visualizing the frequency of <code>carrier</code> split by <code>origin</code>: instead of mapping <code>origin</code> to <code>fill</code> we include it as the variable to create small multiples of the plot across the levels of <code>origin</code>.</p>
+<div class="sourceCode" id="cb47"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb47-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb47-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb47-3" data-line-number="3"><span class="st">  </span><span class="kw">facet_wrap</span>(<span class="op">~</span><span class="st"> </span>origin, <span class="dt">ncol =</span> <span class="dv">1</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:facet-bar-vert"></span>
-<img src="moderndive_files/figure-html/facet-bar-vert-1.png" alt="Faceted barplot comparing the number of flights by carrier and origin." width="\textwidth" />
+<img src="ModernDive_files/figure-html/facet-bar-vert-1.png" alt="Faceted barplot comparing the number of flights by carrier and origin." width="\textwidth" />
 <p class="caption">
-FIGURE 2.26: Faceted barplot comparing the number of flights by carrier and origin.
+FIGURE 2.27: Faceted barplot comparing the number of flights by carrier and origin.
 </p>
 </div>
 <div class="learncheck">
@@ -1564,8 +1534,8 @@ <h3><span class="header-section-number">2.8.3</span> Two categorical variables</
 </div>
 <p><strong>(LC2.32)</strong> What kinds of questions are not easily answered by looking at Figure <a href="2-viz.html#fig:flights-stacked-bar">2.23</a>?</p>
 <p><strong>(LC2.33)</strong> What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?</p>
-<p><strong>(LC2.34)</strong> Why might the side-by-side (AKA dodged) barplot be preferable to a stacked barplot in this case?</p>
-<p><strong>(LC2.35)</strong> What are the disadvantages of using a side-by-side (AKA dodged) barplot, in general?</p>
+<p><strong>(LC2.34)</strong> Why might the side-by-side barplot be preferable to a stacked barplot in this case?</p>
+<p><strong>(LC2.35)</strong> What are the disadvantages of using a dodged barplot, in general?</p>
 <p><strong>(LC2.36)</strong> Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case?</p>
 <p><strong>(LC2.37)</strong> What information about the different carriers at different airports is more easily seen in the faceted barplot?</p>
 <div class="learncheck">
@@ -1574,14 +1544,14 @@ <h3><span class="header-section-number">2.8.3</span> Two categorical variables</
 </div>
 <div id="summary-4" class="section level3">
 <h3><span class="header-section-number">2.8.4</span> Summary</h3>
-<p>Barplots are a very common way of displaying the distribution of a categorical variable, or in other words the frequency with which the different categories (also called <em>levels</em>) occur. They are easy to understand and make it easy to make comparisons across levels. Furthermore, when trying to visualize the relationship of two categorical variables, you have many options: stacked barplots, side-by-side barplots, and faceted barplots. Depending on what aspect of the relationship you are trying to emphasize, you will need to make a choice between these three types of barplots and own that choice.</p>
+<p>Barplots are a common way of displaying the distribution of a categorical variable, or in other words the frequency with which the different categories (also called <em>levels</em>) occur. They are easy to understand and make it easy to make comparisons across levels. Furthermore, when trying to visualize the relationship of two categorical variables, you have many options: stacked barplots, side-by-side barplots, and faceted barplots. Depending on what aspect of the relationship you are trying to emphasize, you will need to make a choice between these three types of barplots and own that choice.</p>
 </div>
 </div>
 <div id="conclusion-1" class="section level2">
 <h2><span class="header-section-number">2.9</span> Conclusion</h2>
 <div id="summary-table" class="section level3">
 <h3><span class="header-section-number">2.9.1</span> Summary table</h3>
-<p>Let’s recap all five of the Five Named Graphs (5NG)  in Table <a href="2-viz.html#tab:viz-summary-table">2.4</a> summarizing their differences. Using these 5NG, you’ll be able to visualize the distributions and relationships of variables contained in a wide array of datasets. This will be even more the case as we start to map more variables to more of each <code>geom</code>etric object’s <code>aes</code>thetic attribute options, further unlocking the awesome power of the <code>ggplot2</code> package.</p>
+<p>Let’s recap all five of the five named graphs (5NG)  in Table <a href="2-viz.html#tab:viz-summary-table">2.4</a> summarizing their differences. Using these 5NG, you’ll be able to visualize the distributions and relationships of variables contained in a wide array of datasets. This will be even more the case as we start to map more variables to more of each <code>geom</code>etric object’s <code>aes</code>thetic attribute options, further unlocking the awesome power of the <code>ggplot2</code> package.</p>
 <table>
 <caption>
 <span id="tab:viz-summary-table">TABLE 2.4: </span>Summary of Five Named Graphs
@@ -1635,7 +1605,7 @@ <h3><span class="header-section-number">2.9.1</span> Summary table</h3>
 <code>geom_line()</code>
 </td>
 <td style="text-align:left;">
-Used when there is a sequential order to x-variable e.g. time
+Used when there is a sequential order to x-variable, e.g., time
 </td>
 </tr>
 <tr>
@@ -1693,25 +1663,25 @@ <h3><span class="header-section-number">2.9.1</span> Summary table</h3>
 </div>
 <div id="function-argument-specification" class="section level3">
 <h3><span class="header-section-number">2.9.2</span> Function argument specification</h3>
-<p>Let’s go over some important points about specifying the arguments (i.e. inputs) to functions. Run the following two segments of code:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Segment 1:</span>
-<span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>()
-
-<span class="co"># Segment 2:</span>
-<span class="kw">ggplot</span>(flights, <span class="kw">aes</span>(<span class="dt">x =</span> carrier)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>()</code></pre>
-<p>You’ll notice that that both code segments create the same barplot, even though in the second segment we omitted the <code>data =</code> and <code>mapping =</code> code argument names. This is because the <code>ggplot()</code> function by default assumes that the <code>data</code> argument comes first and the <code>mapping</code> argument comes second.  So as long as you specify the data frame in question first and the <code>aes()</code> mapping second, you can omit the explicit statement of the argument names <code>data =</code> and <code>mapping =</code>.</p>
-<p>Going forward for the rest of this book, all <code>ggplot()</code> code will be like the second segment: with the <code>data =</code> and <code>mapping =</code> explicit naming of the argument omitted with the default ordering of arguments respected. We’ll do this for brevity’s sake and it’s common to see this style when reviewing the R code of other R users.</p>
+<p>Let’s go over some important points about specifying the arguments (i.e., inputs) to functions. Run the following two segments of code:</p>
+<div class="sourceCode" id="cb48"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb48-1" data-line-number="1"><span class="co"># Segment 1:</span></a>
+<a class="sourceLine" id="cb48-2" data-line-number="2"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb48-3" data-line-number="3"><span class="st">  </span><span class="kw">geom_bar</span>()</a>
+<a class="sourceLine" id="cb48-4" data-line-number="4"></a>
+<a class="sourceLine" id="cb48-5" data-line-number="5"><span class="co"># Segment 2:</span></a>
+<a class="sourceLine" id="cb48-6" data-line-number="6"><span class="kw">ggplot</span>(flights, <span class="kw">aes</span>(<span class="dt">x =</span> carrier)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb48-7" data-line-number="7"><span class="st">  </span><span class="kw">geom_bar</span>()</a></code></pre></div>
+<p>You’ll notice that both code segments create the same barplot, even though in the second segment we omitted the <code>data =</code> and <code>mapping =</code> code argument names. This is because the <code>ggplot()</code> function by default assumes that the <code>data</code> argument comes first and the <code>mapping</code> argument comes second.  As long as you specify the data frame in question first and the <code>aes()</code> mapping second, you can omit the explicit statement of the argument names <code>data =</code> and <code>mapping =</code>.</p>
+<p>Going forward for the rest of this book, all <code>ggplot()</code> code will be like the second segment: with the <code>data =</code> and <code>mapping =</code> explicit naming of the argument omitted with the default ordering of arguments respected. We’ll do this for brevity’s sake; it’s common to see this style when reviewing other R users’ code.</p>
 </div>
 <div id="additional-resources-1" class="section level3">
 <h3><span class="header-section-number">2.9.3</span> Additional resources</h3>
 <p>An R script file of all R code used in this chapter is available <a href="scripts/02-visualization.R">here</a>.</p>
-<p>If you want to further unlock the power of the <code>ggplot2</code> package for data visualization, we suggest that you check out RStudio’s “Data Visualization with ggplot2” cheatsheet. This cheatsheet summarizes much more than what we’ve discussed in this chapter. In particular it presents many more than the 5 <code>geom</code>etric objects we covered in this chapter while providing quick and easy to read visual descriptions. For all the <code>geom</code>etric objects, it also lists all the possible aesthetic attributes one can tweak. You can access this cheatsheet  by going to the RStudio Menu Bar -&gt; Help -&gt; Cheatsheets -&gt; “Data Visualization with ggplot2.” You can see a preview in the figure below.</p>
+<p>If you want to further unlock the power of the <code>ggplot2</code> package for data visualization, we suggest that you check out RStudio’s “Data Visualization with ggplot2” cheatsheet. This cheatsheet summarizes much more than what we’ve discussed in this chapter. In particular, it presents many more than the 5 <code>geom</code>etric objects we covered in this chapter while providing quick and easy to read visual descriptions. For all the <code>geom</code>etric objects, it also lists all the possible aesthetic attributes one can tweak. In the current version of RStudio in late 2019, you can access this cheatsheet by going to the RStudio Menu Bar -&gt; Help -&gt; Cheatsheets -&gt; “Data Visualization with ggplot2.” You can see a preview in the figure below.</p>
 <div class="figure" style="text-align: center"><span id="fig:ggplot-cheatsheet"></span>
 <img src="images/cheatsheets/ggplot_cheatsheet-1.png" alt="Data Visualization with ggplot2 cheatsheet." width="\textwidth" />
 <p class="caption">
-FIGURE 2.27: Data Visualization with ggplot2 cheatsheet.
+FIGURE 2.28: Data Visualization with ggplot2 cheatsheet.
 </p>
 </div>
 <!--
@@ -1728,18 +1698,18 @@ <h3><span class="header-section-number">2.9.3</span> Additional resources</h3>
 <div id="whats-to-come-3" class="section level3">
 <h3><span class="header-section-number">2.9.4</span> What’s to come</h3>
 <p>Recall in Figure <a href="2-viz.html#fig:noalpha">2.2</a> in Section <a href="2-viz.html#scatterplots">2.3</a> we visualized the relationship between departure delay and arrival delay for Alaska Airlines flights. This necessitated paring down the <code>flights</code> data frame to a new data frame <code>alaska_flights</code> consisting of only <code>carrier == AS</code> flights first:</p>
-<pre class="sourceCode r"><code class="sourceCode r">alaska_flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(carrier <span class="op">==</span><span class="st"> &quot;AS&quot;</span>)
-
-<span class="kw">ggplot</span>(<span class="dt">data =</span> alaska_flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> dep_delay, <span class="dt">y =</span> arr_delay)) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">geom_point</span>()</code></pre>
+<div class="sourceCode" id="cb49"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb49-1" data-line-number="1">alaska_flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb49-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(carrier <span class="op">==</span><span class="st"> &quot;AS&quot;</span>)</a>
+<a class="sourceLine" id="cb49-3" data-line-number="3"></a>
+<a class="sourceLine" id="cb49-4" data-line-number="4"><span class="kw">ggplot</span>(<span class="dt">data =</span> alaska_flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> dep_delay, <span class="dt">y =</span> arr_delay)) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb49-5" data-line-number="5"><span class="st">  </span><span class="kw">geom_point</span>()</a></code></pre></div>
 <p>Furthermore recall in Figure <a href="2-viz.html#fig:hourlytemp">2.7</a> in Section <a href="2-viz.html#linegraphs">2.4</a> we visualized hourly temperature recordings at Newark airport only for the first 15 days of January 2013. This necessitated paring down the <code>weather</code> data frame to a new data frame <code>early_january_weather</code> consisting of hourly temperature recordings only for <code>origin == &quot;EWR&quot;</code>, <code>month == 1</code>, and day less than or equal to <code>15</code> first:</p>
-<pre class="sourceCode r"><code class="sourceCode r">early_january_weather &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(origin <span class="op">==</span><span class="st"> &quot;EWR&quot;</span> <span class="op">&amp;</span><span class="st"> </span>month <span class="op">==</span><span class="st"> </span><span class="dv">1</span> <span class="op">&amp;</span><span class="st"> </span>day <span class="op">&lt;=</span><span class="st"> </span><span class="dv">15</span>)
-
-<span class="kw">ggplot</span>(<span class="dt">data =</span> early_january_weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> time_hour, <span class="dt">y =</span> temp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_line</span>()</code></pre>
-<p>These two code segments were a preview of Chapter <a href="3-wrangling.html#wrangling">3</a> on data wrangling using the <code>dplyr</code> package. Data wrangling is the process of transforming and modifying existing data with the intent of making it more appropriate for analysis purposes. For example, the two code segments used the <code>filter()</code> function to create new data frames (<code>alaska_flights</code> and <code>early_january_weather</code>) by choosing only a subset of rows of existing data frames (<code>flights</code> and <code>weather</code>). In the next chapter, we’ll formally introduce the <code>filter()</code> and other data wrangling functions as well as the <em>pipe operator</em> <code>%&gt;%</code> which allows you to combine multiple data wrangling actions into a single sequential <em>chain</em> of actions. On to Chapter <a href="3-wrangling.html#wrangling">3</a> on data wrangling!</p>
+<div class="sourceCode" id="cb50"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb50-1" data-line-number="1">early_january_weather &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb50-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(origin <span class="op">==</span><span class="st"> &quot;EWR&quot;</span> <span class="op">&amp;</span><span class="st"> </span>month <span class="op">==</span><span class="st"> </span><span class="dv">1</span> <span class="op">&amp;</span><span class="st"> </span>day <span class="op">&lt;=</span><span class="st"> </span><span class="dv">15</span>)</a>
+<a class="sourceLine" id="cb50-3" data-line-number="3"></a>
+<a class="sourceLine" id="cb50-4" data-line-number="4"><span class="kw">ggplot</span>(<span class="dt">data =</span> early_january_weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> time_hour, <span class="dt">y =</span> temp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb50-5" data-line-number="5"><span class="st">  </span><span class="kw">geom_line</span>()</a></code></pre></div>
+<p>These two code segments were a preview of Chapter <a href="3-wrangling.html#wrangling">3</a> on data wrangling using the <code>dplyr</code> package. Data wrangling is the process of transforming and modifying existing data with the intent of making it more appropriate for analysis purposes. For example, these two code segments used the <code>filter()</code> function to create new data frames (<code>alaska_flights</code> and <code>early_january_weather</code>) by choosing only a subset of rows of existing data frames (<code>flights</code> and <code>weather</code>). In the next chapter, we’ll formally introduce the <code>filter()</code> and other data wrangling functions as well as the <em>pipe operator</em> <code>%&gt;%</code> which allows you to combine multiple data wrangling actions into a single sequential <em>chain</em> of actions. On to Chapter <a href="3-wrangling.html#wrangling">3</a> on data wrangling!</p>
 
 </div>
 </div>
@@ -1747,16 +1717,16 @@ <h3><span class="header-section-number">2.9.4</span> What’s to come</h3>
 <h3>References</h3>
 <div id="refs" class="references">
 <div id="ref-rds2016">
-<p>Grolemund, Garrett, and Hadley Wickham. 2016. <em>R for Data Science</em>. <a href="http://r4ds.had.co.nz/">http://r4ds.had.co.nz/</a>.</p>
+<p>Grolemund, Garrett, and Hadley Wickham. 2017. <em>R for Data Science</em>. First. Sebastopol, CA: O’Reilly Media. <a href="https://r4ds.had.co.nz/">https://r4ds.had.co.nz/</a>.</p>
 </div>
 <div id="ref-robbins2013">
-<p>Robbins, Naomi. 2013. <em>Creating More Effective Graphs</em>. Chart House.</p>
+<p>Robbins, Naomi. 2013. <em>Creating More Effective Graphs</em>. First. New York, NY: Chart House.</p>
 </div>
 <div id="ref-R-ggplot2">
 <p>Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, and Hiroaki Yutani. 2019. <em>Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics</em>. <a href="https://CRAN.R-project.org/package=ggplot2">https://CRAN.R-project.org/package=ggplot2</a>.</p>
 </div>
 <div id="ref-wilkinson2005">
-<p>Wilkinson, Leland. 2005. <em>The Grammar of Graphics (Statistics and Computing)</em>. Secaucus, NJ, USA: Springer-Verlag New York, Inc.</p>
+<p>Wilkinson, Leland. 2005. <em>The Grammar of Graphics (Statistics and Computing)</em>. First. Secaucus, NJ: Springer-Verlag.</p>
 </div>
 </div>
             </section>
@@ -1770,11 +1740,13 @@ <h3>References</h3>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -1782,12 +1754,11 @@ <h3>References</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -1802,6 +1773,10 @@ <h3>References</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -1818,8 +1793,9 @@ <h3>References</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/3-wrangling.html b/docs/3-wrangling.html
index f8b1845cf..79f3c0375 100644
--- a/docs/3-wrangling.html
+++ b/docs/3-wrangling.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>Chapter 3 Data Wrangling | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="Chapter 3 Data Wrangling | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="Chapter 3 Data Wrangling | Statistical Inference via Data Science" />
@@ -21,18 +21,18 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="2-viz.html">
-<link rel="next" href="4-tidy.html">
+<link rel="prev" href="2-viz.html"/>
+<link rel="next" href="4-tidy.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -570,7 +583,7 @@ <h1>
 </html>
 <div id="wrangling" class="section level1">
 <h1><span class="header-section-number">Chapter 3</span> Data Wrangling</h1>
-<p>So far in our journey, we’ve seen how to look at data saved in data frames using the <code>glimpse()</code> and <code>View()</code> functions in Chapter <a href="1-getting-started.html#getting-started">1</a> and how to create data visualizations using the <code>ggplot2</code> package in Chapter <a href="2-viz.html#viz">2</a>. In particular we studied what we term the “five named graphs” (5NG):</p>
+<p>So far in our journey, we’ve seen how to look at data saved in data frames using the <code>glimpse()</code> and <code>View()</code> functions in Chapter <a href="1-getting-started.html#getting-started">1</a>, and how to create data visualizations using the <code>ggplot2</code> package in Chapter <a href="2-viz.html#viz">2</a>. In particular we studied what we term the “five named graphs” (5NG):</p>
 <ol style="list-style-type: decimal">
 <li>scatterplots via <code>geom_point()</code></li>
 <li>linegraphs via <code>geom_line()</code></li>
@@ -578,31 +591,31 @@ <h1><span class="header-section-number">Chapter 3</span> Data Wrangling</h1>
 <li>histograms via <code>geom_histogram()</code></li>
 <li>barplots via <code>geom_bar()</code> or <code>geom_col()</code></li>
 </ol>
-<p>We created these visualizations using the “Grammar of Graphics”, which maps variables in a data frame to the aesthetic attributes of one the 5 <code>geom</code>etric objects. We can also control other aesthetic attributes of the geometric objects such as the size and color as seen in the Gapminder data example in Figure <a href="2-viz.html#fig:gapminder">2.1</a>.</p>
-<p>Recall however that for two of our visualizations, we first needed to transform/modify existing data frames a little. For example, recall the scatterplot in Figure <a href="2-viz.html#fig:noalpha">2.2</a> of departure and arrival delay <em>only</em> for Alaska Airlines flights. In order to create this visualization, we first needed to pare down the <code>flights</code> data frame to a smaller data frame <code>alaska_flights</code> consisting of only <code>carrier == &quot;AS&quot;</code> flights. We did this using the <code>filter()</code> function:</p>
-<pre class="sourceCode r"><code class="sourceCode r">alaska_flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(carrier <span class="op">==</span><span class="st"> &quot;AS&quot;</span>)</code></pre>
-<p>In this chapter, we’ll introduce a series of functions from the <code>dplyr</code> package for data wrangling that will allow you to take a data frame and “wrangle” it (transform it) to suit your needs. Such functions include:</p>
+<p>We created these visualizations using the grammar of graphics, which maps variables in a data frame to the aesthetic attributes of one of the 5 <code>geom</code>etric objects. We can also control other aesthetic attributes of the geometric objects such as the size and color as seen in the Gapminder data example in Figure <a href="2-viz.html#fig:gapminder">2.1</a>.</p>
+<p>Recall however that for two of our visualizations, we first needed to transform/modify existing data frames a little. For example, recall the scatterplot in Figure <a href="2-viz.html#fig:noalpha">2.2</a> of departure and arrival delays <em>only</em> for Alaska Airlines flights. In order to create this visualization, we first needed to pare down the <code>flights</code> data frame to an <code>alaska_flights</code> data frame consisting of only <code>carrier == &quot;AS&quot;</code> flights. Thus, <code>alaska_flights</code> will have fewer rows than <code>flights</code>. We did this using the <code>filter()</code> function:</p>
+<div class="sourceCode" id="cb51"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb51-1" data-line-number="1">alaska_flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb51-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(carrier <span class="op">==</span><span class="st"> &quot;AS&quot;</span>)</a></code></pre></div>
+<p>In this chapter, we’ll extend this example and we’ll introduce a series of functions from the <code>dplyr</code> package for data wrangling that will allow you to take a data frame and “wrangle” it (transform it) to suit your needs. Such functions include:</p>
 <ol style="list-style-type: decimal">
 <li><code>filter()</code> a data frame’s existing rows to only pick out a subset of them. For example, the <code>alaska_flights</code> data frame.</li>
-<li><code>summarize()</code> one of its columns/variables with a <em>summary statistic</em>. Examples of summary statistics include the median and interquartile range of temperatures as we saw in Section <a href="2-viz.html#boxplots">2.7</a> on boxplots.</li>
-<li><code>group_by()</code> its rows. In other words, assign different rows to be part of the same <em>group</em>. Then we can combine <code>group_by()</code> with <code>summarize()</code> to report summary statistics for each group <em>separately</em>. For example, say you don’t want a single overall average departure delay <code>dep_delay</code> for all three <code>origin</code> airports combined, but rather three separate average departure delays, one for each of the three <code>origin</code> airports.</li>
+<li><code>summarize()</code> one or more of its columns/variables with a <em>summary statistic</em>. Examples of summary statistics include the median and interquartile range of temperatures as we saw in Section <a href="2-viz.html#boxplots">2.7</a> on boxplots.</li>
+<li><code>group_by()</code> its rows. In other words, assign different rows to be part of the same <em>group</em>. We can then combine <code>group_by()</code> with <code>summarize()</code> to report summary statistics for each group <em>separately</em>. For example, say you don’t want a single overall average departure delay <code>dep_delay</code> for all three <code>origin</code> airports combined, but rather three separate average departure delays, one computed for each of the three <code>origin</code> airports.</li>
 <li><code>mutate()</code> its existing columns/variables to create new ones. For example, convert hourly temperature recordings from degrees Fahrenheit to degrees Celsius.</li>
 <li><code>arrange()</code> its rows. For example, sort the rows of <code>weather</code> in ascending or descending order of <code>temp</code>.</li>
 <li><code>join()</code> it with another data frame by matching along a “key” variable. In other words, merge these two data frames together.</li>
 </ol>
 <p>Notice how we used <code>computer_code</code> font to describe the actions we want to take on our data frames. This is because the <code>dplyr</code> package for data wrangling has intuitively verb-named functions that are easy to remember.</p>
-<p>There is a further benefit to learning to use the <code>dplyr</code> package for data wrangling: its similarity to the database querying language <a href="https://en.wikipedia.org/wiki/SQL">SQL</a> (pronounced “sequel”). The SQL language is used to manage large databases quickly and efficiently and is widely used by many institutions with a lot of data. While SQL is a topic left for a book or a course on database management, keep in mind that once you learn <code>dplyr</code> you can learn SQL easily. We’ll talk more about their similarities in Subsection <a href="3-wrangling.html#normal-forms">3.7.4</a>.</p>
+<p>There is a further benefit to learning to use the <code>dplyr</code> package for data wrangling: its similarity to the database querying language <a href="https://en.wikipedia.org/wiki/SQL">SQL</a> (pronounced “sequel” or spelled out as “S”, “Q”, “L”). SQL (which stands for “Structured Query Language”) is used to manage large databases quickly and efficiently and is widely used by many institutions with a lot of data. While SQL is a topic left for a book or a course on database management, keep in mind that once you learn <code>dplyr</code>, you can learn SQL easily. We’ll talk more about their similarities in Subsection <a href="3-wrangling.html#normal-forms">3.7.4</a>.</p>
 <div id="needed-packages-1" class="section level3 unnumbered">
 <h3>Needed packages</h3>
 <p>Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section <a href="1-getting-started.html#packages">1.3</a> for information on how to install and load R packages.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(dplyr)
-<span class="kw">library</span>(ggplot2)
-<span class="kw">library</span>(nycflights13)</code></pre>
+<div class="sourceCode" id="cb52"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb52-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb52-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
+<a class="sourceLine" id="cb52-3" data-line-number="3"><span class="kw">library</span>(nycflights13)</a></code></pre></div>
 </div>
 <div id="piping" class="section level2">
 <h2><span class="header-section-number">3.1</span> The pipe operator: <code>%&gt;%</code></h2>
-<p>Before we start data wrangling, let’s first introduce a very nifty tool that gets loaded along with the <code>dplyr</code> package: the  pipe operator <code>%&gt;%</code>. The pipe operator allows us to combine multiple operations on a computer into a single sequential <em>chain</em> of actions.</p>
+<p>Before we start data wrangling, let’s first introduce a nifty tool that gets loaded with the <code>dplyr</code> package: the  pipe operator <code>%&gt;%</code>. The pipe operator allows us to combine multiple operations in R into a single sequential <em>chain</em> of actions.</p>
 <p>Let’s start with a hypothetical example. Say you would like to perform a hypothetical sequence of operations on a hypothetical data frame <code>x</code> using hypothetical functions <code>f()</code>, <code>g()</code>, and <code>h()</code>:</p>
 <ol style="list-style-type: decimal">
 <li>Take <code>x</code> <em>then</em></li>
@@ -611,12 +624,12 @@ <h2><span class="header-section-number">3.1</span> The pipe operator: <code>%&gt
 <li>Use the output of <code>g(f(x))</code> as an input to a function <code>h()</code></li>
 </ol>
 <p>One way to achieve this sequence of operations is by using nesting parentheses as follows:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">h</span>(<span class="kw">g</span>(<span class="kw">f</span>(x)))</code></pre>
-<p>This code isn’t so hard to read since we are applying only three functions: <code>f()</code>, then <code>g()</code>, then <code>h()</code>. However, you can imagine that this will get progressively harder to read as the number of functions applied in your sequence increases. This is where the pipe operator <code>%&gt;%</code> comes in handy. <code>%&gt;%</code> takes the output of one function and then “pipes” it to be the input of the next function. Furthermore, a helpful trick is to read <code>%&gt;%</code> as “then” or “and then.” For example, you can obtain the same output as the hypothetical sequence of functions as follows:</p>
-<pre class="sourceCode r"><code class="sourceCode r">x <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">f</span>() <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">g</span>() <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">h</span>()</code></pre>
+<div class="sourceCode" id="cb53"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb53-1" data-line-number="1"><span class="kw">h</span>(<span class="kw">g</span>(<span class="kw">f</span>(x)))</a></code></pre></div>
+<p>This code isn’t so hard to read since we are applying only three functions: <code>f()</code>, then <code>g()</code>, then <code>h()</code> and each of the functions is short in its name. Further, each of these functions also only has one argument. However, you can imagine that this will get progressively harder to read as the number of functions applied in your sequence increases and the arguments in each function increase as well. This is where the pipe operator <code>%&gt;%</code> comes in handy. <code>%&gt;%</code> takes the output of one function and then “pipes” it to be the input of the next function. Furthermore, a helpful trick is to read <code>%&gt;%</code> as “then” or “and then.” For example, you can obtain the same output as the hypothetical sequence of functions as follows:</p>
+<div class="sourceCode" id="cb54"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb54-1" data-line-number="1">x <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb54-2" data-line-number="2"><span class="st">  </span><span class="kw">f</span>() <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb54-3" data-line-number="3"><span class="st">  </span><span class="kw">g</span>() <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb54-4" data-line-number="4"><span class="st">  </span><span class="kw">h</span>()</a></code></pre></div>
 <p>You would read this sequence as:</p>
 <ol style="list-style-type: decimal">
 <li>Take <code>x</code> <em>then</em></li>
@@ -627,13 +640,13 @@ <h2><span class="header-section-number">3.1</span> The pipe operator: <code>%&gt
 <p>So while both approaches achieve the same goal, the latter is much more human-readable because you can clearly read the sequence of operations line-by-line. But what are the hypothetical <code>x</code>, <code>f()</code>, <code>g()</code>, and <code>h()</code>? Throughout this chapter on data wrangling:</p>
 <ol style="list-style-type: decimal">
 <li>The starting value <code>x</code> will be a data frame. For example, the  <code>flights</code> data frame we explored in Section <a href="1-getting-started.html#nycflights13">1.4</a>.</li>
-<li>The sequence of functions, here <code>f()</code>, <code>g()</code>, and <code>h()</code>, will mostly be a sequence of any number of the six data wrangling verb-named functions we listed in the introduction to this chapter. For example, the <code>filter(carrier == &quot;AS&quot;)</code> function we previewed earlier.</li>
+<li>The sequence of functions, here <code>f()</code>, <code>g()</code>, and <code>h()</code>, will mostly be a sequence of any number of the six data wrangling verb-named functions we listed in the introduction to this chapter. For example, the <code>filter(carrier == &quot;AS&quot;)</code> function and argument specified we previewed earlier.</li>
 <li>The result will be the transformed/modified data frame that you want. In our example, we’ll save the result in a new data frame by using the <code>&lt;-</code> assignment operator with the name <code>alaska_flights</code> via <code>alaska_flights &lt;-</code>.</li>
 </ol>
-<pre class="sourceCode r"><code class="sourceCode r">alaska_flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(carrier <span class="op">==</span><span class="st"> &quot;AS&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb55"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb55-1" data-line-number="1">alaska_flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb55-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(carrier <span class="op">==</span><span class="st"> &quot;AS&quot;</span>)</a></code></pre></div>
 <p>Much like when adding layers to a <code>ggplot()</code> using the <code>+</code> sign, you form a single <em>chain</em> of data wrangling operations by combining verb-named functions into a single sequence using the pipe operator <code>%&gt;%</code>. Furthermore, much like how the <code>+</code> sign has to come at the end of lines when constructing plots, the pipe operator <code>%&gt;%</code> has to come at the end of lines as well.</p>
-<p>Keep in mind, there are many more advanced data wrangling functions than just the six listed in the introduction to this chapter; you’ll see some examples of these near in Section <a href="3-wrangling.html#other-verbs">3.8</a>. However, just with these six verb-named functions you’ll be able to perform a broad array of data wrangling tasks for the rest of this book.</p>
+<p>Keep in mind, there are many more advanced data wrangling functions than just the six listed in the introduction to this chapter; you’ll see some examples of these in Section <a href="3-wrangling.html#other-verbs">3.8</a>. However, just with these six verb-named functions you’ll be able to perform a broad array of data wrangling tasks for the rest of this book.</p>
 </div>
 <div id="filter" class="section level2">
 <h2><span class="header-section-number">3.2</span> <code>filter</code> rows</h2>
@@ -645,9 +658,9 @@ <h2><span class="header-section-number">3.2</span> <code>filter</code> rows</h2>
 </div>
 <p>The  <code>filter()</code> function here works much like the “Filter” option in Microsoft Excel; it allows you to specify criteria about the values of a variable in your dataset and then filters out only the rows that match that criteria.</p>
 <p>We begin by focusing only on flights from New York City to Portland, Oregon. The <code>dest</code> destination code (or airport code) for Portland, Oregon is <code>&quot;PDX&quot;</code>. Run the following and look at the results in RStudio’s spreadsheet viewer to ensure that only flights heading to Portland are chosen here:</p>
-<pre class="sourceCode r"><code class="sourceCode r">portland_flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(dest <span class="op">==</span><span class="st"> &quot;PDX&quot;</span>)
-<span class="kw">View</span>(portland_flights)</code></pre>
+<div class="sourceCode" id="cb56"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb56-1" data-line-number="1">portland_flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb56-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(dest <span class="op">==</span><span class="st"> &quot;PDX&quot;</span>)</a>
+<a class="sourceLine" id="cb56-3" data-line-number="3"><span class="kw">View</span>(portland_flights)</a></code></pre></div>
 <p>Note the order of the code. First, take the <code>flights</code> data frame <code>flights</code> <em>then</em> <code>filter()</code> the data frame so that only those where the <code>dest</code> equals <code>&quot;PDX&quot;</code> are included. We test for equality using the double equal sign  <code>==</code> and not a single equal sign <code>=</code>. In other words <code>filter(dest = &quot;PDX&quot;)</code> will yield an error. This is a convention across many programming languages. If you are new to coding, you’ll probably forget to use the double equal sign <code>==</code> a few times before you get the hang of it.</p>
 <p>You can use other operators  beyond just the <code>==</code> operator that tests for equality:</p>
 <ul>
@@ -655,40 +668,38 @@ <h2><span class="header-section-number">3.2</span> <code>filter</code> rows</h2>
 <li><code>&lt;</code> corresponds to “less than”</li>
 <li><code>&gt;=</code> corresponds to “greater than or equal to”</li>
 <li><code>&lt;=</code> corresponds to “less than or equal to”</li>
-<li><code>!=</code> corresponds to “not equal to”. The <code>!</code> is used in many programming languages to indicate “not”.</li>
+<li><code>!=</code> corresponds to “not equal to.” The <code>!</code> is used in many programming languages to indicate “not.”</li>
 </ul>
-<p>Furthermore, you can combine multiple criteria together using operators that make comparisons:</p>
+<p>Furthermore, you can combine multiple criteria using operators that make comparisons:</p>
 <ul>
 <li><code>|</code> corresponds to “or”</li>
 <li><code>&amp;</code> corresponds to “and”</li>
 </ul>
 <p>To see many of these in action, let’s filter <code>flights</code> for all rows that departed from JFK <em>and</em> were heading to Burlington, Vermont (<code>&quot;BTV&quot;</code>) or Seattle, Washington (<code>&quot;SEA&quot;</code>) <em>and</em> departed in the months of October, November, or December. Run the following:</p>
-<pre class="sourceCode r"><code class="sourceCode r">btv_sea_flights_fall &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(origin <span class="op">==</span><span class="st"> &quot;JFK&quot;</span> <span class="op">&amp;</span><span class="st"> </span>(dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span>) <span class="op">&amp;</span><span class="st"> </span>month <span class="op">&gt;=</span><span class="st"> </span><span class="dv">10</span>)
-<span class="kw">View</span>(btv_sea_flights_fall)</code></pre>
+<div class="sourceCode" id="cb57"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb57-1" data-line-number="1">btv_sea_flights_fall &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb57-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(origin <span class="op">==</span><span class="st"> &quot;JFK&quot;</span> <span class="op">&amp;</span><span class="st"> </span>(dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span>) <span class="op">&amp;</span><span class="st"> </span>month <span class="op">&gt;=</span><span class="st"> </span><span class="dv">10</span>)</a>
+<a class="sourceLine" id="cb57-3" data-line-number="3"><span class="kw">View</span>(btv_sea_flights_fall)</a></code></pre></div>
 <p>Note that even though colloquially speaking one might say “all flights leaving Burlington, Vermont <em>and</em> Seattle, Washington,” in terms of computer operations, we really mean “all flights leaving Burlington, Vermont <em>or</em> leaving Seattle, Washington.” For a given row in the data, <code>dest</code> can be <code>&quot;BTV&quot;</code>, or <code>&quot;SEA&quot;</code>, or something else, but not both <code>&quot;BTV&quot;</code> and <code>&quot;SEA&quot;</code> at the same time. Furthermore, note the careful use of parentheses around <code>dest == &quot;BTV&quot; | dest == &quot;SEA&quot;</code>.</p>
-<p>We can often skip the use of <code>&amp;</code> and just separate our conditions with a comma. In other words the previous code will return the identical output <code>btv_sea_flights_fall</code> as the following code:</p>
-<pre class="sourceCode r"><code class="sourceCode r">btv_sea_flights_fall &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(origin <span class="op">==</span><span class="st"> &quot;JFK&quot;</span>, (dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span>), month <span class="op">&gt;=</span><span class="st"> </span><span class="dv">10</span>)
-<span class="kw">View</span>(btv_sea_flights_fall)</code></pre>
+<p>We can often skip the use of <code>&amp;</code> and just separate our conditions with a comma. The previous code will return the identical output <code>btv_sea_flights_fall</code> as the following code:</p>
+<div class="sourceCode" id="cb58"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb58-1" data-line-number="1">btv_sea_flights_fall &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb58-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(origin <span class="op">==</span><span class="st"> &quot;JFK&quot;</span>, (dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span>), month <span class="op">&gt;=</span><span class="st"> </span><span class="dv">10</span>)</a>
+<a class="sourceLine" id="cb58-3" data-line-number="3"><span class="kw">View</span>(btv_sea_flights_fall)</a></code></pre></div>
 <p>Let’s present another example that uses the  <code>!</code> “not” operator to pick rows that <em>don’t</em> match a criteria. As mentioned earlier, the <code>!</code> can be read as “not.” Here we are filtering rows corresponding to flights that didn’t go to Burlington, VT or Seattle, WA.</p>
-<pre class="sourceCode r"><code class="sourceCode r">not_BTV_SEA &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(<span class="op">!</span>(dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span>))
-<span class="kw">View</span>(not_BTV_SEA)</code></pre>
+<div class="sourceCode" id="cb59"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb59-1" data-line-number="1">not_BTV_SEA &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb59-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(<span class="op">!</span>(dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span>))</a>
+<a class="sourceLine" id="cb59-3" data-line-number="3"><span class="kw">View</span>(not_BTV_SEA)</a></code></pre></div>
 <p>Again, note the careful use of parentheses around the <code>(dest == &quot;BTV&quot; | dest == &quot;SEA&quot;)</code>. If we didn’t use parentheses as follows:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(<span class="op">!</span>dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb60"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb60-1" data-line-number="1">flights <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">filter</span>(<span class="op">!</span>dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span>)</a></code></pre></div>
 <p>We would be returning all flights not headed to <code>&quot;BTV&quot;</code> <em>or</em> those headed to <code>&quot;SEA&quot;</code>, which is an entirely different resulting data frame.</p>
-<p>Now say we have a larger number of airports we want to filter for, say <code>&quot;SEA&quot;</code>, <code>&quot;SFO&quot;</code>, <code>&quot;PDX&quot;</code>, <code>&quot;BTV&quot;</code>, and <code>&quot;BDL&quot;</code>. We could continue to use the <code>|</code> <em>or</em>  operator as so:</p>
-<pre class="sourceCode r"><code class="sourceCode r">many_airports &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;SFO&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;PDX&quot;</span> <span class="op">|</span><span class="st"> </span>
-<span class="st">         </span>dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;BDL&quot;</span>)
-<span class="kw">View</span>(many_airports)</code></pre>
+<p>Now say we have a larger number of airports we want to filter for, say <code>&quot;SEA&quot;</code>, <code>&quot;SFO&quot;</code>, <code>&quot;PDX&quot;</code>, <code>&quot;BTV&quot;</code>, and <code>&quot;BDL&quot;</code>. We could continue to use the <code>|</code> (<em>or</em>)  operator:</p>
+<div class="sourceCode" id="cb61"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb61-1" data-line-number="1">many_airports &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb61-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;SFO&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;PDX&quot;</span> <span class="op">|</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb61-3" data-line-number="3"><span class="st">         </span>dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;BDL&quot;</span>)</a></code></pre></div>
 <p>but as we progressively include more airports, this will get unwieldy to write. A slightly shorter approach uses the <code>%in%</code>  operator along with the <code>c()</code> function. Recall from Subsection <a href="1-getting-started.html#programming-concepts">1.2.1</a> that the <code>c()</code> function “combines” or “concatenates” values into a single <em>vector</em> of values. </p>
-<pre class="sourceCode r"><code class="sourceCode r">many_airports &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(dest <span class="op">%in%</span><span class="st"> </span><span class="kw">c</span>(<span class="st">&quot;SEA&quot;</span>, <span class="st">&quot;SFO&quot;</span>, <span class="st">&quot;PDX&quot;</span>, <span class="st">&quot;BTV&quot;</span>, <span class="st">&quot;BDL&quot;</span>))
-<span class="kw">View</span>(many_airports)</code></pre>
-<p>What this code is doing is filtering <code>flights</code> for all flights where <code>dest</code> is in the vector of airports <code>c(&quot;BTV&quot;, &quot;SEA&quot;, &quot;PDX&quot;, &quot;SFO&quot;, &quot;BDL&quot;)</code>.Both outputs of <code>many_airports</code> are the same, but as you can see the latter takes much less energy to code.</p>
+<div class="sourceCode" id="cb62"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb62-1" data-line-number="1">many_airports &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb62-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(dest <span class="op">%in%</span><span class="st"> </span><span class="kw">c</span>(<span class="st">&quot;SEA&quot;</span>, <span class="st">&quot;SFO&quot;</span>, <span class="st">&quot;PDX&quot;</span>, <span class="st">&quot;BTV&quot;</span>, <span class="st">&quot;BDL&quot;</span>))</a>
+<a class="sourceLine" id="cb62-3" data-line-number="3"><span class="kw">View</span>(many_airports)</a></code></pre></div>
+<p>What this code is doing is filtering <code>flights</code> for all flights where <code>dest</code> is in the vector of airports <code>c(&quot;BTV&quot;, &quot;SEA&quot;, &quot;PDX&quot;, &quot;SFO&quot;, &quot;BDL&quot;)</code>. Both outputs of <code>many_airports</code> are the same, but as you can see the latter takes much less energy to code. The <code>%in%</code> operator is useful for looking for matches commonly in one vector/variable compared to another.</p>
 <p>As a final note, we recommend that <code>filter()</code> should often be among the first verbs you consider applying to your data. This cleans your dataset to only those rows you care about, or put differently, it narrows down the scope of your data frame to just the observations you care about.</p>
 <div class="learncheck">
 <p>
@@ -710,54 +721,54 @@ <h2><span class="header-section-number">3.3</span> <code>summarize</code> variab
 FIGURE 3.2: Diagram illustrating a summary function in R.
 </p>
 </div>
-<p>More precisely, we’ll use the <code>mean()</code> and <code>sd()</code> summary functions within the <code>summarize()</code>  function from the <code>dplyr</code> package. Note you can also use the UK spelling of <code>summarise()</code>. As shown in Figure <a href="3-wrangling.html#fig:sum1">3.3</a>, the <code>summarize()</code> function takes in a data frame and returns a data frame with only one row corresponding to the summary statistics.</p>
+<p>More precisely, we’ll use the <code>mean()</code> and <code>sd()</code> summary functions within the <code>summarize()</code>  function from the <code>dplyr</code> package. Note you can also use the British English spelling of <code>summarise()</code>. As shown in Figure <a href="3-wrangling.html#fig:sum1">3.3</a>, the <code>summarize()</code> function takes in a data frame and returns a data frame with only one row corresponding to the summary statistics.</p>
 <div class="figure" style="text-align: center"><span id="fig:sum1"></span>
-<img src="images/cheatsheets/summarize1.png" alt="Diagram of summarize() rows." width="\textwidth" />
+<img src="images/cheatsheets/summarize1.png" alt="Diagram of summarize() rows." width="80%" height="80%" />
 <p class="caption">
 FIGURE 3.3: Diagram of summarize() rows.
 </p>
 </div>
 <p>We’ll save the results in a new data frame called <code>summary_temp</code> that will have two columns/variables: the <code>mean</code> and the <code>std_dev</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">summary_temp &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean =</span> <span class="kw">mean</span>(temp), <span class="dt">std_dev =</span> <span class="kw">sd</span>(temp))
-summary_temp</code></pre>
+<div class="sourceCode" id="cb63"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb63-1" data-line-number="1">summary_temp &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb63-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean =</span> <span class="kw">mean</span>(temp), <span class="dt">std_dev =</span> <span class="kw">sd</span>(temp))</a>
+<a class="sourceLine" id="cb63-3" data-line-number="3">summary_temp</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
    mean std_dev
   &lt;dbl&gt;   &lt;dbl&gt;
 1    NA      NA</code></pre>
-<p>Why are the values returned <code>NA</code>? As we saw in Section <a href="2-viz.html#geompoint">2.3.1</a> when creating the scatterplot of departure and arrival delays for <code>alaska_flights</code>, <code>NA</code> is how R encodes <em>missing values</em>  where <code>NA</code> indicates “not available” or “not applicable.” If a value for a particular row and a particular column does not exist, <code>NA</code> is stored instead. Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it? Perhaps the data was not collected at all because it was too difficult to do so? Perhaps there was an erroneous value that someone entered that has been corrected to read as missing? You’ll often encounter issues with missing values when working with real data.</p>
+<p>Why are the values returned <code>NA</code>? As we saw in Subsection <a href="2-viz.html#geompoint">2.3.1</a> when creating the scatterplot of departure and arrival delays for <code>alaska_flights</code>, <code>NA</code> is how R encodes <em>missing values</em>  where <code>NA</code> indicates “not available” or “not applicable.” If a value for a particular row and a particular column does not exist, <code>NA</code> is stored instead. Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it? Perhaps the data was not collected at all because it was too difficult to do so? Perhaps there was an erroneous value that someone entered that has been corrected to read as missing? You’ll often encounter issues with missing values when working with real data.</p>
 <p>Going back to our <code>summary_temp</code> output, by default any time you try to calculate a summary statistic of a variable that has one or more <code>NA</code> missing values in R, <code>NA</code> is returned. To work around this fact, you can set the <code>na.rm</code> argument to <code>TRUE</code>, where <code>rm</code> is short for “remove”; this will ignore any <code>NA</code> missing values and only return the summary value for all non-missing values.</p>
 <p>The code that follows computes the mean and standard deviation of all non-missing values of <code>temp</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">summary_temp &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>), 
-            <span class="dt">std_dev =</span> <span class="kw">sd</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))
-summary_temp</code></pre>
+<div class="sourceCode" id="cb65"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb65-1" data-line-number="1">summary_temp &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb65-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>), </a>
+<a class="sourceLine" id="cb65-3" data-line-number="3">            <span class="dt">std_dev =</span> <span class="kw">sd</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</a>
+<a class="sourceLine" id="cb65-4" data-line-number="4">summary_temp</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
    mean std_dev
   &lt;dbl&gt;   &lt;dbl&gt;
 1  55.3    17.8</code></pre>
-<p>Notice how the <code>na.rm=TRUE</code>  are used as arguments to the <code>mean()</code>  and <code>sd()</code>  summary functions individually, and not to the <code>summarize()</code> function.</p>
-<p>However, one needs to be cautious whenever ignoring missing values as we’ve just done. In the upcoming Learning Checks we’ll consider the possible ramifications of blindly sweeping rows with missing values “under the rug.” This is in fact why the <code>na.rm</code> argument to any summary statistic function in R is set to <code>FALSE</code> by default. In other words, do not ignore rows with missing values by default. R is alerting you to the presence of missing data and you should be mindful of this missingness and any potential causes of this missingness throughout your analysis.</p>
-<p>What are other summary functions can we use inside the <code>summarize()</code> verb to compute summary statistics? As seen in the diagram in Figure <a href="3-wrangling.html#fig:summary-function">3.2</a>, you can use any function in R that takes many values and returns just one. Here are just a few:</p>
+<p>Notice how the <code>na.rm = TRUE</code>  are used as arguments to the <code>mean()</code>  and <code>sd()</code>  summary functions individually, and not to the <code>summarize()</code> function.</p>
+<p>However, one needs to be cautious whenever ignoring missing values as we’ve just done. In the upcoming <em>Learning checks</em> questions, we’ll consider the possible ramifications of blindly sweeping rows with missing values “under the rug.” This is in fact why the <code>na.rm</code> argument to any summary statistic function in R is set to <code>FALSE</code> by default. In other words, R does not ignore rows with missing values by default. R is alerting you to the presence of missing data and you should be mindful of this missingness and any potential causes of this missingness throughout your analysis.</p>
+<p>What are other summary functions we can use inside the <code>summarize()</code> verb to compute summary statistics? As seen in the diagram in Figure <a href="3-wrangling.html#fig:summary-function">3.2</a>, you can use any function in R that takes many values and returns just one. Here are just a few:</p>
 <ul>
-<li><code>mean()</code>: the mean AKA the average</li>
+<li><code>mean()</code>: the average</li>
 <li><code>sd()</code>: the standard deviation, which is a measure of spread</li>
-<li><code>min()</code> and <code>max()</code>: the minimum and maximum values respectively</li>
-<li><code>IQR()</code>: Interquartile range</li>
-<li><code>sum()</code>: the sum</li>
-<li><code>n()</code>: a count of the number of rows/observations in each group. This particular summary function will make more sense when <code>group_by()</code> is covered in Section <a href="3-wrangling.html#groupby">3.4</a>.</li>
+<li><code>min()</code> and <code>max()</code>: the minimum and maximum values, respectively</li>
+<li><code>IQR()</code>: interquartile range</li>
+<li><code>sum()</code>: the total amount when adding multiple numbers</li>
+<li><code>n()</code>: a count of the number of rows in each group. This particular summary function will make more sense when <code>group_by()</code> is covered in Section <a href="3-wrangling.html#groupby">3.4</a>.</li>
 </ul>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC3.2)</strong> Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor’s approach?</p>
-<p><strong>(LC3.3)</strong> Modify the <code>summarize</code> function to create <code>summary_temp</code> to also use the <code>n()</code> summary function: <code>summarize(count = n())</code>. What does the returned value correspond to?</p>
-<p><strong>(LC3.4)</strong> Why doesn’t the following code work? Run the code line by line instead of all at once, and then look at the data. In other words, run <code>summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE))</code> first.</p>
-<pre class="sourceCode r"><code class="sourceCode r">summary_temp &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st">   </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">std_dev =</span> <span class="kw">sd</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</code></pre>
+<p><strong>(LC3.2)</strong> Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five-year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor’s approach?</p>
+<p><strong>(LC3.3)</strong> Modify the <code>summarize()</code> function to create <code>summary_temp</code> to also use the <code>n()</code> summary function: <code>summarize(count = n())</code>. What does the returned value correspond to?</p>
+<p><strong>(LC3.4)</strong> Why doesn’t the following code work? Run the code line-by-line instead of all at once, and then look at the data. In other words, run <code>summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE))</code> first.</p>
+<div class="sourceCode" id="cb67"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb67-1" data-line-number="1">summary_temp &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st">   </span></a>
+<a class="sourceLine" id="cb67-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb67-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">std_dev =</span> <span class="kw">sd</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</a></code></pre></div>
 <div class="learncheck">
 
 </div>
@@ -773,14 +784,14 @@ <h2><span class="header-section-number">3.4</span> <code>group_by</code> rows</h
 </p>
 </div>
 <p>Say instead of a single mean temperature for the whole year, you would like 12 mean temperatures, one for each of the 12 months separately. In other words, we would like to compute the mean temperature split by month. We can do this by “grouping” temperature observations by the values of another variable, in this case by the 12 values of the variable <code>month</code>. Run the following code:</p>
-<pre class="sourceCode r"><code class="sourceCode r">summary_monthly_temp &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(month) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>), 
-            <span class="dt">std_dev =</span> <span class="kw">sd</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))
-summary_monthly_temp</code></pre>
+<div class="sourceCode" id="cb68"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb68-1" data-line-number="1">summary_monthly_temp &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb68-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(month) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb68-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>), </a>
+<a class="sourceLine" id="cb68-4" data-line-number="4">            <span class="dt">std_dev =</span> <span class="kw">sd</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</a>
+<a class="sourceLine" id="cb68-5" data-line-number="5">summary_monthly_temp</a></code></pre></div>
 <pre><code># A tibble: 12 x 3
    month  mean std_dev
-   &lt;dbl&gt; &lt;dbl&gt;   &lt;dbl&gt;
+   &lt;int&gt; &lt;dbl&gt;   &lt;dbl&gt;
  1     1  35.6   10.2 
  2     2  34.3    6.98
  3     3  39.9    6.25
@@ -794,8 +805,9 @@ <h2><span class="header-section-number">3.4</span> <code>group_by</code> rows</h
 11    11  45.0   10.4 
 12    12  38.4    9.98</code></pre>
 <p>This code is identical to the previous code that created <code>summary_temp</code>, but with an extra <code>group_by(month)</code> added before the <code>summarize()</code>. Grouping the <code>weather</code> dataset by <code>month</code> and then applying the <code>summarize()</code> functions yields a data frame that displays the mean and standard deviation temperature split by the 12 months of the year.</p>
-<p>It is important to note that the  <code>group_by()</code> function doesn’t change data frames by itself. Rather it changes the <em>meta-data</em>, or data about the data, specifically the grouping structure. It is only after we apply the <code>summarize()</code> function that the data frame changes. For example, let’s consider the  <code>diamonds</code> data frame included in the <code>ggplot2</code> package. Run this code:</p>
-<pre class="sourceCode r"><code class="sourceCode r">diamonds</code></pre>
+<p>It is important to note that the  <code>group_by()</code> function doesn’t change data frames by itself. Rather it changes the <em>meta-data</em>, or data about the data, specifically the grouping structure. It is only after we apply the <code>summarize()</code> function that the data frame changes.</p>
+<p>For example, let’s consider the  <code>diamonds</code> data frame included in the <code>ggplot2</code> package. Run this code:</p>
+<div class="sourceCode" id="cb70"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb70-1" data-line-number="1">diamonds</a></code></pre></div>
 <pre><code># A tibble: 53,940 x 10
    carat cut       color clarity depth table price     x     y     z
    &lt;dbl&gt; &lt;ord&gt;     &lt;ord&gt; &lt;ord&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
@@ -811,8 +823,8 @@ <h2><span class="header-section-number">3.4</span> <code>group_by</code> rows</h
 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
 # … with 53,930 more rows</code></pre>
 <p>Observe that the first line of the output reads <code># A tibble: 53,940 x 10</code>. This is an example of meta-data, in this case the number of observations/rows and variables/columns in <code>diamonds</code>. The actual data itself are the subsequent table of values. Now let’s pipe the <code>diamonds</code> data frame into <code>group_by(cut)</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">diamonds <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(cut)</code></pre>
+<div class="sourceCode" id="cb72"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb72-1" data-line-number="1">diamonds <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb72-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(cut)</a></code></pre></div>
 <pre><code># A tibble: 53,940 x 10
 # Groups:   cut [5]
    carat cut       color clarity depth table price     x     y     z
@@ -828,11 +840,11 @@ <h2><span class="header-section-number">3.4</span> <code>group_by</code> rows</h
  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
 # … with 53,930 more rows</code></pre>
-<p>Observe that now there is additional meta-data: <code># Groups: cut [5]</code> indicating that the grouping structure meta-data has been set based on the 5 possible levels of the categorical variable <code>cut</code>: <code>&quot;Fair&quot;</code>, <code>&quot;Good&quot;</code>, <code>&quot;Very Good&quot;</code>, <code>&quot;Premium&quot;</code>, <code>&quot;Ideal&quot;</code>. On the other hand, observe that the data has not changed: it is still a table of 53,940 <span class="math inline">\(\times\)</span> 10 values.</p>
+<p>Observe that now there is additional meta-data: <code># Groups: cut [5]</code> indicating that the grouping structure meta-data has been set based on the 5 possible levels of the categorical variable <code>cut</code>: <code>&quot;Fair&quot;</code>, <code>&quot;Good&quot;</code>, <code>&quot;Very Good&quot;</code>, <code>&quot;Premium&quot;</code>, and <code>&quot;Ideal&quot;</code>. On the other hand, observe that the data has not changed: it is still a table of 53,940 <span class="math inline">\(\times\)</span> 10 values.</p>
 <p>Only by combining a <code>group_by()</code> with another data wrangling operation, in this case <code>summarize()</code>, will the data actually be transformed.</p>
-<pre class="sourceCode r"><code class="sourceCode r">diamonds <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(cut) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">avg_price =</span> <span class="kw">mean</span>(price))</code></pre>
+<div class="sourceCode" id="cb74"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb74-1" data-line-number="1">diamonds <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb74-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(cut) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb74-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">avg_price =</span> <span class="kw">mean</span>(price))</a></code></pre></div>
 <pre><code># A tibble: 5 x 2
   cut       avg_price
   &lt;ord&gt;         &lt;dbl&gt;
@@ -842,9 +854,9 @@ <h2><span class="header-section-number">3.4</span> <code>group_by</code> rows</h
 4 Premium       4584.
 5 Ideal         3458.</code></pre>
 <p>If you would like to remove this grouping structure meta-data, we can pipe the resulting data frame into the  <code>ungroup()</code> function:</p>
-<pre class="sourceCode r"><code class="sourceCode r">diamonds <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(cut) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">ungroup</span>()</code></pre>
+<div class="sourceCode" id="cb76"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb76-1" data-line-number="1">diamonds <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb76-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(cut) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb76-3" data-line-number="3"><span class="st">  </span><span class="kw">ungroup</span>()</a></code></pre></div>
 <pre><code># A tibble: 53,940 x 10
    carat cut       color clarity depth table price     x     y     z
    &lt;dbl&gt; &lt;ord&gt;     &lt;ord&gt; &lt;ord&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
@@ -860,11 +872,11 @@ <h2><span class="header-section-number">3.4</span> <code>group_by</code> rows</h
 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
 # … with 53,930 more rows</code></pre>
 <p>Observe how the <code># Groups: cut [5]</code> meta-data is no longer present.</p>
-<p>Let’s now revisit the <code>n()</code>  counting summary function we briefly introduced in the previously. Recall that the <code>n()</code> function counts rows. This is opposed to the <code>sum()</code> summary function that returns the sum of a numerical variable. For example, suppose we’d like to count how many flights departed each of the three airports in New York City:</p>
-<pre class="sourceCode r"><code class="sourceCode r">by_origin &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(origin) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">count =</span> <span class="kw">n</span>())
-by_origin</code></pre>
+<p>Let’s now revisit the <code>n()</code>  counting summary function we briefly introduced previously. Recall that the <code>n()</code> function counts rows. This is opposed to the <code>sum()</code> summary function that returns the sum of a numerical variable. For example, suppose we’d like to count how many flights departed each of the three airports in New York City:</p>
+<div class="sourceCode" id="cb78"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb78-1" data-line-number="1">by_origin &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb78-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(origin) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb78-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">count =</span> <span class="kw">n</span>())</a>
+<a class="sourceLine" id="cb78-4" data-line-number="4">by_origin</a></code></pre></div>
 <pre><code># A tibble: 3 x 2
   origin  count
   &lt;chr&gt;   &lt;int&gt;
@@ -875,10 +887,10 @@ <h2><span class="header-section-number">3.4</span> <code>group_by</code> rows</h
 <div id="grouping-by-more-than-one-variable" class="section level3">
 <h3><span class="header-section-number">3.4.1</span> Grouping by more than one variable</h3>
 <p>You are not limited to grouping by one variable. Say you want to know the number of flights leaving each of the three New York City airports <em>for each month</em>. We can also group by a second variable <code>month</code> using <code>group_by(origin, month)</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">by_origin_monthly &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(origin, month) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">count =</span> <span class="kw">n</span>())
-by_origin_monthly</code></pre>
+<div class="sourceCode" id="cb80"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb80-1" data-line-number="1">by_origin_monthly &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb80-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(origin, month) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb80-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">count =</span> <span class="kw">n</span>())</a>
+<a class="sourceLine" id="cb80-4" data-line-number="4">by_origin_monthly</a></code></pre></div>
 <pre><code># A tibble: 36 x 3
 # Groups:   origin [3]
    origin month count
@@ -896,11 +908,11 @@ <h3><span class="header-section-number">3.4.1</span> Grouping by more than one v
 # … with 26 more rows</code></pre>
 <p>Observe that there are 36 rows to <code>by_origin_monthly</code> because there are 12 months for 3 airports (<code>EWR</code>, <code>JFK</code>, and <code>LGA</code>).</p>
 <p>Why do we <code>group_by(origin, month)</code> and not <code>group_by(origin)</code> and then <code>group_by(month)</code>? Let’s investigate:</p>
-<pre class="sourceCode r"><code class="sourceCode r">by_origin_monthly_incorrect &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(origin) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(month) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">count =</span> <span class="kw">n</span>())
-by_origin_monthly_incorrect</code></pre>
+<div class="sourceCode" id="cb82"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb82-1" data-line-number="1">by_origin_monthly_incorrect &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb82-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(origin) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb82-3" data-line-number="3"><span class="st">  </span><span class="kw">group_by</span>(month) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb82-4" data-line-number="4"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">count =</span> <span class="kw">n</span>())</a>
+<a class="sourceLine" id="cb82-5" data-line-number="5">by_origin_monthly_incorrect</a></code></pre></div>
 <pre><code># A tibble: 12 x 2
    month count
    &lt;int&gt; &lt;int&gt;
@@ -922,7 +934,7 @@ <h3><span class="header-section-number">3.4.1</span> Grouping by more than one v
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC3.5)</strong> Recall from Chapter <a href="2-viz.html#viz">2</a> when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the <code>summary_monthly_temp</code> data frame tell us about temperatures in New York City throughout the year?</p>
+<p><strong>(LC3.5)</strong> Recall from Chapter <a href="2-viz.html#viz">2</a> when we looked at temperatures by months in NYC. What does the standard deviation column in the <code>summary_monthly_temp</code> data frame tell us about temperatures in NYC throughout the year?</p>
 <p><strong>(LC3.6)</strong> What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC?</p>
 <p><strong>(LC3.7)</strong> Recreate <code>by_monthly_origin</code>, but instead of grouping via <code>group_by(origin, month)</code>, group variables in a different order <code>group_by(month, origin)</code>. What differs in the resulting dataset?</p>
 <p><strong>(LC3.8)</strong> How could we identify how many flights left each of the three airports for each <code>carrier</code>?</p>
@@ -935,28 +947,28 @@ <h3><span class="header-section-number">3.4.1</span> Grouping by more than one v
 <div id="mutate" class="section level2">
 <h2><span class="header-section-number">3.5</span> <code>mutate</code> existing variables</h2>
 <div class="figure" style="text-align: center"><span id="fig:select"></span>
-<img src="images/cheatsheets/mutate.png" alt="Diagram of mutate() columns." width="\textwidth" />
+<img src="images/cheatsheets/mutate.png" alt="Diagram of mutate() columns." width="80%" height="80%" />
 <p class="caption">
 FIGURE 3.5: Diagram of mutate() columns.
 </p>
 </div>
-<p>Another common transformation of data is to create/compute new variables based on existing ones. For example, say you are more comfortable thinking of temperature in degrees Celsius °C instead of degrees Fahrenheit °F. The formula to convert temperatures from °F to °C is</p>
+<p>Another common transformation of data is to create/compute new variables based on existing ones. For example, say you are more comfortable thinking of temperature in degrees Celsius (°C) instead of degrees Fahrenheit (°F). The formula to convert temperatures from °F to °C is</p>
 <p><span class="math display">\[
 \text{temp in C} = \frac{\text{temp in F} - 32}{1.8}
 \]</span></p>
 <p>We can apply this formula to the <code>temp</code> variable using the <code>mutate()</code>  function from the <code>dplyr</code> package, which takes existing variables and mutates them to create new ones.</p>
-<pre class="sourceCode r"><code class="sourceCode r">weather &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">temp_in_C =</span> (temp <span class="op">-</span><span class="st"> </span><span class="dv">32</span>) <span class="op">/</span><span class="st"> </span><span class="fl">1.8</span>)</code></pre>
-<p>In this code we <code>mutate()</code> the <code>weather</code> data frame by creating a new variable <code>temp_in_C = (temp-32) / 1.8</code> and then <em>overwrite</em> the original <code>weather</code> data frame. Why did we overwrite the data frame <code>weather</code>, instead of assigning the result to a new data frame like <code>weather_new</code>? As a rough rule of thumb, as long as you are not losing original information that you might need later, it’s acceptable practice to overwrite existing data frames with updated ones, as we did here. On the other hand, why did we not overwrite the variable <code>temp</code>, but instead create a new variable called <code>temp_in_C</code>? Because if we did this, we would have erased the original information contained in <code>temp</code> of temperatures in Fahrenheit that may still be valuable to us.</p>
+<div class="sourceCode" id="cb84"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb84-1" data-line-number="1">weather &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb84-2" data-line-number="2"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">temp_in_C =</span> (temp <span class="op">-</span><span class="st"> </span><span class="dv">32</span>) <span class="op">/</span><span class="st"> </span><span class="fl">1.8</span>)</a></code></pre></div>
+<p>In this code, we <code>mutate()</code> the <code>weather</code> data frame by creating a new variable <code>temp_in_C = (temp - 32) / 1.8</code> and then <em>overwrite</em> the original <code>weather</code> data frame. Why did we overwrite the data frame <code>weather</code>, instead of assigning the result to a new data frame like <code>weather_new</code>? As a rough rule of thumb, as long as you are not losing original information that you might need later, it’s acceptable practice to overwrite existing data frames with updated ones, as we did here. On the other hand, why did we not overwrite the variable <code>temp</code>, but instead created a new variable called <code>temp_in_C</code>? Because if we did this, we would have erased the original information contained in <code>temp</code> of temperatures in Fahrenheit that may still be valuable to us.</p>
 <p>Let’s now compute monthly average temperatures in both °F and °C using the <code>group_by()</code> and <code>summarize()</code> code we saw in Section <a href="3-wrangling.html#groupby">3.4</a>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">summary_monthly_temp &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(month) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_temp_in_F =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>), 
-            <span class="dt">mean_temp_in_C =</span> <span class="kw">mean</span>(temp_in_C, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))
-summary_monthly_temp</code></pre>
+<div class="sourceCode" id="cb85"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb85-1" data-line-number="1">summary_monthly_temp &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb85-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(month) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb85-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_temp_in_F =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>), </a>
+<a class="sourceLine" id="cb85-4" data-line-number="4">            <span class="dt">mean_temp_in_C =</span> <span class="kw">mean</span>(temp_in_C, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</a>
+<a class="sourceLine" id="cb85-5" data-line-number="5">summary_monthly_temp</a></code></pre></div>
 <pre><code># A tibble: 12 x 3
    month mean_temp_in_F mean_temp_in_C
-   &lt;dbl&gt;          &lt;dbl&gt;          &lt;dbl&gt;
+   &lt;int&gt;          &lt;dbl&gt;          &lt;dbl&gt;
  1     1           35.6           2.02
  2     2           34.3           1.26
  3     3           39.9           4.38
@@ -969,13 +981,13 @@ <h2><span class="header-section-number">3.5</span> <code>mutate</code> existing
 10    10           60.1          15.6 
 11    11           45.0           7.22
 12    12           38.4           3.58</code></pre>
-<p>Let’s consider another example. Passengers are often frustrated when their flight departs late, but aren’t as annoyed if, in the end, pilots can make up some time during the flight. This is known in the airline industry as “gain” and we will create this variable using the <code>mutate()</code> function:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">gain =</span> dep_delay <span class="op">-</span><span class="st"> </span>arr_delay)</code></pre>
+<p>Let’s consider another example. Passengers are often frustrated when their flight departs late, but aren’t as annoyed if, in the end, pilots can make up some time during the flight. This is known in the airline industry as <em>gain</em>, and we will create this variable using the <code>mutate()</code> function:</p>
+<div class="sourceCode" id="cb87"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb87-1" data-line-number="1">flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb87-2" data-line-number="2"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">gain =</span> dep_delay <span class="op">-</span><span class="st"> </span>arr_delay)</a></code></pre></div>
 <p>Let’s take a look at only the <code>dep_delay</code>, <code>arr_delay</code>, and the resulting <code>gain</code> variables for the first 5 rows in our updated <code>flights</code> data frame in Table <a href="3-wrangling.html#tab:first-five-flights">3.1</a>.</p>
 <table class="table" style="margin-left: auto; margin-right: auto;">
 <caption>
-<span id="tab:first-five-flights">TABLE 3.1: </span>First five rows of departure/arrival delay and gain variables.
+<span id="tab:first-five-flights">TABLE 3.1: </span>First five rows of departure/arrival delay and gain variables
 </caption>
 <thead>
 <tr>
@@ -1048,42 +1060,42 @@ <h2><span class="header-section-number">3.5</span> <code>mutate</code> existing
 </tr>
 </tbody>
 </table>
-<p>The flight in the first row departed 2 minutes late but arrived 11 minutes late, so its “gained time in the air” is a loss of 9 minutes, hence its <code>gain</code> is 2 - 11 = -9. On the other hand, the flight in the fourth row departed a minute early (<code>dep_delay</code> of -1) but arrived 18 minutes early (<code>arr_delay</code> of -18), so its “gained time in the air” is -1 - (-18) = -1 + 18 = 17 minutes, hence its <code>gain</code> is +17.</p>
+<p>The flight in the first row departed 2 minutes late but arrived 11 minutes late, so its “gained time in the air” is a loss of 9 minutes, hence its <code>gain</code> is 2 - 11 = -9. On the other hand, the flight in the fourth row departed a minute early (<code>dep_delay</code> of -1) but arrived 18 minutes early (<code>arr_delay</code> of -18), so its “gained time in the air” is <span class="math inline">\(-1 - (-18) = -1 + 18 = 17\)</span> minutes, hence its <code>gain</code> is +17.</p>
 <p>Let’s look at some summary statistics of the <code>gain</code> variable by considering multiple summary functions at once in the same <code>summarize()</code> code:</p>
-<pre class="sourceCode r"><code class="sourceCode r">gain_summary &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(
-    <span class="dt">min =</span> <span class="kw">min</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">q1 =</span> <span class="kw">quantile</span>(gain, <span class="fl">0.25</span>, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">median =</span> <span class="kw">quantile</span>(gain, <span class="fl">0.5</span>, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">q3 =</span> <span class="kw">quantile</span>(gain, <span class="fl">0.75</span>, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">max =</span> <span class="kw">max</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">mean =</span> <span class="kw">mean</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">sd =</span> <span class="kw">sd</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
-    <span class="dt">missing =</span> <span class="kw">sum</span>(<span class="kw">is.na</span>(gain))
-  )
-gain_summary</code></pre>
+<div class="sourceCode" id="cb88"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb88-1" data-line-number="1">gain_summary &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb88-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(</a>
+<a class="sourceLine" id="cb88-3" data-line-number="3">    <span class="dt">min =</span> <span class="kw">min</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb88-4" data-line-number="4">    <span class="dt">q1 =</span> <span class="kw">quantile</span>(gain, <span class="fl">0.25</span>, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb88-5" data-line-number="5">    <span class="dt">median =</span> <span class="kw">quantile</span>(gain, <span class="fl">0.5</span>, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb88-6" data-line-number="6">    <span class="dt">q3 =</span> <span class="kw">quantile</span>(gain, <span class="fl">0.75</span>, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb88-7" data-line-number="7">    <span class="dt">max =</span> <span class="kw">max</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb88-8" data-line-number="8">    <span class="dt">mean =</span> <span class="kw">mean</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb88-9" data-line-number="9">    <span class="dt">sd =</span> <span class="kw">sd</span>(gain, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb88-10" data-line-number="10">    <span class="dt">missing =</span> <span class="kw">sum</span>(<span class="kw">is.na</span>(gain))</a>
+<a class="sourceLine" id="cb88-11" data-line-number="11">  )</a>
+<a class="sourceLine" id="cb88-12" data-line-number="12">gain_summary</a></code></pre></div>
 <pre><code># A tibble: 1 x 8
     min    q1 median    q3   max  mean    sd missing
   &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;   &lt;int&gt;
 1  -196    -3      7    17   109  5.66  18.0    9430</code></pre>
-<p>We see for example that the average gain is +5 minutes while the largest is +109 minutes! However, this code would take some time to type out in practice. We’ll see later on in Subsection <a href="5-regression.html#model1EDA">5.1.1</a> that there is a much more succinct way to compute a variety of common summary statistics: using the <code>skim()</code> function from the <code>skimr</code> package.</p>
+<p>We see for example that the average gain is +5 minutes, while the largest is +109 minutes! However, this code would take some time to type out in practice. We’ll see later on in Subsection <a href="5-regression.html#model1EDA">5.1.1</a> that there is a much more succinct way to compute a variety of common summary statistics: using the <code>skim()</code> function from the <code>skimr</code> package.</p>
 <p>Recall from Section <a href="2-viz.html#histograms">2.5</a> that since <code>gain</code> is a numerical variable, we can visualize its distribution using a histogram.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> gain)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>, <span class="dt">bins =</span> <span class="dv">20</span>)</code></pre>
+<div class="sourceCode" id="cb90"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb90-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> gain)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb90-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>, <span class="dt">bins =</span> <span class="dv">20</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:gain-hist"></span>
-<img src="moderndive_files/figure-html/gain-hist-1.png" alt="Histogram of gain variable." width="\textwidth" />
+<img src="ModernDive_files/figure-html/gain-hist-1.png" alt="Histogram of gain variable." width="\textwidth" />
 <p class="caption">
 FIGURE 3.6: Histogram of gain variable.
 </p>
 </div>
 <p>The resulting histogram in Figure <a href="3-wrangling.html#fig:gain-hist">3.6</a> provides a different perspective on the <code>gain</code> variable than the summary statistics we computed earlier. For example, note that most values of <code>gain</code> are right around 0.</p>
-<p>To close out our discussion on the <code>mutate()</code> function to create new variables, note that we can create multiple new variables at once in the same <code>mutate()</code> code. Furthermore, within the same <code>mutate()</code> code we can refer to new variables we just created. As an example, consider the <code>mutate()</code> code Hadley Wickham  and Garrett Grolemund  show in Chapter 5 of “R for Data Science” <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2016</a>)</span>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(
-    <span class="dt">gain =</span> dep_delay <span class="op">-</span><span class="st"> </span>arr_delay,
-    <span class="dt">hours =</span> air_time <span class="op">/</span><span class="st"> </span><span class="dv">60</span>,
-    <span class="dt">gain_per_hour =</span> gain <span class="op">/</span><span class="st"> </span>hours
-  )</code></pre>
+<p>To close out our discussion on the <code>mutate()</code> function to create new variables, note that we can create multiple new variables at once in the same <code>mutate()</code> code. Furthermore, within the same <code>mutate()</code> code we can refer to new variables we just created. As an example, consider the <code>mutate()</code> code Hadley Wickham  and Garrett Grolemund  show in Chapter 5 of <em>R for Data Science</em> <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2017</a>)</span>:</p>
+<div class="sourceCode" id="cb91"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb91-1" data-line-number="1">flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb91-2" data-line-number="2"><span class="st">  </span><span class="kw">mutate</span>(</a>
+<a class="sourceLine" id="cb91-3" data-line-number="3">    <span class="dt">gain =</span> dep_delay <span class="op">-</span><span class="st"> </span>arr_delay,</a>
+<a class="sourceLine" id="cb91-4" data-line-number="4">    <span class="dt">hours =</span> air_time <span class="op">/</span><span class="st"> </span><span class="dv">60</span>,</a>
+<a class="sourceLine" id="cb91-5" data-line-number="5">    <span class="dt">gain_per_hour =</span> gain <span class="op">/</span><span class="st"> </span>hours</a>
+<a class="sourceLine" id="cb91-6" data-line-number="6">  )</a></code></pre></div>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -1098,12 +1110,12 @@ <h2><span class="header-section-number">3.5</span> <code>mutate</code> existing
 </div>
 <div id="arrange" class="section level2">
 <h2><span class="header-section-number">3.6</span> <code>arrange</code> and sort rows</h2>
-<p>One of the most commonly performed data wrangling tasks is to sort a data frame’s rows in alphanumeric order of one of the variables. The <code>dplyr</code> package’s <code>arrange()</code> function  allows us to sort/reorder a data frame’s rows according to the values of the specified variable.</p>
+<p>One of the most commonly performed data wrangling tasks is to sort a data frame’s rows in the alphanumeric order of one of the variables. The <code>dplyr</code> package’s <code>arrange()</code> function  allows us to sort/reorder a data frame’s rows according to the values of the specified variable.</p>
 <p>Suppose we are interested in determining the most frequent destination airports for all domestic flights departing from New York City in 2013:</p>
-<pre class="sourceCode r"><code class="sourceCode r">freq_dest &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(dest) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">num_flights =</span> <span class="kw">n</span>())
-freq_dest</code></pre>
+<div class="sourceCode" id="cb92"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb92-1" data-line-number="1">freq_dest &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb92-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(dest) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb92-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">num_flights =</span> <span class="kw">n</span>())</a>
+<a class="sourceLine" id="cb92-4" data-line-number="4">freq_dest</a></code></pre></div>
 <pre><code># A tibble: 105 x 2
    dest  num_flights
    &lt;chr&gt;       &lt;int&gt;
@@ -1118,9 +1130,9 @@ <h2><span class="header-section-number">3.6</span> <code>arrange</code> and sort
  9 BGR           375
 10 BHM           297
 # … with 95 more rows</code></pre>
-<p>Observe that by default the rows of the resulting <code>freq_dest</code> data frame are sorted in alphabetical order of <code>dest</code> destination. Say instead we would like to see the same data, but sorted from the most to the least number of flights <code>num_flights</code> instead:</p>
-<pre class="sourceCode r"><code class="sourceCode r">freq_dest <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">arrange</span>(num_flights)</code></pre>
+<p>Observe that by default the rows of the resulting <code>freq_dest</code> data frame are sorted in alphabetical order of <code>dest</code>ination. Say instead we would like to see the same data, but sorted from the most to the least number of flights (<code>num_flights</code>) instead:</p>
+<div class="sourceCode" id="cb94"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb94-1" data-line-number="1">freq_dest <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb94-2" data-line-number="2"><span class="st">  </span><span class="kw">arrange</span>(num_flights)</a></code></pre></div>
 <pre><code># A tibble: 105 x 2
    dest  num_flights
    &lt;chr&gt;       &lt;int&gt;
@@ -1135,9 +1147,9 @@ <h2><span class="header-section-number">3.6</span> <code>arrange</code> and sort
  9 JAC            25
 10 BZN            36
 # … with 95 more rows</code></pre>
-<p>This is however the opposite of what we want. The rows are sorted with the least frequent destination airports displayed first. This is because <code>arrange()</code> always returns rows sorted in ascending order by default. To switch the ordering to be in “descending” order instead, we use the <code>desc()</code>  function as so:</p>
-<pre class="sourceCode r"><code class="sourceCode r">freq_dest <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(num_flights))</code></pre>
+<p>This is, however, the opposite of what we want. The rows are sorted with the least frequent destination airports displayed first. This is because <code>arrange()</code> always returns rows sorted in ascending order by default. To switch the ordering to be in “descending” order instead, we use the <code>desc()</code>  function as so:</p>
+<div class="sourceCode" id="cb96"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb96-1" data-line-number="1">freq_dest <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb96-2" data-line-number="2"><span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(num_flights))</a></code></pre></div>
 <pre><code># A tibble: 105 x 2
    dest  num_flights
    &lt;chr&gt;       &lt;int&gt;
@@ -1155,53 +1167,55 @@ <h2><span class="header-section-number">3.6</span> <code>arrange</code> and sort
 </div>
 <div id="joins" class="section level2">
 <h2><span class="header-section-number">3.7</span> <code>join</code> data frames</h2>
-<p>Another common data transformation task is “joining” or “merging” two different datasets. For example, in the <code>flights</code> data frame the variable <code>carrier</code> lists the carrier code for the different flights. While the corresponding airline names for <code>&quot;UA&quot;</code> and <code>&quot;AA&quot;</code> might be somewhat easy to guess (United and American Airlines), what airlines have codes <code>&quot;VX&quot;</code>, <code>&quot;HA&quot;</code>, and <code>&quot;B6&quot;</code>? This information is provided in a separate data frame <code>airlines</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">View</span>(airlines)</code></pre>
-<p>We see that in <code>airports</code>, <code>carrier</code> is the carrier code while <code>name</code> is the full name of the airline company. Using this table, we can see that <code>&quot;VX&quot;</code>, <code>&quot;HA&quot;</code>, and <code>&quot;B6&quot;</code> correspond to Virgin America, Hawaiian Airlines, and JetBlue respectively. However, wouldn’t it be nice to have all this information in a single data frame instead of two separate data frames? We can do this by “joining” i.e. “merging” the <code>flights</code> and <code>airlines</code> data frames.</p>
-<p>Note that the values in the variable <code>carrier</code> in the <code>flights</code> data frame match the values in the variable <code>carrier</code> in the <code>airlines</code> data frame. In this case, we can use the variable <code>carrier</code> as a  <em>key variable</em> to match the rows of the two data frames. Key variables are almost always identification variables that uniquely identify the observational units as we saw in Subsection <a href="1-getting-started.html#identification-vs-measurement-variables">1.4.4</a>. This ensures that rows in both data frames are appropriately matched during the join. Hadley and Garrett <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2016</a>)</span> created the following diagram to help us understand how the different data frames in the <code>nycflights13</code> package are linked by various key variables:</p>
+<p>Another common data transformation task is “joining” or “merging” two different datasets. For example, in the <code>flights</code> data frame, the variable <code>carrier</code> lists the carrier code for the different flights. While the corresponding airline names for <code>&quot;UA&quot;</code> and <code>&quot;AA&quot;</code> might be somewhat easy to guess (United and American Airlines), what airlines have codes <code>&quot;VX&quot;</code>, <code>&quot;HA&quot;</code>, and <code>&quot;B6&quot;</code>? This information is provided in a separate data frame <code>airlines</code>.</p>
+<div class="sourceCode" id="cb98"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb98-1" data-line-number="1"><span class="kw">View</span>(airlines)</a></code></pre></div>
+<p>We see that in <code>airports</code>, <code>carrier</code> is the carrier code, while <code>name</code> is the full name of the airline company. Using this table, we can see that <code>&quot;VX&quot;</code>, <code>&quot;HA&quot;</code>, and <code>&quot;B6&quot;</code> correspond to Virgin America, Hawaiian Airlines, and JetBlue, respectively. However, wouldn’t it be nice to have all this information in a single data frame instead of two separate data frames? We can do this by “joining” the <code>flights</code> and <code>airlines</code> data frames.</p>
+<p>Note that the values in the variable <code>carrier</code> in the <code>flights</code> data frame match the values in the variable <code>carrier</code> in the <code>airlines</code> data frame. In this case, we can use the variable <code>carrier</code> as a  <em>key variable</em> to match the rows of the two data frames. Key variables are almost always <em>identification variables</em> that uniquely identify the observational units as we saw in Subsection <a href="1-getting-started.html#identification-vs-measurement-variables">1.4.4</a>. This ensures that rows in both data frames are appropriately matched during the join. Hadley and Garrett <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2017</a>)</span> created the diagram shown in Figure <a href="3-wrangling.html#fig:reldiagram">3.7</a> to help us understand how the different data frames in the <code>nycflights13</code> package are linked by various key variables:</p>
+
 <div class="figure" style="text-align: center"><span id="fig:reldiagram"></span>
-<img src="images/r4ds/relational-nycflights.png" alt="Data relationships in nycflights13 from R for Data Science." width="\textwidth" />
+<img src="images/r4ds/relational-nycflights.png" alt="Data relationships in nycflights13 from R for Data Science." width="\textwidth" height="120%" />
 <p class="caption">
-FIGURE 3.7: Data relationships in nycflights13 from R for Data Science.
+FIGURE 3.7: Data relationships in nycflights13 from <em>R for Data Science</em>.
 </p>
 </div>
 <div id="matching-key-variable-names" class="section level3">
 <h3><span class="header-section-number">3.7.1</span> Matching “key” variable names</h3>
 <p>In both the <code>flights</code> and <code>airlines</code> data frames, the key variable we want to join/merge/match the rows by has the same name: <code>carrier</code>. Let’s use the <code>inner_join()</code>  function to join the two data frames, where the rows will be matched by the variable <code>carrier</code>, and then compare the resulting data frames:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights_joined &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">inner_join</span>(airlines, <span class="dt">by =</span> <span class="st">&quot;carrier&quot;</span>)
-<span class="kw">View</span>(flights)
-<span class="kw">View</span>(flights_joined)</code></pre>
+<div class="sourceCode" id="cb99"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb99-1" data-line-number="1">flights_joined &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb99-2" data-line-number="2"><span class="st">  </span><span class="kw">inner_join</span>(airlines, <span class="dt">by =</span> <span class="st">&quot;carrier&quot;</span>)</a>
+<a class="sourceLine" id="cb99-3" data-line-number="3"><span class="kw">View</span>(flights)</a>
+<a class="sourceLine" id="cb99-4" data-line-number="4"><span class="kw">View</span>(flights_joined)</a></code></pre></div>
 <p>Observe that the <code>flights</code> and <code>flights_joined</code> data frames are identical except that <code>flights_joined</code> has an additional variable <code>name</code>. The values of <code>name</code> correspond to the airline companies’ names as indicated in the <code>airlines</code> data frame.</p>
-<p>A visual representation of the <code>inner_join()</code> is shown in Figure <a href="3-wrangling.html#fig:ijdiagram">3.8</a> <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2016</a>)</span>. There are other types of joins available (such as <code>left_join()</code>, <code>right_join()</code>, <code>outer_join()</code>, and <code>anti_join()</code>), but the <code>inner_join()</code> will solve nearly all of the problems you’ll encounter in this book.</p>
+<p>A visual representation of the <code>inner_join()</code> is shown in Figure <a href="3-wrangling.html#fig:ijdiagram">3.8</a> <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2017</a>)</span>. There are other types of joins available (such as <code>left_join()</code>, <code>right_join()</code>, <code>outer_join()</code>, and <code>anti_join()</code>), but the <code>inner_join()</code> will solve nearly all of the problems you’ll encounter in this book.</p>
+
 <div class="figure" style="text-align: center"><span id="fig:ijdiagram"></span>
-<img src="images/r4ds/join-inner.png" alt="Diagram of inner join from R for Data Science." width="\textwidth" />
+<img src="images/r4ds/join-inner.png" alt="Diagram of inner join from R for Data Science." width="\textwidth" height="120%" />
 <p class="caption">
-FIGURE 3.8: Diagram of inner join from R for Data Science.
+FIGURE 3.8: Diagram of inner join from <em>R for Data Science</em>.
 </p>
 </div>
 </div>
 <div id="diff-key" class="section level3">
 <h3><span class="header-section-number">3.7.2</span> Different “key” variable names</h3>
-<p>Say instead you are interested in the destinations of all domestic flights departing NYC in 2013 and you ask yourself questions like: “What cities are these airports in?” or “Is <code>&quot;ORD&quot;</code> Orlando?” or &quot;Where is <code>&quot;FLL&quot;?</code></p>
+<p>Say instead you are interested in the destinations of all domestic flights departing NYC in 2013, and you ask yourself questions like: “What cities are these airports in?”, or “Is <code>&quot;ORD&quot;</code> Orlando?”, or “Where is <code>&quot;FLL&quot;</code>?”.</p>
 <p>The <code>airports</code> data frame contains the airport codes for each airport:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">View</span>(airports)</code></pre>
-<p>However, if you look at both the <code>airports</code> and <code>flights</code> data frames, you’ll find that the airport codes are in variables that have different names. In <code>airports</code> the airport code is in <code>faa</code> whereas in <code>flights</code> the airport codes are in <code>origin</code> and <code>dest</code>. This fact is further highlighted in the visual representation of the relationships between these data frames in Figure <a href="3-wrangling.html#fig:reldiagram">3.7</a>.</p>
-<p>In order to join these two data frames by airport code, our <code>inner_join()</code> operation will use the <code>by = c(&quot;dest&quot; = &quot;faa&quot;)</code>  argument, which allows us to join two data frames where the key variable has a different name:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights_with_airport_names &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">inner_join</span>(airports, <span class="dt">by =</span> <span class="kw">c</span>(<span class="st">&quot;dest&quot;</span> =<span class="st"> &quot;faa&quot;</span>))
-<span class="kw">View</span>(flights_with_airport_names)</code></pre>
+<div class="sourceCode" id="cb100"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb100-1" data-line-number="1"><span class="kw">View</span>(airports)</a></code></pre></div>
+<p>However, if you look at both the <code>airports</code> and <code>flights</code> data frames, you’ll find that the airport codes are in variables that have different names. In <code>airports</code> the airport code is in <code>faa</code>, whereas in <code>flights</code> the airport codes are in <code>origin</code> and <code>dest</code>. This fact is further highlighted in the visual representation of the relationships between these data frames in Figure <a href="3-wrangling.html#fig:reldiagram">3.7</a>.</p>
+<p>In order to join these two data frames by airport code, our <code>inner_join()</code> operation will use the <code>by = c(&quot;dest&quot; = &quot;faa&quot;)</code>  argument with modified code syntax allowing us to join two data frames where the key variable has a different name:</p>
+<div class="sourceCode" id="cb101"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb101-1" data-line-number="1">flights_with_airport_names &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb101-2" data-line-number="2"><span class="st">  </span><span class="kw">inner_join</span>(airports, <span class="dt">by =</span> <span class="kw">c</span>(<span class="st">&quot;dest&quot;</span> =<span class="st"> &quot;faa&quot;</span>))</a>
+<a class="sourceLine" id="cb101-3" data-line-number="3"><span class="kw">View</span>(flights_with_airport_names)</a></code></pre></div>
 <p>Let’s construct the chain of pipe operators <code>%&gt;%</code> that computes the number of flights from NYC to each destination, but also includes information about each destination airport:</p>
-<pre class="sourceCode r"><code class="sourceCode r">named_dests &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">group_by</span>(dest) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">num_flights =</span> <span class="kw">n</span>()) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(num_flights)) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">inner_join</span>(airports, <span class="dt">by =</span> <span class="kw">c</span>(<span class="st">&quot;dest&quot;</span> =<span class="st"> &quot;faa&quot;</span>)) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">rename</span>(<span class="dt">airport_name =</span> name)
-named_dests</code></pre>
+<div class="sourceCode" id="cb102"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb102-1" data-line-number="1">named_dests &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb102-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(dest) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb102-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">num_flights =</span> <span class="kw">n</span>()) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb102-4" data-line-number="4"><span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(num_flights)) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb102-5" data-line-number="5"><span class="st">  </span><span class="kw">inner_join</span>(airports, <span class="dt">by =</span> <span class="kw">c</span>(<span class="st">&quot;dest&quot;</span> =<span class="st"> &quot;faa&quot;</span>)) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb102-6" data-line-number="6"><span class="st">  </span><span class="kw">rename</span>(<span class="dt">airport_name =</span> name)</a>
+<a class="sourceLine" id="cb102-7" data-line-number="7">named_dests</a></code></pre></div>
 <pre><code># A tibble: 101 x 9
    dest  num_flights airport_name          lat    lon   alt    tz dst   tzone   
-   &lt;chr&gt;       &lt;int&gt; &lt;chr&gt;               &lt;dbl&gt;  &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;   
+   &lt;chr&gt;       &lt;int&gt; &lt;chr&gt;               &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;   
  1 ORD         17283 Chicago Ohare Intl   42.0  -87.9   668    -6 A     America…
  2 ATL         17215 Hartsfield Jackson…  33.6  -84.4  1026    -5 A     America…
  3 LAX         16174 Los Angeles Intl     33.9 -118.    126    -8 A     America…
@@ -1213,15 +1227,15 @@ <h3><span class="header-section-number">3.7.2</span> Different “key” variabl
  9 MIA         11728 Miami Intl           25.8  -80.3     8    -5 A     America…
 10 DCA          9705 Ronald Reagan Wash…  38.9  -77.0    15    -5 A     America…
 # … with 91 more rows</code></pre>
-<p>In case you didn’t know, <code>&quot;ORD&quot;</code> is the airport code of Chicago O’Hare airport and <code>&quot;FLL&quot;</code> is the main airport in Fort Lauderdale, Florida, which we can be seen in the <code>airport_name</code> variable.</p>
+<p>In case you didn’t know, <code>&quot;ORD&quot;</code> is the airport code of Chicago O’Hare airport and <code>&quot;FLL&quot;</code> is the main airport in Fort Lauderdale, Florida, which can be seen in the <code>airport_name</code> variable.</p>
 </div>
 <div id="multiple-key-variables" class="section level3">
 <h3><span class="header-section-number">3.7.3</span> Multiple “key” variables</h3>
-<p>Say instead we want to join two data frames by <em>multiple key variables</em>. For example, in Figure <a href="3-wrangling.html#fig:reldiagram">3.7</a> we see that in order to join the <code>flights</code> and <code>weather</code> data frames, we need more than one key variable: <code>year</code>, <code>month</code>, <code>day</code>, <code>hour</code>, and <code>origin</code>. This is because the combination of these 5 variables act to uniquely identify each observational unit in the <code>weather</code> data frame: hourly weather recordings at each of the 3 NYC airports.</p>
-<p>We achieve this by specifying a <em>vector</em> of key variables to join by using the <code>c()</code> function. Recall from Subsection <a href="1-getting-started.html#programming-concepts">1.2.1</a> that <code>c()</code> is short for “combine” or “concatenate”. </p>
-<pre class="sourceCode r"><code class="sourceCode r">flights_weather_joined &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">inner_join</span>(weather, <span class="dt">by =</span> <span class="kw">c</span>(<span class="st">&quot;year&quot;</span>, <span class="st">&quot;month&quot;</span>, <span class="st">&quot;day&quot;</span>, <span class="st">&quot;hour&quot;</span>, <span class="st">&quot;origin&quot;</span>))
-<span class="kw">View</span>(flights_weather_joined)</code></pre>
+<p>Say instead we want to join two data frames by <em>multiple key variables</em>. For example, in Figure <a href="3-wrangling.html#fig:reldiagram">3.7</a>, we see that in order to join the <code>flights</code> and <code>weather</code> data frames, we need more than one key variable: <code>year</code>, <code>month</code>, <code>day</code>, <code>hour</code>, and <code>origin</code>. This is because the combination of these 5 variables act to uniquely identify each observational unit in the <code>weather</code> data frame: hourly weather recordings at each of the 3 NYC airports.</p>
+<p>We achieve this by specifying a <em>vector</em> of key variables to join by using the <code>c()</code> function. Recall from Subsection <a href="1-getting-started.html#programming-concepts">1.2.1</a> that <code>c()</code> is short for “combine” or “concatenate.” </p>
+<div class="sourceCode" id="cb104"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb104-1" data-line-number="1">flights_weather_joined &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb104-2" data-line-number="2"><span class="st">  </span><span class="kw">inner_join</span>(weather, <span class="dt">by =</span> <span class="kw">c</span>(<span class="st">&quot;year&quot;</span>, <span class="st">&quot;month&quot;</span>, <span class="st">&quot;day&quot;</span>, <span class="st">&quot;hour&quot;</span>, <span class="st">&quot;origin&quot;</span>))</a>
+<a class="sourceLine" id="cb104-3" data-line-number="3"><span class="kw">View</span>(flights_weather_joined)</a></code></pre></div>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -1235,14 +1249,14 @@ <h3><span class="header-section-number">3.7.3</span> Multiple “key” variable
 </div>
 <div id="normal-forms" class="section level3">
 <h3><span class="header-section-number">3.7.4</span> Normal forms</h3>
-<p>The data frames included in the <code>nycflights13</code> package are in a form that minimizes redundancy of data. For example, the <code>flights</code> data frame only saves the <code>carrier</code> code of the airline company; it does not include the actual name of the airline. For example the first row of <code>flights</code> has <code>carrier</code> equal to <code>UA</code>, but does it does not include the airline name “United Air Lines Inc.”</p>
+<p>The data frames included in the <code>nycflights13</code> package are in a form that minimizes redundancy of data. For example, the <code>flights</code> data frame only saves the <code>carrier</code> code of the airline company; it does not include the actual name of the airline. For example, the first row of <code>flights</code> has <code>carrier</code> equal to <code>UA</code>, but it does not include the airline name of “United Air Lines Inc.”</p>
 <p>The names of the airline companies are included in the <code>name</code> variable of the <code>airlines</code> data frame. In order to have the airline company name included in <code>flights</code>, we could join these two data frames as follows:</p>
-<pre class="sourceCode r"><code class="sourceCode r">joined_flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">inner_join</span>(airlines, <span class="dt">by =</span> <span class="st">&quot;carrier&quot;</span>)
-<span class="kw">View</span>(joined_flights)</code></pre>
-<p>We are capable of performing this join because each of the data frames have <em>keys</em> in common to relate one to another: the <code>carrier</code> variable in both the <code>flights</code> and <code>airlines</code> data frames. The <em>key</em> variable(s) that we base our joins on are often <em>identification variables</em> we mentioned previously.</p>
+<div class="sourceCode" id="cb105"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb105-1" data-line-number="1">joined_flights &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb105-2" data-line-number="2"><span class="st">  </span><span class="kw">inner_join</span>(airlines, <span class="dt">by =</span> <span class="st">&quot;carrier&quot;</span>)</a>
+<a class="sourceLine" id="cb105-3" data-line-number="3"><span class="kw">View</span>(joined_flights)</a></code></pre></div>
+<p>We are capable of performing this join because each of the data frames have <em>keys</em> in common to relate one to another: the <code>carrier</code> variable in both the <code>flights</code> and <code>airlines</code> data frames. The <em>key</em> variable(s) that we base our joins on are often <em>identification variables</em> as we mentioned previously.</p>
 <p>This is an important property of what’s known as <em>normal forms</em> of data. The process of decomposing data frames into less redundant tables without losing information is called <em>normalization</em>. More information is available on <a href="https://en.wikipedia.org/wiki/Database_normalization">Wikipedia</a>.</p>
-<p>Both <code>dplyr</code> and the <a href="https://en.wikipedia.org/wiki/SQL">SQL</a> database querying language (pronounced “sequel”) we mentioned in the introduction of this chapter use such <em>normal forms</em>. Given that they share such commonalities, once you learn either of these two tools, you can learn the other very easily.</p>
+<p>Both <code>dplyr</code> and <a href="https://en.wikipedia.org/wiki/SQL">SQL</a> we mentioned in the introduction of this chapter use such <em>normal forms</em>. Given that they share such commonalities, once you learn either of these two tools, you can learn the other very easily.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -1271,60 +1285,50 @@ <h3><span class="header-section-number">3.8.1</span> <code>select</code> variabl
 </p>
 </div>
 <p>We’ve seen that the <code>flights</code> data frame in the <code>nycflights13</code> package contains 19 different variables. You can identify the names of these 19 variables by running the <code>glimpse()</code> function from the <code>dplyr</code> package:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">glimpse</span>(flights)</code></pre>
+<div class="sourceCode" id="cb106"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb106-1" data-line-number="1"><span class="kw">glimpse</span>(flights)</a></code></pre></div>
 <p>However, say you only need two of these 19 variables, say <code>carrier</code> and <code>flight</code>. You can <code>select()</code>  these two variables:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(carrier, flight)</code></pre>
-<p>This function makes it easier to explore large datasets since it allows us to limit the scope to only those variables we care most about. For example, if we <code>select()</code> only a smaller number of variables, it will make viewing the dataset in RStudio’s spreadsheet viewer more digestible.</p>
+<div class="sourceCode" id="cb107"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb107-1" data-line-number="1">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb107-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(carrier, flight)</a></code></pre></div>
+<p>This function makes it easier to explore large datasets since it allows us to limit the scope to only those variables we care most about. For example, if we <code>select()</code> only a smaller number of variables as is shown in Figure <a href="3-wrangling.html#fig:selectfig">3.9</a>, it will make viewing the dataset in RStudio’s spreadsheet viewer more digestible.</p>
 <p>Let’s say instead you want to drop, or de-select, certain variables. For example, consider the variable <code>year</code> in the <code>flights</code> data frame. This variable isn’t quite a “variable” because it is always <code>2013</code> and hence doesn’t change. Say you want to remove this variable from the data frame. We can deselect <code>year</code> by using the <code>-</code> sign:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights_no_year &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(<span class="op">-</span>year)</code></pre>
+<div class="sourceCode" id="cb108"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb108-1" data-line-number="1">flights_no_year &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">select</span>(<span class="op">-</span>year)</a></code></pre></div>
 <p>Another way of selecting columns/variables is by specifying a range of columns:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flight_arr_times &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(month<span class="op">:</span>day, arr_time<span class="op">:</span>sched_arr_time)
-flight_arr_times</code></pre>
+<div class="sourceCode" id="cb109"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb109-1" data-line-number="1">flight_arr_times &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">select</span>(month<span class="op">:</span>day, arr_time<span class="op">:</span>sched_arr_time)</a>
+<a class="sourceLine" id="cb109-2" data-line-number="2">flight_arr_times</a></code></pre></div>
 <p>This will <code>select()</code> all columns between <code>month</code> and <code>day</code>, as well as between <code>arr_time</code> and <code>sched_arr_time</code>, and drop the rest.</p>
 <p>The <code>select()</code> function can also be used to reorder columns when used with the <code>everything()</code> helper function. For example, suppose we want the <code>hour</code>, <code>minute</code>, and <code>time_hour</code> variables to appear immediately after the <code>year</code>, <code>month</code>, and <code>day</code> variables, while not discarding the rest of the variables. In the following code, <code>everything()</code> will pick up all remaining variables:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights_reorder &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(year, month, day, hour, minute, time_hour, <span class="kw">everything</span>())
-<span class="kw">glimpse</span>(flights_reorder)</code></pre>
-<p>Lastly, the helper functions <code>starts_with()</code>, <code>ends_with()</code>, and <code>contains()</code> can be used to select variables/columns that match those conditions. For example:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights_begin_a &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(<span class="kw">starts_with</span>(<span class="st">&quot;a&quot;</span>))
-flights_begin_a</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">flights_delays &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(<span class="kw">ends_with</span>(<span class="st">&quot;delay&quot;</span>))
-flights_delays</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">flights_time &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(<span class="kw">contains</span>(<span class="st">&quot;time&quot;</span>))
-flights_time</code></pre>
+<div class="sourceCode" id="cb110"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb110-1" data-line-number="1">flights_reorder &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb110-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(year, month, day, hour, minute, time_hour, <span class="kw">everything</span>())</a>
+<a class="sourceLine" id="cb110-3" data-line-number="3"><span class="kw">glimpse</span>(flights_reorder)</a></code></pre></div>
+<p>Lastly, the helper functions <code>starts_with()</code>, <code>ends_with()</code>, and <code>contains()</code> can be used to select variables/columns that match those conditions. As examples,</p>
+<div class="sourceCode" id="cb111"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb111-1" data-line-number="1">flights <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">select</span>(<span class="kw">starts_with</span>(<span class="st">&quot;a&quot;</span>))</a>
+<a class="sourceLine" id="cb111-2" data-line-number="2">flights <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">select</span>(<span class="kw">ends_with</span>(<span class="st">&quot;delay&quot;</span>))</a>
+<a class="sourceLine" id="cb111-3" data-line-number="3">flights <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">select</span>(<span class="kw">contains</span>(<span class="st">&quot;time&quot;</span>))</a></code></pre></div>
 </div>
 <div id="rename" class="section level3">
 <h3><span class="header-section-number">3.8.2</span> <code>rename</code> variables</h3>
-<p>Another useful function is  <code>rename()</code>, which as you may have guessed renames variables. Suppose we want <code>dep_time</code> and <code>arr_time</code> to be <code>departure_time</code> and <code>arrival_time</code> instead in the <code>flights_time</code> data frame:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights_time_new &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(dep_time, arr_time) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rename</span>(<span class="dt">departure_time =</span> dep_time,
-         <span class="dt">arrival_time =</span> arr_time)
-<span class="kw">glimpse</span>(flights_time_new)</code></pre>
-<p>Note that in this case we used a single <code>=</code> sign within the <code>rename()</code>. For example <code>departure_time = dep_time</code> renames the <code>dep_time</code> variable to have the new name <code>departure_time</code>. This is because we are not testing for equality like we would using <code>==</code>. Instead we want to assign a new variable <code>departure_time</code> to have the same values as <code>dep_time</code> and then delete the variable <code>dep_time</code>. It’s easy to forget if the new name comes before or after the equals sign. We usually remember this as “New Before, Old After” or NBOA.</p>
+<p>Another useful function is  <code>rename()</code>, which as you may have guessed changes the name of variables. Suppose we want to only focus on <code>dep_time</code> and <code>arr_time</code> and change <code>dep_time</code> and <code>arr_time</code> to be <code>departure_time</code> and <code>arrival_time</code> instead in the <code>flights_time</code> data frame:</p>
+<div class="sourceCode" id="cb112"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb112-1" data-line-number="1">flights_time_new &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb112-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(dep_time, arr_time) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb112-3" data-line-number="3"><span class="st">  </span><span class="kw">rename</span>(<span class="dt">departure_time =</span> dep_time, <span class="dt">arrival_time =</span> arr_time)</a>
+<a class="sourceLine" id="cb112-4" data-line-number="4"><span class="kw">glimpse</span>(flights_time_new)</a></code></pre></div>
+<p>Note that in this case we used a single <code>=</code> sign within the <code>rename()</code>. For example, <code>departure_time = dep_time</code> renames the <code>dep_time</code> variable to have the new name <code>departure_time</code>. This is because we are not testing for equality like we would using <code>==</code>. Instead we want to assign a new variable <code>departure_time</code> to have the same values as <code>dep_time</code> and then delete the variable <code>dep_time</code>. Note that new <code>dplyr</code> users often forget that the new variable name comes before the equal sign. <!-- We usually remember this as "New Before, Old After" or NBOA. --></p>
 </div>
 <div id="top_n-values-of-a-variable" class="section level3">
 <h3><span class="header-section-number">3.8.3</span> <code>top_n</code> values of a variable</h3>
-<p>We can also return the top <code>n</code> values of a variable using the <code>top_n()</code>  function. For example, we can return a data frame of the top 10 destination airports using the example from Section <a href="3-wrangling.html#diff-key">3.7.2</a>. Observe that we set the number of values to return to <code>n = 10</code> and <code>wt = num_flights</code> to indicate that we want the rows corresponding to the top 10 values of <code>num_flights</code>. See the help file for <code>top_n()</code> by running <code>?top_n</code> for more information.</p>
-<pre class="sourceCode r"><code class="sourceCode r">named_dests <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">top_n</span>(<span class="dt">n =</span> <span class="dv">10</span>, <span class="dt">wt =</span> num_flights)</code></pre>
+<p>We can also return the top <code>n</code> values of a variable using the <code>top_n()</code>  function. For example, we can return a data frame of the top 10 destination airports using the example from Subsection <a href="3-wrangling.html#diff-key">3.7.2</a>. Observe that we set the number of values to return to <code>n = 10</code> and <code>wt = num_flights</code> to indicate that we want the rows corresponding to the top 10 values of <code>num_flights</code>. See the help file for <code>top_n()</code> by running <code>?top_n</code> for more information.</p>
+<div class="sourceCode" id="cb113"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb113-1" data-line-number="1">named_dests <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">top_n</span>(<span class="dt">n =</span> <span class="dv">10</span>, <span class="dt">wt =</span> num_flights)</a></code></pre></div>
 <p>Let’s further <code>arrange()</code> these results in descending order of <code>num_flights</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">named_dests  <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">top_n</span>(<span class="dt">n =</span> <span class="dv">10</span>, <span class="dt">wt =</span> num_flights) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(num_flights))</code></pre>
+<div class="sourceCode" id="cb114"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb114-1" data-line-number="1">named_dests  <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb114-2" data-line-number="2"><span class="st">  </span><span class="kw">top_n</span>(<span class="dt">n =</span> <span class="dv">10</span>, <span class="dt">wt =</span> num_flights) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb114-3" data-line-number="3"><span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(num_flights))</a></code></pre></div>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
 <p><strong>(LC3.16)</strong> What are some ways to select all three of the <code>dest</code>, <code>air_time</code>, and <code>distance</code> variables from <code>flights</code>? Give the code showing how to do this in at least three different ways.</p>
-<p><strong>(LC3.17)</strong> How could one use <code>starts_with</code>, <code>ends_with</code>, and <code>contains</code> to select columns from the <code>flights</code> data frame? Provide three different examples in total: one for <code>starts_with</code>, one for <code>ends_with</code>, and one for <code>contains</code>.</p>
+<p><strong>(LC3.17)</strong> How could one use <code>starts_with()</code>, <code>ends_with()</code>, and <code>contains()</code> to select columns from the <code>flights</code> data frame? Provide three different examples in total: one for <code>starts_with()</code>, one for <code>ends_with()</code>, and one for <code>contains()</code>.</p>
 <p><strong>(LC3.18)</strong> Why might we want to use the <code>select</code> function on a data frame?</p>
 <p><strong>(LC3.19)</strong> Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013.</p>
 <div class="learncheck">
@@ -1339,7 +1343,7 @@ <h3><span class="header-section-number">3.9.1</span> Summary table</h3>
 <p>Let’s recap our data wrangling verbs in Table <a href="3-wrangling.html#tab:wrangle-summary-table">3.2</a>. Using these verbs and the pipe <code>%&gt;%</code> operator from Section <a href="3-wrangling.html#piping">3.1</a>, you’ll be able to write easily legible code to perform almost all the data wrangling and data transformation necessary for the rest of this book.</p>
 <table>
 <caption>
-<span id="tab:wrangle-summary-table">TABLE 3.2: </span>Summary of data wrangling verbs.
+<span id="tab:wrangle-summary-table">TABLE 3.2: </span>Summary of data wrangling verbs
 </caption>
 <thead>
 <tr>
@@ -1373,7 +1377,7 @@ <h3><span class="header-section-number">3.9.1</span> Summary table</h3>
 <code>group_by()</code>
 </td>
 <td style="text-align:left;">
-Add grouping structure to rows in data frame. Note this does not change values in data frame.
+Add grouping structure to rows in data frame. Note this does not change values in data frame, rather only the meta-data
 </td>
 </tr>
 <tr>
@@ -1408,10 +1412,18 @@ <h3><span class="header-section-number">3.9.1</span> Summary table</h3>
 </p>
 </div>
 <p><strong>(LC3.20)</strong> Let’s now put your newly acquired data wrangling skills to the test!</p>
-<p>An airline industry measure of a passenger airline’s capacity is the <a href="https://en.wikipedia.org/wiki/Available_seat_miles">available seat miles</a>, which is equal to the number of seats available multiplied by the number of miles or kilometers flown summed over all flights. So for example say an airline had 2 flights using a plane with 10 seats that flew 500 miles and 3 flights using a plane with 20 seats that flew 1000 miles, the available seat miles would be 2 <span class="math inline">\(\times\)</span> 10 <span class="math inline">\(\times\)</span> 500 <span class="math inline">\(+\)</span> 3 <span class="math inline">\(\times\)</span> 20 <span class="math inline">\(\times\)</span> 1000 = 70,000 seat miles.</p>
+<p>An airline industry measure of a passenger airline’s capacity is the <a href="https://en.wikipedia.org/wiki/Available_seat_miles">available seat miles</a>, which is equal to the number of seats available multiplied by the number of miles or kilometers flown summed over all flights.</p>
+<p>For example, let’s consider the scenario in Figure <a href="3-wrangling.html#fig:available-seat-miles">3.10</a>. Since the airplane has 4 seats and it travels 200 miles, the available seat miles are <span class="math inline">\(4 \times 200 = 800\)</span>.</p>
+<div class="figure" style="text-align: center"><span id="fig:available-seat-miles"></span>
+<img src="images/flowcharts/flowchart/flowchart.012.png" alt="Example of available seat miles for one flight." width="\textwidth" height="40%" />
+<p class="caption">
+FIGURE 3.10: Example of available seat miles for one flight.
+</p>
+</div>
+<p>Extending this idea, let’s say an airline had 2 flights using a plane with 10 seats that flew 500 miles and 3 flights using a plane with 20 seats that flew 1000 miles, the available seat miles would be <span class="math inline">\(2 \times 10 \times 500 + 3 \times 20 \times 1000 = 70,000\)</span> seat miles.</p>
 <p>Using the datasets included in the <code>nycflights13</code> package, compute the available seat miles for each airline sorted in descending order. After completing all the necessary data wrangling steps, the resulting data frame should have 16 rows (one for each airline) and 2 columns (airline name and available seat miles). Here are some hints:</p>
 <ol style="list-style-type: decimal">
-<li><strong>Crucial</strong>: Unless you are very confident in what you are doing, it is worthwhile to not starting to code right away. Rather first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level <em>pseudocode</em> that is informal yet detailed enough to articulate what you are doing. This way you won’t confuse <em>what</em> you are trying to do (the algorithm) with <em>how</em> you are going to do it (writing <code>dplyr</code> code).</li>
+<li><strong>Crucial</strong>: Unless you are very confident in what you are doing, it is worthwhile not starting to code right away. Rather, first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level <em>pseudocode</em> that is informal yet detailed enough to articulate what you are doing. This way you won’t confuse <em>what</em> you are trying to do (the algorithm) with <em>how</em> you are going to do it (writing <code>dplyr</code> code).</li>
 <li>Take a close look at all the datasets using the <code>View()</code> function: <code>flights</code>, <code>weather</code>, <code>planes</code>, <code>airports</code>, and <code>airlines</code> to identify which variables are necessary to compute available seat miles.</li>
 <li>Figure <a href="3-wrangling.html#fig:reldiagram">3.7</a> showing how the various datasets can be joined will also be useful.</li>
 <li>Consider the data wrangling verbs in Table <a href="3-wrangling.html#tab:wrangle-summary-table">3.2</a> as your toolbox!</li>
@@ -1423,20 +1435,20 @@ <h3><span class="header-section-number">3.9.1</span> Summary table</h3>
 <div id="additional-resources-2" class="section level3">
 <h3><span class="header-section-number">3.9.2</span> Additional resources</h3>
 <p>An R script file of all R code used in this chapter is available <a href="scripts/03-wrangling.R">here</a>.</p>
-<p>If you want to further unlock the power of the <code>dplyr</code> package for data wrangling, we suggest you that you check out RStudio’s “Data Transformation with dplyr” cheatsheet. This cheatsheet summarizes much more than what we’ve discussed in this chapter, in particular more-intermediate level and advanced data wrangling functions, while providing quick and easy to read visual descriptions. In fact, many of the diagrams illustrating data wrangling operations in this chapter, such as Figure <a href="3-wrangling.html#fig:filter">3.1</a> on <code>filter()</code>, originate from this cheatsheet.</p>
-<p>You can access this cheatsheet  by going to the RStudio Menu Bar -&gt; Help -&gt; Cheatsheets -&gt; “Data Transformation with dplyr”. You can see a preview in the figure below.</p>
+<p>If you want to further unlock the power of the <code>dplyr</code> package for data wrangling, we suggest that you check out RStudio’s “Data Transformation with dplyr” cheatsheet. This cheatsheet summarizes much more than what we’ve discussed in this chapter, in particular more intermediate level and advanced data wrangling functions, while providing quick and easy-to-read visual descriptions. In fact, many of the diagrams illustrating data wrangling operations in this chapter, such as Figure <a href="3-wrangling.html#fig:filter">3.1</a> on <code>filter()</code>, originate from this cheatsheet.</p>
+<p>In the current version of RStudio in late 2019, you can access this cheatsheet by going to the RStudio Menu Bar -&gt; Help -&gt; Cheatsheets -&gt; “Data Transformation with dplyr.” You can see a preview in the figure below.</p>
 <div class="figure" style="text-align: center"><span id="fig:dplyr-cheatsheet"></span>
 <img src="images/cheatsheets/dplyr_cheatsheet-1.png" alt="Data Transformation with dplyr cheatsheet." width="\textwidth" />
 <p class="caption">
-FIGURE 3.10: Data Transformation with dplyr cheatsheet.
+FIGURE 3.11: Data Transformation with dplyr cheatsheet.
 </p>
 </div>
-<p>On top of data wrangling verbs and examples we presented in this section, if you’d like to see more examples of using the <code>dplyr</code> package for data wrangling check out <a href="http://r4ds.had.co.nz/transform.html">Chapter 5</a> of Garrett Grolemund and Hadley Wickham’s book <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2016</a>)</span>.</p>
+<p>On top of the data wrangling verbs and examples we presented in this section, if you’d like to see more examples of using the <code>dplyr</code> package for data wrangling, check out <a href="http://r4ds.had.co.nz/transform.html">Chapter 5</a> of <em>R for Data Science</em> <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2017</a>)</span>.</p>
 </div>
 <div id="whats-to-come-1" class="section level3">
 <h3><span class="header-section-number">3.9.3</span> What’s to come?</h3>
 <p>So far in this book, we’ve explored, visualized, and wrangled data saved in data frames. These data frames were saved in a spreadsheet-like format: in a rectangular shape with a certain number of rows corresponding to observations and a certain number of columns corresponding to variables describing these observations.</p>
-<p>We’ll see in the upcoming Chapter <a href="4-tidy.html#tidy">4</a> that there are actually two ways to represent data in spreadsheet-type rectangular format: 1) “wide” format and 2) “tall/narrow” format. The tall/narrow format is also known as <em>“tidy”</em> format in R user circles. While the distinction between “tidy” and non-“tidy” formatted data is very subtle, it has very large implications for our data science work. This is because almost all the packages used in this book, including the <code>ggplot2</code> package for data visualization and the <code>dplyr</code> package for data wrangling, all assume that all data frames are in “tidy” format.</p>
+<p>We’ll see in the upcoming Chapter <a href="4-tidy.html#tidy">4</a> that there are actually two ways to represent data in spreadsheet-type rectangular format: (1) “wide” format and (2) “tall/narrow” format. The tall/narrow format is also known as <em>“tidy”</em> format in R user circles. While the distinction between “tidy” and non-“tidy” formatted data is subtle, it has immense implications for our data science work. This is because almost all the packages used in this book, including the <code>ggplot2</code> package for data visualization and the <code>dplyr</code> package for data wrangling, all assume that all data frames are in “tidy” format.</p>
 <p>Furthermore, up until now we’ve only explored, visualized, and wrangled data saved within R packages. But what if you want to analyze data that you have saved in a Microsoft Excel, a Google Sheets, or a “Comma-Separated Values” (CSV) file? In Section <a href="4-tidy.html#csv">4.1</a>, we’ll show you how to import this data into R using the <code>readr</code> package.</p>
 
 </div>
@@ -1445,7 +1457,7 @@ <h3><span class="header-section-number">3.9.3</span> What’s to come?</h3>
 <h3>References</h3>
 <div id="refs" class="references">
 <div id="ref-rds2016">
-<p>Grolemund, Garrett, and Hadley Wickham. 2016. <em>R for Data Science</em>. <a href="http://r4ds.had.co.nz/">http://r4ds.had.co.nz/</a>.</p>
+<p>Grolemund, Garrett, and Hadley Wickham. 2017. <em>R for Data Science</em>. First. Sebastopol, CA: O’Reilly Media. <a href="https://r4ds.had.co.nz/">https://r4ds.had.co.nz/</a>.</p>
 </div>
 </div>
             </section>
@@ -1459,11 +1471,13 @@ <h3>References</h3>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -1471,12 +1485,11 @@ <h3>References</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -1491,6 +1504,10 @@ <h3>References</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -1507,8 +1524,9 @@ <h3>References</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/4-tidy.html b/docs/4-tidy.html
index c79f2bc3b..6ee58f697 100644
--- a/docs/4-tidy.html
+++ b/docs/4-tidy.html
@@ -4,35 +4,35 @@
 
   <meta charset="utf-8" />
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
-  <title>Chapter 4 Data Importing &amp; “Tidy” Data | Statistical Inference via Data Science</title>
+  <title>Chapter 4 Data Importing and “Tidy” Data | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
-  <meta property="og:title" content="Chapter 4 Data Importing &amp; “Tidy” Data | Statistical Inference via Data Science" />
+  <meta property="og:title" content="Chapter 4 Data Importing and “Tidy” Data | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
-  <meta name="twitter:title" content="Chapter 4 Data Importing &amp; “Tidy” Data | Statistical Inference via Data Science" />
+  <meta name="twitter:title" content="Chapter 4 Data Importing and “Tidy” Data | Statistical Inference via Data Science" />
   <meta name="twitter:site" content="@ModernDive" />
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="3-wrangling.html">
-<link rel="next" href="5-regression.html">
+<link rel="prev" href="3-wrangling.html"/>
+<link rel="next" href="5-regression.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -569,38 +582,38 @@ <h1>
 <img src='https://moderndive.com/wide_format.png' alt="ModernDive">
 </html>
 <div id="tidy" class="section level1">
-<h1><span class="header-section-number">Chapter 4</span> Data Importing &amp; “Tidy” Data</h1>
-<p>In Subsection <a href="1-getting-started.html#programming-concepts">1.2.1</a>, we introduced the concept of a  data frame in R: a rectangular spreadsheet-like representation of data where the rows correspond to observations and the columns correspond to variables describing each observation. In Section <a href="1-getting-started.html#nycflights13">1.4</a>, we started exploring our first data frame: the <code>flights</code> data frame included in the <code>nycflights13</code> package. In Chapter <a href="2-viz.html#viz">2</a> we created visualizations based on the data included in <code>flights</code> and other data frames such as <code>weather</code>. In Chapter <a href="3-wrangling.html#wrangling">3</a>, we learned how to wrangle data, in other words take existing data frames and transform/modify them to suit our ends.</p>
-<p>In this final chapter of the “Data Science via the tidyverse” portion of the book, we extend some of these ideas by discussing a type of data formatting called “tidy” data. You will see that having data stored in “tidy” format is about more than what the everyday definition of the term “tidy” might suggest: having your data “neatly organized.” Instead, we define the term “tidy” as it’s used by data scientists who use R, outlining a set of rules by which data is saved.</p>
-<p>Knowledge of this type of data formatting was not necessary for our treatment of data visualization in Chapter <a href="2-viz.html#viz">2</a> and data wrangling in Chapter <a href="3-wrangling.html#wrangling">3</a>. This is because all the data used was already in “tidy” format. In this chapter, we’ll now see that this format is essential to using the tools we covered up until now. Furthermore, it will also be useful for all subsequent chapters in this book when we cover regression and statistical inference. First however, we’ll show you how to import spreadsheet data in R.</p>
+<h1><span class="header-section-number">Chapter 4</span> Data Importing and “Tidy” Data</h1>
+<p>In Subsection <a href="1-getting-started.html#programming-concepts">1.2.1</a>, we introduced the concept of a  data frame in R: a rectangular spreadsheet-like representation of data where the rows correspond to observations and the columns correspond to variables describing each observation. In Section <a href="1-getting-started.html#nycflights13">1.4</a>, we started exploring our first data frame: the <code>flights</code> data frame included in the <code>nycflights13</code> package. In Chapter <a href="2-viz.html#viz">2</a>, we created visualizations based on the data included in <code>flights</code> and other data frames such as <code>weather</code>. In Chapter <a href="3-wrangling.html#wrangling">3</a>, we learned how to take existing data frames and transform/modify them to suit our ends.</p>
+<p>In this final chapter of the “Data Science with <code>tidyverse</code>” portion of the book, we extend some of these ideas by discussing a type of data formatting called “tidy” data. You will see that having data stored in “tidy” format is about more than just what the everyday definition of the term “tidy” might suggest: having your data “neatly organized.” Instead, we define the term “tidy” as it’s used by data scientists who use R, outlining a set of rules by which data is saved.</p>
+<p>Knowledge of this type of data formatting was not necessary for our treatment of data visualization in Chapter <a href="2-viz.html#viz">2</a> and data wrangling in Chapter <a href="3-wrangling.html#wrangling">3</a>. This is because all the data used were already in “tidy” format. In this chapter, we’ll now see that this format is essential to using the tools we covered up until now. Furthermore, it will also be useful for all subsequent chapters in this book when we cover regression and statistical inference. First, however, we’ll show you how to import spreadsheet data in R.</p>
 <div id="needed-packages-2" class="section level3 unnumbered">
 <h3>Needed packages</h3>
 <p>Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section <a href="1-getting-started.html#packages">1.3</a> for information on how to install and load R packages.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(dplyr)
-<span class="kw">library</span>(ggplot2)
-<span class="kw">library</span>(readr)
-<span class="kw">library</span>(tidyr)
-<span class="kw">library</span>(nycflights13)
-<span class="kw">library</span>(fivethirtyeight)</code></pre>
+<div class="sourceCode" id="cb115"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb115-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb115-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
+<a class="sourceLine" id="cb115-3" data-line-number="3"><span class="kw">library</span>(readr)</a>
+<a class="sourceLine" id="cb115-4" data-line-number="4"><span class="kw">library</span>(tidyr)</a>
+<a class="sourceLine" id="cb115-5" data-line-number="5"><span class="kw">library</span>(nycflights13)</a>
+<a class="sourceLine" id="cb115-6" data-line-number="6"><span class="kw">library</span>(fivethirtyeight)</a></code></pre></div>
 </div>
 <div id="csv" class="section level2">
 <h2><span class="header-section-number">4.1</span> Importing data</h2>
-<p>Up to this point, we’ve almost entirely used data stored inside of an R package. Say instead you have your own data saved on your computer or somewhere online? How can you analyze this data in R? Spreadsheet data is often saved in one of the following three formats.</p>
+<p>Up to this point, we’ve almost entirely used data stored inside of an R package. Say instead you have your own data saved on your computer or somewhere online. How can you analyze this data in R? Spreadsheet data is often saved in one of the following three formats:</p>
 <p>First, a <em>Comma Separated Values</em> <code>.csv</code>  file. You can think of a <code>.csv</code> file as a bare-bones spreadsheet where:</p>
 <ul>
 <li>Each line in the file corresponds to one row of data/one observation.</li>
-<li>Values for each line are separated with commas. In other words, the values of different variables are separated by commas.</li>
+<li>Values for each line are separated with commas. In other words, the values of different variables are separated by commas in each row.</li>
 <li>The first line is often, but not always, a <em>header</em> row indicating the names of the columns/variables.</li>
 </ul>
-<p>Second, an Excel <code>.xlsx</code> spreadsheet file. This format is based on Microsoft’s proprietary Excel software. As opposed to a bare-bones <code>.csv</code> file, an <code>.xlsx</code> Excel files contains a lot of meta-data, or in other words, data about data. Recall we saw a previous example of meta-data in Section <a href="3-wrangling.html#groupby">3.4</a> when adding “group structure” meta-data to a data frame by using the <code>group_by()</code> verb. Some examples of Excel spreadsheet meta-data include the use of bold and italic fonts, colored cells, different column widths, and formula macros.</p>
-<p>Third, a <a href="https://www.google.com/sheets/about/">Google Sheets</a> file, which is a “cloud” or online-based way to work with a spreadsheet. Google Sheets allows you to download your data in both comma separated values <code>.csv</code> and Excel <code>.xlsx</code> formats. One way to import Google Sheets data is to go to the Google Sheets menu bar -&gt; File -&gt; Download as -&gt; Select “Microsoft Excel” or “Comma-separated values” and then load that data into R.</p>
-<p>We’ll cover two methods for importing <code>.csv</code> and <code>.xlsx</code> spreadsheet data in R: one using the console and the other using RStudio’s graphical user interface, abbreviated by “GUI.”</p>
+<p>Second, an Excel <code>.xlsx</code> spreadsheet file. This format is based on Microsoft’s proprietary Excel software. As opposed to bare-bones <code>.csv</code> files, <code>.xlsx</code> Excel files contain a lot of meta-data (data about data). Recall we saw a previous example of meta-data in Section <a href="3-wrangling.html#groupby">3.4</a> when adding “group structure” meta-data to a data frame by using the <code>group_by()</code> verb. Some examples of Excel spreadsheet meta-data include the use of bold and italic fonts, colored cells, different column widths, and formula macros.</p>
+<p>Third, a <a href="https://www.google.com/sheets/about/">Google Sheets</a> file, which is a “cloud” or online-based way to work with a spreadsheet. Google Sheets allows you to download your data in both comma separated values <code>.csv</code> and Excel <code>.xlsx</code> formats. One way to import Google Sheets data in R is to go to the Google Sheets menu bar -&gt; File -&gt; Download as -&gt; Select “Microsoft Excel” or “Comma-separated values” and then load that data into R. A more advanced way to import Google Sheets data in R is by using the <a href="https://cran.r-project.org/web/packages/googlesheets/vignettes/basic-usage.html"><code>googlesheets</code></a> package, a method we leave to a more advanced data science book.</p>
+<p>We’ll cover two methods for importing <code>.csv</code> and <code>.xlsx</code> spreadsheet data in R: one using the console and the other using RStudio’s graphical user interface, abbreviated as “GUI.”</p>
 <div id="using-the-console" class="section level3">
 <h3><span class="header-section-number">4.1.1</span> Using the console</h3>
 <p>First, let’s import a Comma Separated Values <code>.csv</code> file that exists on the internet. The <code>.csv</code> file <code>dem_score.csv</code> contains ratings of the level of democracy in different countries spanning 1952 to 1992 and is accessible at <a href="https://moderndive.com/data/dem_score.csv" class="uri">https://moderndive.com/data/dem_score.csv</a>. Let’s use the <code>read_csv()</code> function from the <code>readr</code>  <span class="citation">(Wickham, Hester, and Francois <a href="#ref-R-readr">2018</a>)</span> package to read it off the web, import it into R, and save it in a data frame called <code>dem_score</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(readr)
-dem_score &lt;-<span class="st"> </span><span class="kw">read_csv</span>(<span class="st">&quot;https://moderndive.com/data/dem_score.csv&quot;</span>)
-dem_score</code></pre>
+<div class="sourceCode" id="cb116"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb116-1" data-line-number="1"><span class="kw">library</span>(readr)</a>
+<a class="sourceLine" id="cb116-2" data-line-number="2">dem_score &lt;-<span class="st"> </span><span class="kw">read_csv</span>(<span class="st">&quot;https://moderndive.com/data/dem_score.csv&quot;</span>)</a>
+<a class="sourceLine" id="cb116-3" data-line-number="3">dem_score</a></code></pre></div>
 <pre><code># A tibble: 96 x 10
    country    `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
    &lt;chr&gt;       &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;
@@ -615,60 +628,53 @@ <h3><span class="header-section-number">4.1.1</span> Using the console</h3>
  9 Bhutan        -10    -10    -10    -10    -10    -10    -10    -10    -10
 10 Bolivia        -4     -3     -3     -4     -7     -7      8      9      9
 # … with 86 more rows</code></pre>
-<p>In this <code>dem_score</code> data frame, the minimum value of <code>-10</code> corresponds to a highly autocratic nation whereas a value of <code>10</code> corresponds to a highly democratic nation. Note also that backticks surround the different variable names. Variable names in R by default are not allowed to start with a number nor include spaces, but we can get around this fact by surrounding the column name with backticks. We’ll revisit the <code>dem_score</code> data frame in a case study in the upcoming Section <a href="4-tidy.html#case-study-tidy">4.3</a>.</p>
-<p>Note that the <code>read_csv()</code> function included in the <code>readr</code> package is different than the <code>read.csv()</code> function that comes installed with R. While the difference in the names might seem trivial (an <code>_</code> instead of a <code>.</code>), the <code>read_csv()</code> function is, in our opinion, easier to use since it can more easily read data off the web and generally imports data at a much faster speed. Furthermore, the <code>read_csv()</code> function included in the <code>readr</code> saves data frames as <code>tibbles</code> by default. <code>tibble</code> is short for “tidy table”; we’ll discuss what it makes for data to be “tidy” shortly in the upcoming Section <a href="4-tidy.html#tidy-data-ex">4.2</a>.</p>
+<p>In this <code>dem_score</code> data frame, the minimum value of <code>-10</code> corresponds to a highly autocratic nation, whereas a value of <code>10</code> corresponds to a highly democratic nation. Note also that backticks surround the different variable names. Variable names in R by default are not allowed to start with a number nor include spaces, but we can get around this fact by surrounding the column name with backticks. We’ll revisit the <code>dem_score</code> data frame in a case study in the upcoming Section <a href="4-tidy.html#case-study-tidy">4.3</a>.</p>
+<p>Note that the <code>read_csv()</code> function included in the <code>readr</code> package is different than the <code>read.csv()</code> function that comes installed with R. While the difference in the names might seem trivial (an <code>_</code> instead of a <code>.</code>), the <code>read_csv()</code> function is, in our opinion, easier to use since it can more easily read data off the web and generally imports data at a much faster speed. Furthermore, the <code>read_csv()</code> function included in the <code>readr</code> saves data frames as <code>tibbles</code> by default.</p>
 </div>
 <div id="using-rstudios-interface" class="section level3">
 <h3><span class="header-section-number">4.1.2</span> Using RStudio’s interface</h3>
 <p>Let’s read in the exact same data, but this time from an Excel file saved on your computer. Furthermore, we’ll do this using RStudio’s graphical interface instead of running <code>read_csv()</code> in the console. First, download the Excel file <code>dem_score.xlsx</code> by going to <a href="https://moderndive.com/data/dem_score.xlsx" download>https://moderndive.com/data/dem_score.xlsx</a>, then</p>
 <ol style="list-style-type: decimal">
 <li>Go to the Files pane of RStudio.</li>
-<li>Navigate to the directory (i.e. folder on your computer) where the downloaded <code>dem_score.xlsx</code> Excel file is saved. For example, this might be in your Downloads folder.</li>
+<li>Navigate to the directory (i.e., folder on your computer) where the downloaded <code>dem_score.xlsx</code> Excel file is saved. For example, this might be in your Downloads folder.</li>
 <li>Click on <code>dem_score.xlsx</code>.</li>
 <li>Click “Import Dataset…”</li>
 </ol>
-<p>At this point you should see a screen pop-up like in Figure <a href="4-tidy.html#fig:read-excel">4.1</a>. After clicking on the “Import”  button on the bottom right of Figure <a href="4-tidy.html#fig:read-excel">4.1</a>, RStudio will save this spreadsheet’s data in a data frame called <code>dem_score</code> and display its contents in the spreadsheet viewer.</p>
+<p>At this point, you should see a screen pop-up like in Figure <a href="4-tidy.html#fig:read-excel">4.1</a>. After clicking on the “Import”  button on the bottom right of Figure <a href="4-tidy.html#fig:read-excel">4.1</a>, RStudio will save this spreadsheet’s data in a data frame called <code>dem_score</code> and display its contents in the spreadsheet viewer.</p>
 <div class="figure" style="text-align: center"><span id="fig:read-excel"></span>
 <img src="images/rstudio_screenshots/read_excel.png" alt="Importing an Excel file to R." width="\textwidth" />
 <p class="caption">
 FIGURE 4.1: Importing an Excel file to R.
 </p>
 </div>
-<p>Furthermore, note the “Code Preview” block in the bottom right of Figure <a href="4-tidy.html#fig:read-excel">4.1</a>. You can copy and paste this code to reload your data again later automatically, instead of repeating this manual point-and-click process.</p>
+<p>Furthermore, note the “Code Preview” block in the bottom right of Figure <a href="4-tidy.html#fig:read-excel">4.1</a>. You can copy and paste this code to reload your data again later programmatically, instead of repeating this manual point-and-click process.</p>
 </div>
 </div>
 <div id="tidy-data-ex" class="section level2">
-<h2><span class="header-section-number">4.2</span> Tidy data</h2>
-<p>Let’s now switch gears and learn about the concept of “tidy” data format with a motivating example from the <code>fivethirtyeight</code> package. The <code>fivethirtyeight</code> package <span class="citation">(Kim, Ismay, and Chunn <a href="#ref-R-fivethirtyeight">2018</a>)</span> provides access to the datasets used in many articles published by data journalism website <a href="https://fivethirtyeight.com/">FiveThirtyEight.com</a>. For a complete list of all 107 data sets included in the <code>fivethirtyeight</code> package, check out the package webpage by going to <a href="https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html" class="uri">https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html</a>.</p>
-<p>Let’s focus our attention on the <code>drinks</code> data frame:</p>
-<pre class="sourceCode r"><code class="sourceCode r">drinks</code></pre>
-<pre><code># A tibble: 193 x 5
-   country      beer_servings spirit_servings wine_servings total_litres_of_pur…
-   &lt;chr&gt;                &lt;int&gt;           &lt;int&gt;         &lt;int&gt;                &lt;dbl&gt;
- 1 Afghanistan              0               0             0                  0  
- 2 Albania                 89             132            54                  4.9
- 3 Algeria                 25               0            14                  0.7
- 4 Andorra                245             138           312                 12.4
- 5 Angola                 217              57            45                  5.9
- 6 Antigua &amp; B…           102             128            45                  4.9
- 7 Argentina              193              25           221                  8.3
- 8 Armenia                 21             179            11                  3.8
- 9 Australia              261              72           212                 10.4
-10 Austria                279              75           191                  9.7
-# … with 183 more rows</code></pre>
-<p>After reading the help file by running <code>?drinks</code>, you’ll see that <code>drinks</code> is a data frame containing results from a survey of the average number of servings of beer, spirits, and wine consumed in 193 countries. This data was originally reported on FiveThirtyEight.com in Mona Chalabi’s article <a href="https://fivethirtyeight.com/features/dear-mona-followup-where-do-people-drink-the-most-beer-wine-and-spirits/">“Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?”</a></p>
+<h2><span class="header-section-number">4.2</span> “Tidy” data</h2>
+<p>Let’s now switch gears and learn about the concept of “tidy” data format with a motivating example from the <code>fivethirtyeight</code> package. The <code>fivethirtyeight</code> package <span class="citation">(Kim, Ismay, and Chunn <a href="#ref-R-fivethirtyeight">2019</a>)</span> provides access to the datasets used in many articles published by the data journalism website, <a href="https://fivethirtyeight.com/">FiveThirtyEight.com</a>. For a complete list of all 127 datasets included in the <code>fivethirtyeight</code> package, check out the package webpage by going to: <a href="https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html" class="uri">https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html</a>.</p>
+<p>Let’s focus our attention on the <code>drinks</code> data frame and look at its first 5 rows:</p>
+<pre><code># A tibble: 5 x 5
+  country    beer_servings spirit_servings wine_servings total_litres_of_pure_a…
+  &lt;chr&gt;              &lt;int&gt;           &lt;int&gt;         &lt;int&gt;                   &lt;dbl&gt;
+1 Afghanist…             0               0             0                     0  
+2 Albania               89             132            54                     4.9
+3 Algeria               25               0            14                     0.7
+4 Andorra              245             138           312                    12.4
+5 Angola               217              57            45                     5.9</code></pre>
+<p>After reading the help file by running <code>?drinks</code>, you’ll see that <code>drinks</code> is a data frame containing results from a survey of the average number of servings of beer, spirits, and wine consumed in 193 countries. This data was originally reported on FiveThirtyEight.com in Mona Chalabi’s article: <a href="https://fivethirtyeight.com/features/dear-mona-followup-where-do-people-drink-the-most-beer-wine-and-spirits/">“Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?”</a>.</p>
 <p>Let’s apply some of the data wrangling verbs we learned in Chapter <a href="3-wrangling.html#wrangling">3</a> on the <code>drinks</code> data frame:</p>
 <ol style="list-style-type: decimal">
-<li><code>filter()</code> the <code>drinks</code> data frame to only consider 4 countries: the United States, China, Italy, and Saudi Arabia <em>then</em></li>
+<li><code>filter()</code> the <code>drinks</code> data frame to only consider 4 countries: the United States, China, Italy, and Saudi Arabia, <em>then</em></li>
 <li><code>select()</code> all columns except <code>total_litres_of_pure_alcohol</code> by using the <code>-</code> sign, <em>then</em></li>
-<li><code>rename()</code> the variables <code>beer_servings</code>, <code>spirit_servings</code>, and <code>wine_servings</code> to <code>beer</code>, <code>spirit</code>, and <code>wine</code> respectively.</li>
+<li><code>rename()</code> the variables <code>beer_servings</code>, <code>spirit_servings</code>, and <code>wine_servings</code> to <code>beer</code>, <code>spirit</code>, and <code>wine</code>, respectively.</li>
 </ol>
 <p>and save the resulting data frame in <code>drinks_smaller</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">drinks_smaller &lt;-<span class="st"> </span>drinks <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(country <span class="op">%in%</span><span class="st"> </span><span class="kw">c</span>(<span class="st">&quot;USA&quot;</span>, <span class="st">&quot;China&quot;</span>, <span class="st">&quot;Italy&quot;</span>, <span class="st">&quot;Saudi Arabia&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(<span class="op">-</span>total_litres_of_pure_alcohol) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rename</span>(<span class="dt">beer =</span> beer_servings, <span class="dt">spirit =</span> spirit_servings, <span class="dt">wine =</span> wine_servings)
-drinks_smaller</code></pre>
+<div class="sourceCode" id="cb119"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb119-1" data-line-number="1">drinks_smaller &lt;-<span class="st"> </span>drinks <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb119-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(country <span class="op">%in%</span><span class="st"> </span><span class="kw">c</span>(<span class="st">&quot;USA&quot;</span>, <span class="st">&quot;China&quot;</span>, <span class="st">&quot;Italy&quot;</span>, <span class="st">&quot;Saudi Arabia&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb119-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(<span class="op">-</span>total_litres_of_pure_alcohol) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb119-4" data-line-number="4"><span class="st">  </span><span class="kw">rename</span>(<span class="dt">beer =</span> beer_servings, <span class="dt">spirit =</span> spirit_servings, <span class="dt">wine =</span> wine_servings)</a>
+<a class="sourceLine" id="cb119-5" data-line-number="5">drinks_smaller</a></code></pre></div>
 <pre><code># A tibble: 4 x 4
   country       beer spirit  wine
   &lt;chr&gt;        &lt;int&gt;  &lt;int&gt; &lt;int&gt;
@@ -676,21 +682,21 @@ <h2><span class="header-section-number">4.2</span> Tidy data</h2>
 2 Italy           85     42   237
 3 Saudi Arabia     0      5     0
 4 USA            249    158    84</code></pre>
-<p>Let’s now ask ourselves a question: “Using the <code>drinks_smaller</code> data frame, how would we create the side-by-side (i.e. dodged) barplot in Figure <a href="4-tidy.html#fig:drinks-smaller">4.2</a>?” Recall we saw barplots displaying two categorical variables in Section <a href="2-viz.html#two-categ-barplot">2.8.3</a>.</p>
+<p>Let’s now ask ourselves a question: “Using the <code>drinks_smaller</code> data frame, how would we create the side-by-side barplot in Figure <a href="4-tidy.html#fig:drinks-smaller">4.2</a>?”. Recall we saw barplots displaying two categorical variables in Subsection <a href="2-viz.html#two-categ-barplot">2.8.3</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:drinks-smaller"></span>
-<img src="moderndive_files/figure-html/drinks-smaller-1.png" alt="Comparing alcohol consumption in 4 countries." width="\textwidth" />
+<img src="ModernDive_files/figure-html/drinks-smaller-1.png" alt="Comparing alcohol consumption in 4 countries." width="\textwidth" />
 <p class="caption">
 FIGURE 4.2: Comparing alcohol consumption in 4 countries.
 </p>
 </div>
-<p>Let’s break down the Grammar of Graphics we introduced in Section <a href="2-viz.html#grammarofgraphics">2.1</a>:</p>
+<p>Let’s break down the grammar of graphics we introduced in Section <a href="2-viz.html#grammarofgraphics">2.1</a>:</p>
 <ol style="list-style-type: decimal">
 <li>The categorical variable <code>country</code> with four levels (China, Italy, Saudi Arabia, USA) would have to be mapped to the <code>x</code>-position of the bars.</li>
 <li>The numerical variable <code>servings</code> would have to be mapped to the <code>y</code>-position of the bars (the height of the bars).</li>
 <li>The categorical variable <code>type</code> with three levels (beer, spirit, wine) would have to be mapped to the <code>fill</code> color of the bars.</li>
 </ol>
-<p>Observe however that <code>drinks_smaller</code> has three separate variables <code>beer</code>, <code>spirit</code>, and <code>wine</code>. In order to use the <code>ggplot()</code> function to recreate the barplot in Figure <a href="4-tidy.html#fig:drinks-smaller">4.2</a> however, we need a <em>single variable</em> <code>type</code> with three possible values: <code>beer</code>, <code>spirit</code>, and <code>wine</code>. We could then map this <code>type</code> variable to the <code>fill</code> aesthetic of our plot. In other words, to recreate the barplot in Figure <a href="4-tidy.html#fig:drinks-smaller">4.2</a>, our data frame would have to look like this:</p>
-<pre class="sourceCode r"><code class="sourceCode r">drinks_smaller_tidy</code></pre>
+<p>Observe, however, that <code>drinks_smaller</code> has three separate variables <code>beer</code>, <code>spirit</code>, and <code>wine</code>. In order to use the <code>ggplot()</code> function to recreate the barplot in Figure <a href="4-tidy.html#fig:drinks-smaller">4.2</a>. However, we need a <em>single variable</em> <code>type</code> with three possible values: <code>beer</code>, <code>spirit</code>, and <code>wine</code>. We could then map this <code>type</code> variable to the <code>fill</code> aesthetic of our plot. In other words, to recreate the barplot in Figure <a href="4-tidy.html#fig:drinks-smaller">4.2</a>, our data frame would have to look like this:</p>
+<div class="sourceCode" id="cb121"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb121-1" data-line-number="1">drinks_smaller_tidy</a></code></pre></div>
 <pre><code># A tibble: 12 x 3
    country      type   servings
    &lt;chr&gt;        &lt;chr&gt;     &lt;int&gt;
@@ -706,28 +712,18 @@ <h2><span class="header-section-number">4.2</span> Tidy data</h2>
 10 Italy        wine        237
 11 Saudi Arabia wine          0
 12 USA          wine         84</code></pre>
-<p>Let’s compare <code>drinks_smaller_tidy</code> to the <code>drinks_smaller</code> data frame from earlier:</p>
-<pre class="sourceCode r"><code class="sourceCode r">drinks_smaller</code></pre>
-<pre><code># A tibble: 4 x 4
-  country       beer spirit  wine
-  &lt;chr&gt;        &lt;int&gt;  &lt;int&gt; &lt;int&gt;
-1 China           79    192     8
-2 Italy           85     42   237
-3 Saudi Arabia     0      5     0
-4 USA            249    158    84</code></pre>
-<p>Observe that while <code>drinks_smaller</code> and <code>drinks_smaller_tidy</code> are both rectangular in shape and contain the same 12 numerical values (3 alcohol types <span class="math inline">\(\times\)</span> 4 countries), they are formatted differently. <code>drinks_smaller</code> is formatted in what’s known as  <a href="https://en.wikipedia.org/wiki/Wide_and_narrow_data">“wide”</a> format, whereas <code>drinks_smaller_tidy</code> is formatted in what’s known as <a href="https://en.wikipedia.org/wiki/Wide_and_narrow_data#Narrow">“long/narrow”</a> format.</p>
-<p>In the context of doing data science in R, long/narrow format  is also known as “tidy” format. In order to use the <code>ggplot2</code> and <code>dplyr</code> packages for data visualization and data wrangling, your input data frames <em>must</em> be in “tidy” format. Thus, all non-“tidy” data must be converted to “tidy” format first.</p>
-<p>Before we show you how to convert non-“tidy” data frames like <code>drinks_smaller</code> to “tidy” data frames like <code>drinks_smaller_tidy</code>, let’s go over the explicit definition of “tidy” data.</p>
+<p>Observe that while <code>drinks_smaller</code> and <code>drinks_smaller_tidy</code> are both rectangular in shape and contain the same 12 numerical values (3 alcohol types by 4 countries), they are formatted differently. <code>drinks_smaller</code> is formatted in what’s known as  <a href="https://en.wikipedia.org/wiki/Wide_and_narrow_data">“wide”</a> format, whereas <code>drinks_smaller_tidy</code> is formatted in what’s known as <a href="https://en.wikipedia.org/wiki/Wide_and_narrow_data#Narrow">“long/narrow”</a> format.</p>
+<p>In the context of doing data science in R, long/narrow format  is also known as “tidy” format. In order to use the <code>ggplot2</code> and <code>dplyr</code> packages for data visualization and data wrangling, your input data frames <em>must</em> be in “tidy” format. Thus, all non-“tidy” data must be converted to “tidy” format first. Before we convert non-“tidy” data frames like <code>drinks_smaller</code> to “tidy” data frames like <code>drinks_smaller_tidy</code>, let’s define “tidy” data.</p>
 <div id="tidy-definition" class="section level3">
 <h3><span class="header-section-number">4.2.1</span> Definition of “tidy” data</h3>
 <p>You have surely heard the word “tidy” in your life:</p>
 <ul>
 <li>“Tidy up your room!”</li>
-<li>“Please write your homework in a tidy way so that it is easier to grade and to provide feedback.”</li>
-<li>Marie Kondo’s best-selling book <a href="https://www.amazon.com/Life-Changing-Magic-Tidying-Decluttering-Organizing/dp/B00RC3ZGN4/"><em>The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing</em></a> and Netflix TV series <a href="https://www.netflix.com/title/80209379"><em>Tidying Up with Marie Kondo</em></a>.</li>
+<li>“Write your homework in a tidy way so it is easier to provide feedback.”</li>
+<li>Marie Kondo’s best-selling book, <a href="https://www.powells.com/book/-9781607747307"><em>The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing</em></a>, and Netflix TV series <a href="https://www.netflix.com/title/80209379"><em>Tidying Up with Marie Kondo</em></a>.</li>
 <li>“I am not by any stretch of the imagination a tidy person, and the piles of unread books on the coffee table and by my bed have a plaintive, pleading quality to me - ‘Read me, please!’” - Linda Grant</li>
 </ul>
-<p>What does it mean for your data to be “tidy”? While “tidy” has a clear English meaning of “organized”, “tidy” in the context of data science using R means that your data follows a standardized format. We will follow Hadley Wickham’s  definition of <em>tidy data</em>  <span class="citation">(Wickham <a href="#ref-tidy">2014</a>)</span>.</p>
+<p>What does it mean for your data to be “tidy”? While “tidy” has a clear English meaning of “organized,” the word “tidy” in data science using R means that your data follows a standardized format. We will follow Hadley Wickham’s  definition of <em>“tidy” data</em>  <span class="citation">(Wickham <a href="#ref-tidy">2014</a>)</span> shown also in Figure <a href="4-tidy.html#fig:tidyfig">4.3</a>:</p>
 <blockquote>
 <p>A <em>dataset</em> is a collection of values, usually either numbers (if quantitative) or strings AKA text data (if qualitative/categorical). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a city) across attributes.</p>
 <p>“Tidy” data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In <em>tidy data</em>:</p>
@@ -737,10 +733,11 @@ <h3><span class="header-section-number">4.2.1</span> Definition of “tidy” da
 <li>Each type of observational unit forms a table.</li>
 </ol>
 </blockquote>
+
 <div class="figure" style="text-align: center"><span id="fig:tidyfig"></span>
-<img src="images/r4ds/tidy-1.png" alt="Tidy data graphic from R for Data Science." width="\textwidth" />
+<img src="images/r4ds/tidy-1.png" alt="Tidy data graphic from R for Data Science." width="80%" height="80%" />
 <p class="caption">
-FIGURE 4.3: Tidy data graphic from R for Data Science.
+FIGURE 4.3: Tidy data graphic from <em>R for Data Science</em>.
 </p>
 </div>
 <p>For example, say you have the following table of stock prices in Table <a href="4-tidy.html#tab:non-tidy-stocks">4.1</a>:</p>
@@ -795,7 +792,7 @@ <h3><span class="header-section-number">4.2.1</span> Definition of “tidy” da
 </tr>
 </tbody>
 </table>
-<p>Although the data are neatly organized in a rectangular spreadsheet-type format, they do not follow the definition of data in “tidy” format. While there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns. In “tidy” data format each variable should be its own column, as shown in Table <a href="4-tidy.html#tab:tidy-stocks">4.2</a>. Notice that both tables present the same information, but in different formats.</p>
+<p>Although the data are neatly organized in a rectangular spreadsheet-type format, they do not follow the definition of data in “tidy” format. While there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns. In “tidy” data format, each variable should be its own column, as shown in Table <a href="4-tidy.html#tab:tidy-stocks">4.2</a>. Notice that both tables present the same information, but in different formats.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:tidy-stocks">TABLE 4.2: </span>Stock prices (tidy format)
@@ -806,10 +803,10 @@ <h3><span class="header-section-number">4.2.1</span> Definition of “tidy” da
 Date
 </th>
 <th style="text-align:left;">
-Stock name
+Stock Name
 </th>
 <th style="text-align:left;">
-Stock price
+Stock Price
 </th>
 </tr>
 </thead>
@@ -827,13 +824,13 @@ <h3><span class="header-section-number">4.2.1</span> Definition of “tidy” da
 </tr>
 <tr>
 <td style="text-align:left;">
-2009-01-02
+2009-01-01
 </td>
 <td style="text-align:left;">
-Boeing
+Amazon
 </td>
 <td style="text-align:left;">
-$172.61
+$174.90
 </td>
 </tr>
 <tr>
@@ -841,10 +838,10 @@ <h3><span class="header-section-number">4.2.1</span> Definition of “tidy” da
 2009-01-01
 </td>
 <td style="text-align:left;">
-Amazon
+Google
 </td>
 <td style="text-align:left;">
-$174.90
+$174.34
 </td>
 </tr>
 <tr>
@@ -852,21 +849,21 @@ <h3><span class="header-section-number">4.2.1</span> Definition of “tidy” da
 2009-01-02
 </td>
 <td style="text-align:left;">
-Amazon
+Boeing
 </td>
 <td style="text-align:left;">
-$171.42
+$172.61
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-2009-01-01
+2009-01-02
 </td>
 <td style="text-align:left;">
-Google
+Amazon
 </td>
 <td style="text-align:left;">
-$174.34
+$171.42
 </td>
 </tr>
 <tr>
@@ -882,10 +879,10 @@ <h3><span class="header-section-number">4.2.1</span> Definition of “tidy” da
 </tr>
 </tbody>
 </table>
-<p>Now we have the requisite three columns <code>Date</code>, <code>Stock Name</code>, and <code>Stock Price</code>. On the other hand, consider the data in Table <a href="4-tidy.html#tab:tidy-stocks-2">4.3</a>.</p>
+<p>Now we have the requisite three columns Date, Stock Name, and Stock Price. On the other hand, consider the data in Table <a href="4-tidy.html#tab:tidy-stocks-2">4.3</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:tidy-stocks-2">TABLE 4.3: </span>Example of tidy data.
+<span id="tab:tidy-stocks-2">TABLE 4.3: </span>Example of tidy data
 </caption>
 <thead>
 <tr>
@@ -925,7 +922,7 @@ <h3><span class="header-section-number">4.2.1</span> Definition of “tidy” da
 </tr>
 </tbody>
 </table>
-<p>In this case, even though the variable “Boeing Price” occurs just like in our non-“tidy” data in Table <a href="4-tidy.html#tab:non-tidy-stocks">4.1</a>, the data <em>is</em> “tidy” since there are three variables corresponding to three unique pieces of information: Date, Boeing stock price, and the weather that particular day.</p>
+<p>In this case, even though the variable “Boeing Price” occurs just like in our non-“tidy” data in Table <a href="4-tidy.html#tab:non-tidy-stocks">4.1</a>, the data <em>is</em> “tidy” since there are three variables corresponding to three unique pieces of information: Date, Boeing price, and the Weather that particular day.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -939,9 +936,9 @@ <h3><span class="header-section-number">4.2.1</span> Definition of “tidy” da
 </div>
 <div id="converting-to-tidy-data" class="section level3">
 <h3><span class="header-section-number">4.2.2</span> Converting to “tidy” data</h3>
-<p>In this book so far, you’ve only seen data frames that were already in “tidy” format. Furthermore, for the rest of this book, you’ll mostly only see data frames that are already in “tidy” format as well. This is not always the case however with all datasets in the world. If your original data frame is in wide i.e. non-“tidy” format and you would like to use the <code>ggplot2</code> or <code>dplyr</code> packages, you will first have to convert it “tidy” format using the  <code>gather()</code> function in the <code>tidyr</code>  package <span class="citation">(Wickham and Henry <a href="#ref-R-tidyr">2019</a>)</span>.</p>
+<p>In this book so far, you’ve only seen data frames that were already in “tidy” format. Furthermore, for the rest of this book, you’ll mostly only see data frames that are already in “tidy” format as well. This is not always the case however with all datasets in the world. If your original data frame is in wide (non-“tidy”) format and you would like to use the <code>ggplot2</code> or <code>dplyr</code> packages, you will first have to convert it to “tidy” format. To do so, we recommend using the  <code>pivot_longer()</code> function in the <code>tidyr</code>  package <span class="citation">(Wickham and Henry <a href="#ref-R-tidyr">2019</a>)</span>.</p>
 <p>Going back to our <code>drinks_smaller</code> data frame from earlier:</p>
-<pre class="sourceCode r"><code class="sourceCode r">drinks_smaller</code></pre>
+<div class="sourceCode" id="cb123"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb123-1" data-line-number="1">drinks_smaller</a></code></pre></div>
 <pre><code># A tibble: 4 x 4
   country       beer spirit  wine
   &lt;chr&gt;        &lt;int&gt;  &lt;int&gt; &lt;int&gt;
@@ -949,74 +946,82 @@ <h3><span class="header-section-number">4.2.2</span> Converting to “tidy” da
 2 Italy           85     42   237
 3 Saudi Arabia     0      5     0
 4 USA            249    158    84</code></pre>
-<p>We convert it to “tidy” format by using the <code>gather()</code> function from the <code>tidyr</code> package as follows:</p>
-<pre class="sourceCode r"><code class="sourceCode r">drinks_smaller_tidy &lt;-<span class="st"> </span>drinks_smaller <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">gather</span>(<span class="dt">key =</span> type, <span class="dt">value =</span> servings, <span class="op">-</span>country)
-drinks_smaller_tidy</code></pre>
+<p>We convert it to “tidy” format by using the <code>pivot_longer()</code> function from the <code>tidyr</code> package as follows:</p>
+<div class="sourceCode" id="cb125"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb125-1" data-line-number="1">drinks_smaller_tidy &lt;-<span class="st"> </span>drinks_smaller <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb125-2" data-line-number="2"><span class="st">  </span><span class="kw">pivot_longer</span>(<span class="dt">names_to =</span> <span class="st">&quot;type&quot;</span>, </a>
+<a class="sourceLine" id="cb125-3" data-line-number="3">               <span class="dt">values_to =</span> <span class="st">&quot;servings&quot;</span>, </a>
+<a class="sourceLine" id="cb125-4" data-line-number="4">               <span class="dt">cols =</span> <span class="op">-</span>country)</a>
+<a class="sourceLine" id="cb125-5" data-line-number="5">drinks_smaller_tidy</a></code></pre></div>
 <pre><code># A tibble: 12 x 3
    country      type   servings
    &lt;chr&gt;        &lt;chr&gt;     &lt;int&gt;
  1 China        beer         79
- 2 Italy        beer         85
- 3 Saudi Arabia beer          0
- 4 USA          beer        249
- 5 China        spirit      192
- 6 Italy        spirit       42
- 7 Saudi Arabia spirit        5
- 8 USA          spirit      158
- 9 China        wine          8
-10 Italy        wine        237
-11 Saudi Arabia wine          0
+ 2 China        spirit      192
+ 3 China        wine          8
+ 4 Italy        beer         85
+ 5 Italy        spirit       42
+ 6 Italy        wine        237
+ 7 Saudi Arabia beer          0
+ 8 Saudi Arabia spirit        5
+ 9 Saudi Arabia wine          0
+10 USA          beer        249
+11 USA          spirit      158
 12 USA          wine         84</code></pre>
-<p>We set the arguments to <code>gather()</code> as follows:</p>
+<p>We set the arguments to <code>pivot_longer()</code> as follows:</p>
 <ol style="list-style-type: decimal">
-<li><code>key</code> is the name of the variable in the new “tidy” data frame that will contain the <em>column names</em> of the original data. Observe how we set <code>key = type</code>. In the resulting <code>drinks_smaller_tidy</code>, the column <code>type</code> contains the three types of alcohol <code>beer</code>, <code>spirit</code>, and <code>wine</code>.</li>
-<li><code>value</code> is the name of the variable in the new “tidy” data frame that will contain the <em>rows and columns of values</em> of the original data. Observe how we set <code>value = servings</code>. In the resulting <code>drinks_smaller_tidy</code>, the column <code>value</code> contains the 4 <span class="math inline">\(\times\)</span> 3 = 12 numerical values.</li>
-<li>The third argument is the columns you either want to or don’t want to tidy. Observe how we set this to <code>-country</code> indicating that we don’t want to tidy the <code>country</code> variable in <code>drinks_smaller</code> and rather only <code>beer</code>, <code>spirit</code>, and <code>wine</code>.</li>
+<li><code>names_to</code> here corresponds to the name of the variable in the new “tidy”/long data frame that will contain the <em>column names</em> of the original data. Observe how we set <code>names_to = &quot;type&quot;</code>. In the resulting <code>drinks_smaller_tidy</code>, the column <code>type</code> contains the three types of alcohol <code>beer</code>, <code>spirit</code>, and <code>wine</code>. Since <code>type</code> is a variable name that doesn’t appear in <code>drinks_smaller</code>, we use quotation marks around it. You’ll receive an error if you just use <code>names_to = type</code> here.</li>
+<li><code>values_to</code> here is the name of the variable in the new “tidy” data frame that will contain the <em>values</em> of the original data. Observe how we set <code>values_to = &quot;servings&quot;</code> since each of the numeric values in each of the <code>beer</code>, <code>wine</code>, and <code>spirit</code> columns of the <code>drinks_smaller</code> data corresponds to a value of <code>servings</code>. In the resulting <code>drinks_smaller_tidy</code>, the column <code>servings</code> contains the 4 <span class="math inline">\(\times\)</span> 3 = 12 numerical values. Note again that <code>servings</code> doesn’t appear as a variable in <code>drinks_smaller</code> so it again needs quotation marks around it for the <code>values_to</code> argument.</li>
+<li>The third argument <code>cols</code> is the columns in the <code>drinks_smaller</code> data frame you either want to or don’t want to “tidy.” Observe how we set this to <code>-country</code> indicating that we don’t want to “tidy” the <code>country</code> variable in <code>drinks_smaller</code> and rather only <code>beer</code>, <code>spirit</code>, and <code>wine</code>. Since <code>country</code> is a column that appears in <code>drinks_smaller</code> we don’t put quotation marks around it.</li>
 </ol>
-<p>The third argument is a little nuanced, so let’s consider code that’s written slightly differently but that produces the same output:</p>
-<pre class="sourceCode r"><code class="sourceCode r">drinks_smaller_tidy &lt;-<span class="st"> </span>drinks_smaller <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">gather</span>(<span class="dt">key =</span> type, <span class="dt">value =</span> servings, <span class="kw">c</span>(beer, spirit, wine))
-drinks_smaller_tidy</code></pre>
-<p>Note that the third argument now specifies which columns we want to tidy <code>c(beer, spirit, wine)</code>, instead of the columns we don’t want to tidy using <code>-country</code>. We use the <code>c()</code> function to create a vector of the columns in <code>drinks_smaller</code> that we’d like to tidy.</p>
-<p>With our <code>drinks_smaller_tidy</code> “tidy” formatted data frame, we can now produce the barplot you saw in Figure <a href="4-tidy.html#fig:drinks-smaller">4.2</a> using <code>geom_col()</code>. Recall from Section <a href="2-viz.html#geombar">2.8</a> on barplots that we use <code>geom_col()</code> and not <code>geom_bar()</code>, since we would like to map the “pre-counted” <code>servings</code> variable to the <code>y</code>-aesthetic of the bars.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(drinks_smaller_tidy, 
-       <span class="kw">aes</span>(<span class="dt">x =</span> country, <span class="dt">y =</span> servings, <span class="dt">fill =</span> type)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_col</span>(<span class="dt">position =</span> <span class="st">&quot;dodge&quot;</span>)</code></pre>
+<p>The third argument here of <code>cols</code> is a little nuanced, so let’s consider code that’s written slightly differently but that produces the same output:</p>
+<div class="sourceCode" id="cb127"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb127-1" data-line-number="1">drinks_smaller <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb127-2" data-line-number="2"><span class="st">  </span><span class="kw">pivot_longer</span>(<span class="dt">names_to =</span> <span class="st">&quot;type&quot;</span>, </a>
+<a class="sourceLine" id="cb127-3" data-line-number="3">               <span class="dt">values_to =</span> <span class="st">&quot;servings&quot;</span>, </a>
+<a class="sourceLine" id="cb127-4" data-line-number="4">               <span class="dt">cols =</span> <span class="kw">c</span>(beer, spirit, wine))</a></code></pre></div>
+<p>Note that the third argument now specifies which columns we want to “tidy” with <code>c(beer, spirit, wine)</code>, instead of the columns we don’t want to “tidy” using <code>-country</code>. We use the <code>c()</code> function to create a vector of the columns in <code>drinks_smaller</code> that we’d like to “tidy.” Note that since these three columns appear one after another in the <code>drinks_smaller</code> data frame, we could also do the following for the <code>cols</code> argument:</p>
+<div class="sourceCode" id="cb128"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb128-1" data-line-number="1">drinks_smaller <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb128-2" data-line-number="2"><span class="st">  </span><span class="kw">pivot_longer</span>(<span class="dt">names_to =</span> <span class="st">&quot;type&quot;</span>, </a>
+<a class="sourceLine" id="cb128-3" data-line-number="3">               <span class="dt">values_to =</span> <span class="st">&quot;servings&quot;</span>, </a>
+<a class="sourceLine" id="cb128-4" data-line-number="4">               <span class="dt">cols =</span> beer<span class="op">:</span>wine)</a></code></pre></div>
+<p>With our <code>drinks_smaller_tidy</code> “tidy” formatted data frame, we can now produce the barplot you saw in Figure <a href="4-tidy.html#fig:drinks-smaller">4.2</a> using <code>geom_col()</code>. This is done in Figure <a href="4-tidy.html#fig:drinks-smaller-tidy-barplot">4.4</a>. Recall from Section <a href="2-viz.html#geombar">2.8</a> on barplots that we use <code>geom_col()</code> and not <code>geom_bar()</code>, since we would like to map the “pre-counted” <code>servings</code> variable to the <code>y</code>-aesthetic of the bars.</p>
+<div class="sourceCode" id="cb129"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb129-1" data-line-number="1"><span class="kw">ggplot</span>(drinks_smaller_tidy, <span class="kw">aes</span>(<span class="dt">x =</span> country, <span class="dt">y =</span> servings, <span class="dt">fill =</span> type)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb129-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_col</span>(<span class="dt">position =</span> <span class="st">&quot;dodge&quot;</span>)</a></code></pre></div>
+
 <div class="figure" style="text-align: center"><span id="fig:drinks-smaller-tidy-barplot"></span>
-<img src="moderndive_files/figure-html/drinks-smaller-tidy-barplot-1.png" alt="Comparing alcohol consumption in 4 countries." width="\textwidth" />
+<img src="ModernDive_files/figure-html/drinks-smaller-tidy-barplot-1.png" alt="Comparing alcohol consumption in 4 countries using geom_col()." width="\textwidth" />
 <p class="caption">
-FIGURE 4.4: Comparing alcohol consumption in 4 countries.
+FIGURE 4.4: Comparing alcohol consumption in 4 countries using geom_col().
 </p>
 </div>
-<p>Converting “wide” format data to “tidy” format often confuses new R users. The only way to learn to get comfortable with the <code>gather()</code> function is with practice, practice, and more practice. For example, run <code>?gather</code> and look at the examples in the bottom of the help file. We’ll show another example of using <code>gather()</code> to convert a “wide” formatted data frame to “tidy” format in Section <a href="4-tidy.html#case-study-tidy">4.3</a>. For other examples of converting a dataset into “tidy” format, check out the different functions available for data tidying and a case study using data from the World Health Organization in <a href="http://r4ds.had.co.nz/tidy-data.html">R for Data Science</a> <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2016</a>)</span>.</p>
+<p>Converting “wide” format data to “tidy” format often confuses new R users. The only way to learn to get comfortable with the <code>pivot_longer()</code> function is with practice, practice, and more practice using different datasets. For example, run <code>?pivot_longer</code> and look at the examples in the bottom of the help file. We’ll show another example of using <code>pivot_longer()</code> to convert a “wide” formatted data frame to “tidy” format in Section <a href="4-tidy.html#case-study-tidy">4.3</a>.</p>
+<p>If however you want to convert a “tidy” data frame to “wide” format, you will need to use the <code>pivot_wider()</code> function instead. Run <code>?pivot_wider</code> and look at the examples in the bottom of the help file for examples.</p>
+<p>You can also view examples of both <code>pivot_longer()</code> and <code>pivot_wider()</code> on the <a href="https://tidyr.tidyverse.org/dev/articles/pivot.html#pew">tidyverse.org</a> webpage. There’s a nice example to check out the different functions available for data tidying and a case study using data from the World Health Organization on that webpage. Furthermore, each week the R4DS Online Learning Community posts a dataset in the weekly <a href="https://github.com/rfordatascience/tidytuesday"><code>#</code>TidyTuesday event</a> that might serve as a nice place for you to find other data to explore and transform.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC4.3)</strong> Take a look the <code>airline_safety</code> data frame included in the <code>fivethirtyeight</code> data package. Run the following:</p>
-<pre class="sourceCode r"><code class="sourceCode r">airline_safety</code></pre>
-<p>After reading the help file by running <code>?airline_safety</code>, we see that <code>airline_safety</code> is a data frame containing information on different airlines companies’ safety records. This data was originally reported on the data journalism website FiveThirtyEight.com in Nate Silver’s article <a href="https://fivethirtyeight.com/features/should-travelers-avoid-flying-airlines-that-have-had-crashes-in-the-past/">“Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?”</a>. Let’s ignore the <code>incl_reg_subsidiaries</code> and <code>avail_seat_km_per_week</code> variables for simplicity:</p>
-<pre class="sourceCode r"><code class="sourceCode r">airline_safety_smaller &lt;-<span class="st"> </span>airline_safety <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(<span class="op">-</span><span class="kw">c</span>(incl_reg_subsidiaries, avail_seat_km_per_week))
-airline_safety_smaller</code></pre>
-<pre><code># A tibble: 56 x 7
-   airline incidents_85_99 fatal_accidents… fatalities_85_99 incidents_00_14
-   &lt;chr&gt;             &lt;int&gt;            &lt;int&gt;            &lt;int&gt;           &lt;int&gt;
- 1 Aer Li…               2                0                0               0
- 2 Aerofl…              76               14              128               6
- 3 Aeroli…               6                0                0               1
- 4 Aerome…               3                1               64               5
- 5 Air Ca…               2                0                0               2
- 6 Air Fr…              14                4               79               6
- 7 Air In…               2                1              329               4
- 8 Air Ne…               3                0                0               5
- 9 Alaska…               5                0                0               5
-10 Alital…               7                2               50               4
-# … with 46 more rows, and 2 more variables: fatal_accidents_00_14 &lt;int&gt;,
-#   fatalities_00_14 &lt;int&gt;</code></pre>
-<p>This data frame is not in “tidy” format. How would you convert this data frame to be in “tidy” format, in particular so that it has a variable <code>incident_type_years</code> indicating the incident type/year and a variable <code>count</code> of the counts?</p>
+<p><strong>(LC4.3)</strong> Take a look at the <code>airline_safety</code> data frame included in the <code>fivethirtyeight</code> data package. Run the following:</p>
+<div class="sourceCode" id="cb130"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb130-1" data-line-number="1">airline_safety</a></code></pre></div>
+<p>After reading the help file by running <code>?airline_safety</code>, we see that <code>airline_safety</code> is a data frame containing information on different airline companies’ safety records. This data was originally reported on the data journalism website, FiveThirtyEight.com, in Nate Silver’s article, <a href="https://fivethirtyeight.com/features/should-travelers-avoid-flying-airlines-that-have-had-crashes-in-the-past/">“Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?”</a>. Let’s only consider the variables <code>airlines</code> and those relating to fatalities for simplicity:</p>
+<div class="sourceCode" id="cb131"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb131-1" data-line-number="1">airline_safety_smaller &lt;-<span class="st"> </span>airline_safety <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb131-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(airline, <span class="kw">starts_with</span>(<span class="st">&quot;fatalities&quot;</span>))</a>
+<a class="sourceLine" id="cb131-3" data-line-number="3">airline_safety_smaller</a></code></pre></div>
+<pre><code># A tibble: 56 x 3
+   airline               fatalities_85_99 fatalities_00_14
+   &lt;chr&gt;                            &lt;int&gt;            &lt;int&gt;
+ 1 Aer Lingus                           0                0
+ 2 Aeroflot                           128               88
+ 3 Aerolineas Argentinas                0                0
+ 4 Aeromexico                          64                0
+ 5 Air Canada                           0                0
+ 6 Air France                          79              337
+ 7 Air India                          329              158
+ 8 Air New Zealand                      0                7
+ 9 Alaska Airlines                      0               88
+10 Alitalia                            50                0
+# … with 46 more rows</code></pre>
+<p>This data frame is not in “tidy” format. How would you convert this data frame to be in “tidy” format, in particular so that it has a variable <code>fatalities_years</code> indicating the incident year and a variable <code>count</code> of the fatality counts?</p>
 <div class="learncheck">
 
 </div>
@@ -1024,102 +1029,104 @@ <h3><span class="header-section-number">4.2.2</span> Converting to “tidy” da
 <div id="nycflights13-package-1" class="section level3">
 <h3><span class="header-section-number">4.2.3</span> <code>nycflights13</code> package</h3>
 <p>Recall the <code>nycflights13</code> package we introduced in Section <a href="1-getting-started.html#nycflights13">1.4</a> with data about all domestic flights departing from New York City in 2013. Let’s revisit the <code>flights</code> data frame by running <code>View(flights)</code>. We saw that <code>flights</code> has a rectangular shape, with each of its 336,776 rows corresponding to a flight and each of its 22 columns corresponding to different characteristics/measurements of each flight. This satisfied the first two criteria of the definition of “tidy” data from Subsection <a href="4-tidy.html#tidy-definition">4.2.1</a>: that “Each variable forms a column” and “Each observation forms a row.” But what about the third property of “tidy” data that “Each type of observational unit forms a table”?</p>
-<p>Recall that we also saw in Section <a href="1-getting-started.html#exploredataframes">1.4.3</a> that the observational unit for the <code>flights</code> data frame is an individual flight. In other words, the rows of the <code>flights</code> data frame refer to characteristics/measurements of individual flights. Also included in the <code>nycflights13</code> package are other data frames with their rows representing different observational units <span class="citation">(Wickham <a href="#ref-R-nycflights13">2018</a>)</span>:</p>
+<p>Recall that we saw in Subsection <a href="1-getting-started.html#exploredataframes">1.4.3</a> that the observational unit for the <code>flights</code> data frame is an individual flight. In other words, the rows of the <code>flights</code> data frame refer to characteristics/measurements of individual flights. Also included in the <code>nycflights13</code> package are other data frames with their rows representing different observational units <span class="citation">(Wickham <a href="#ref-R-nycflights13">2019</a><a href="#ref-R-nycflights13">a</a>)</span>:</p>
 <ul>
 <li><code>airlines</code>: translation between two letter IATA carrier codes and airline company names (16 in total). The observational unit is an airline company.</li>
-<li><code>planes</code>: aircraft information about each of 3,322 planes used. i.e. the observational unit is an aircraft.</li>
-<li><code>weather</code>: hourly meteorological data (about 8705 observations) for each of the three NYC airports. i.e. the observational unit is an hourly measurement of weather at one of the three airports.</li>
-<li><code>airports</code>: airport names and locations. i.e. the observational unit is an airport.</li>
+<li><code>planes</code>: aircraft information about each of 3,322 planes used, i.e., the observational unit is an aircraft.</li>
+<li><code>weather</code>: hourly meteorological data (about 8,705 observations) for each of the three NYC airports, i.e., the observational unit is an hourly measurement of weather at one of the three airports.</li>
+<li><code>airports</code>: airport names and locations. The observational unit is an airport.</li>
 </ul>
-<p>The organization of the information into these five data frames follow the third “tidy” data property: observations corresponding to the same observational unit should be saved in the same table i.e. data frame. You could think of this property as the old English expression: “birds of a feather flock together.”</p>
+<p>The organization of the information into these five data frames follows the third “tidy” data property: observations corresponding to the same observational unit should be saved in the same table, i.e., data frame. You could think of this property as the old English expression: “birds of a feather flock together.”</p>
 </div>
 </div>
 <div id="case-study-tidy" class="section level2">
 <h2><span class="header-section-number">4.3</span> Case study: Democracy in Guatemala</h2>
-<p>In this section, we’ll show you another example of how to convert a data frame that isn’t in “tidy” format (in other words is in “wide” format) to a data frame that is in “tidy” format (in other words is in “long/narrow” format). We’ll do this using the <code>gather()</code> function from the <code>tidyr</code> package again.</p>
+<p>In this section, we’ll show you another example of how to convert a data frame that isn’t in “tidy” format (“wide” format) to a data frame that is in “tidy” format (“long/narrow” format). We’ll do this using the <code>pivot_longer()</code> function from the <code>tidyr</code> package again.</p>
 <p>Furthermore, we’ll make use of functions from the <code>ggplot2</code> and <code>dplyr</code> packages to produce a <em>time-series plot</em> showing how the democracy scores have changed over the 40 years from 1952 to 1992 for Guatemala. Recall that we saw time-series plots in Section <a href="2-viz.html#linegraphs">2.4</a> on creating linegraphs using <code>geom_line()</code>.</p>
 <p>Let’s use the <code>dem_score</code> data frame we imported in Section <a href="4-tidy.html#csv">4.1</a>, but focus on only data corresponding to Guatemala.</p>
-<pre class="sourceCode r"><code class="sourceCode r">guat_dem &lt;-<span class="st"> </span>dem_score <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(country <span class="op">==</span><span class="st"> &quot;Guatemala&quot;</span>)
-guat_dem</code></pre>
+<div class="sourceCode" id="cb133"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb133-1" data-line-number="1">guat_dem &lt;-<span class="st"> </span>dem_score <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb133-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(country <span class="op">==</span><span class="st"> &quot;Guatemala&quot;</span>)</a>
+<a class="sourceLine" id="cb133-3" data-line-number="3">guat_dem</a></code></pre></div>
 <pre><code># A tibble: 1 x 10
   country   `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
   &lt;chr&gt;      &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;
 1 Guatemala      2     -6     -5      3      1     -3     -7      3      3</code></pre>
-<p>Let’s lay out the Grammar of Graphics we saw in Section <a href="2-viz.html#grammarofgraphics">2.1</a>.</p>
-<p>First we know we need to set <code>data = guat_dem</code> and use a <code>geom_line()</code> layer, but what is the aesthetic mapping of variables. We’d like to see how the democracy score has changed over the years, so we need to map:</p>
+<p>Let’s lay out the grammar of graphics we saw in Section <a href="2-viz.html#grammarofgraphics">2.1</a>.</p>
+<p>First we know we need to set <code>data = guat_dem</code> and use a <code>geom_line()</code> layer, but what is the aesthetic mapping of variables? We’d like to see how the democracy score has changed over the years, so we need to map:</p>
 <ul>
 <li><code>year</code> to the x-position aesthetic and</li>
 <li><code>democracy_score</code> to the y-position aesthetic</li>
 </ul>
-<p>Now we are stuck in a predicament, much like with our <code>drinks_smaller</code> example in Section <a href="4-tidy.html#tidy-data-ex">4.2</a>. We see that we have a variable named <code>country</code>, but its only value is <code>&quot;Guatemala&quot;</code>. We have other variables denoted by different year values. Unfortunately, the <code>guat_dem</code> data frame is not “tidy” and hence is not in the appropriate format to apply the Grammar of Graphics and thus we cannot use the <code>ggplot2</code> package just yet.</p>
-<p>We need to take the values of the columns corresponding to years in <code>guat_dem</code> and convert them into a new “key” variable called <code>year</code>. Furthermore, we need to take the democracy score values in the inside of the data frame and turn them into a new “value” variable called <code>democracy_score</code>. Our resulting data frame will thus have three columns: <code>country</code>, <code>year</code>, and <code>democracy_score</code>. Recall that the <code>gather()</code> function in the <code>tidyr</code> package can complete this task for us:</p>
-<pre class="sourceCode r"><code class="sourceCode r">guat_dem_tidy &lt;-<span class="st"> </span>guat_dem <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">gather</span>(<span class="dt">key =</span> year, <span class="dt">value =</span> democracy_score, <span class="op">-</span>country) 
-guat_dem_tidy</code></pre>
+<p>Now we are stuck in a predicament, much like with our <code>drinks_smaller</code> example in Section <a href="4-tidy.html#tidy-data-ex">4.2</a>. We see that we have a variable named <code>country</code>, but its only value is <code>&quot;Guatemala&quot;</code>. We have other variables denoted by different year values. Unfortunately, the <code>guat_dem</code> data frame is not “tidy” and hence is not in the appropriate format to apply the grammar of graphics, and thus we cannot use the <code>ggplot2</code> package just yet.</p>
+<p>We need to take the values of the columns corresponding to years in <code>guat_dem</code> and convert them into a new “names” variable called <code>year</code>. Furthermore, we need to take the democracy score values in the inside of the data frame and turn them into a new “values” variable called <code>democracy_score</code>. Our resulting data frame will have three columns: <code>country</code>, <code>year</code>, and <code>democracy_score</code>. Recall that the <code>pivot_longer()</code> function in the <code>tidyr</code> package does this for us:</p>
+<div class="sourceCode" id="cb135"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb135-1" data-line-number="1">guat_dem_tidy &lt;-<span class="st"> </span>guat_dem <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb135-2" data-line-number="2"><span class="st">  </span><span class="kw">pivot_longer</span>(<span class="dt">names_to =</span> <span class="st">&quot;year&quot;</span>, </a>
+<a class="sourceLine" id="cb135-3" data-line-number="3">               <span class="dt">values_to =</span> <span class="st">&quot;democracy_score&quot;</span>, </a>
+<a class="sourceLine" id="cb135-4" data-line-number="4">               <span class="dt">cols =</span> <span class="op">-</span>country,</a>
+<a class="sourceLine" id="cb135-5" data-line-number="5">               <span class="dt">names_ptypes =</span> <span class="kw">list</span>(<span class="dt">year =</span> <span class="kw">integer</span>())) </a>
+<a class="sourceLine" id="cb135-6" data-line-number="6">guat_dem_tidy</a></code></pre></div>
 <pre><code># A tibble: 9 x 3
-  country   year  democracy_score
-  &lt;chr&gt;     &lt;chr&gt;           &lt;dbl&gt;
-1 Guatemala 1952                2
-2 Guatemala 1957               -6
-3 Guatemala 1962               -5
-4 Guatemala 1967                3
-5 Guatemala 1972                1
-6 Guatemala 1977               -3
-7 Guatemala 1982               -7
-8 Guatemala 1987                3
-9 Guatemala 1992                3</code></pre>
-<p>We set the arguments to <code>gather()</code> as follows:</p>
+  country    year democracy_score
+  &lt;chr&gt;     &lt;int&gt;           &lt;dbl&gt;
+1 Guatemala  1952               2
+2 Guatemala  1957              -6
+3 Guatemala  1962              -5
+4 Guatemala  1967               3
+5 Guatemala  1972               1
+6 Guatemala  1977              -3
+7 Guatemala  1982              -7
+8 Guatemala  1987               3
+9 Guatemala  1992               3</code></pre>
+<p>We set the arguments to <code>pivot_longer()</code> as follows:</p>
 <ol style="list-style-type: decimal">
-<li><code>key</code> is the name of the variable in the new “tidy” data frame that will contain the <em>column names</em> of the original data. Observe how we set <code>key = year</code>. In the resulting <code>guat_dem_tidy</code>, the column <code>year</code> contains the years where Guatemala’s democracy scores were measured.</li>
-<li><code>value</code> is the name of the variable in the new “tidy” data frame that will contain the <em>rows and columns of values</em> of the original data. Observe how we set <code>value = democracy_score</code>. In the resulting <code>guat_dem_tidy</code> the column <code>democracy_score</code> contains the 1 <span class="math inline">\(\times\)</span> 9 = 9 democracy scores.</li>
-<li>The third argument is the columns you either want to or don’t want to tidy. Observe how we set this to <code>-country</code> indicating that we don’t want to tidy the <code>country</code> variable in <code>guat_dem</code> and rather only variables <code>1952</code> through <code>1992</code>.</li>
+<li><code>names_to</code> is the name of the variable in the new “tidy” data frame that will contain the <em>column names</em> of the original data. Observe how we set <code>names_to = &quot;year&quot;</code>. In the resulting <code>guat_dem_tidy</code>, the column <code>year</code> contains the years where Guatemala’s democracy scores were measured.</li>
+<li><code>values_to</code> is the name of the variable in the new “tidy” data frame that will contain the <em>values</em> of the original data. Observe how we set <code>values_to = &quot;democracy_score&quot;</code>. In the resulting <code>guat_dem_tidy</code> the column <code>democracy_score</code> contains the 1 <span class="math inline">\(\times\)</span> 9 = 9 democracy scores as numeric values.</li>
+<li>The third argument is the columns you either want to or don’t want to “tidy.” Observe how we set this to <code>cols = -country</code> indicating that we don’t want to “tidy” the <code>country</code> variable in <code>guat_dem</code> and rather only variables <code>1952</code> through <code>1992</code>.</li>
+<li>The last argument of <code>names_ptypes</code> tells R what type of variable <code>year</code> should be set to. Without specifying that it is an <code>integer</code> as we’ve done here, <code>pivot_longer()</code> will set it to be a character value by default.</li>
 </ol>
-<p>However, observe in the output for <code>guat_dem_tidy</code> that the <code>year</code> variable is of type <code>chr</code> or character. Before we can plot this variable on the x-axis, we need to convert it into a numerical variable using the <code>as.numeric()</code> function within the <code>mutate()</code> function, which we saw in Section <a href="3-wrangling.html#mutate">3.5</a> on mutating existing variables to create new ones.</p>
-<pre class="sourceCode r"><code class="sourceCode r">guat_dem_tidy &lt;-<span class="st"> </span>guat_dem_tidy <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">year =</span> <span class="kw">as.numeric</span>(year))</code></pre>
-<p>We can now create the time-series plot to visualize how democracy scores in Guatemala have changed from 1952 to 1992 using a  <code>geom_line()</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(guat_dem_tidy, <span class="kw">aes</span>(<span class="dt">x =</span> year, <span class="dt">y =</span> democracy_score)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_line</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Year&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Democracy Score&quot;</span>)</code></pre>
+<p>We can now create the time-series plot in Figure <a href="4-tidy.html#fig:guat-dem-tidy">4.5</a> to visualize how democracy scores in Guatemala have changed from 1952 to 1992 using a  <code>geom_line()</code>. Furthermore, we’ll use the <code>labs()</code> function in the <code>ggplot2</code> package to add informative labels to all the <code>aes()</code>thetic attributes of our plot, in this case the <code>x</code> and <code>y</code> positions.</p>
+<div class="sourceCode" id="cb137"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb137-1" data-line-number="1"><span class="kw">ggplot</span>(guat_dem_tidy, <span class="kw">aes</span>(<span class="dt">x =</span> year, <span class="dt">y =</span> democracy_score)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb137-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_line</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb137-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Year&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Democracy Score&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:guat-dem-tidy"></span>
-<img src="moderndive_files/figure-html/guat-dem-tidy-1.png" alt="Democracy scores in Guatemala 1952-1992." width="\textwidth" />
+<img src="ModernDive_files/figure-html/guat-dem-tidy-1.png" alt="Democracy scores in Guatemala 1952-1992." width="\textwidth" />
 <p class="caption">
 FIGURE 4.5: Democracy scores in Guatemala 1952-1992.
 </p>
 </div>
+<p>Note that if we forgot to include the <code>names_ptypes</code> argument specifying that <code>year</code> was not of character format, we would have gotten an error here since <code>geom_line()</code> wouldn’t have known how to sort the character values in <code>year</code> in the right order.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
 <p><strong>(LC4.4)</strong> Convert the <code>dem_score</code> data frame into
-a tidy data frame and assign the name of <code>dem_score_tidy</code> to the resulting long-formatted data frame.</p>
-<p><strong>(LC4.5)</strong> Read in the life expectancy data stored at <a href="https://moderndive.com/data/le_mess.csv" class="uri">https://moderndive.com/data/le_mess.csv</a> and convert it to a tidy data frame.</p>
+a “tidy” data frame and assign the name of <code>dem_score_tidy</code> to the resulting long-formatted data frame.</p>
+<p><strong>(LC4.5)</strong> Read in the life expectancy data stored at <a href="https://moderndive.com/data/le_mess.csv" class="uri">https://moderndive.com/data/le_mess.csv</a> and convert it to a “tidy” data frame.</p>
 <div class="learncheck">
 
 </div>
 </div>
 <div id="tidyverse-package" class="section level2">
 <h2><span class="header-section-number">4.4</span> <code>tidyverse</code> package</h2>
-<p>Notice at the beginning of the chapter we loaded the following four packages, which are among the four of the most frequently used R packages for data science:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(dplyr)
-<span class="kw">library</span>(ggplot2)
-<span class="kw">library</span>(readr)
-<span class="kw">library</span>(tidyr)</code></pre>
-<p>There is a much quicker way to load these packages than by individually loading them: by installing and loading the <code>tidyverse</code> package. The <code>tidyverse</code> package acts as an “umbrella” package whereby installing/loading it will install/load multiple packages at once for you.</p>
-<p>After installing the <code>tidyverse</code> package as you would a normal package via <code>install.packages(&quot;tidyverse&quot;)</code>, running:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(tidyverse)</code></pre>
+<p>Notice at the beginning of the chapter we loaded the following four packages, which are among four of the most frequently used R packages for data science:</p>
+<div class="sourceCode" id="cb138"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb138-1" data-line-number="1"><span class="kw">library</span>(ggplot2)</a>
+<a class="sourceLine" id="cb138-2" data-line-number="2"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb138-3" data-line-number="3"><span class="kw">library</span>(readr)</a>
+<a class="sourceLine" id="cb138-4" data-line-number="4"><span class="kw">library</span>(tidyr)</a></code></pre></div>
+<p>Recall that <code>ggplot2</code> is for data visualization, <code>dplyr</code> is for data wrangling, <code>readr</code> is for importing spreadsheet data into R, and <code>tidyr</code> is for converting data to “tidy” format. There is a much quicker way to load these packages than by individually loading them: by installing and loading the <code>tidyverse</code> package. The <code>tidyverse</code> package acts as an “umbrella” package whereby installing/loading it will install/load multiple packages at once for you.</p>
+<p>After installing the <code>tidyverse</code> package as you would a normal package as seen in Section <a href="1-getting-started.html#packages">1.3</a>, running:</p>
+<div class="sourceCode" id="cb139"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb139-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a></code></pre></div>
 <p>would be the same as running:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(ggplot2)
-<span class="kw">library</span>(dplyr)
-<span class="kw">library</span>(tidyr)
-<span class="kw">library</span>(readr)
-<span class="kw">library</span>(purrr)
-<span class="kw">library</span>(tibble)
-<span class="kw">library</span>(stringr)
-<span class="kw">library</span>(forcats)</code></pre>
-<p>You’ve seen the first 4 of these packages: <code>ggplot2</code> for data visualization, <code>dplyr</code> for data wrangling, <code>tidyr</code> for converting data to “tidy” format, and <code>readr</code> for importing spreadsheet data into R. The remaining packages (<code>purrr</code>, <code>tibble</code>, <code>stringr</code>, and <code>forcats</code>) are left for a more advanced book; check out <a href="http://r4ds.had.co.nz/">R for Data Science</a> to learn about these packages.</p>
+<div class="sourceCode" id="cb140"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb140-1" data-line-number="1"><span class="kw">library</span>(ggplot2)</a>
+<a class="sourceLine" id="cb140-2" data-line-number="2"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb140-3" data-line-number="3"><span class="kw">library</span>(readr)</a>
+<a class="sourceLine" id="cb140-4" data-line-number="4"><span class="kw">library</span>(tidyr)</a>
+<a class="sourceLine" id="cb140-5" data-line-number="5"><span class="kw">library</span>(purrr)</a>
+<a class="sourceLine" id="cb140-6" data-line-number="6"><span class="kw">library</span>(tibble)</a>
+<a class="sourceLine" id="cb140-7" data-line-number="7"><span class="kw">library</span>(stringr)</a>
+<a class="sourceLine" id="cb140-8" data-line-number="8"><span class="kw">library</span>(forcats)</a></code></pre></div>
+<p>The <code>purrr</code>, <code>tibble</code>, <code>stringr</code>, and <code>forcats</code> are left for a more advanced book; check out <a href="http://r4ds.had.co.nz/"><em>R for Data Science</em></a> to learn about these packages.</p>
 <p>For the remainder of this book, we’ll start every chapter by running <code>library(tidyverse)</code>, instead of loading the various component packages individually. The <code>tidyverse</code> “umbrella” package gets its name from the fact that all the functions in all its packages are designed to have common inputs and outputs: data frames are in “tidy” format. This standardization of input and output data frames makes transitions between different functions in the different packages as seamless as possible. For more information, check out the <a href="https://www.tidyverse.org/">tidyverse.org</a> webpage for the package.</p>
 </div>
 <div id="conclusion-3" class="section level2">
@@ -1127,8 +1134,7 @@ <h2><span class="header-section-number">4.5</span> Conclusion</h2>
 <div id="additional-resources-3" class="section level3">
 <h3><span class="header-section-number">4.5.1</span> Additional resources</h3>
 <p>An R script file of all R code used in this chapter is available <a href="scripts/04-tidy.R">here</a>.</p>
-<p>If you want to learn more about using the <code>readr</code>  and <code>tidyr</code>  package, we suggest you that you check out RStudio’s “Data Import Cheat Sheet.”</p>
-<p>You can access these cheatsheets by going to the RStudio Menu Bar -&gt; Help -&gt; Cheatsheets -&gt; “Browse Cheatsheets” -&gt; Scroll down the page to the “Data Import Cheat Sheet”. The first page of this cheatsheet has information on using the <code>readr</code> package to import data while the second page has information on using the <code>tidyr</code> package to “tidy” data. You can see a preview of both cheatsheets in the figures below.</p>
+<p>If you want to learn more about using the <code>readr</code> and <code>tidyr</code> package, we suggest that you check out RStudio’s “Data Import Cheat Sheet.” In the current version of RStudio in late 2019, you can access this cheatsheet by going to the RStudio Menu Bar -&gt; Help -&gt; Cheatsheets -&gt; “Browse Cheatsheets” -&gt; Scroll down the page to the “Data Import Cheat Sheet.” The first page of this cheatsheet has information on using the <code>readr</code> package to import data, while the second page has information on using the <code>tidyr</code> package to “tidy” data. You can see a preview of both cheatsheets in the figures below.</p>
 <div class="figure" style="text-align: center"><span id="fig:import-cheatsheet"></span>
 <img src="images/cheatsheets/data-import-1.png" alt="Data Import cheatsheet (first page): readr package." width="66%" />
 <p class="caption">
@@ -1144,15 +1150,17 @@ <h3><span class="header-section-number">4.5.1</span> Additional resources</h3>
 </div>
 <div id="whats-to-come-2" class="section level3">
 <h3><span class="header-section-number">4.5.2</span> What’s to come?</h3>
-<p>Congratulations! You’ve completed the “Data Science with tidyverse” portion of this book! We’ll now move to the “Data modeling with moderndive” portion of this book in Chapters <a href="5-regression.html#regression">5</a> and <a href="6-multiple-regression.html#multiple-regression">6</a>, where you’ll leverage your data visualization and wrangling skills to model relationships between different variables in data frames.</p>
-<p>However, we’re going to leave the Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a> on “Inference for Regression” until after we’ve covered statistical inference in Chapters <a href="7-sampling.html#sampling">7</a>, <a href="8-confidence-intervals.html#confidence-intervals">8</a>, and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>. Onwards and upwards!</p>
-<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-141"></span>
-<img src="images/flowcharts/flowchart/flowchart.005.png" alt="ModernDive flowchart - On to Part II!" width="\textwidth" />
+<p>Congratulations! You’ve completed the “Data Science with <code>tidyverse</code>” portion of this book. We’ll now move to the “Data modeling with moderndive” portion of this book in Chapters <a href="5-regression.html#regression">5</a> and <a href="6-multiple-regression.html#multiple-regression">6</a>, where you’ll leverage your data visualization and wrangling skills to model relationships between different variables in data frames.</p>
+<p>However, we’re going to leave Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a> on “Inference for Regression” until after we’ve covered statistical inference in Chapters <a href="7-sampling.html#sampling">7</a>, <a href="8-confidence-intervals.html#confidence-intervals">8</a>, and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>. Onwards and upwards into Data Modeling as shown in Figure <a href="4-tidy.html#fig:part2">4.8</a>!</p>
+
+<div class="figure" style="text-align: center"><span id="fig:part2"></span>
+<img src="images/flowcharts/flowchart/flowchart.005.png" alt="ModernDive flowchart - on to Part II!" width="\textwidth" />
 <p class="caption">
-FIGURE 4.8: ModernDive flowchart - On to Part II!
+FIGURE 4.8: <em>ModernDive</em> flowchart - on to Part II!
 </p>
 </div>
 
+
 </div>
 </div>
 </div>
@@ -1161,20 +1169,17 @@ <h3><span class="header-section-number">4.5.2</span> What’s to come?</h3>
 
 <h3>References</h3>
 <div id="refs" class="references">
-<div id="ref-rds2016">
-<p>Grolemund, Garrett, and Hadley Wickham. 2016. <em>R for Data Science</em>. <a href="http://r4ds.had.co.nz/">http://r4ds.had.co.nz/</a>.</p>
-</div>
 <div id="ref-R-fivethirtyeight">
-<p>Kim, Albert Y., Chester Ismay, and Jennifer Chunn. 2018. <em>Fivethirtyeight: Data and Code Behind the Stories and Interactives at ’Fivethirtyeight’</em>. <a href="https://CRAN.R-project.org/package=fivethirtyeight">https://CRAN.R-project.org/package=fivethirtyeight</a>.</p>
+<p>Kim, Albert Y., Chester Ismay, and Jennifer Chunn. 2019. <em>Fivethirtyeight: Data and Code Behind the Stories and Interactives at ’Fivethirtyeight’</em>. <a href="https://CRAN.R-project.org/package=fivethirtyeight">https://CRAN.R-project.org/package=fivethirtyeight</a>.</p>
 </div>
 <div id="ref-tidy">
 <p>Wickham, Hadley. 2014. “Tidy Data.” <em>Journal of Statistical Software</em> Volume 59 (Issue 10). <a href="https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf">https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf</a>.</p>
 </div>
 <div id="ref-R-nycflights13">
-<p>Wickham, Hadley. 2018. <em>Nycflights13: Flights That Departed Nyc in 2013</em>. <a href="https://CRAN.R-project.org/package=nycflights13">https://CRAN.R-project.org/package=nycflights13</a>.</p>
+<p>Wickham, Hadley. 2019a. <em>Nycflights13: Flights That Departed Nyc in 2013</em>. <a href="https://CRAN.R-project.org/package=nycflights13">https://CRAN.R-project.org/package=nycflights13</a>.</p>
 </div>
 <div id="ref-R-tidyr">
-<p>Wickham, Hadley, and Lionel Henry. 2019. <em>Tidyr: Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions</em>. <a href="https://CRAN.R-project.org/package=tidyr">https://CRAN.R-project.org/package=tidyr</a>.</p>
+<p>Wickham, Hadley, and Lionel Henry. 2019. <em>Tidyr: Tidy Messy Data</em>. <a href="https://CRAN.R-project.org/package=tidyr">https://CRAN.R-project.org/package=tidyr</a>.</p>
 </div>
 <div id="ref-R-readr">
 <p>Wickham, Hadley, Jim Hester, and Romain Francois. 2018. <em>Readr: Read Rectangular Text Data</em>. <a href="https://CRAN.R-project.org/package=readr">https://CRAN.R-project.org/package=readr</a>.</p>
@@ -1191,11 +1196,13 @@ <h3>References</h3>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -1203,12 +1210,11 @@ <h3>References</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -1223,6 +1229,10 @@ <h3>References</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -1239,8 +1249,9 @@ <h3>References</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/5-regression.html b/docs/5-regression.html
index c09fcc885..924556f08 100644
--- a/docs/5-regression.html
+++ b/docs/5-regression.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>Chapter 5 Basic Regression | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="Chapter 5 Basic Regression | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="Chapter 5 Basic Regression | Statistical Inference via Data Science" />
@@ -21,18 +21,18 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="4-tidy.html">
-<link rel="next" href="6-multiple-regression.html">
+<link rel="prev" href="4-tidy.html"/>
+<link rel="next" href="6-multiple-regression.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -570,7 +583,7 @@ <h1>
 </html>
 <div id="regression" class="section level1">
 <h1><span class="header-section-number">Chapter 5</span> Basic Regression</h1>
-<p>Now that we are equipped with data visualization skills from Chapter <a href="2-viz.html#viz">2</a>, data wrangling skills from Chapter <a href="3-wrangling.html#wrangling">3</a>, and an understanding of how to import data and the concept of “tidy” data format from Chapter <a href="4-tidy.html#tidy">4</a>, let’s now proceed with data modeling. The fundamental premise of data modeling is to make explicit the relationship between:</p>
+<p>Now that we are equipped with data visualization skills from Chapter <a href="2-viz.html#viz">2</a>, data wrangling skills from Chapter <a href="3-wrangling.html#wrangling">3</a>, and an understanding of how to import data and the concept of a “tidy” data format from Chapter <a href="4-tidy.html#tidy">4</a>, let’s now proceed with data modeling. The fundamental premise of data modeling is to make explicit the relationship between:</p>
 <ul>
 <li>an <em>outcome variable</em> <span class="math inline">\(y\)</span>, also called a <em>dependent variable</em> or response variable,  and</li>
 <li>an <em>explanatory/predictor variable</em> <span class="math inline">\(x\)</span>, also called an <em>independent variable</em> or  covariate.</li>
@@ -578,20 +591,20 @@ <h1><span class="header-section-number">Chapter 5</span> Basic Regression</h1>
 <p>Another way to state this is using mathematical terminology: we will model the outcome variable <span class="math inline">\(y\)</span> “as a function” of the explanatory/predictor variable <span class="math inline">\(x\)</span>. When we say “function” here, we aren’t referring to functions in R like the <code>ggplot()</code> function, but rather as a mathematical function. But, why do we have two different labels, explanatory and predictor, for the variable <span class="math inline">\(x\)</span>? That’s because even though the two terms are often used interchangeably, roughly speaking data modeling serves one of two purposes:</p>
 <ol style="list-style-type: decimal">
 <li><strong>Modeling for explanation</strong>: When you want to explicitly describe and quantify the relationship between the outcome variable <span class="math inline">\(y\)</span> and a set of explanatory variables <span class="math inline">\(x\)</span>, determine the significance of any relationships, have measures summarizing these relationships, and possibly identify any <em>causal</em> relationships between the variables.</li>
-<li><strong>Modeling for prediction</strong>: When you want to predict an outcome variable <span class="math inline">\(y\)</span> based on the information contained in a set of predictor variables <span class="math inline">\(x\)</span>. Unlike modeling for explanation however, you don’t care so much about understanding how all the variables relate and interact with one another, but rather only whether you can make good predictions about <span class="math inline">\(y\)</span> using the information in <span class="math inline">\(x\)</span>.</li>
+<li><strong>Modeling for prediction</strong>: When you want to predict an outcome variable <span class="math inline">\(y\)</span> based on the information contained in a set of predictor variables <span class="math inline">\(x\)</span>. Unlike modeling for explanation, however, you don’t care so much about understanding how all the variables relate and interact with one another, but rather only whether you can make good predictions about <span class="math inline">\(y\)</span> using the information in <span class="math inline">\(x\)</span>.</li>
 </ol>
-<p>For example, say you are interested in an outcome variable <span class="math inline">\(y\)</span> of whether patients develop lung cancer and information <span class="math inline">\(x\)</span> on their risk factors, such as smoking habits, age, and socioeconomic status. If we are modeling for explanation, we would be interested in both describing and quantifying the effects of the different risk factors. One reason could be because you want to design an intervention to reduce lung cancer incidence in a population, such as targeting smokers of a specific age group with advertising for smoking cessation programs. If we are modeling for prediction however, we wouldn’t care so much about understanding how all the individual risk factors contribute to lung cancer, but rather only whether we can make good predictions of who will contract lung cancer.</p>
-<p>In this book, we’ll focus on modeling for explanation and hence refer to <span class="math inline">\(x\)</span> as <em>explanatory variables</em>. If you are interested in learning about modeling for prediction, we suggest you check out books and courses on the field of <em>machine learning</em>. Furthermore, while there exists many techniques for modeling, such as tree-based models and neural networks, in this book we’ll focus on one particular technique: <em>linear regression</em>.  Linear regression is one of the most commonly-used and easy-to-understand approaches to modeling.</p>
-<p>Linear regression involves a <em>numerical</em> outcome variable <span class="math inline">\(y\)</span> and explanatory variables <span class="math inline">\(x\)</span> that are either <em>numerical</em> or <em>categorical</em>. Furthermore, the relationship between <span class="math inline">\(y\)</span> and <span class="math inline">\(x\)</span> is assumed to be linear, or in other words, a line. However, we’ll see that what constitutes a “line” will vary depending on the nature of your <span class="math inline">\(x\)</span> explanatory variables.</p>
+<p>For example, say you are interested in an outcome variable <span class="math inline">\(y\)</span> of whether patients develop lung cancer and information <span class="math inline">\(x\)</span> on their risk factors, such as smoking habits, age, and socioeconomic status. If we are modeling for explanation, we would be interested in both describing and quantifying the effects of the different risk factors. One reason could be that you want to design an intervention to reduce lung cancer incidence in a population, such as targeting smokers of a specific age group with advertising for smoking cessation programs. If we are modeling for prediction, however, we wouldn’t care so much about understanding how all the individual risk factors contribute to lung cancer, but rather only whether we can make good predictions of which people will contract lung cancer.</p>
+<p>In this book, we’ll focus on modeling for explanation and hence refer to <span class="math inline">\(x\)</span> as <em>explanatory variables</em>. If you are interested in learning about modeling for prediction, we suggest you check out books and courses on the field of <em>machine learning</em> such as <a href="http://www-bcf.usc.edu/~gareth/ISL/"><em>An Introduction to Statistical Learning with Applications in R (ISLR)</em></a> <span class="citation">(James et al. <a href="#ref-islr2017">2017</a>)</span>. Furthermore, while there exist many techniques for modeling, such as tree-based models and neural networks, in this book we’ll focus on one particular technique: <em>linear regression</em>.  Linear regression is one of the most commonly-used and easy-to-understand approaches to modeling.</p>
+<p>Linear regression involves a <em>numerical</em> outcome variable <span class="math inline">\(y\)</span> and explanatory variables <span class="math inline">\(x\)</span> that are either <em>numerical</em> or <em>categorical</em>. Furthermore, the relationship between <span class="math inline">\(y\)</span> and <span class="math inline">\(x\)</span> is assumed to be linear, or in other words, a line. However, we’ll see that what constitutes a “line” will vary depending on the nature of your explanatory variables <span class="math inline">\(x\)</span> .</p>
 <p>In Chapter <a href="5-regression.html#regression">5</a> on basic regression, we’ll only consider models with a single explanatory variable <span class="math inline">\(x\)</span>. In Section <a href="5-regression.html#model1">5.1</a>, the explanatory variable will be numerical. This scenario is known as <em>simple linear regression</em>. In Section <a href="5-regression.html#model2">5.2</a>, the explanatory variable will be categorical.</p>
-<p>In Chapter <a href="6-multiple-regression.html#multiple-regression">6</a> on multiple regression, we’ll extend the ideas behind basic regression and consider models with two explanatory variables <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span>. In Section <a href="6-multiple-regression.html#model3">6.2</a>, we’ll have one numerical and one categorical explanatory variable. In particular, we’ll consider two such models: <em>interaction</em> and <em>parallel slopes</em> models. In Section <a href="6-multiple-regression.html#model4">6.1</a>, we’ll have two numerical explanatory variables.</p>
-<p>In Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a> on inference for regression, we’ll revisit our regression models and analyze the results using the tools for <em>statistical inference</em> you’ll develop in Chapters <a href="7-sampling.html#sampling">7</a>, <a href="8-confidence-intervals.html#confidence-intervals">8</a>, and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> on sampling, confidence intervals, and hypothesis test/p-values respectively.</p>
-<p>Let’s now begin with basic regression,  which are linear regression models with a single explanatory variable <span class="math inline">\(x\)</span>. We’ll also discuss important statistical concepts like the <em>correlation coefficient</em>, that “correlation isn’t necessarily causation,” and what it means for a line to be “best-fitting.”</p>
+<p>In Chapter <a href="6-multiple-regression.html#multiple-regression">6</a> on multiple regression, we’ll extend the ideas behind basic regression and consider models with two explanatory variables <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span>. In Section <a href="6-multiple-regression.html#model4">6.1</a>, we’ll have two numerical explanatory variables. In Section <a href="6-multiple-regression.html#model3">6.2</a>, we’ll have one numerical and one categorical explanatory variable. In particular, we’ll consider two such models: <em>interaction</em> and <em>parallel slopes</em> models.</p>
+<p>In Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a> on inference for regression, we’ll revisit our regression models and analyze the results using the tools for <em>statistical inference</em> you’ll develop in Chapters <a href="7-sampling.html#sampling">7</a>, <a href="8-confidence-intervals.html#confidence-intervals">8</a>, and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> on sampling, bootstrapping and confidence intervals, and hypothesis testing and <span class="math inline">\(p\)</span>-values, respectively.</p>
+<p>Let’s now begin with basic regression,  which refers to linear regression models with a single explanatory variable <span class="math inline">\(x\)</span>. We’ll also discuss important statistical concepts like the <em>correlation coefficient</em>, that “correlation isn’t necessarily causation,” and what it means for a line to be “best-fitting.”</p>
 <div id="needed-packages-3" class="section level3 unnumbered">
 <h3>Needed packages</h3>
-<p>Let’s now load all the packages needed for this chapter (this assumes you’ve already installed them). In this chapter we introduce some new packages:</p>
+<p>Let’s now load all the packages needed for this chapter (this assumes you’ve already installed them). In this chapter, we introduce some new packages:</p>
 <ol style="list-style-type: decimal">
-<li>The <code>tidyverse</code> “umbrella” <span class="citation">(Wickham <a href="#ref-R-tidyverse">2017</a>)</span> package. Recall from our discussion in Section <a href="4-tidy.html#tidyverse-package">4.4</a> that loading the <code>tidyverse</code> package by running <code>library(tidyverse)</code> loads the following commonly used data science packages all at once:
+<li>The <code>tidyverse</code> “umbrella” <span class="citation">(Wickham <a href="#ref-R-tidyverse">2019</a><a href="#ref-R-tidyverse">b</a>)</span> package. Recall from our discussion in Section <a href="4-tidy.html#tidyverse-package">4.4</a> that loading the <code>tidyverse</code> package by running <code>library(tidyverse)</code> loads the following commonly used data science packages all at once:
 <ul>
 <li><code>ggplot2</code> for data visualization</li>
 <li><code>dplyr</code> for data wrangling</li>
@@ -600,54 +613,54 @@ <h3>Needed packages</h3>
 <li>As well as the more advanced <code>purrr</code>, <code>tibble</code>, <code>stringr</code>, and <code>forcats</code> packages</li>
 </ul></li>
 <li>The <code>moderndive</code>  package of datasets and functions for tidyverse-friendly introductory linear regression.</li>
-<li>The <code>skimr</code> <span class="citation">(Quinn et al. <a href="#ref-R-skimr">2019</a>)</span> package, which provides a simple to use function to quickly compute a wide array of commonly-used summary statistics. </li>
+<li>The <code>skimr</code> <span class="citation">(Quinn et al. <a href="#ref-R-skimr">2019</a>)</span> package, which provides a simple-to-use function to quickly compute a wide array of commonly used summary statistics. </li>
 </ol>
 <p>If needed, read Section <a href="1-getting-started.html#packages">1.3</a> for information on how to install and load R packages.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(tidyverse)
-<span class="kw">library</span>(moderndive)
-<span class="kw">library</span>(skimr)
-<span class="kw">library</span>(gapminder)</code></pre>
+<div class="sourceCode" id="cb141"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb141-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb141-2" data-line-number="2"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb141-3" data-line-number="3"><span class="kw">library</span>(skimr)</a>
+<a class="sourceLine" id="cb141-4" data-line-number="4"><span class="kw">library</span>(gapminder)</a></code></pre></div>
 </div>
 <div id="model1" class="section level2">
 <h2><span class="header-section-number">5.1</span> One numerical explanatory variable</h2>
-<p>Why do some professors and instructors at universities and colleges receive high teaching evaluations from students while others don’t? Are there differences in teaching evaluations between instructors of different demographic groups? Could there be an impact due to student biases? These are all questions that are of interest to university/college administrators, as teaching evaluations are among the many criteria considered in determining which instructors and professors get promoted.</p>
-<p>Researchers at the University of Texas in Austin, Texas (UT Austin) tried to answer the following research question: what factors can explain differences in instructor teaching evaluation scores? To this end, they collected instructor and course information on 463 courses. A full description of the study can be found at <a href="https://www.openintro.org/stat/data/?data=evals">openintro.org</a>.</p>
-<p>In this section, we’ll keep things simple for now and try to explain differences in instructor teaching scores as a function of one numerical variable: the instructor’s “beauty” score (we’ll describe how this score was determined shortly). Could it be that instructors with higher “beauty” scores also have higher teaching evaluations? Could it be instead that instructors with higher “beauty” scores tend to have lower teaching evaluations? Or could it be there is no relationship between “beauty” score and teaching evaluations? We’ll answer these questions by modeling the relationship between teaching scores and “beauty” scores using <em>simple linear regression</em>  where we have:</p>
+<p>Why do some professors and instructors at universities and colleges receive high teaching evaluations scores from students while others receive lower ones? Are there differences in teaching evaluations between instructors of different demographic groups? Could there be an impact due to student biases? These are all questions that are of interest to university/college administrators, as teaching evaluations are among the many criteria considered in determining which instructors and professors get promoted.</p>
+<p>Researchers at the University of Texas in Austin, Texas (UT Austin) tried to answer the following research question: what factors explain differences in instructor teaching evaluation scores? To this end, they collected instructor and course information on 463 courses. A full description of the study can be found at <a href="https://www.openintro.org/stat/data/?data=evals">openintro.org</a>.</p>
+<p>In this section, we’ll keep things simple for now and try to explain differences in instructor teaching scores as a function of one numerical variable: the instructor’s “beauty” score (we’ll describe how this score was determined shortly). Could it be that instructors with higher “beauty” scores also have higher teaching evaluations? Could it be instead that instructors with higher “beauty” scores tend to have lower teaching evaluations? Or could it be that there is no relationship between “beauty” score and teaching evaluations? We’ll answer these questions by modeling the relationship between teaching scores and “beauty” scores using <em>simple linear regression</em>  where we have:</p>
 <ol style="list-style-type: decimal">
-<li>A numerical outcome variable <span class="math inline">\(y\)</span>, the instructor’s teaching score and</li>
-<li>A single numerical explanatory variable <span class="math inline">\(x\)</span>, the instructor’s “beauty” score.</li>
+<li>A numerical outcome variable <span class="math inline">\(y\)</span> (the instructor’s teaching score) and</li>
+<li>A single numerical explanatory variable <span class="math inline">\(x\)</span> (the instructor’s “beauty” score).</li>
 </ol>
 <div id="model1EDA" class="section level3">
 <h3><span class="header-section-number">5.1.1</span> Exploratory data analysis</h3>
-<p>The data on the 463 courses at UT Austin can be found in the <code>evals</code> data frame included in the <code>moderndive</code> package. However, to keep things simple, let’s <code>select()</code> only the subset of the variables we’ll consider in this chapter, and save this data in a new data frame called <code>eval_ch6</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">evals_ch6 &lt;-<span class="st"> </span>evals <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">select</span>(ID, score, bty_avg, age)</code></pre>
-<p>A crucial step before doing any kind of analysis or modeling is performing an <em>exploratory data analysis</em>,  or EDA for short. Exploratory data analysis gives you a sense of the distributions of the individual variables in your data, whether any potential relationships exist between variables, whether there are outliers and/or missing values, and most importantly, how to build your model. Here are three common steps in an exploratory data analysis.</p>
+<p>The data on the 463 courses at UT Austin can be found in the <code>evals</code> data frame included in the <code>moderndive</code> package. However, to keep things simple, let’s <code>select()</code> only the subset of the variables we’ll consider in this chapter, and save this data in a new data frame called <code>evals_ch5</code>:</p>
+<div class="sourceCode" id="cb142"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb142-1" data-line-number="1">evals_ch5 &lt;-<span class="st"> </span>evals <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb142-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(ID, score, bty_avg, age)</a></code></pre></div>
+<p>A crucial step before doing any kind of analysis or modeling is performing an <em>exploratory data analysis</em>,  or EDA for short. EDA gives you a sense of the distributions of the individual variables in your data, whether any potential relationships exist between variables, whether there are outliers and/or missing values, and (most importantly) how to build your model. Here are three common steps in an EDA:</p>
 <ol style="list-style-type: decimal">
 <li>Most crucially, looking at the raw data values.</li>
 <li>Computing summary statistics, such as means, medians, and interquartile ranges.</li>
 <li>Creating data visualizations.</li>
 </ol>
 <p>Let’s perform the first common step in an exploratory data analysis: looking at the raw data values. Because this step seems so trivial, unfortunately many data analysts ignore it. However, getting an early sense of what your raw data looks like can often prevent many larger issues down the road.</p>
-<p>You can do this by using RStudio’s spreadsheet viewer or by using the <code>glimpse()</code> function as introduced in Section <a href="1-getting-started.html#exploredataframes">1.4.3</a> on exploring data frames:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">glimpse</span>(evals_ch6)</code></pre>
+<p>You can do this by using RStudio’s spreadsheet viewer or by using the <code>glimpse()</code> function as introduced in Subsection <a href="1-getting-started.html#exploredataframes">1.4.3</a> on exploring data frames:</p>
+<div class="sourceCode" id="cb143"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb143-1" data-line-number="1"><span class="kw">glimpse</span>(evals_ch5)</a></code></pre></div>
 <pre><code>Observations: 463
 Variables: 4
 $ ID      &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
 $ score   &lt;dbl&gt; 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4…
 $ bty_avg &lt;dbl&gt; 5.00, 5.00, 5.00, 5.00, 3.00, 3.00, 3.00, 3.33, 3.33, 3.17, 3…
 $ age     &lt;int&gt; 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4…</code></pre>
-<p>Observe that <code>Observations: 463</code> indicates that there are 463 rows/observations in <code>evals_ch6</code>, where each row corresponds to one observed course at UT Austin. It is important to note that the <em>observational unit</em>  are individual courses and not individual instructors. Recall from Subsection <a href="1-getting-started.html#exploredataframes">1.4.3</a> that the observational unit is the “type of thing” that is being measured by our variables. Since instructors teach more than one course in an academic year, the same instructor will appear more than once in the data. Hence there are fewer than 463 unique instructors being represented in <code>evals_ch6</code>. We’ll revisit this idea in Section <a href="10-inference-for-regression.html#regression-conditions">10.3</a>, when we talk about the “independence assumption” for inference for regression.</p>
-<p>A full description of all the variables included in <code>evals</code> can be found at <a href="https://www.openintro.org/stat/data/?data=evals">openintro.org</a> and by reading the associated help file (run <code>?evals</code> in the console). However, let’s fully describe the 4 variables we selected in <code>evals_ch6</code>:</p>
+<p>Observe that <code>Observations: 463</code> indicates that there are 463 rows/observations in <code>evals_ch5</code>, where each row corresponds to one observed course at UT Austin. It is important to note that the <em>observational unit</em>  is an individual course and not an individual instructor. Recall from Subsection <a href="1-getting-started.html#exploredataframes">1.4.3</a> that the observational unit is the “type of thing” that is being measured by our variables. Since instructors teach more than one course in an academic year, the same instructor will appear more than once in the data. Hence there are fewer than 463 unique instructors being represented in <code>evals_ch5</code>. We’ll revisit this idea in Section <a href="10-inference-for-regression.html#regression-conditions">10.3</a>, when we talk about the “independence assumption” for inference for regression.</p>
+<p>A full description of all the variables included in <code>evals</code> can be found at <a href="https://www.openintro.org/stat/data/?data=evals">openintro.org</a> or by reading the associated help file (run <code>?evals</code> in the console). However, let’s fully describe only the 4 variables we selected in <code>evals_ch5</code>:</p>
 <ol style="list-style-type: decimal">
 <li><code>ID</code>: An identification variable used to distinguish between the 1 through 463 courses in the dataset.</li>
 <li><code>score</code>: A numerical variable of the course instructor’s average teaching score, where the average is computed from the evaluation scores from all students in that course. Teaching scores of 1 are lowest and 5 are highest. This is the outcome variable <span class="math inline">\(y\)</span> of interest.</li>
-<li><code>bty_avg</code>: A numerical variable of the course instructor’s average “beauty” score, where the average is computed from a separate panel of 6 students. “Beauty” scores of 1 are lowest and 10 are highest. This is the explanatory variable <span class="math inline">\(x\)</span> of interest.</li>
-<li><code>age</code>: A numerical variable of the course instructor’s age. This will be another explanatory variable <span class="math inline">\(x\)</span> we’ll study later.</li>
+<li><code>bty_avg</code>: A numerical variable of the course instructor’s average “beauty” score, where the average is computed from a separate panel of six students. “Beauty” scores of 1 are lowest and 10 are highest. This is the explanatory variable <span class="math inline">\(x\)</span> of interest.</li>
+<li><code>age</code>: A numerical variable of the course instructor’s age. This will be another explanatory variable <span class="math inline">\(x\)</span> that we’ll use in the <em>Learning check</em> at the end of this subsection.</li>
 </ol>
-<p>An alternative way to look at the raw data values is by choosing a random sample of the rows in <code>evals_ch6</code> by piping it into the <code>sample_n()</code>  function from the <code>dplyr</code> package. Here we set the <code>size</code> argument to be <code>5</code>, indicating that we want a random sample of 5 rows. We display the results Table <a href="5-regression.html#tab:five-random-courses">5.1</a>. Note due to the random nature of the sampling, you will likely end up with a different subset of 5 rows.</p>
-<pre class="sourceCode r"><code class="sourceCode r">evals_ch6 <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">sample_n</span>(<span class="dt">size =</span> <span class="dv">5</span>)</code></pre>
+<p>An alternative way to look at the raw data values is by choosing a random sample of the rows in <code>evals_ch5</code> by piping it into the <code>sample_n()</code>  function from the <code>dplyr</code> package. Here we set the <code>size</code> argument to be <code>5</code>, indicating that we want a random sample of 5 rows. We display the results in Table <a href="5-regression.html#tab:five-random-courses">5.1</a>. Note that due to the random nature of the sampling, you will likely end up with a different subset of 5 rows.</p>
+<div class="sourceCode" id="cb145"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb145-1" data-line-number="1">evals_ch5 <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb145-2" data-line-number="2"><span class="st">  </span><span class="kw">sample_n</span>(<span class="dt">size =</span> <span class="dv">5</span>)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:five-random-courses">TABLE 5.1: </span>A random sample of 5 out of the 463 courses at UT Austin
@@ -741,107 +754,114 @@ <h3><span class="header-section-number">5.1.1</span> Exploratory data analysis</
 </tr>
 </tbody>
 </table>
-<p>Now that we’ve looked at the raw values in our <code>evals_ch6</code> data frame and got a preliminary sense of the data, let’s move on to the next common step in an exploratory data analysis: computing summary statistics. Let’s start by computing the mean and median of our numerical outcome variable <code>score</code> and our numerical explanatory variable <code>bty_avg</code> “beauty” score. We’ll do this by using the <code>summarize()</code> function from <code>dplyr</code> along with the <code>mean()</code> and <code>median()</code> summary functions we saw in Section <a href="3-wrangling.html#summarize">3.3</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">evals_ch6 <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_bty_avg =</span> <span class="kw">mean</span>(bty_avg), <span class="dt">mean_score =</span> <span class="kw">mean</span>(score),
-            <span class="dt">median_bty_avg =</span> <span class="kw">median</span>(bty_avg), <span class="dt">median_score =</span> <span class="kw">median</span>(score))</code></pre>
+<p>Now that we’ve looked at the raw values in our <code>evals_ch5</code> data frame and got a preliminary sense of the data, let’s move on to the next common step in an exploratory data analysis: computing summary statistics. Let’s start by computing the mean and median of our numerical outcome variable <code>score</code> and our numerical explanatory variable “beauty” score denoted as <code>bty_avg</code>. We’ll do this by using the <code>summarize()</code> function from <code>dplyr</code> along with the <code>mean()</code> and <code>median()</code> summary functions we saw in Section <a href="3-wrangling.html#summarize">3.3</a>.</p>
+<div class="sourceCode" id="cb146"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb146-1" data-line-number="1">evals_ch5 <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb146-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_bty_avg =</span> <span class="kw">mean</span>(bty_avg), <span class="dt">mean_score =</span> <span class="kw">mean</span>(score),</a>
+<a class="sourceLine" id="cb146-3" data-line-number="3">            <span class="dt">median_bty_avg =</span> <span class="kw">median</span>(bty_avg), <span class="dt">median_score =</span> <span class="kw">median</span>(score))</a></code></pre></div>
 <pre><code># A tibble: 1 x 4
   mean_bty_avg mean_score median_bty_avg median_score
          &lt;dbl&gt;      &lt;dbl&gt;          &lt;dbl&gt;        &lt;dbl&gt;
 1         4.42       4.17           4.33          4.3</code></pre>
-<p>However, what if we want other summary statistics as well, such as the standard deviation (a measure of spread), the minimum and maximum values, and various percentiles? Typing out all these summary statistic functions in <code>summarize()</code> would be long and tedious. Instead, let’s use the convenient <code>skim()</code> function from the <code>skimr</code>  package. This function takes in a data frame, “skims” it, and returns commonly used summary statistics. Let’s take our <code>evals_ch6</code> data frame, <code>select()</code> only the outcome and explanatory variables teaching <code>score</code> and <code>bty_avg</code>, and pipe them into the <code>skim()</code> function:</p>
-<pre class="sourceCode r"><code class="sourceCode r">evals_ch6 <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">select</span>(score, bty_avg) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">skim</span>()</code></pre>
+<p>However, what if we want other summary statistics as well, such as the standard deviation (a measure of spread), the minimum and maximum values, and various percentiles?</p>
+<p>Typing out all these summary statistic functions in <code>summarize()</code> would be long and tedious. Instead, let’s use the convenient <code>skim()</code> function from the <code>skimr</code>  package. This function takes in a data frame, “skims” it, and returns commonly used summary statistics. Let’s take our <code>evals_ch5</code> data frame, <code>select()</code> only the outcome and explanatory variables teaching <code>score</code> and <code>bty_avg</code>, and pipe them into the <code>skim()</code> function:</p>
+<div class="sourceCode" id="cb148"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb148-1" data-line-number="1">evals_ch5 <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">select</span>(score, bty_avg) <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">skim</span>()</a></code></pre></div>
 <pre><code>Skim summary statistics
  n obs: 463 
  n variables: 2 
 
-── Variable type:numeric ───────────────────────────────────────────────────────
+── Variable type:numeric
  variable missing complete   n mean   sd   p0  p25  p50 p75 p100
   bty_avg       0      463 463 4.42 1.53 1.67 3.17 4.33 5.5 8.17
     score       0      463 463 4.17 0.54 2.3  3.8  4.3  4.6 5   </code></pre>
-<p>(Note that for formatting purposes, the inline histogram that is usually printed with <code>skim()</code> has been removed.)</p>
-<p>For our two numerical variables teaching <code>score</code> and “beauty” score <code>bty_avg</code> it returns:</p>
+<!--
+TODO: 
+Update skimr::skim() output to match v2.0.1
+
+Skipped: Couldn't figure out how to use skim_with(ts = sfl(line_graph = NULL))
+at https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html
+
+Used remotes::install_version("skimr", version = "1.0.6") to use that version
+instead.
+-->
+<p>(For formatting purposes in this book, the inline histogram that is usually printed with <code>skim()</code> has been removed. This can be done by using <code>skim_with(numeric = list(hist = NULL))</code> prior to using the <code>skim()</code> function for version 1.0.6 of <code>skimr</code>.)</p>
+<p>For the numerical variables teaching <code>score</code> and <code>bty_avg</code> it returns:</p>
 <ul>
 <li><code>missing</code>: the number of missing values</li>
 <li><code>complete</code>: the number of non-missing or complete values</li>
 <li><code>n</code>: the total number of values</li>
-<li><code>mean</code>: the mean AKA average</li>
+<li><code>mean</code>: the average</li>
 <li><code>sd</code>: the standard deviation</li>
-<li><code>p0</code>: the 0<sup>th</sup> percentile: the value at which 0% of observations are smaller than it AKA the <em>minimum</em> value</li>
-<li><code>p25</code>: the 25<sup>th</sup> percentile: the value at which 25% of observations are smaller than it AKA the <em>1<sup>st</sup> quartile</em></li>
-<li><code>p50</code>: the 50<sup>th</sup> percentile: the value at which 50% of observations are smaller than it AKA the <em>2<sup>nd</sup></em> quartile and more commonly the <em>median</em></li>
-<li><code>p75</code>: the 75<sup>th</sup> percentile: the value at which 75% of observations are smaller than it AKA the <em>3<sup>rd</sup> quartile</em></li>
-<li><code>p100</code>: the 100<sup>th</sup> percentile: the value at which 100% of observations are smaller than it AKA the <em>maximum</em> value</li>
+<li><code>p0</code>: the 0th percentile: the value at which 0% of observations are smaller than it (the <em>minimum</em> value)</li>
+<li><code>p25</code>: the 25th percentile: the value at which 25% of observations are smaller than it (the <em>1st quartile</em>)</li>
+<li><code>p50</code>: the 50th percentile: the value at which 50% of observations are smaller than it (the <em>2nd</em> quartile and more commonly called the <em>median</em>)</li>
+<li><code>p75</code>: the 75th percentile: the value at which 75% of observations are smaller than it (the <em>3rd quartile</em>)</li>
+<li><code>p100</code>: the 100th percentile: the value at which 100% of observations are smaller than it (the <em>maximum</em> value)</li>
 </ul>
-<p>Looking at this output, we get an idea of how the values of both variables distribute. For example, the mean teaching score was 4.17 out of 5 whereas the mean “beauty” score was 4.42 out of 10. Furthermore, the middle 50% of teaching scores were between 3.80 and 4.6 (the first and third quartiles) whereas the middle 50% of “beauty” scores were between 3.17 and 5.5 out of 10.</p>
-<p>However, the <code>skim()</code> function only returns what are known as <em>univariate</em>  summary statistics: functions that take a single variable and return some numerical summary of that variable. However, there also exist <em>bivariate</em>  summary statistics: functions that take in two variables and return some summary of those two variables. In particular, when the two variables are numerical, we can compute the  <em>correlation coefficient</em>. Generally speaking, <em>coefficients</em> are quantitative expressions of a specific phenomenon. A <em>correlation coefficient</em> is a quantitative expression of the <em>strength of the linear relationship between two numerical variables</em>. Its value ranges between -1 and 1 where:</p>
+<p>Looking at this output, we can see how the values of both variables distribute. For example, the mean teaching score was 4.17 out of 5, whereas the mean “beauty” score was 4.42 out of 10. Furthermore, the middle 50% of teaching scores was between 3.80 and 4.6 (the first and third quartiles), whereas the middle 50% of “beauty” scores falls within 3.17 to 5.5 out of 10.</p>
+<p>The <code>skim()</code> function only returns what are known as <em>univariate</em>  summary statistics: functions that take a single variable and return some numerical summary of that variable. However, there also exist <em>bivariate</em>  summary statistics: functions that take in two variables and return some summary of those two variables. In particular, when the two variables are numerical, we can compute the  <em>correlation coefficient</em>. Generally speaking, <em>coefficients</em> are quantitative expressions of a specific phenomenon. A <em>correlation coefficient</em> is a quantitative expression of the <em>strength of the linear relationship between two numerical variables</em>. Its value ranges between -1 and 1 where:</p>
 <ul>
-<li>-1 indicates a perfect <em>negative relationship</em>: As the value of one variable goes up, the value of the other variable tends to go down.</li>
+<li>-1 indicates a perfect <em>negative relationship</em>: As one variable increases, the value of the other variable tends to go down, following a straight line.</li>
 <li>0 indicates no relationship: The values of both variables go up/down independently of each other.</li>
-<li>+1 indicates a perfect <em>positive relationship</em>: As the value of one variable goes up, the value of the other variable tends to go up as well.</li>
+<li>+1 indicates a perfect <em>positive relationship</em>: As the value of one variable goes up, the value of the other variable tends to go up as well in a linear fashion.</li>
 </ul>
 <p>Figure <a href="5-regression.html#fig:correlation1">5.1</a> gives examples of 9 different correlation coefficient values for hypothetical numerical variables <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>. For example, observe in the top right plot that for a correlation coefficient of -0.75 there is a negative linear relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>, but it is not as strong as the negative linear relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> when the correlation coefficient is -0.9 or -1.</p>
 <div class="figure" style="text-align: center"><span id="fig:correlation1"></span>
-<img src="moderndive_files/figure-html/correlation1-1.png" alt="Different correlation coefficients." width="\textwidth" />
+<img src="ModernDive_files/figure-html/correlation1-1.png" alt="Nine different correlation coefficients." width="\textwidth" />
 <p class="caption">
-FIGURE 5.1: Different correlation coefficients.
+FIGURE 5.1: Nine different correlation coefficients.
 </p>
 </div>
-<p>The correlation coefficient can be computed using the <code>get_correlation()</code>  function in the <code>moderndive</code> package, where in this case the inputs to the function are the two numerical variables for which we want to calculate the correlation coefficient. We put the name of the response variable on the left-hand side of the <code>~</code> “tilde” sign, while putting the name of the explanatory variable on the right-hand side. This is known as R’s  <em>formula notation</em>. We will use this same “formula” syntax with regression later in this chapter.</p>
-<pre class="sourceCode r"><code class="sourceCode r">evals_ch6 <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">get_correlation</span>(<span class="dt">formula =</span> score <span class="op">~</span><span class="st"> </span>bty_avg)</code></pre>
+<p>The correlation coefficient can be computed using the <code>get_correlation()</code>  function in the <code>moderndive</code> package. In this case, the inputs to the function are the two numerical variables for which we want to calculate the correlation coefficient.</p>
+<p>We put the name of the outcome variable on the left-hand side of the <code>~</code> “tilde” sign, while putting the name of the explanatory variable on the right-hand side. This is known as R’s  <em>formula notation</em>. We will use this same “formula” syntax with regression later in this chapter.</p>
+<div class="sourceCode" id="cb150"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb150-1" data-line-number="1">evals_ch5 <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb150-2" data-line-number="2"><span class="st">  </span><span class="kw">get_correlation</span>(<span class="dt">formula =</span> score <span class="op">~</span><span class="st"> </span>bty_avg)</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
-  correlation
-        &lt;dbl&gt;
-1       0.187</code></pre>
-<p>An alternative way to compute the correlation coefficient is to use the <code>cor()</code> function within a <code>summarize()</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">evals_ch6 <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">correlation =</span> <span class="kw">cor</span>(score, bty_avg))</code></pre>
-<pre><code># A tibble: 1 x 1
-  correlation
-        &lt;dbl&gt;
-1       0.187</code></pre>
-<p>In our case, the correlation coefficient of 0.187 indicates that the relationship between teaching evaluation score and “beauty” average is “weakly positive.” There is a certain amount of subjectivity in interpreting correlation coefficients, especially those that aren’t close to the extreme values of -1, 0, and 1. To develop your intuition about correlation coefficients, play the “Guess the Correlation” 1980’s style video game in Subsection <a href="5-regression.html#additional-resources-basic-regression">5.4.1</a>.</p>
-<p>Let’s now perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Since both the <code>score</code> and <code>bty_avg</code> variables are numerical, a scatterplot is an appropriate graph to visualize this data. Let’s do this using <code>geom_point()</code> and display the result in Figure <a href="5-regression.html#fig:numxplot1">5.2</a>. Furthermore, let’s highlight the 6 points in the top right of the visualization in an orange box.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(evals_ch6, <span class="kw">aes</span>(<span class="dt">x =</span> bty_avg, <span class="dt">y =</span> score)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Beauty Score&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Teaching Score&quot;</span>,
-       <span class="dt">title =</span> <span class="st">&quot;Scatterplot of relationship of teaching and beauty scores&quot;</span>)</code></pre>
+    cor
+  &lt;dbl&gt;
+1 0.187</code></pre>
+<p>An alternative way to compute correlation is to use the <code>cor()</code> summary function within a <code>summarize()</code>:</p>
+<div class="sourceCode" id="cb152"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb152-1" data-line-number="1">evals_ch5 <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb152-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">correlation =</span> <span class="kw">cor</span>(score, bty_avg))</a></code></pre></div>
+<p>In our case, the correlation coefficient of 0.187 indicates that the relationship between teaching evaluation score and “beauty” average is “weakly positive.” There is a certain amount of subjectivity in interpreting correlation coefficients, especially those that aren’t close to the extreme values of -1, 0, and 1. To develop your intuition about correlation coefficients, play the “Guess the Correlation” 1980’s style video game mentioned in Subsection <a href="5-regression.html#additional-resources-basic-regression">5.4.1</a>.</p>
+<p>Let’s now perform the last of the steps in an exploratory data analysis: creating data visualizations. Since both the <code>score</code> and <code>bty_avg</code> variables are numerical, a scatterplot is an appropriate graph to visualize this data. Let’s do this using <code>geom_point()</code> and display the result in Figure <a href="5-regression.html#fig:numxplot1">5.2</a>. Furthermore, let’s highlight the six points in the top right of the visualization in a box.</p>
+<div class="sourceCode" id="cb153"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb153-1" data-line-number="1"><span class="kw">ggplot</span>(evals_ch5, <span class="kw">aes</span>(<span class="dt">x =</span> bty_avg, <span class="dt">y =</span> score)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb153-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb153-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Beauty Score&quot;</span>, </a>
+<a class="sourceLine" id="cb153-4" data-line-number="4">       <span class="dt">y =</span> <span class="st">&quot;Teaching Score&quot;</span>,</a>
+<a class="sourceLine" id="cb153-5" data-line-number="5">       <span class="dt">title =</span> <span class="st">&quot;Scatterplot of relationship of teaching and beauty scores&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:numxplot1"></span>
-<img src="moderndive_files/figure-html/numxplot1-1.png" alt="Instructor evaluation scores at UT Austin." width="\textwidth" />
+<img src="ModernDive_files/figure-html/numxplot1-1.png" alt="Instructor evaluation scores at UT Austin." width="\textwidth" />
 <p class="caption">
 FIGURE 5.2: Instructor evaluation scores at UT Austin.
 </p>
 </div>
-<p>Observe that most “beauty” scores lie between 2 and 8 while most teaching scores lie between 3 and 5. Furthermore, while opinions may vary, it is our opinion that the relationship between teaching score and “beauty” score is “weakly positive.” This is consistent with our earlier computed correlation coefficient of 0.187.</p>
-<p>Furthermore, there appear to be 6 points in the top-right of this plot highlighted in the orange box. However, this is not actually the case, as this plot suffers from <em>overplotting</em>. Recall from Subsection <a href="2-viz.html#overplotting">2.3.2</a> that overplotting occurs when several points are stacked directly on top of each other, making it difficult to distinguish them. So while it may appear that there are only 6 points in the orange box, there are actually more. This fact is only apparent when using <code>geom_jitter()</code> in place of <code>geom_point()</code>. We display the resulting plot in Figure <a href="5-regression.html#fig:numxplot2">5.3</a> along with the same orange box as in Figure <a href="5-regression.html#fig:numxplot1">5.2</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(evals_ch6, <span class="kw">aes</span>(<span class="dt">x =</span> bty_avg, <span class="dt">y =</span> score)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_jitter</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Beauty Score&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Teaching Score&quot;</span>,
-       <span class="dt">title =</span> <span class="st">&quot;Scatterplot of relationship of teaching and beauty scores&quot;</span>)</code></pre>
+<p>Observe that most “beauty” scores lie between 2 and 8, while most teaching scores lie between 3 and 5. Furthermore, while opinions may vary, it is our opinion that the relationship between teaching score and “beauty” score is “weakly positive.” This is consistent with our earlier computed correlation coefficient of 0.187.</p>
+<p>Furthermore, there appear to be six points in the top-right of this plot highlighted in the box. However, this is not actually the case, as this plot suffers from <em>overplotting</em>. Recall from Subsection <a href="2-viz.html#overplotting">2.3.2</a> that overplotting occurs when several points are stacked directly on top of each other, making it difficult to distinguish them. So while it may appear that there are only six points in the box, there are actually more. This fact is only apparent when using <code>geom_jitter()</code> in place of <code>geom_point()</code>. We display the resulting plot in Figure <a href="5-regression.html#fig:numxplot2">5.3</a> along with the same small box as in Figure <a href="5-regression.html#fig:numxplot1">5.2</a>.</p>
+<div class="sourceCode" id="cb154"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb154-1" data-line-number="1"><span class="kw">ggplot</span>(evals_ch5, <span class="kw">aes</span>(<span class="dt">x =</span> bty_avg, <span class="dt">y =</span> score)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb154-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_jitter</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb154-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Beauty Score&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Teaching Score&quot;</span>,</a>
+<a class="sourceLine" id="cb154-4" data-line-number="4">       <span class="dt">title =</span> <span class="st">&quot;Scatterplot of relationship of teaching and beauty scores&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:numxplot2"></span>
-<img src="moderndive_files/figure-html/numxplot2-1.png" alt="Instructor evaluation scores at UT Austin." width="\textwidth" />
+<img src="ModernDive_files/figure-html/numxplot2-1.png" alt="Instructor evaluation scores at UT Austin." width="\textwidth" />
 <p class="caption">
 FIGURE 5.3: Instructor evaluation scores at UT Austin.
 </p>
 </div>
-<p>It is now apparent that there are 12 points in the area highlighted in orange and not 6 as originally suggested in Figure <a href="5-regression.html#fig:numxplot1">5.2</a>. Recall from Section <a href="2-viz.html#overplotting">2.3.2</a> on overplotting that jittering adds a little random “nudge” to each of the points to break up these ties. Furthermore, recall that jittering is strictly a visualization tool; it does not alter the original values in the data frame <code>evals_ch6</code>. To keep things simple going forward however, we’ll only present regular scatterplots rather than their jittered counterparts.</p>
-<p>Let’s build on the unjittered scatterplot in Figure <a href="5-regression.html#fig:numxplot1">5.2</a> by adding a “best-fitting” line: of all possible lines we can draw on this scatterplot, it is the line that “best” fits through the cloud of points. We do this by adding a new <code>geom_smooth(method = &quot;lm&quot;, se = FALSE)</code> layer to the <code>ggplot()</code> code that created the scatterplot in Figure <a href="5-regression.html#fig:numxplot1">5.2</a>. The <code>method = &quot;lm&quot;</code> argument sets the line to be a “linear model” i.e. a line, while the <code>se = FALSE</code>  argument suppresses “standard error” uncertainty bars.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(evals_ch6, <span class="kw">aes</span>(<span class="dt">x =</span> bty_avg, <span class="dt">y =</span> score)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Beauty Score&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Teaching Score&quot;</span>,
-       <span class="dt">title =</span> <span class="st">&quot;Relationship between teaching and beauty scores&quot;</span>) <span class="op">+</span><span class="st">  </span>
-<span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>)</code></pre>
+<p>It is now apparent that there are 12 points in the area highlighted in the box and not six as originally suggested in Figure <a href="5-regression.html#fig:numxplot1">5.2</a>. Recall from Subsection <a href="2-viz.html#overplotting">2.3.2</a> on overplotting that jittering adds a little random “nudge” to each of the points to break up these ties. Furthermore, recall that jittering is strictly a visualization tool; it does not alter the original values in the data frame <code>evals_ch5</code>. To keep things simple going forward, however, we’ll only present regular scatterplots rather than their jittered counterparts.</p>
+<p>Let’s build on the unjittered scatterplot in Figure <a href="5-regression.html#fig:numxplot1">5.2</a> by adding a “best-fitting” line: of all possible lines we can draw on this scatterplot, it is the line that “best” fits through the cloud of points. We do this by adding a new <code>geom_smooth(method = &quot;lm&quot;, se = FALSE)</code> layer to the <code>ggplot()</code> code that created the scatterplot in Figure <a href="5-regression.html#fig:numxplot1">5.2</a>. The <code>method = &quot;lm&quot;</code> argument sets the line to be a “<code>l</code>inear <code>m</code>odel.” The <code>se = FALSE</code>  argument suppresses <em>standard error</em> uncertainty bars. (We’ll define the concept of <em>standard error</em> later in Subsection <a href="7-sampling.html#sampling-definitions">7.3.2</a>.)</p>
+<div class="sourceCode" id="cb155"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb155-1" data-line-number="1"><span class="kw">ggplot</span>(evals_ch5, <span class="kw">aes</span>(<span class="dt">x =</span> bty_avg, <span class="dt">y =</span> score)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb155-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb155-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Beauty Score&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Teaching Score&quot;</span>,</a>
+<a class="sourceLine" id="cb155-4" data-line-number="4">       <span class="dt">title =</span> <span class="st">&quot;Relationship between teaching and beauty scores&quot;</span>) <span class="op">+</span><span class="st">  </span></a>
+<a class="sourceLine" id="cb155-5" data-line-number="5"><span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:numxplot3"></span>
-<img src="moderndive_files/figure-html/numxplot3-1.png" alt="Regression line." width="\textwidth" />
+<img src="ModernDive_files/figure-html/numxplot3-1.png" alt="Regression line." width="\textwidth" />
 <p class="caption">
 FIGURE 5.4: Regression line.
 </p>
 </div>
-<p>The blue line in the resulting Figure <a href="5-regression.html#fig:numxplot3">5.4</a> is called a “regression line.” The regression line  is a visual summary of the relationship between two numerical variables, in our case the outcome variable <code>score</code> and the explanatory variable <code>bty_avg</code>. The positive slope of the blue line is consistent with our earlier observed correlation coefficient of 0.187 suggesting that there is a positive relationship between these two variables: as instructors have higher “beauty” scores, so also do they receive higher teaching evaluations. We’ll see later however that while the correlation coefficient and the slope of a regression line always have the same sign (positive or negative), they do not necessarily have the same value.</p>
-<p>Furthermore, a regression line is “best-fitting” in that it minimizes some mathematical criteria. We present this mathematical criteria in Subsection <a href="5-regression.html#leastsquares">5.3.2</a>, but we suggest you read this subsection only after reading the rest of this section on regression with one numerical explanatory variable.</p>
+<p>The line in the resulting Figure <a href="5-regression.html#fig:numxplot3">5.4</a> is called a “regression line.” The regression line  is a visual summary of the relationship between two numerical variables, in our case the outcome variable <code>score</code> and the explanatory variable <code>bty_avg</code>. The positive slope of the blue line is consistent with our earlier observed correlation coefficient of 0.187 suggesting that there is a positive relationship between these two variables: as instructors have higher “beauty” scores, so also do they receive higher teaching evaluations. We’ll see later, however, that while the correlation coefficient and the slope of a regression line always have the same sign (positive or negative), they typically do not have the same value.</p>
+<p>Furthermore, a regression line is “best-fitting” in that it minimizes some mathematical criteria. We present these mathematical criteria in Subsection <a href="5-regression.html#leastsquares">5.3.2</a>, but we suggest you read this subsection only after first reading the rest of this section on regression with one numerical explanatory variable.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -860,18 +880,18 @@ <h3><span class="header-section-number">5.1.1</span> Exploratory data analysis</
 </div>
 <div id="model1table" class="section level3">
 <h3><span class="header-section-number">5.1.2</span> Simple linear regression</h3>
-<p>You may recall from secondary/high school algebra that the equation of a line is <span class="math inline">\(y = a + b\cdot x\)</span>. (Note that the <span class="math inline">\(\cdot\)</span> symbol is equivalent to the <span class="math inline">\(\times\)</span> “multiply by” mathematical symbol. We’ll use the <span class="math inline">\(\cdot\)</span> symbol in this book as it is more succinct.) It is defined by two coefficients <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>: the intercept coefficient <span class="math inline">\(a\)</span> i.e. the value of <span class="math inline">\(y\)</span> when <span class="math inline">\(x = 0\)</span> and the slope coefficient <span class="math inline">\(b\)</span> for <span class="math inline">\(x\)</span> i.e. the increase in <span class="math inline">\(y\)</span> for every increase of one in <span class="math inline">\(x\)</span>.</p>
-<p>However, when defining a regression line like the regression line in Figure <a href="5-regression.html#fig:numxplot3">5.4</a>, we use slightly different notation: the equation of the regression line is <span class="math inline">\(\widehat{y} = b_0 + b_1 \cdot x\)</span>  where the intercept coefficient is <span class="math inline">\(b_0\)</span> i.e. the value of <span class="math inline">\(\widehat{y}\)</span> when <span class="math inline">\(x=0\)</span>. The slope coefficient for <span class="math inline">\(x\)</span> is <span class="math inline">\(b_1\)</span> i.e. the increase in <span class="math inline">\(\widehat{y}\)</span> for every increase of one in <span class="math inline">\(x\)</span>. Why do we put a “hat” on top of the <span class="math inline">\(y\)</span>? It’s a form of notation commonly used in regression to indicate that we have a  “fitted value”, or the value of <span class="math inline">\(y\)</span> on the regression line for a given <span class="math inline">\(x\)</span> value. We’ll discuss this more in the upcoming Subsection <a href="5-regression.html#model1points">5.1.3</a>.</p>
-<p>We know that the regression line in Figure <a href="5-regression.html#fig:numxplot3">5.4</a> has a positive slope <span class="math inline">\(b_1\)</span> corresponding to our explanatory <span class="math inline">\(x\)</span> variable <code>bty_avg</code>. Why? Because as instructors have higher <code>bty_avg</code> scores, so also do they tend to have higher teaching evaluation <code>scores</code>. However, what is the numerical value of the slope <span class="math inline">\(b_1\)</span>? What about the intercept <span class="math inline">\(b_0\)</span>? Let’s not compute these two values by hand, but rather let’s use a computer!</p>
+<p>You may recall from secondary/high school algebra that the equation of a line is <span class="math inline">\(y = a + b\cdot x\)</span>. (Note that the <span class="math inline">\(\cdot\)</span> symbol is equivalent to the <span class="math inline">\(\times\)</span> “multiply by” mathematical symbol. We’ll use the <span class="math inline">\(\cdot\)</span> symbol in the rest of this book as it is more succinct.) It is defined by two coefficients <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>. The intercept coefficient <span class="math inline">\(a\)</span> is the value of <span class="math inline">\(y\)</span> when <span class="math inline">\(x = 0\)</span>. The slope coefficient <span class="math inline">\(b\)</span> for <span class="math inline">\(x\)</span> is the increase in <span class="math inline">\(y\)</span> for every increase of one in <span class="math inline">\(x\)</span>. This is also called the “rise over run.”</p>
+<p>However, when defining a regression line like the regression line in Figure <a href="5-regression.html#fig:numxplot3">5.4</a>, we use slightly different notation: the equation of the regression line is <span class="math inline">\(\widehat{y} = b_0 + b_1 \cdot x\)</span> . The intercept coefficient is <span class="math inline">\(b_0\)</span>, so <span class="math inline">\(b_0\)</span> is the value of <span class="math inline">\(\widehat{y}\)</span> when <span class="math inline">\(x = 0\)</span>. The slope coefficient for <span class="math inline">\(x\)</span> is <span class="math inline">\(b_1\)</span>, i.e., the increase in <span class="math inline">\(\widehat{y}\)</span> for every increase of one in <span class="math inline">\(x\)</span>. Why do we put a “hat” on top of the <span class="math inline">\(y\)</span>? It’s a form of notation commonly used in regression to indicate that we have a  “fitted value,” or the value of <span class="math inline">\(y\)</span> on the regression line for a given <span class="math inline">\(x\)</span> value. We’ll discuss this more in the upcoming Subsection <a href="5-regression.html#model1points">5.1.3</a>.</p>
+<p>We know that the regression line in Figure <a href="5-regression.html#fig:numxplot3">5.4</a> has a positive slope <span class="math inline">\(b_1\)</span> corresponding to our explanatory <span class="math inline">\(x\)</span> variable <code>bty_avg</code>. Why? Because as instructors tend to have higher <code>bty_avg</code> scores, so also do they tend to have higher teaching evaluation <code>scores</code>. However, what is the numerical value of the slope <span class="math inline">\(b_1\)</span>? What about the intercept <span class="math inline">\(b_0\)</span>? Let’s not compute these two values by hand, but rather let’s use a computer!</p>
 <p>We can obtain the values of the intercept <span class="math inline">\(b_0\)</span> and the slope for <code>btg_avg</code> <span class="math inline">\(b_1\)</span> by outputting a <em>linear regression table</em>. This is done in two steps:</p>
 <ol style="list-style-type: decimal">
 <li>We first “fit” the linear regression model using the <code>lm()</code> function and save it in <code>score_model</code>.</li>
-<li>We get the regression table by applying the <code>get_regression_table()</code>  from the <code>moderndive</code> package to <code>score_model</code>.</li>
+<li>We get the regression table by applying the <code>get_regression_table()</code>  function from the <code>moderndive</code> package to <code>score_model</code>.</li>
 </ol>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit regression model:</span>
-score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals_ch6)
-<span class="co"># Get regression table:</span>
-<span class="kw">get_regression_table</span>(score_model)</code></pre>
+<div class="sourceCode" id="cb156"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb156-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb156-2" data-line-number="2">score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals_ch5)</a>
+<a class="sourceLine" id="cb156-3" data-line-number="3"><span class="co"># Get regression table:</span></a>
+<a class="sourceLine" id="cb156-4" data-line-number="4"><span class="kw">get_regression_table</span>(score_model)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:regtable">TABLE 5.2: </span>Linear regression table
@@ -950,7 +970,7 @@ <h3><span class="header-section-number">5.1.2</span> Simple linear regression</h
 </tr>
 </tbody>
 </table>
-<p>Let’s first focus on interpreting the regression table output in Table <a href="5-regression.html#tab:regtable">5.2</a> and then we’ll later revisit the code that produced it. In the <code>estimate</code> column of Table <a href="5-regression.html#tab:regtable">5.2</a> are the intercept <span class="math inline">\(b_0\)</span> = 3.88 and the slope <span class="math inline">\(b_1\)</span> = 0.067 for <code>bty_avg</code>. Thus the equation of the regression line in Figure <a href="5-regression.html#fig:numxplot3">5.4</a> follows:</p>
+<p>Let’s first focus on interpreting the regression table output in Table <a href="5-regression.html#tab:regtable">5.2</a>, and then we’ll later revisit the code that produced it. In the <code>estimate</code> column of Table <a href="5-regression.html#tab:regtable">5.2</a> are the intercept <span class="math inline">\(b_0\)</span> = 3.88 and the slope <span class="math inline">\(b_1\)</span> = 0.067 for <code>bty_avg</code>. Thus the equation of the regression line in Figure <a href="5-regression.html#fig:numxplot3">5.4</a> follows:</p>
 <p><span class="math display">\[
 \begin{aligned}
 \widehat{y} &amp;= b_0 + b_1 \cdot x\\
@@ -958,47 +978,47 @@ <h3><span class="header-section-number">5.1.2</span> Simple linear regression</h
 &amp;= 3.880 + 0.067\cdot\text{bty}\_\text{avg}
 \end{aligned}
 \]</span></p>
-<p>The intercept <span class="math inline">\(b_0\)</span> = 3.880 is the average teaching score <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{score}}\)</span> for those courses where the instructor had a “beauty” score <code>bty_avg</code> of 0. Or in graphical terms, it’s where the line intersects the <span class="math inline">\(y\)</span> axis when <span class="math inline">\(x\)</span> = 0. Note however that while the  intercept of the regression line has a mathematical interpretation, it has no <em>practical</em> interpretation, since observing a <code>bty_avg</code> of 0 is impossible; it is the average of six panelists’ “beauty” score ranging from 1 to 10. Furthermore, looking at the scatterplot with the regression line in Figure <a href="5-regression.html#fig:numxplot3">5.4</a>, no instructors had a “beauty” score anywhere near 0.</p>
+<p>The intercept <span class="math inline">\(b_0\)</span> = 3.88 is the average teaching score <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{score}}\)</span> for those courses where the instructor had a “beauty” score <code>bty_avg</code> of 0. Or in graphical terms, it’s where the line intersects the <span class="math inline">\(y\)</span> axis when <span class="math inline">\(x\)</span> = 0. Note, however, that while the  intercept of the regression line has a mathematical interpretation, it has no <em>practical</em> interpretation here, since observing a <code>bty_avg</code> of 0 is impossible; it is the average of six panelists’ “beauty” scores ranging from 1 to 10. Furthermore, looking at the scatterplot with the regression line in Figure <a href="5-regression.html#fig:numxplot3">5.4</a>, no instructors had a “beauty” score anywhere near 0.</p>
 <p>Of greater interest is the  slope <span class="math inline">\(b_1\)</span> = <span class="math inline">\(b_{\text{bty\_avg}}\)</span> for <code>bty_avg</code> of 0.067, as this summarizes the relationship between the teaching and “beauty” score variables. Note that the sign is positive, suggesting a positive relationship between these two variables, meaning teachers with higher “beauty” scores also tend to have higher teaching scores. Recall from earlier that the correlation coefficient is 0.187. They both have the same positive sign, but have a different value. Recall further that the correlation’s interpretation is the “strength of linear association”. The  slope’s interpretation is a little different:</p>
 <blockquote>
 <p>For every increase of 1 unit in <code>bty_avg</code>, there is an <em>associated</em> increase of, <em>on average</em>, 0.067 units of <code>score</code>.</p>
 </blockquote>
-<p>We only state that there is an <em>associated</em> increase and not necessarily a <em>causal</em> increase. For example, perhaps it’s not that higher “beauty” scores directly cause higher teaching scores per se. Instead it could be that individuals from wealthier backgrounds tend to have stronger educational backgrounds and hence have higher teaching scores, but that these wealthy individuals also have higher “beauty” scores. In other words, just because two variables are strongly associated, it doesn’t necessarily mean that one causes the other. This is summed up in the often quoted phrase “correlation is not necessarily causation.” We discuss this idea further in Subsection <a href="5-regression.html#correlation-is-not-causation">5.3.1</a>.</p>
+<p>We only state that there is an <em>associated</em> increase and not necessarily a <em>causal</em> increase. For example, perhaps it’s not that higher “beauty” scores directly cause higher teaching scores per se. Instead, the following could hold true: individuals from wealthier backgrounds tend to have stronger educational backgrounds and hence have higher teaching scores, while at the same time these wealthy individuals also tend to have higher “beauty” scores. In other words, just because two variables are strongly associated, it doesn’t necessarily mean that one causes the other. This is summed up in the often quoted phrase, “correlation is not necessarily causation.” We discuss this idea further in Subsection <a href="5-regression.html#correlation-is-not-causation">5.3.1</a>.</p>
 <p>Furthermore, we say that this associated increase is <em>on average</em> 0.067 units of teaching <code>score</code>, because you might have two instructors whose <code>bty_avg</code> scores differ by 1 unit, but their difference in teaching scores won’t necessarily be exactly 0.067. What the slope of 0.067 is saying is that across all possible courses, the <em>average</em> difference in teaching score between two instructors whose “beauty” scores differ by one is 0.067.</p>
-<p>Now that we’ve learned how to compute the equation for the regression line in Figure <a href="5-regression.html#fig:numxplot3">5.4</a> using the values in the <code>estimate</code> column of Table <a href="5-regression.html#tab:regtable">5.2</a> and how to interpret the resulting the intercept and slope, let’s revisit the code that generated this table:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit regression model:</span>
-score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals_ch6)
-<span class="co"># Get regression table:</span>
-<span class="kw">get_regression_table</span>(score_model)</code></pre>
-<p>First, we “fit” the linear regression model to the <code>data</code> using the <code>lm()</code>  function and save this to <code>score_model</code>. When we say “fit”, we mean “find the best fitting line to this data.” <code>lm()</code> stands for “linear model” and is used as follows: <code>lm(y ~ x, data = data_frame_name)</code> where:</p>
+<p>Now that we’ve learned how to compute the equation for the regression line in Figure <a href="5-regression.html#fig:numxplot3">5.4</a> using the values in the <code>estimate</code> column of Table <a href="5-regression.html#tab:regtable">5.2</a>, and how to interpret the resulting intercept and slope, let’s revisit the code that generated this table:</p>
+<div class="sourceCode" id="cb157"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb157-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb157-2" data-line-number="2">score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals_ch5)</a>
+<a class="sourceLine" id="cb157-3" data-line-number="3"><span class="co"># Get regression table:</span></a>
+<a class="sourceLine" id="cb157-4" data-line-number="4"><span class="kw">get_regression_table</span>(score_model)</a></code></pre></div>
+<p>First, we “fit” the linear regression model to the <code>data</code> using the <code>lm()</code>  function and save this as <code>score_model</code>. When we say “fit”, we mean “find the best fitting line to this data.” <code>lm()</code> stands for “linear model” and is used as follows: <code>lm(y ~ x, data = data_frame_name)</code> where:</p>
 <ul>
 <li><code>y</code> is the outcome variable, followed by a tilde <code>~</code>. In our case, <code>y</code> is set to <code>score</code>.</li>
 <li><code>x</code> is the explanatory variable. In our case, <code>x</code> is set to <code>bty_avg</code>.</li>
 <li>The combination of <code>y ~ x</code> is called a <em>model formula</em>. (Note the order of <code>y</code> and <code>x</code>.) In our case, the model formula is <code>score ~ bty_avg</code>. We saw such model formulas earlier when we computed the correlation coefficient using the <code>get_correlation()</code> function in Subsection <a href="5-regression.html#model1EDA">5.1.1</a>.</li>
-<li><code>data_frame_name</code> is the name of the data frame that contains the variables <code>y</code> and <code>x</code>. In our case, <code>data_frame_name</code> is the <code>evals_ch6</code> data frame.</li>
+<li><code>data_frame_name</code> is the name of the data frame that contains the variables <code>y</code> and <code>x</code>. In our case, <code>data_frame_name</code> is the <code>evals_ch5</code> data frame.</li>
 </ul>
 <p>Second, we take the saved model in <code>score_model</code> and apply the <code>get_regression_table()</code> function from the <code>moderndive</code> package to it to obtain the regression table in Table <a href="5-regression.html#tab:regtable">5.2</a>. This function is an example of what’s known in computer programming as a <em>wrapper function</em>.  They take other pre-existing functions and “wrap” them into a single function that hides its inner workings. This concept is illustrated in Figure <a href="5-regression.html#fig:moderndive-figure-wrapper">5.5</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:moderndive-figure-wrapper"></span>
-<img src="images/shutterstock/wrapper_function.png" alt="The concept of a wrapper function." width="\textwidth" />
+<img src="images/shutterstock/wrapper_function.png" alt="The concept of a wrapper function." width="60%" height="60%" />
 <p class="caption">
 FIGURE 5.5: The concept of a wrapper function.
 </p>
 </div>
-<p>So all you need to worry about is the what the inputs look like and what the outputs look like; you leave all the other details “under the hood of the car.” In our regression modeling example, the <code>get_regression_table()</code> function takes a saved <code>lm()</code> linear regression model as input and returns a data frame of the regression table as output. If you’re interested in learning more about the <code>get_regression_table()</code> function’s design and inner-workings, check out Subsection <a href="5-regression.html#underthehood">5.3.3</a>.</p>
-<p>Lastly, you might be wondering what remaining 5 columns in Table <a href="5-regression.html#tab:regtable">5.2</a> are: <code>std_error</code>, <code>statistic</code>, <code>p_value</code>, <code>lower_ci</code> and <code>upper_ci</code>? They are the “standard error”, “test statistic”, “p-value”, “lower 95% confidence interval bound”, and “upper 95% confidence interval bound.” They tell us about both the <em>statistical significance</em> and <em>practical significance</em> of our results. You can think of this loosely as the “meaningfulness” of our results from a statistical perspective. We are going to put aside these ideas for now and revisit them in Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a> on (statistical) inference for regression. We’ll do this after we’ve had a chance to cover standard errors in Chapter <a href="7-sampling.html#sampling">7</a>, confidence intervals in Chapter <a href="8-confidence-intervals.html#confidence-intervals">8</a>, and hypothesis testing and p-values in Chapter <a href="9-hypothesis-testing.html#hypothesis-testing">9</a></p>
+<p>So all you need to worry about is what the inputs look like and what the outputs look like; you leave all the other details “under the hood of the car.” In our regression modeling example, the <code>get_regression_table()</code> function takes a saved <code>lm()</code> linear regression model as input and returns a data frame of the regression table as output. If you’re interested in learning more about the <code>get_regression_table()</code> function’s inner workings, check out Subsection <a href="5-regression.html#underthehood">5.3.3</a>.</p>
+<p>Lastly, you might be wondering what the remaining five columns in Table <a href="5-regression.html#tab:regtable">5.2</a> are: <code>std_error</code>, <code>statistic</code>, <code>p_value</code>, <code>lower_ci</code> and <code>upper_ci</code>. They are the <em>standard error</em>, <em>test statistic</em>, <em>p-value</em>, <em>lower 95% confidence interval bound</em>, and <em>upper 95% confidence interval bound</em>. They tell us about both the <em>statistical significance</em> and <em>practical significance</em> of our results. This is loosely the “meaningfulness” of our results from a statistical perspective. Let’s put aside these ideas for now and revisit them in Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a> on (statistical) inference for regression. We’ll do this after we’ve had a chance to cover standard errors in Chapter <a href="7-sampling.html#sampling">7</a>, confidence intervals in Chapter <a href="8-confidence-intervals.html#confidence-intervals">8</a>, and hypothesis testing and <span class="math inline">\(p\)</span>-values in Chapter <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC5.2)</strong> Fit a new simple linear regression using <code>lm(score ~ age, data = evals_ch6)</code> where <code>age</code> is the new explanatory variable <span class="math inline">\(x\)</span>. Get information about the “best-fitting” line from the regression table by applying the <code>get_regression_table()</code> function. How do the regression results match up with the results from your earlier exploratory data analysis?</p>
+<p><strong>(LC5.2)</strong> Fit a new simple linear regression using <code>lm(score ~ age, data = evals_ch5)</code> where <code>age</code> is the new explanatory variable <span class="math inline">\(x\)</span>. Get information about the “best-fitting” line from the regression table by applying the <code>get_regression_table()</code> function. How do the regression results match up with the results from your earlier exploratory data analysis?</p>
 <div class="learncheck">
 
 </div>
 </div>
 <div id="model1points" class="section level3">
 <h3><span class="header-section-number">5.1.3</span> Observed/fitted values and residuals</h3>
-<p>We just saw how to get the value of the intercept and the slope of a regression line from the <code>estimate</code> column of a regression table generated by the <code>get_regression_table()</code> function. Now instead say we want information on individual observations. For example, let’s focus on the 21<sup>st</sup> of the 463 courses in the <code>evals_ch6</code> data frame in Table <a href="5-regression.html#tab:instructor-21">5.3</a>:</p>
+<p>We just saw how to get the value of the intercept and the slope of a regression line from the <code>estimate</code> column of a regression table generated by the <code>get_regression_table()</code> function. Now instead say we want information on individual observations. For example, let’s focus on the 21st of the 463 courses in the <code>evals_ch5</code> data frame in Table <a href="5-regression.html#tab:instructor-21">5.3</a>:</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:instructor-21">TABLE 5.3: </span>Data for the 21st course out of 463
@@ -1036,25 +1056,25 @@ <h3><span class="header-section-number">5.1.3</span> Observed/fitted values and
 </tr>
 </tbody>
 </table>
-<p>What is the value <span class="math inline">\(\widehat{y}\)</span> on the blue line regression line corresponding to this instructor’s <code>bty_avg</code> “beauty” score of 7.333? In Figure <a href="5-regression.html#fig:numxplot4">5.6</a> we mark three values corresponding to the instructor for this 21<sup>st</sup> course and give their statistical names:</p>
+<p>What is the value <span class="math inline">\(\widehat{y}\)</span> on the regression line corresponding to this instructor’s <code>bty_avg</code> “beauty” score of 7.333? In Figure <a href="5-regression.html#fig:numxplot4">5.6</a> we mark three values corresponding to the instructor for this 21st course and give their statistical names:</p>
 <ul>
 <li>Circle: The <em>observed value</em> <span class="math inline">\(y\)</span> = 4.9 is this course’s instructor’s actual teaching score.</li>
-<li>Square: The <em>fitted value</em> <span class="math inline">\(\widehat{y}\)</span> is value on the regression line for <span class="math inline">\(x\)</span> = <code>bty_avg</code> = 7.333. This value is computed using the intercept and slope in the previous regression table:</li>
+<li>Square: The <em>fitted value</em> <span class="math inline">\(\widehat{y}\)</span> is the value on the regression line for <span class="math inline">\(x\)</span> = <code>bty_avg</code> = 7.333. This value is computed using the intercept and slope in the previous regression table:</li>
 </ul>
 <p><span class="math display">\[\widehat{y} = b_0 + b_1 \cdot x = 3.88 + 0.067 \cdot 7.333 = 4.369\]</span></p>
 <ul>
 <li>Arrow: The length of this arrow is the <em>residual</em>  and is computed by subtracting the fitted value <span class="math inline">\(\widehat{y}\)</span> from the observed value <span class="math inline">\(y\)</span>. The residual can be thought of as a model’s error or “lack of fit” for a particular observation. In the case of this course’s instructor, it is <span class="math inline">\(y - \widehat{y}\)</span> = 4.9 - 4.369 = 0.531.</li>
 </ul>
 <div class="figure" style="text-align: center"><span id="fig:numxplot4"></span>
-<img src="moderndive_files/figure-html/numxplot4-1.png" alt="Example of observed value, fitted value, and residual." width="\textwidth" />
+<img src="ModernDive_files/figure-html/numxplot4-1.png" alt="Example of observed value, fitted value, and residual." width="\textwidth" />
 <p class="caption">
 FIGURE 5.6: Example of observed value, fitted value, and residual.
 </p>
 </div>
-<p>Now say we want to compute both the fitted value <span class="math inline">\(\widehat{y} = b_0 + b_1 \cdot x\)</span> and the residual <span class="math inline">\(y - \widehat{y}\)</span> for <em>all</em> 463 courses in the study? Recall that each course corresponds to one of the 463 rows in the <code>evals_ch6</code> data frame and also one of the 463 points in the regression plot in Figure <a href="5-regression.html#fig:numxplot4">5.6</a>.</p>
-<p>We could repeat the previous calculations we performed by hand 463 times, but that would be tedious and time consuming. Instead, let’s do this using a computer with the <code>get_regression_points()</code> function. Just like the <code>get_regression_table()</code> function, the <code>get_regression_points()</code> function is a “wrapper” function. However, this function returns a different output. Let’s apply the <code>get_regression_points()</code> function to <code>score_model</code>, which is where we saved our <code>lm()</code> model in the previous section. In Table <a href="5-regression.html#tab:regression-points-1">5.4</a> we present the results of only the 21<sup>st</sup> through 24<sup>th</sup> courses for brevity’s sake.</p>
-<pre class="sourceCode r"><code class="sourceCode r">regression_points &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(score_model)
-regression_points</code></pre>
+<p>Now say we want to compute both the fitted value <span class="math inline">\(\widehat{y} = b_0 + b_1 \cdot x\)</span> and the residual <span class="math inline">\(y - \widehat{y}\)</span> for <em>all</em> 463 courses in the study. Recall that each course corresponds to one of the 463 rows in the <code>evals_ch5</code> data frame and also one of the 463 points in the regression plot in Figure <a href="5-regression.html#fig:numxplot4">5.6</a>.</p>
+<p>We could repeat the previous calculations we performed by hand 463 times, but that would be tedious and time consuming. Instead, let’s do this using a computer with the <code>get_regression_points()</code> function. Just like the <code>get_regression_table()</code> function, the <code>get_regression_points()</code> function is a “wrapper” function. However, this function returns a different output. Let’s apply the <code>get_regression_points()</code> function to <code>score_model</code>, which is where we saved our <code>lm()</code> model in the previous section. In Table <a href="5-regression.html#tab:regression-points-1">5.4</a> we present the results of only the 21st through 24th courses for brevity’s sake.</p>
+<div class="sourceCode" id="cb158"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb158-1" data-line-number="1">regression_points &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(score_model)</a>
+<a class="sourceLine" id="cb158-2" data-line-number="2">regression_points</a></code></pre></div>
 <table>
 <caption>
 <span id="tab:regression-points-1">TABLE 5.4: </span>Regression points (for only the 21st through 24th courses)
@@ -1151,19 +1171,19 @@ <h3><span class="header-section-number">5.1.3</span> Observed/fitted values and
 </table>
 <p>Let’s inspect the individual columns and match them with the elements of Figure <a href="5-regression.html#fig:numxplot4">5.6</a>:</p>
 <ul>
-<li>The <code>score</code> column represents the observed outcome variable <span class="math inline">\(y\)</span> i.e. the y-position of the 463 black points.</li>
-<li>The <code>bty_avg</code> column represents the values of the explanatory variable <span class="math inline">\(x\)</span> i.e. the x-position of the 463 black points.</li>
-<li>The <code>score_hat</code> column represents the fitted values <span class="math inline">\(\widehat{y}\)</span> i.e. the corresponding value on the regression line for the 463 <span class="math inline">\(x\)</span> values.</li>
-<li>The <code>residual</code> column represents the residuals <span class="math inline">\(y - \widehat{y}\)</span> i.e the 463 vertical distances between the 463 black points and the regression line.</li>
+<li>The <code>score</code> column represents the observed outcome variable <span class="math inline">\(y\)</span>. This is the y-position of the 463 black points.</li>
+<li>The <code>bty_avg</code> column represents the values of the explanatory variable <span class="math inline">\(x\)</span>. This is the x-position of the 463 black points.</li>
+<li>The <code>score_hat</code> column represents the fitted values <span class="math inline">\(\widehat{y}\)</span>. This is the corresponding value on the regression line for the 463 <span class="math inline">\(x\)</span> values.</li>
+<li>The <code>residual</code> column represents the residuals <span class="math inline">\(y - \widehat{y}\)</span>. This is the 463 vertical distances between the 463 black points and the regression line.</li>
 </ul>
-<p>Just as we did for the instructor of the 21st course in the <code>evals_ch6</code> dataset (in the first row of the table), let’s repeat the calculations for the instructor of the 24th course (in the fourth row of Table <a href="5-regression.html#tab:regression-points-1">5.4</a>):</p>
+<p>Just as we did for the instructor of the 21st course in the <code>evals_ch5</code> dataset (in the first row of the table), let’s repeat the calculations for the instructor of the 24th course (in the fourth row of Table <a href="5-regression.html#tab:regression-points-1">5.4</a>):</p>
 <ul>
 <li><code>score</code> = 4.4 is the observed teaching <code>score</code> <span class="math inline">\(y\)</span> for this course’s instructor.</li>
 <li><code>bty_avg</code> = 5.50 is the value of the explanatory variable <code>bty_avg</code> <span class="math inline">\(x\)</span> for this course’s instructor.</li>
 <li><code>score_hat</code> = 4.25 = 3.88 + 0.067 <span class="math inline">\(\cdot\)</span> 5.50 is the fitted value <span class="math inline">\(\widehat{y}\)</span> on the regression line for this course’s instructor.</li>
-<li><code>residual</code> = 0.153 = 4.4 - 4.25 is the value of the residual for this instructor. In other words, the model was off by 0.153 teaching score units for this course’s instructor.</li>
+<li><code>residual</code> = 0.153 = 4.4 - 4.25 is the value of the residual for this instructor. In other words, the model’s fitted value was off by 0.153 teaching score units for this course’s instructor.</li>
 </ul>
-<p>At this point we suggest you read Section <a href="5-regression.html#leastsquares">5.3.2</a>, where we define what we mean by “best-fitting” regression lines: of all possible lines we can draw through the points, it is the line that minimizes the <em>sum of squared residuals</em>.</p>
+<p>At this point, you can skip ahead if you like to Subsection <a href="5-regression.html#leastsquares">5.3.2</a> to learn about the processes behind what makes “best-fitting” regression lines. As a primer, a “best-fitting” line refers to the line that minimizes the <em>sum of squared residuals</em> out of all possible lines we can draw through the points. In Section <a href="5-regression.html#model2">5.2</a>, we’ll discuss another common scenario of having a categorical explanatory variable and a numerical outcome variable.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -1177,50 +1197,43 @@ <h3><span class="header-section-number">5.1.3</span> Observed/fitted values and
 </div>
 <div id="model2" class="section level2">
 <h2><span class="header-section-number">5.2</span> One categorical explanatory variable</h2>
-<p>It’s an unfortunate truth that life expectancy is not the same across all countries in the world. International development agencies are very interested in studying these differences in life expectancy in the hopes of identifying where governments should allocate resources to address this problem. In this section, we’ll explore differences in life expectancy in two ways:</p>
+<p>It’s an unfortunate truth that life expectancy is not the same across all countries in the world. International development agencies are interested in studying these differences in life expectancy in the hopes of identifying where governments should allocate resources to address this problem. In this section, we’ll explore differences in life expectancy in two ways:</p>
 <ol style="list-style-type: decimal">
 <li>Differences between continents: Are there significant differences in average life expectancy between the five populated continents of the world: Africa, the Americas, Asia, Europe, and Oceania?</li>
 <li>Differences within continents: How does life expectancy vary within the world’s five continents? For example, is the spread of life expectancy among the countries of Africa larger than the spread of life expectancy among the countries of Asia?</li>
 </ol>
-<p>To answer such questions, we’ll use the <code>gapminder</code> data frame included in the <code>gapminder</code>  package. This dataset has international development statistics such as life expectancy, GDP per capita, and population for 142 countries for 5-year intervals between 1952 and 2007. Recall we visualized some of this data in Figure <a href="2-viz.html#fig:gapminder">2.1</a> in Subsection <a href="2-viz.html#gapminder">2.1.2</a> on the “Grammar of Graphics.”</p>
-<p>We’ll use this data for basic linear regression again, but now using an explanatory variable <span class="math inline">\(x\)</span> that is categorical, as opposed to the numerical explanatory variable model we used in the previous Section <a href="5-regression.html#model1">5.1</a>.</p>
+<p>To answer such questions, we’ll use the <code>gapminder</code> data frame included in the <code>gapminder</code>  package. This dataset has international development statistics such as life expectancy, GDP per capita, and population for 142 countries for 5-year intervals between 1952 and 2007. Recall we visualized some of this data in Figure <a href="2-viz.html#fig:gapminder">2.1</a> in Subsection <a href="2-viz.html#gapminder">2.1.2</a> on the grammar of graphics.</p>
+<p>We’ll use this data for basic regression again, but now using an explanatory variable <span class="math inline">\(x\)</span> that is categorical, as opposed to the numerical explanatory variable model we used in the previous Section <a href="5-regression.html#model1">5.1</a>.</p>
 <ol style="list-style-type: decimal">
-<li>A numerical outcome variable <span class="math inline">\(y\)</span>, a country’s life expectancy and</li>
-<li>A single categorical explanatory variable <span class="math inline">\(x\)</span>, the continent the country is a part of.</li>
+<li>A numerical outcome variable <span class="math inline">\(y\)</span> (a country’s life expectancy) and</li>
+<li>A single categorical explanatory variable <span class="math inline">\(x\)</span> (the continent that the country is a part of).</li>
 </ol>
 <p>When the explanatory variable <span class="math inline">\(x\)</span> is categorical, the concept of a “best-fitting” regression line is a little different than the one we saw previously in Section <a href="5-regression.html#model1">5.1</a> where the explanatory variable <span class="math inline">\(x\)</span> was numerical. We’ll study these differences shortly in Subsection <a href="5-regression.html#model2table">5.2.2</a>, but first we conduct an exploratory data analysis.</p>
 <div id="model2EDA" class="section level3">
 <h3><span class="header-section-number">5.2.1</span> Exploratory data analysis</h3>
-<p>The data on the 142 countries can be found in the <code>gapminder</code> data frame included in the <code>gapminder</code> package. However, to keep things simple, let’s <code>filter()</code> for only those observations/rows corresponding to the year 2007, <code>select()</code> only the subset of the variables we’ll consider in this chapter. We’ll save this data in a new data frame called <code>gapminder2007</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(gapminder)
-gapminder2007 &lt;-<span class="st"> </span>gapminder <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">filter</span>(year <span class="op">==</span><span class="st"> </span><span class="dv">2007</span>) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">select</span>(country, lifeExp, continent, gdpPercap)</code></pre>
-<p>Recall from Section <a href="5-regression.html#model1EDA">5.1.1</a> that there are three common steps in an exploratory data analysis:</p>
-<ol style="list-style-type: decimal">
-<li>Most crucially: Looking at the raw data values.</li>
-<li>Computing summary statistics, like means, medians, and interquartile ranges.</li>
-<li>Creating data visualizations.</li>
-</ol>
-<p>Let’s perform the first common step in an exploratory data analysis: looking at the raw data values. You can do this by using RStudio’s spreadsheet viewer or by using the <code>glimpse()</code> command as introduced in Section <a href="1-getting-started.html#exploredataframes">1.4.3</a> on exploring data frames:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">glimpse</span>(gapminder2007)</code></pre>
+<p>The data on the 142 countries can be found in the <code>gapminder</code> data frame included in the <code>gapminder</code> package. However, to keep things simple, let’s <code>filter()</code> for only those observations/rows corresponding to the year 2007. Additionally, let’s <code>select()</code> only the subset of the variables we’ll consider in this chapter. We’ll save this data in a new data frame called <code>gapminder2007</code>:</p>
+<div class="sourceCode" id="cb159"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb159-1" data-line-number="1"><span class="kw">library</span>(gapminder)</a>
+<a class="sourceLine" id="cb159-2" data-line-number="2">gapminder2007 &lt;-<span class="st"> </span>gapminder <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb159-3" data-line-number="3"><span class="st">  </span><span class="kw">filter</span>(year <span class="op">==</span><span class="st"> </span><span class="dv">2007</span>) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb159-4" data-line-number="4"><span class="st">  </span><span class="kw">select</span>(country, lifeExp, continent, gdpPercap)</a></code></pre></div>
+<p>Let’s perform the first common step in an exploratory data analysis: looking at the raw data values. You can do this by using RStudio’s spreadsheet viewer or by using the <code>glimpse()</code> command as introduced in Subsection <a href="1-getting-started.html#exploredataframes">1.4.3</a> on exploring data frames:</p>
+<div class="sourceCode" id="cb160"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb160-1" data-line-number="1"><span class="kw">glimpse</span>(gapminder2007)</a></code></pre></div>
 <pre><code>Observations: 142
 Variables: 4
 $ country   &lt;fct&gt; Afghanistan, Albania, Algeria, Angola, Argentina, Australia…
 $ lifeExp   &lt;dbl&gt; 43.8, 76.4, 72.3, 42.7, 75.3, 81.2, 79.8, 75.6, 64.1, 79.4,…
 $ continent &lt;fct&gt; Asia, Europe, Africa, Africa, Americas, Oceania, Europe, As…
 $ gdpPercap &lt;dbl&gt; 975, 5937, 6223, 4797, 12779, 34435, 36126, 29796, 1391, 33…</code></pre>
-<p>Observe that <code>Observations: 142</code> indicates that there are 142 rows/observations in <code>gapminder2007</code>, where each row corresponds to one country. In other words, the <em>observational unit</em> are individual countries. Furthermore, observe that the variable <code>continent</code> is of type <code>&lt;fct&gt;</code>, which stands for “factor,” which is R’s way of encoding categorical variables.</p>
-<p>A full description of all the variables included in <code>gapminder</code> can be found by reading the associated help file (run <code>?gapminder</code> in the console). However, let’s fully describe the 4 variables we selected in <code>gapminder2007</code>:</p>
+<p>Observe that <code>Observations: 142</code> indicates that there are 142 rows/observations in <code>gapminder2007</code>, where each row corresponds to one country. In other words, the <em>observational unit</em> is an individual country. Furthermore, observe that the variable <code>continent</code> is of type <code>&lt;fct&gt;</code>, which stands for <em>factor</em>, which is R’s way of encoding categorical variables.</p>
+<p>A full description of all the variables included in <code>gapminder</code> can be found by reading the associated help file (run <code>?gapminder</code> in the console). However, let’s fully describe only the 4 variables we selected in <code>gapminder2007</code>:</p>
 <ol style="list-style-type: decimal">
-<li><code>country</code>: An identification variable used to distinguish the 142 countries in the dataset.</li>
+<li><code>country</code>: An identification variable of type character/text used to distinguish the 142 countries in the dataset.</li>
 <li><code>lifeExp</code>: A numerical variable of that country’s life expectancy at birth. This is the outcome variable <span class="math inline">\(y\)</span> of interest.</li>
-<li><code>continent</code>: A categorical variable with 5 levels i.e. possible categories: Africa, Asia, Americas, Europe, and Oceania. This is the explanatory variable <span class="math inline">\(x\)</span> of interest.</li>
-<li><code>gdpPercap</code>: A numerical variable of that country’s GDP per capita in US inflation-adjusted dollars that we’ll use as another outcome variable <span class="math inline">\(y\)</span> in the Learning Check at the end of this section.</li>
+<li><code>continent</code>: A categorical variable with five levels. Here “levels” correspond to the possible categories: Africa, Asia, Americas, Europe, and Oceania. This is the explanatory variable <span class="math inline">\(x\)</span> of interest.</li>
+<li><code>gdpPercap</code>: A numerical variable of that country’s GDP per capita in US inflation-adjusted dollars that we’ll use as another outcome variable <span class="math inline">\(y\)</span> in the <em>Learning check</em> at the end of this subsection.</li>
 </ol>
-<p>Furthermore, let’s look at a random sample of 5 out of the 142 countries in Table <a href="5-regression.html#tab:model2-data-preview">5.5</a>. Note due to the random nature of the sampling, you will likely end up with a different subset of 5 rows.</p>
-<pre class="sourceCode r"><code class="sourceCode r">gapminder2007 <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">sample_n</span>(<span class="dt">size =</span> <span class="dv">5</span>)</code></pre>
+<p>Let’s look at a random sample of five out of the 142 countries in Table <a href="5-regression.html#tab:model2-data-preview">5.5</a>.</p>
+<div class="sourceCode" id="cb162"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb162-1" data-line-number="1">gapminder2007 <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">sample_n</span>(<span class="dt">size =</span> <span class="dv">5</span>)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:model2-data-preview">TABLE 5.5: </span>Random sample of 5 out of 142 countries
@@ -1314,71 +1327,73 @@ <h3><span class="header-section-number">5.2.1</span> Exploratory data analysis</
 </tr>
 </tbody>
 </table>
-<p>Now that we’ve looked at the raw values in our <code>gapminder2007</code> data frame and got a sense of the data, let’s move on to computing summary statistics. Let’s once again apply the <code>skim()</code> function from the <code>skimr</code> package. Recall from our previous EDA that this function takes in a data frame, “skims” it, and returns commonly used summary statistics. Let’s take our <code>gapminder2007</code> data frame, <code>select()</code> only the outcome and explanatory variables <code>lifeExp</code> and <code>continent</code>, and pipe them into the <code>skim()</code> function:</p>
-<pre class="sourceCode r"><code class="sourceCode r">gapminder2007 <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">select</span>(lifeExp, continent) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">skim</span>()</code></pre>
+<p>Note that random sampling will likely produce a different subset of 5 rows for you than what’s shown. Now that we’ve looked at the raw values in our <code>gapminder2007</code> data frame and got a sense of the data, let’s move on to computing summary statistics. Let’s once again apply the <code>skim()</code> function from the <code>skimr</code> package. Recall from our previous EDA that this function takes in a data frame, “skims” it, and returns commonly used summary statistics. Let’s take our <code>gapminder2007</code> data frame, <code>select()</code> only the outcome and explanatory variables <code>lifeExp</code> and <code>continent</code>, and pipe them into the <code>skim()</code> function:</p>
+<div class="sourceCode" id="cb163"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb163-1" data-line-number="1">gapminder2007 <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb163-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(lifeExp, continent) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb163-3" data-line-number="3"><span class="st">  </span><span class="kw">skim</span>()</a></code></pre></div>
 <pre><code>Skim summary statistics
  n obs: 142 
  n variables: 2 
 
-── Variable type:factor ────────────────────────────────────────────────────────
+── Variable type:factor
   variable missing complete   n n_unique                         top_counts ordered
  continent       0      142 142        5 Afr: 52, Asi: 33, Eur: 30, Ame: 25   FALSE
 
-── Variable type:numeric ───────────────────────────────────────────────────────
+── Variable type:numeric
  variable missing complete   n  mean    sd    p0   p25   p50   p75 p100
   lifeExp       0      142 142 67.01 12.07 39.61 57.16 71.94 76.41 82.6</code></pre>
 <p>The <code>skim()</code> output now reports summaries for categorical variables (<code>Variable type:factor</code>) separately from the numerical variables (<code>Variable type:numeric</code>). For the categorical variable <code>continent</code>, it reports:</p>
 <ul>
-<li><code>missing</code>, <code>complete</code>, <code>n</code> which are the number of missing, complete, and total number of values as before.</li>
-<li><code>n_unique</code>: The number of unique levels to this variable, corresponding to Africa, Asia, Americas, Europe, and Oceania.</li>
-<li><code>top_counts</code>: In this case the top four counts: <code>Africa</code> has 52 countries, <code>Asia</code> has 33, <code>Europe</code> has 30, and <code>Americas</code> has 25. Not displayed is <code>Oceania</code> with 2 countries.</li>
-<li><code>ordered</code>: This tells us whether the categorical variable is “ordinal”: whether there is encoded hierarchy (like low, medium, high). In this case, <code>continent</code> is not ordered.</li>
+<li><code>missing</code>, <code>complete</code>, and <code>n</code>, which are the number of missing, complete, and total number of values as before, respectively.</li>
+<li><code>n_unique</code>: The number of unique levels to this variable, corresponding to Africa, Asia, Americas, Europe, and Oceania. This refers to how many countries are in the data for each continent.</li>
+<li><code>top_counts</code>: In this case, the top four counts: <code>Africa</code> has 52 countries, <code>Asia</code> has 33, <code>Europe</code> has 30, and <code>Americas</code> has 25. Not displayed is <code>Oceania</code> with 2 countries.</li>
+<li><code>ordered</code>: This tells us whether the categorical variable is “ordinal”: whether there is an encoded hierarchy (like low, medium, high). In this case, <code>continent</code> is not ordered.</li>
 </ul>
-<p>Turning our attention to the summary statistics of the numerical variable <code>lifeExp</code>, we observe that the global median life expectancy in 2007 was 71.94, or in other words, half of the world’s countries (71 countries) had a life expectancy less than 71.94. The mean life expectancy of 67.01 is lower however. Why is the mean life expectancy lower than the median?</p>
+<p>Turning our attention to the summary statistics of the numerical variable <code>lifeExp</code>, we observe that the global median life expectancy in 2007 was 71.94. Thus, half of the world’s countries (71 countries) had a life expectancy less than 71.94. The mean life expectancy of 67.01 is lower, however. Why is the mean life expectancy lower than the median?</p>
 <p>We can answer this question by performing the last of the three common steps in an exploratory data analysis: creating data visualizations. Let’s visualize the distribution of our outcome variable <span class="math inline">\(y\)</span> = <code>lifeExp</code> in Figure <a href="5-regression.html#fig:lifeExp2007hist">5.7</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(gapminder2007, <span class="kw">aes</span>(<span class="dt">x =</span> lifeExp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">5</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Life expectancy&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Number of countries&quot;</span>,
-       <span class="dt">title =</span> <span class="st">&quot;Histogram of distribution of worldwide life expectancies&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb165"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb165-1" data-line-number="1"><span class="kw">ggplot</span>(gapminder2007, <span class="kw">aes</span>(<span class="dt">x =</span> lifeExp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb165-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">5</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb165-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Life expectancy&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Number of countries&quot;</span>,</a>
+<a class="sourceLine" id="cb165-4" data-line-number="4">       <span class="dt">title =</span> <span class="st">&quot;Histogram of distribution of worldwide life expectancies&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:lifeExp2007hist"></span>
-<img src="moderndive_files/figure-html/lifeExp2007hist-1.png" alt="Histogram of Life Expectancy in 2007." width="\textwidth" />
+<img src="ModernDive_files/figure-html/lifeExp2007hist-1.png" alt="Histogram of life expectancy in 2007." width="\textwidth" />
 <p class="caption">
-FIGURE 5.7: Histogram of Life Expectancy in 2007.
+FIGURE 5.7: Histogram of life expectancy in 2007.
 </p>
 </div>
-<p>We see that this data is <em>left-skewed</em>, also known as <em>negatively</em>  skewed: there are a few countries with very low life expectancy that are bringing down the mean life expectancy. However, the median is less sensitive to the effects of such outliers, hence the median is greater than the mean in this case.</p>
-<p>Remember however, that we want to compare life expectancies both between continents and within continents. In other words, our visualizations need to incorporate some notion of the variable <code>continent</code>. We can do this easily with a faceted histogram. Recall from Section <a href="2-viz.html#facets">2.6</a> that facets allow us to split a visualization by the different values of another variable. We display the resulting visualization in Figure <a href="5-regression.html#fig:catxplot0b">5.8</a> by adding a  <code>facet_wrap(~ continent, nrow = 2)</code> layer.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(gapminder2007, <span class="kw">aes</span>(<span class="dt">x =</span> lifeExp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">5</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Life expectancy&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Number of countries&quot;</span>,
-       <span class="dt">title =</span> <span class="st">&quot;Histogram of distribution of worldwide life expectancies&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">facet_wrap</span>(<span class="op">~</span><span class="st"> </span>continent, <span class="dt">nrow =</span> <span class="dv">2</span>)</code></pre>
+<p>We see that this data is <em>left-skewed</em>, also known as <em>negatively</em>  skewed: there are a few countries with low life expectancy that are bringing down the mean life expectancy. However, the median is less sensitive to the effects of such outliers; hence, the median is greater than the mean in this case.</p>
+<p>Remember, however, that we want to compare life expectancies both between continents and within continents. In other words, our visualizations need to incorporate some notion of the variable <code>continent</code>. We can do this easily with a faceted histogram. Recall from Section <a href="2-viz.html#facets">2.6</a> that facets allow us to split a visualization by the different values of another variable. We display the resulting visualization in Figure <a href="5-regression.html#fig:catxplot0b">5.8</a> by adding a  <code>facet_wrap(~ continent, nrow = 2)</code> layer.</p>
+<div class="sourceCode" id="cb166"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb166-1" data-line-number="1"><span class="kw">ggplot</span>(gapminder2007, <span class="kw">aes</span>(<span class="dt">x =</span> lifeExp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb166-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">5</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb166-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Life expectancy&quot;</span>, </a>
+<a class="sourceLine" id="cb166-4" data-line-number="4">       <span class="dt">y =</span> <span class="st">&quot;Number of countries&quot;</span>,</a>
+<a class="sourceLine" id="cb166-5" data-line-number="5">       <span class="dt">title =</span> <span class="st">&quot;Histogram of distribution of worldwide life expectancies&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb166-6" data-line-number="6"><span class="st">  </span><span class="kw">facet_wrap</span>(<span class="op">~</span><span class="st"> </span>continent, <span class="dt">nrow =</span> <span class="dv">2</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:catxplot0b"></span>
-<img src="moderndive_files/figure-html/catxplot0b-1.png" alt="Life expectancy in 2007." width="\textwidth" />
+<img src="ModernDive_files/figure-html/catxplot0b-1.png" alt="Life expectancy in 2007." width="\textwidth" />
 <p class="caption">
 FIGURE 5.8: Life expectancy in 2007.
 </p>
 </div>
 <p>Observe that unfortunately the distribution of African life expectancies is much lower than the other continents, while in Europe life expectancies tend to be higher and furthermore do not vary as much. On the other hand, both Asia and Africa have the most variation in life expectancies. There is the least variation in Oceania, but keep in mind that there are only two countries in Oceania: Australia and New Zealand.</p>
 <p>Recall that an alternative method to visualize the distribution of a numerical variable split by a categorical variable is by using a side-by-side boxplot. We map the categorical variable <code>continent</code> to the <span class="math inline">\(x\)</span>-axis and the different life expectancies within each continent on the <span class="math inline">\(y\)</span>-axis in Figure <a href="5-regression.html#fig:catxplot1">5.9</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(gapminder2007, <span class="kw">aes</span>(<span class="dt">x =</span> continent, <span class="dt">y =</span> lifeExp)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_boxplot</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Continent&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Life expectancy (years)&quot;</span>,
-       <span class="dt">title =</span> <span class="st">&quot;Life expectancy by continent&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb167"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb167-1" data-line-number="1"><span class="kw">ggplot</span>(gapminder2007, <span class="kw">aes</span>(<span class="dt">x =</span> continent, <span class="dt">y =</span> lifeExp)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb167-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_boxplot</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb167-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Continent&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Life expectancy&quot;</span>,</a>
+<a class="sourceLine" id="cb167-4" data-line-number="4">       <span class="dt">title =</span> <span class="st">&quot;Life expectancy by continent&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:catxplot1"></span>
-<img src="moderndive_files/figure-html/catxplot1-1.png" alt="Life expectancy in 2007." width="\textwidth" />
+<img src="ModernDive_files/figure-html/catxplot1-1.png" alt="Life expectancy in 2007." width="\textwidth" />
 <p class="caption">
 FIGURE 5.9: Life expectancy in 2007.
 </p>
 </div>
-<p>Some people prefer comparing the distributions of a numerical variable between different levels of a categorical variable using a boxplot instead of a faceted histogram. This is because we can make quick comparisons between the categorical variable’s levels with imaginary horizontal lines. For example, observe in Figure <a href="5-regression.html#fig:catxplot1">5.9</a> that we can quickly convince ourselves that Oceania has the highest median life expectancies by drawing an imaginary horizontal line at <span class="math inline">\(y\)</span> = 80. Furthermore, as we observed in the faceted histogram in Figure <a href="5-regression.html#fig:catxplot0b">5.8</a>, Africa and Asia have the largest variation in life expectancy as evidenced by their large interquartile ranges i.e. the heights of the boxes.</p>
-<p>It’s important to remember however that the solid lines in the middle of the boxes correspond to the medians (i.e. the middle value) rather than the mean (the average). So for example, if you look at Asia, the solid line denotes the median life expectancy of around 72 years. This tells us that half of all countries in Asia have a life expectancy below 72 years whereas half have a life expectancy above 72 years.</p>
+<p>Some people prefer comparing the distributions of a numerical variable between different levels of a categorical variable using a boxplot instead of a faceted histogram. This is because we can make quick comparisons between the categorical variable’s levels with imaginary horizontal lines. For example, observe in Figure <a href="5-regression.html#fig:catxplot1">5.9</a> that we can quickly convince ourselves that Oceania has the highest median life expectancies by drawing an imaginary horizontal line at <span class="math inline">\(y\)</span> = 80. Furthermore, as we observed in the faceted histogram in Figure <a href="5-regression.html#fig:catxplot0b">5.8</a>, Africa and Asia have the largest variation in life expectancy as evidenced by their large interquartile ranges (the heights of the boxes).</p>
+<p>It’s important to remember, however, that the solid lines in the middle of the boxes correspond to the medians (the middle value) rather than the mean (the average). So, for example, if you look at Asia, the solid line denotes the median life expectancy of around 72 years. This tells us that half of all countries in Asia have a life expectancy below 72 years, whereas half have a life expectancy above 72 years.</p>
 <p>Let’s compute the median and mean life expectancy for each continent with a little more data wrangling and display the results in Table <a href="5-regression.html#tab:catxplot0">5.6</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">lifeExp_by_continent &lt;-<span class="st"> </span>gapminder2007 <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">group_by</span>(continent) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">median =</span> <span class="kw">median</span>(lifeExp), <span class="dt">mean =</span> <span class="kw">mean</span>(lifeExp))</code></pre>
+<div class="sourceCode" id="cb168"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb168-1" data-line-number="1">lifeExp_by_continent &lt;-<span class="st"> </span>gapminder2007 <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb168-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(continent) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb168-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">median =</span> <span class="kw">median</span>(lifeExp), </a>
+<a class="sourceLine" id="cb168-4" data-line-number="4">            <span class="dt">mean =</span> <span class="kw">mean</span>(lifeExp))</a></code></pre></div>
 <table>
 <caption>
 <span id="tab:catxplot0">TABLE 5.6: </span>Life expectancy by continent
@@ -1455,17 +1470,16 @@ <h3><span class="header-section-number">5.2.1</span> Exploratory data analysis</
 </tbody>
 </table>
 <p>Observe the order of the second column <code>median</code> life expectancy: Africa is lowest, the Americas and Asia are next with similar medians, then Europe, then Oceania. This ordering corresponds to the ordering of the solid black lines inside the boxes in our side-by-side boxplot in Figure <a href="5-regression.html#fig:catxplot1">5.9</a>.</p>
-<p>Let’s now turn our attention to the values in the third column <code>mean</code>. Using Africa’s mean life expectancy of 54.8 as a <em>baseline for comparison</em>, let’s start making relative comparisons to the life expectancies of the other four continents:</p>
+<p>Let’s now turn our attention to the values in the third column <code>mean</code>. Using Africa’s mean life expectancy of 54.8 as a <em>baseline for comparison</em>, let’s start making comparisons to the mean life expectancies of the other four continents and put these values in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>, which we’ll revisit later on in this section.</p>
 <ol style="list-style-type: decimal">
-<li>The mean life expectancy of the Americas is 73.6 - 54.8 = 18.8 years higher.</li>
-<li>The mean life expectancy of Asia is 70.7 - 54.8 = 15.9 years higher.</li>
-<li>The mean life expectancy of Europe is 77.6 - 54.8 = 22.8 years higher.</li>
-<li>The mean life expectancy of Oceania is 80.7 - 54.8 = 25.9 years higher.</li>
+<li>For the Americas, it is 73.6 - 54.8 = 18.8 years higher.</li>
+<li>For Asia, it is 70.7 - 54.8 = 15.9 years higher.</li>
+<li>For Europe, it is 77.6 - 54.8 = 22.8 years higher.</li>
+<li>For Oceania, it is 80.7 - 54.8 = 25.9 years higher.</li>
 </ol>
-<p>Let’s put these values Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>, which we’ll revisit later on in this section.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:continent-mean-life-expectancies">TABLE 5.7: </span>Mean life expectancy by continent and relative differences from mean for Africa.
+<span id="tab:continent-mean-life-expectancies">TABLE 5.7: </span>Mean life expectancy by continent and relative differences from mean for Africa
 </caption>
 <thead>
 <tr>
@@ -1543,29 +1557,21 @@ <h3><span class="header-section-number">5.2.1</span> Exploratory data analysis</
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC5.4)</strong> Conduct a new exploratory data analysis with the same explanatory variable <span class="math inline">\(x\)</span> being <code>continent</code> but with <code>gdpPercap</code> as the new outcome variable <span class="math inline">\(y\)</span>. Remember, this involves three things:</p>
-<ol style="list-style-type: decimal">
-<li>Most crucially: Looking at the raw data values.</li>
-<li>Computing summary statistics, such as means, medians, and interquartile ranges.</li>
-<li>Creating data visualizations.</li>
-</ol>
-<p>What can you say about the differences in GDP per capita between continents based on this exploration?</p>
+<p><strong>(LC5.4)</strong> Conduct a new exploratory data analysis with the same explanatory variable <span class="math inline">\(x\)</span> being <code>continent</code> but with <code>gdpPercap</code> as the new outcome variable <span class="math inline">\(y\)</span>. What can you say about the differences in GDP per capita between continents based on this exploration?</p>
 <div class="learncheck">
 
 </div>
 </div>
 <div id="model2table" class="section level3">
 <h3><span class="header-section-number">5.2.2</span> Linear regression</h3>
-<p>In Subsection <a href="5-regression.html#model1table">5.1.2</a> we introduced simple linear regression, which involves modeling the relationship between a numerical outcome variable <span class="math inline">\(y\)</span> and a numerical explanatory variable <span class="math inline">\(x\)</span>. In our life expectancy example, we now instead have a categorical explanatory variable <span class="math inline">\(x\)</span> <code>continent</code>. Our model will not yield a “best-fitting” regression line like in Figure <a href="5-regression.html#fig:numxplot3">5.4</a>, but rather <em>offsets</em> relative to a baseline for comparison.</p>
-<p>As we did in Section <a href="5-regression.html#model1table">5.1.2</a> when studying the relationship between teaching scores and “beauty” scores, let’s output the regression table for this model. Recall that this is done in two steps:</p>
+<p>In Subsection <a href="5-regression.html#model1table">5.1.2</a> we introduced simple linear regression, which involves modeling the relationship between a numerical outcome variable <span class="math inline">\(y\)</span> and a numerical explanatory variable <span class="math inline">\(x\)</span>. In our life expectancy example, we now instead have a categorical explanatory variable <code>continent</code>. Our model will not yield a “best-fitting” regression line like in Figure <a href="5-regression.html#fig:numxplot3">5.4</a>, but rather <em>offsets</em> relative to a baseline for comparison.</p>
+<p>As we did in Subsection <a href="5-regression.html#model1table">5.1.2</a> when studying the relationship between teaching scores and “beauty” scores, let’s output the regression table for this model. Recall that this is done in two steps:</p>
 <ol style="list-style-type: decimal">
-<li>We first “fit” the linear regression model using the <code>lm(y~x, data)</code> function and save it in <code>lifeExp_model</code>.</li>
-<li>We get the regression table by applying the <code>get_regression_table()</code> from the <code>moderndive</code> package to <code>lifeExp_model</code>.</li>
+<li>We first “fit” the linear regression model using the <code>lm(y ~ x, data)</code> function and save it in <code>lifeExp_model</code>.</li>
+<li>We get the regression table by applying the <code>get_regression_table()</code> function from the <code>moderndive</code> package to <code>lifeExp_model</code>.</li>
 </ol>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit regression model:</span>
-lifeExp_model &lt;-<span class="st"> </span><span class="kw">lm</span>(lifeExp <span class="op">~</span><span class="st"> </span>continent, <span class="dt">data =</span> gapminder2007)
-<span class="co"># Get regression table:</span>
-<span class="kw">get_regression_table</span>(lifeExp_model)</code></pre>
+<div class="sourceCode" id="cb169"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb169-1" data-line-number="1">lifeExp_model &lt;-<span class="st"> </span><span class="kw">lm</span>(lifeExp <span class="op">~</span><span class="st"> </span>continent, <span class="dt">data =</span> gapminder2007)</a>
+<a class="sourceLine" id="cb169-2" data-line-number="2"><span class="kw">get_regression_table</span>(lifeExp_model)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:catxplot4b">TABLE 5.8: </span>Linear regression table
@@ -1715,24 +1721,24 @@ <h3><span class="header-section-number">5.2.2</span> Linear regression</h3>
 </table>
 <p>Let’s once again focus on the values in the <code>term</code> and <code>estimate</code> columns of Table <a href="5-regression.html#tab:catxplot4b">5.8</a>. Why are there now 5 rows? Let’s break them down one-by-one:</p>
 <ol style="list-style-type: decimal">
-<li><code>intercept</code> here corresponds to the mean life expectancy of countries in Africa of 54.8 years.</li>
-<li><code>continentAmericas</code> corresponds to countries in the Americas and the value +18.8 is the same difference in mean life expectancy relative to Africa we displayed in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>. In other words, the mean life expectancy of countries in the Americas is 54.8 + 18.8 = 73.6.</li>
-<li><code>continentAsia</code> corresponds to countries in Asia and the value +15.9 is the same difference in mean life expectancy relative to Africa we displayed in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>. In other words, the mean life expectancy of countries in Asia is 54.8 + 15.9 = 70.7.</li>
-<li><code>continentEurope</code> corresponds to countries in Europe and the value +22.8 is the same difference in mean life expectancy relative to Africa we displayed in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>. In other words, the mean life expectancy of countries in Europe is 54.8 + 22.8 = 77.6.</li>
-<li><code>continentOceania</code> corresponds to countries in Oceania and the value +25.9 is the same difference in mean life expectancy relative to Africa we displayed in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>. In other words, the mean life expectancy of countries in the Oceania is 54.8 + 25.9 = 80.7.</li>
+<li><code>intercept</code> corresponds to the mean life expectancy of countries in Africa of 54.8 years.</li>
+<li><code>continentAmericas</code> corresponds to countries in the Americas and the value +18.8 is the same difference in mean life expectancy relative to Africa we displayed in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>. In other words, the mean life expectancy of countries in the Americas is <span class="math inline">\(54.8 + 18.8 = 73.6\)</span>.</li>
+<li><code>continentAsia</code> corresponds to countries in Asia and the value +15.9 is the same difference in mean life expectancy relative to Africa we displayed in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>. In other words, the mean life expectancy of countries in Asia is <span class="math inline">\(54.8 + 15.9 = 70.7\)</span>.</li>
+<li><code>continentEurope</code> corresponds to countries in Europe and the value +22.8 is the same difference in mean life expectancy relative to Africa we displayed in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>. In other words, the mean life expectancy of countries in Europe is <span class="math inline">\(54.8 + 22.8 = 77.6\)</span>.</li>
+<li><code>continentOceania</code> corresponds to countries in Oceania and the value +25.9 is the same difference in mean life expectancy relative to Africa we displayed in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>. In other words, the mean life expectancy of countries in Oceania is <span class="math inline">\(54.8 + 25.9 = 80.7\)</span>.</li>
 </ol>
 <p>To summarize, the 5 values in the <code>estimate</code> column in Table <a href="5-regression.html#tab:catxplot4b">5.8</a> correspond to the “baseline for comparison” continent Africa (the intercept) as well as four “offsets” from this baseline for the remaining 4 continents: the Americas, Asia, Europe, and Oceania.</p>
-<p>You might be asking at this point why was Africa chosen as the “baseline for comparison” group. This is the case for no other reason than it comes first alphabetically of the five continents; by default R arranges factors/categorical variables in alphanumeric order. You can change this baseline group to be another continent if you manipulate the variable <code>continent</code>’s factor “levels” using the <code>forcats</code> package. See <a href="https://r4ds.had.co.nz/factors.html">Chapter 15</a> of Garrett Grolemund and Hadley Wickham’s book “R for Data Science” <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2016</a>)</span> for examples.</p>
+<p>You might be asking at this point why was Africa chosen as the “baseline for comparison” group. This is the case for no other reason than it comes first alphabetically of the five continents; by default R arranges factors/categorical variables in alphanumeric order. You can change this baseline group to be another continent if you manipulate the variable <code>continent</code>’s factor “levels” using the <code>forcats</code> package. See <a href="https://r4ds.had.co.nz/factors.html">Chapter 15</a> of <em>R for Data Science</em> <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2017</a>)</span> for examples.</p>
 <p>Let’s now write the equation for our fitted values <span class="math inline">\(\widehat{y} = \widehat{\text{life exp}}\)</span>.</p>
 <p><span class="math display">\[
 \begin{aligned}
-\widehat{y} = \widehat{\text{life exp}} &amp;= b_0 + b_{\text{Amer}}\cdot\mathbb{1}_{\mbox{Amer}}(x) + b_{\text{Asia}}\cdot\mathbb{1}_{\mbox{Asia}}(x) + \\
-&amp; \qquad b_{\text{Euro}}\cdot\mathbb{1}_{\mbox{Euro}}(x) + b_{\text{Ocean}}\cdot\mathbb{1}_{\mbox{Ocean}}(x)\\
-&amp;= 54.8 + 18.8\cdot\mathbb{1}_{\mbox{Amer}}(x) + 15.9\cdot\mathbb{1}_{\mbox{Asia}}(x) + \\
-&amp; \qquad 22.8\cdot\mathbb{1}_{\mbox{Euro}}(x) + 25.9\cdot\mathbb{1}_{\mbox{Ocean}}(x)
+\widehat{y} = \widehat{\text{life exp}} &amp;= b_0 + b_{\text{Amer}}\cdot\mathbb{1}_{\text{Amer}}(x) + b_{\text{Asia}}\cdot\mathbb{1}_{\text{Asia}}(x) + \\
+&amp; \qquad b_{\text{Euro}}\cdot\mathbb{1}_{\text{Euro}}(x) + b_{\text{Ocean}}\cdot\mathbb{1}_{\text{Ocean}}(x)\\
+&amp;= 54.8 + 18.8\cdot\mathbb{1}_{\text{Amer}}(x) + 15.9\cdot\mathbb{1}_{\text{Asia}}(x) + \\
+&amp; \qquad 22.8\cdot\mathbb{1}_{\text{Euro}}(x) + 25.9\cdot\mathbb{1}_{\text{Ocean}}(x)
 \end{aligned}
 \]</span></p>
-<p>Whoa! That looks very daunting! Don’t fret however, as once you understand what all the elements mean, things simply greatly. First, <span class="math inline">\(\mathbb{1}_{A}(x)\)</span> is what’s known in mathematics as an “indicator function.” It returns only one of two possible values, 0 and 1, where</p>
+<p>Whoa! That looks daunting! Don’t fret, however, as once you understand what all the elements mean, things simplify greatly. First, <span class="math inline">\(\mathbb{1}_{A}(x)\)</span> is what’s known in mathematics as an “indicator function.” It returns only one of two possible values, 0 and 1, where</p>
 <p><span class="math display">\[
 \mathbb{1}_{A}(x) = \left\{
 \begin{array}{ll}
@@ -1740,19 +1746,19 @@ <h3><span class="header-section-number">5.2.2</span> Linear regression</h3>
 0 &amp; \text{if } \text{otherwise} \end{array}
 \right.
 \]</span></p>
-<p>In a statistical modeling context this is also known as a <em>dummy variable</em>.  In our case, let’s consider the first such indicator variable <span class="math inline">\(\mathbb{1}_{\mbox{Amer}}(x)\)</span>. This indicator function returns 1 if a country is in the Americas, 0 otherwise:</p>
+<p>In a statistical modeling context, this is also known as a <em>dummy variable</em>.  In our case, let’s consider the first such indicator variable <span class="math inline">\(\mathbb{1}_{\text{Amer}}(x)\)</span>. This indicator function returns 1 if a country is in the Americas, 0 otherwise:</p>
 <p><span class="math display">\[
-\mathbb{1}_{\mbox{Amer}}(x) = \left\{
+\mathbb{1}_{\text{Amer}}(x) = \left\{
 \begin{array}{ll}
 1 &amp; \text{if } \text{country } x \text{ is in the Americas} \\
 0 &amp; \text{otherwise}\end{array}
 \right.
 \]</span></p>
-<p>Second, <span class="math inline">\(b_0\)</span> corresponds to the intercept as before; in this case it’s the mean life expectancy of all countries in Africa. Third, the <span class="math inline">\(b_{\text{Amer}}\)</span>, <span class="math inline">\(b_{\text{Asia}}\)</span>, <span class="math inline">\(b_{\text{Euro}}\)</span>, and <span class="math inline">\(b_{\text{Ocean}}\)</span> represent the 4 “offsets relative to the baseline for comparison” in the regression table output in Table <a href="5-regression.html#tab:catxplot4b">5.8</a>: <code>continentAmericas</code>, <code>continentAsia</code>, <code>continentEurope</code>, and <code>continentOceania</code>.</p>
-<p>Let’s put this all together and compute the fitted value <span class="math inline">\(\widehat{y} = \widehat{\text{life exp}}\)</span> for a country in Africa. Since the country is in Africa, all four indicator functions <span class="math inline">\(\mathbb{1}_{\mbox{Amer}}(x)\)</span>, <span class="math inline">\(\mathbb{1}_{\mbox{Asia}}(x)\)</span>, <span class="math inline">\(\mathbb{1}_{\mbox{Euro}}(x)\)</span>, and <span class="math inline">\(\mathbb{1}_{\mbox{Ocean}}(x)\)</span> will equal 0, and thus:</p>
+<p>Second, <span class="math inline">\(b_0\)</span> corresponds to the intercept as before; in this case, it’s the mean life expectancy of all countries in Africa. Third, the <span class="math inline">\(b_{\text{Amer}}\)</span>, <span class="math inline">\(b_{\text{Asia}}\)</span>, <span class="math inline">\(b_{\text{Euro}}\)</span>, and <span class="math inline">\(b_{\text{Ocean}}\)</span> represent the 4 “offsets relative to the baseline for comparison” in the regression table output in Table <a href="5-regression.html#tab:catxplot4b">5.8</a>: <code>continentAmericas</code>, <code>continentAsia</code>, <code>continentEurope</code>, and <code>continentOceania</code>.</p>
+<p>Let’s put this all together and compute the fitted value <span class="math inline">\(\widehat{y} = \widehat{\text{life exp}}\)</span> for a country in Africa. Since the country is in Africa, all four indicator functions <span class="math inline">\(\mathbb{1}_{\text{Amer}}(x)\)</span>, <span class="math inline">\(\mathbb{1}_{\text{Asia}}(x)\)</span>, <span class="math inline">\(\mathbb{1}_{\text{Euro}}(x)\)</span>, and <span class="math inline">\(\mathbb{1}_{\text{Ocean}}(x)\)</span> will equal 0, and thus:</p>
 <p><span class="math display">\[
 \begin{aligned}
-\widehat{\text{life exp}} &amp;= b_0 + b_{\text{Amer}}\cdot\mathbb{1}_{\mbox{Amer}}(x) + b_{\text{Asia}}\cdot\mathbb{1}_{\mbox{Asia}}(x)
+\widehat{\text{life exp}} &amp;= b_0 + b_{\text{Amer}}\cdot\mathbb{1}_{\text{Amer}}(x) + b_{\text{Asia}}\cdot\mathbb{1}_{\text{Asia}}(x)
 + \\
 &amp; \qquad b_{\text{Euro}}\cdot\mathbb{1}_{\text{Euro}}(x) + b_{\text{Ocean}}\cdot\mathbb{1}_{\text{Ocean}}(x)\\
 &amp;= 54.8 + 18.8\cdot\mathbb{1}_{\text{Amer}}(x) + 15.9\cdot\mathbb{1}_{\text{Asia}}(x)
@@ -1762,32 +1768,33 @@ <h3><span class="header-section-number">5.2.2</span> Linear regression</h3>
 &amp;= 54.8
 \end{aligned}
 \]</span></p>
-<p>In other words, all that’s left is the intercept <span class="math inline">\(b_0\)</span>, corresponding to the average life expectancy of African countries of 54.8 years. Next, say we are considering a country in the Americas. In this case only the indicator function <span class="math inline">\(\mathbb{1}_{\mbox{Amer}}(x)\)</span> for the Americas will equal 1, while all the others will equal 0, and thus:</p>
+<p>In other words, all that’s left is the intercept <span class="math inline">\(b_0\)</span>, corresponding to the average life expectancy of African countries of 54.8 years. Next, say we are considering a country in the Americas. In this case, only the indicator function <span class="math inline">\(\mathbb{1}_{\text{Amer}}(x)\)</span> for the Americas will equal 1, while all the others will equal 0, and thus:</p>
 <p><span class="math display">\[
 \begin{aligned}
-\widehat{\text{life exp}} &amp;= 54.8 + 18.8\cdot\mathbb{1}_{\mbox{Amer}}(x) + 15.9\cdot\mathbb{1}_{\mbox{Asia}}(x)
-+ 22.8\cdot\mathbb{1}_{\mbox{Euro}}(x) + \\
-&amp; \qquad 25.9\cdot\mathbb{1}_{\mbox{Ocean}}(x)\\
+\widehat{\text{life exp}} &amp;= 54.8 + 18.8\cdot\mathbb{1}_{\text{Amer}}(x) + 15.9\cdot\mathbb{1}_{\text{Asia}}(x)
++ 22.8\cdot\mathbb{1}_{\text{Euro}}(x) + \\
+&amp; \qquad 25.9\cdot\mathbb{1}_{\text{Ocean}}(x)\\
 &amp;= 54.8 + 18.8\cdot 1 + 15.9\cdot 0 + 22.8\cdot 0 + 25.9\cdot 0\\
-&amp;= 54.8 + 18.8\\
-&amp;= 73.6
+&amp;= 54.8 + 18.8 \\
+&amp; = 73.6
 \end{aligned}
 \]</span></p>
-<p>which is the mean life expectancy for countries in the Americas of 73.6 years we computed in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>. Note the “offset from the baseline for comparison” here is +18.8 years.</p>
-<p>Let’s do one more. Say we are considering a country in Asia. In this case only the indicator function <span class="math inline">\(\mathbb{1}_{\mbox{Asia}}(x)\)</span> for Asia will equal 1, while all the others will equal 0, and thus:</p>
+<p>which is the mean life expectancy for countries in the Americas of 73.6 years in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>. Note the “offset from the baseline for comparison” is +18.8 years.</p>
+<p>Let’s do one more. Say we are considering a country in Asia. In this case, only the indicator function <span class="math inline">\(\mathbb{1}_{\text{Asia}}(x)\)</span> for Asia will equal 1, while all the others will equal 0, and thus:</p>
 <p><span class="math display">\[
 \begin{aligned}
-\widehat{\text{life exp}} &amp;= 54.8 + 18.8\cdot\mathbb{1}_{\mbox{Amer}}(x) + 15.9\cdot\mathbb{1}_{\mbox{Asia}}(x)
-+ 22.8\cdot\mathbb{1}_{\mbox{Euro}}(x) + \\
-&amp; \qquad 25.9\cdot\mathbb{1}_{\mbox{Ocean}}(x)\\
+\widehat{\text{life exp}} &amp;= 54.8 + 18.8\cdot\mathbb{1}_{\text{Amer}}(x) + 15.9\cdot\mathbb{1}_{\text{Asia}}(x)
++ 22.8\cdot\mathbb{1}_{\text{Euro}}(x) + \\
+&amp; \qquad 25.9\cdot\mathbb{1}_{\text{Ocean}}(x)\\
 &amp;= 54.8 + 18.8\cdot 0 + 15.9\cdot 1 + 22.8\cdot 0 + 25.9\cdot 0\\
-&amp;= 54.8 + 15.9\\
-&amp;= 70.7
+&amp;= 54.8 + 15.9 \\
+&amp; = 70.7
 \end{aligned}
 \]</span></p>
-<p>which is the mean life expectancy for countries in Asia of 70.7 years we computed in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>. Note the “offset from the baseline for comparison” here is +15.9 years.</p>
-<p>Let’s generalize this idea a bit. If we fit a linear regression model using a categorical explanatory variable <span class="math inline">\(x\)</span> that has <span class="math inline">\(k\)</span> levels i.e. possible categories, the regression table will return an intercept and <span class="math inline">\(k - 1\)</span> “offsets.” In our case, since there are <span class="math inline">\(k = 5\)</span> continents, the regression model returns an intercept corresponding to the baseline for comparison group of Africa and <span class="math inline">\(k - 1 = 4\)</span> offsets corresponding to the Americas, Asia, Europe, and Oceania.</p>
-<p>Phew! That was a lot of work! Understanding a regression table output when you’re using a categorical explanatory variable is a topic those new to regression often struggle with. The only real remedy for these struggles is practice, practice, practice. However, once you equip yourselves with an understanding of how to create regression models using categorical explanatory variables, you’ll be able to incorporate many new variables into your models given the large amount of the world’s data that is categorical. If you feel like you’re still struggling at this point however, we suggest you closely compare Tables <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a> and <a href="5-regression.html#tab:catxplot4b">5.8</a> and note how you can compute all the values from one table using the values in the other.</p>
+<p>which is the mean life expectancy for Asian countries of 70.7 years in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a>. The “offset from the baseline for comparison” here is +15.9 years.</p>
+<p>Let’s generalize this idea a bit. If we fit a linear regression model using a categorical explanatory variable <span class="math inline">\(x\)</span> that has <span class="math inline">\(k\)</span> possible categories, the regression table will return an intercept and <span class="math inline">\(k - 1\)</span> “offsets.” In our case, since there are <span class="math inline">\(k = 5\)</span> continents, the regression model returns an intercept corresponding to the baseline for comparison group of Africa and <span class="math inline">\(k - 1 = 4\)</span> offsets corresponding to the Americas, Asia, Europe, and Oceania.</p>
+<!--Phew! That was a lot of work! -->
+<p>Understanding a regression table output when you’re using a categorical explanatory variable is a topic those new to regression often struggle with. The only real remedy for these struggles is practice, practice, practice. However, once you equip yourselves with an understanding of how to create regression models using categorical explanatory variables, you’ll be able to incorporate many new variables into your models, given the large amount of the world’s data that is categorical. If you feel like you’re still struggling at this point, however, we suggest you closely compare Tables <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a> and <a href="5-regression.html#tab:catxplot4b">5.8</a> and note how you can compute all the values from one table using the values in the other.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -1806,9 +1813,9 @@ <h3><span class="header-section-number">5.2.3</span> Observed/fitted values and
 <li>Fitted values <span class="math inline">\(\widehat{y}\)</span>, or the value on the regression line for a given <span class="math inline">\(x\)</span> value</li>
 <li>Residuals <span class="math inline">\(y - \widehat{y}\)</span>, or the error between the observed value and the fitted value</li>
 </ol>
-<p>We obtained these values and other values using the <code>get_regression_points()</code> function from the <code>moderndive</code> package. This time however, let’s add an <code>ID = &quot;country&quot;</code> argument: this is telling the function to use the variable <code>country</code> in <code>gapminder2007</code> as an <em>identification variable</em> in the output. This will help contextualize our analysis by matching values to countries.</p>
-<pre class="sourceCode r"><code class="sourceCode r">regression_points &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(lifeExp_model, <span class="dt">ID =</span> <span class="st">&quot;country&quot;</span>)
-regression_points</code></pre>
+<p>We obtained these values and other values using the <code>get_regression_points()</code> function from the <code>moderndive</code> package. This time, however, let’s add an argument setting <code>ID = &quot;country&quot;</code>: this is telling the function to use the variable <code>country</code> in <code>gapminder2007</code> as an <em>identification variable</em> in the output. This will help contextualize our analysis by matching values to countries.</p>
+<div class="sourceCode" id="cb170"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb170-1" data-line-number="1">regression_points &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(lifeExp_model, <span class="dt">ID =</span> <span class="st">&quot;country&quot;</span>)</a>
+<a class="sourceLine" id="cb170-2" data-line-number="2">regression_points</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:model2-residuals">TABLE 5.9: </span>Regression points (First 10 out of 142 countries)
@@ -2005,15 +2012,15 @@ <h3><span class="header-section-number">5.2.3</span> Observed/fitted values and
 </tr>
 </tbody>
 </table>
-<p>Observe in Table <a href="5-regression.html#tab:model2-residuals">5.9</a> that <code>lifeExp_hat</code> are the fitted values <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{lifeexp}}\)</span>. If you look closely, there are only 5 possible values for <code>lifeExp_hat</code>. These correspond to the 5 mean life expectancies for the 5 continents that we displayed in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a> and computed using the values in the <code>estimate</code> column of the regression table in Table <a href="5-regression.html#tab:catxplot4b">5.8</a>.</p>
-<p>The <code>residual</code> column is simply <span class="math inline">\(y - \widehat{y}\)</span> = <code>lifeexp - lifeexp_hat</code>. These values can be interpreted as the deviation of a country’s life expectancy from its continent’s average life expectancy. For example, look at the first row of Table <a href="5-regression.html#tab:model2-residuals">5.9</a> corresponding to Afghanistan. The residual of <span class="math inline">\(y - \widehat{y}\)</span> = 43.8 - 70.7 = -26.9 is telling us that Afghanistan’s life expectancy is a whopping 26.9 years lower than the mean life expectancy of all Asian countries. This can in part be explained by the many years of war that country has suffered.</p>
+<p>Observe in Table <a href="5-regression.html#tab:model2-residuals">5.9</a> that <code>lifeExp_hat</code> contains the fitted values <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{lifeExp}}\)</span>. If you look closely, there are only 5 possible values for <code>lifeExp_hat</code>. These correspond to the five mean life expectancies for the 5 continents that we displayed in Table <a href="5-regression.html#tab:continent-mean-life-expectancies">5.7</a> and computed using the values in the <code>estimate</code> column of the regression table in Table <a href="5-regression.html#tab:catxplot4b">5.8</a>.</p>
+<p>The <code>residual</code> column is simply <span class="math inline">\(y - \widehat{y}\)</span> = <code>lifeExp - lifeExp_hat</code>. These values can be interpreted as the deviation of a country’s life expectancy from its continent’s average life expectancy. For example, look at the first row of Table <a href="5-regression.html#tab:model2-residuals">5.9</a> corresponding to Afghanistan. The residual of <span class="math inline">\(y - \widehat{y} = 43.8 - 70.7 = -26.9\)</span> is telling us that Afghanistan’s life expectancy is a whopping 26.9 years lower than the mean life expectancy of all Asian countries. This can in part be explained by the many years of war that country has suffered.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC5.6)</strong> Using either the sorting functionality of RStudio’s spreadsheet viewer or using the data wrangling tools you learned in Chapter <a href="3-wrangling.html#wrangling">3</a>, identify the 5 countries with the 5 smallest (most negative) residuals? What do these negative residuals say about their life expectancy relative to their continents?</p>
-<p><strong>(LC5.7)</strong> Repeat this process, but identify the 5 countries with the 5 largest (most positive) residuals. What do these negative residuals say about their life expectancy relative to their continents?</p>
+<p><strong>(LC5.6)</strong> Using either the sorting functionality of RStudio’s spreadsheet viewer or using the data wrangling tools you learned in Chapter <a href="3-wrangling.html#wrangling">3</a>, identify the five countries with the five smallest (most negative) residuals? What do these negative residuals say about their life expectancy relative to their continents’ life expectancy?</p>
+<p><strong>(LC5.7)</strong> Repeat this process, but identify the five countries with the five largest (most positive) residuals. What do these positive residuals say about their life expectancy relative to their continents’ life expectancy?</p>
 <div class="learncheck">
 
 </div>
@@ -2023,61 +2030,61 @@ <h3><span class="header-section-number">5.2.3</span> Observed/fitted values and
 <h2><span class="header-section-number">5.3</span> Related topics</h2>
 <div id="correlation-is-not-causation" class="section level3">
 <h3><span class="header-section-number">5.3.1</span> Correlation is not necessarily causation</h3>
-<p>Throughout this chapter we’ve been very cautious when interpreting regression slope coefficients. We always discussed the “associated” effect of an explanatory variable <span class="math inline">\(x\)</span> on an outcome variable <span class="math inline">\(y\)</span>. For example our statement from Subsection <a href="5-regression.html#model1table">5.1.2</a> that “for every increase of 1 unit in <code>bty_avg</code>, there is an <em>associated</em> increase of on average 0.067 units of <code>score</code>.” We include the term “associated” to be extra careful not suggest we are making a <em>causal</em> statement. So while “beauty” score <code>bty_avg</code> is positively correlated with teaching <code>score</code>, we can’t necessarily make any statements about “beauty” scores’ direct causal effect on teaching score without more information on how this study was conducted.</p>
-<p>Here is another example: a not-so-great medical doctor goes through their medical records and finds that patients who slept with their shoes on tended to wake up more with headaches. So this doctor declares “Sleeping with shoes on causes headaches!”</p>
+<p>Throughout this chapter we’ve been cautious when interpreting regression slope coefficients. We always discussed the “associated” effect of an explanatory variable <span class="math inline">\(x\)</span> on an outcome variable <span class="math inline">\(y\)</span>. For example, our statement from Subsection <a href="5-regression.html#model1table">5.1.2</a> that “for every increase of 1 unit in <code>bty_avg</code>, there is an <em>associated</em> increase of on average 0.067 units of <code>score</code>.” We include the term “associated” to be extra careful not to suggest we are making a <em>causal</em> statement. So while “beauty” score of <code>bty_avg</code> is positively correlated with teaching <code>score</code>, we can’t necessarily make any statements about “beauty” scores’ direct causal effect on teaching score without more information on how this study was conducted. Here is another example: a not-so-great medical doctor goes through medical records and finds that patients who slept with their shoes on tended to wake up more with headaches. So this doctor declares, “Sleeping with shoes on causes headaches!”</p>
 <div class="figure" style="text-align: center"><span id="fig:moderndive-figure-causal-graph-2"></span>
-<img src="images/shutterstock/shoes_headache.png" alt="Does sleeping with shoes on cause headaches?" width="\textwidth" />
+<img src="images/shutterstock/shoes_headache.png" alt="Does sleeping with shoes on cause headaches?" width="60%" height="60%" />
 <p class="caption">
 FIGURE 5.10: Does sleeping with shoes on cause headaches?
 </p>
 </div>
-<p>However, there is a good chance that if someone is sleeping with their shoes on, it’s potentially likely because they are intoxicated from alcohol. Furthermore, higher levels of drinking leads to more hangovers, and hence more headaches. In this instance, the amount of alcohol consumption is what’s known as a <em>confounding/lurking</em> variable. It “lurks” behind the scenes, confounding the causal relationship (if any) of “sleeping with shoes on” with “waking up with a headache.” We can summarize this notion in Figure <a href="5-regression.html#fig:moderndive-figure-causal-graph">5.11</a> with a <em>causal graph</em> where:</p>
+<p>However, there is a good chance that if someone is sleeping with their shoes on, it’s potentially because they are intoxicated from alcohol. Furthermore, higher levels of drinking leads to more hangovers, and hence more headaches. The amount of alcohol consumption here is what’s known as a <em>confounding/lurking</em> variable. It “lurks” behind the scenes, confounding the causal relationship (if any) of “sleeping with shoes on” with “waking up with a headache.” We can summarize this in Figure <a href="5-regression.html#fig:moderndive-figure-causal-graph">5.11</a> with a <em>causal graph</em> where:</p>
 <ul>
-<li>Y is a <em>response</em> variable; here “waking up with a headache.”</li>
-<li>X is a <em>treatment</em> variable whose causal effect we are interested in; here “sleeping with shoes on.”</li>
+<li>Y is a <em>response</em> variable; here it is “waking up with a headache.” </li>
+<li>X is a <em>treatment</em> variable whose causal effect we are interested in; here it is “sleeping with shoes on.”</li>
 </ul>
 <div class="figure" style="text-align: center"><span id="fig:moderndive-figure-causal-graph"></span>
-<img src="images/flowcharts/flowchart.009-cropped.png" alt="Causal graph." width="\textwidth" />
+<img src="images/flowcharts/flowchart.009-cropped.png" alt="Causal graph." width="50%" />
 <p class="caption">
 FIGURE 5.11: Causal graph.
 </p>
 </div>
-<p>To study the relationship between Y and X, we could use a regression model where the response variable is set to Y and the explanatory variable is set to be X, as you’ve been doing throughout this chapter. However, Figure <a href="5-regression.html#fig:moderndive-figure-causal-graph">5.11</a> also includes a third variable with arrows pointing at both X and Y:</p>
+<p>To study the relationship between Y and X, we could use a regression model where the outcome variable is set to Y and the explanatory variable is set to be X, as you’ve been doing throughout this chapter. However, Figure <a href="5-regression.html#fig:moderndive-figure-causal-graph">5.11</a> also includes a third variable with arrows pointing at both X and Y:</p>
 <ul>
-<li>Z is a <em>confounding</em> variable  that affects both X &amp; Y, thereby “confounding” their relationship. Here the confounding variable is alcohol.</li>
+<li>Z is a <em>confounding</em> variable  that affects both X and Y, thereby “confounding” their relationship. Here the confounding variable is alcohol.</li>
 </ul>
-<p>Alcohol will cause people to be both more likely to sleep with their shoes on as well as be more likely to wake up with a headache. Thus any regression model of the relationship between X and Y should also use Z as an explanatory variable. In other words, our doctor needs to take into account who had been drinking the night before. In the next chapter we’ll start covering multiple regression models that allow us to incorporate more than one variable in our regression models.</p>
-<p>Establishing causation is a tricky problem and frequently takes either carefully designed experiments or methods to control for the effects of potential confounding variables. Both these approaches attempt to, as best they can, either take all possible confounding variables into account or negate their impact. This allows researchers to focus only on the relationship of interest: the relationship between the response variable Y and the treatment variable X.</p>
-<p>As you read news stories, be careful to not fall into the trap of thinking the correlation necessarily implies causation. Check out <a href="http://www.tylervigen.com/spurious-correlations">Spurious Correlations</a> for some rather comical examples of variables that are correlated, but are definitely not causally related.</p>
+<p>Alcohol will cause people to be both more likely to sleep with their shoes on as well as be more likely to wake up with a headache. Thus any regression model of the relationship between X and Y should also use Z as an explanatory variable. In other words, our doctor needs to take into account who had been drinking the night before. In the next chapter, we’ll start covering multiple regression models that allow us to incorporate more than one variable in our regression models.</p>
+<p>Establishing causation is a tricky problem and frequently takes either carefully designed experiments or methods to control for the effects of confounding variables. Both these approaches attempt, as best they can, either to take all possible confounding variables into account or negate their impact. This allows researchers to focus only on the relationship of interest: the relationship between the outcome variable Y and the treatment variable X.</p>
+<p>As you read news stories, be careful not to fall into the trap of thinking that correlation necessarily implies causation. Check out the <a href="http://www.tylervigen.com/spurious-correlations">Spurious Correlations</a> website for some rather comical examples of variables that are correlated, but are definitely not causally related.</p>
 </div>
 <div id="leastsquares" class="section level3">
 <h3><span class="header-section-number">5.3.2</span> Best-fitting line</h3>
-<p>Regression lines are also known as “best-fitting” lines. But what do we mean by “best”? Let’s unpack the criteria that is used in regression to determine “best.” Recall Figure <a href="5-regression.html#fig:numxplot4">5.6</a>, where for an instructor with a beauty score of <span class="math inline">\(x\)</span> = 7.333 we mark with the <em>observed value</em> <span class="math inline">\(y\)</span> with a circle, the <em>fitted value</em> <span class="math inline">\(\widehat{y}\)</span> with a square, and the <em>residual</em> <span class="math inline">\(y - \widehat{y}\)</span> with an arrow.</p>
-<p>We re-display Figure <a href="5-regression.html#fig:numxplot4">5.6</a> in the top-left plot of Figure <a href="5-regression.html#fig:best-fitting-line">5.12</a>. Furthermore, let’s repeat this for three more arbitrarily chosen course’s instructors:</p>
-<ol style="list-style-type: decimal">
-<li>A course whose instructor had a “beauty” score <span class="math inline">\(x\)</span> = 2.333 and teaching score <span class="math inline">\(y\)</span> = 2.7. The residual in this case is 2.7 - 4.036 = -1.336, which we mark with a new blue arrow in the top-right plot.</li>
-<li>A course whose instructor had a “beauty” score <span class="math inline">\(x\)</span> = 3.667 and teaching score <span class="math inline">\(y\)</span> = 4.4. The residual in this case is 4.4 - 4.125 = 0.2753, which we mark with a new blue arrow in the bottom-left plot.</li>
-<li>A course whose instructor had a “beauty” score <span class="math inline">\(x\)</span> = 6 and teaching score <span class="math inline">\(y\)</span> = 3.8. The residual in this case is 3.8 - 4.28 = -0.4802, which we mark with a new blue arrow in the bottom-right plot.</li>
-</ol>
+<p>Regression lines are also known as “best-fitting” lines. But what do we mean by “best”? Let’s unpack the criteria that is used in regression to determine “best.” Recall Figure <a href="5-regression.html#fig:numxplot4">5.6</a>, where for an instructor with a beauty score of <span class="math inline">\(x = 7.333\)</span> we mark the <em>observed value</em> <span class="math inline">\(y\)</span> with a circle, the <em>fitted value</em> <span class="math inline">\(\widehat{y}\)</span> with a square, and the <em>residual</em> <span class="math inline">\(y - \widehat{y}\)</span> with an arrow. We re-display Figure <a href="5-regression.html#fig:numxplot4">5.6</a> in the top-left plot of Figure <a href="5-regression.html#fig:best-fitting-line">5.12</a> in addition to three more arbitrarily chosen course instructors:</p>
 <div class="figure" style="text-align: center"><span id="fig:best-fitting-line"></span>
-<img src="moderndive_files/figure-html/best-fitting-line-1.png" alt="Example of observed value, fitted value, and residual." width="\textwidth" />
+<img src="ModernDive_files/figure-html/best-fitting-line-1.png" alt="Example of observed value, fitted value, and residual." width="\textwidth" />
 <p class="caption">
 FIGURE 5.12: Example of observed value, fitted value, and residual.
 </p>
 </div>
-<p>Now say we repeated this process of computing residuals for all 463 courses’ instructors, then we squared all the residuals, and then we summed them. We call this quantity the <em>sum of squared residuals</em> and it is a measure of the “lack of fit” of a model. Larger values of the sum of squared residuals indicate a bigger “lack of fit,” in other words a worse fitting model.</p>
-<p>If the regression line perfectly fits all the points perfectly, then the sum of squared residuals is 0. This is because if the regression line fits all the points perfectly, then the fitted value <span class="math inline">\(\widehat{y}\)</span> equals the observed value <span class="math inline">\(y\)</span> in all cases, and hence the residual <span class="math inline">\(y-\widehat{y}\)</span> = 0 in all cases, and the sum of a large number of 0’s is still 0.</p>
+<p>The three other plots refer to:</p>
+<ol style="list-style-type: decimal">
+<li>A course whose instructor had a “beauty” score <span class="math inline">\(x\)</span> = 2.333 and teaching score <span class="math inline">\(y\)</span> = 2.7. The residual in this case is <span class="math inline">\(2.7 - 4.036 = -1.336\)</span>, which we mark with a new blue arrow in the top-right plot.</li>
+<li>A course whose instructor had a “beauty” score <span class="math inline">\(x = 3.667\)</span> and teaching score <span class="math inline">\(y = 4.4\)</span>. The residual in this case is <span class="math inline">\(4.4 - 4.125 = 0.2753\)</span>, which we mark with a new blue arrow in the bottom-left plot.</li>
+<li>A course whose instructor had a “beauty” score <span class="math inline">\(x = 6\)</span> and teaching score <span class="math inline">\(y = 3.8\)</span>. The residual in this case is <span class="math inline">\(3.8 - 4.28 = -0.4802\)</span>, which we mark with a new blue arrow in the bottom-right plot.</li>
+</ol>
+<p>Now say we repeated this process of computing residuals for all 463 courses’ instructors, then we squared all the residuals, and then we summed them. We call this quantity the <em>sum of squared residuals</em>; it is a measure of the <em>lack of fit</em> of a model. Larger values of the sum of squared residuals indicate a bigger lack of fit. This corresponds to a worse fitting model.</p>
+<p>If the regression line fits all the points perfectly, then the sum of squared residuals is 0. This is because if the regression line fits all the points perfectly, then the fitted value <span class="math inline">\(\widehat{y}\)</span> equals the observed value <span class="math inline">\(y\)</span> in all cases, and hence the residual <span class="math inline">\(y-\widehat{y}\)</span> = 0 in all cases, and the sum of even a large number of 0’s is still 0.</p>
 <p>Furthermore, of all possible lines we can draw through the cloud of 463 points, the regression line minimizes this value. In other words, the regression and its corresponding fitted values <span class="math inline">\(\widehat{y}\)</span> minimizes the sum of the squared residuals:</p>
 <p><span class="math display">\[
 \sum_{i=1}^{n}(y_i - \widehat{y}_i)^2
 \]</span></p>
 <p>Let’s use our data wrangling tools from Chapter <a href="3-wrangling.html#wrangling">3</a> to compute the sum of squared residuals exactly:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit regression model:</span>
-score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals_ch6)
-
-<span class="co"># Get regression points:</span>
-regression_points &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(score_model)
-regression_points</code></pre>
+<div class="sourceCode" id="cb171"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb171-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb171-2" data-line-number="2">score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, </a>
+<a class="sourceLine" id="cb171-3" data-line-number="3">                  <span class="dt">data =</span> evals_ch5)</a>
+<a class="sourceLine" id="cb171-4" data-line-number="4"></a>
+<a class="sourceLine" id="cb171-5" data-line-number="5"><span class="co"># Get regression points:</span></a>
+<a class="sourceLine" id="cb171-6" data-line-number="6">regression_points &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(score_model)</a>
+<a class="sourceLine" id="cb171-7" data-line-number="7">regression_points</a></code></pre></div>
 <pre><code># A tibble: 463 x 5
       ID score bty_avg score_hat residual
    &lt;int&gt; &lt;dbl&gt;   &lt;dbl&gt;     &lt;dbl&gt;    &lt;dbl&gt;
@@ -2092,28 +2099,28 @@ <h3><span class="header-section-number">5.3.2</span> Best-fitting line</h3>
  9     9   3.4    3.33      4.10   -0.702
 10    10   4.5    3.17      4.09    0.409
 # … with 453 more rows</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Compute sum of squared residuals</span>
-regression_points <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">squared_residuals =</span> residual<span class="op">^</span><span class="dv">2</span>) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sum_of_squared_residuals =</span> <span class="kw">sum</span>(squared_residuals))</code></pre>
+<div class="sourceCode" id="cb173"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb173-1" data-line-number="1"><span class="co"># Compute sum of squared residuals</span></a>
+<a class="sourceLine" id="cb173-2" data-line-number="2">regression_points <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb173-3" data-line-number="3"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">squared_residuals =</span> residual<span class="op">^</span><span class="dv">2</span>) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb173-4" data-line-number="4"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sum_of_squared_residuals =</span> <span class="kw">sum</span>(squared_residuals))</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   sum_of_squared_residuals
                      &lt;dbl&gt;
 1                     132.</code></pre>
-<p>Any other line drawn in the figure would yield a sum of squared residuals greater than 132. This is a mathematically guaranteed fact that you can prove using calculus and linear algebra. That’s why alternative names for the linear regression line are the <em>best-fitting line</em> as well as the <em>least-squares line</em>. Why do we square the residuals (i.e. the arrow lengths)? We do this so that both positive and negative deviations of the same amount are treated equally.</p>
+<p>Any other straight line drawn in the figure would yield a sum of squared residuals greater than 132. This is a mathematically guaranteed fact that you can prove using calculus and linear algebra. That’s why alternative names for the linear regression line are the <em>best-fitting line</em> and the <em>least-squares line</em>. Why do we square the residuals (i.e., the arrow lengths)? So that both positive and negative deviations of the same amount are treated equally.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC5.8)</strong> Note in the following plot there are 3 points marked with black dots along with:</p>
+<p><strong>(LC5.8)</strong> Note in Figure <a href="5-regression.html#fig:three-lines">5.13</a> there are 3 points marked with dots and:</p>
 <ul>
-<li>The “best” fitting regression line in blue</li>
-<li>An arbitrarily chosen line in dashed red</li>
-<li>Another arbitrarily chosen line in dashed green</li>
+<li>The “best” fitting solid regression line in blue</li>
+<li>An arbitrarily chosen dotted red line</li>
+<li>Another arbitrarily chosen dashed green line</li>
 </ul>
 <div class="figure" style="text-align: center"><span id="fig:three-lines"></span>
-<img src="moderndive_files/figure-html/three-lines-1.png" alt="Regression line and two others." width="80%" />
+<img src="ModernDive_files/figure-html/three-lines-1.png" alt="Regression line and two others." width="85%" />
 <p class="caption">
 FIGURE 5.13: Regression line and two others.
 </p>
@@ -2127,18 +2134,18 @@ <h3><span class="header-section-number">5.3.2</span> Best-fitting line</h3>
 <h3><span class="header-section-number">5.3.3</span> <code>get_regression_x()</code> functions</h3>
 <p>Recall in this chapter we introduced two functions from the <code>moderndive</code> package:</p>
 <ol style="list-style-type: decimal">
-<li><code>get_regression_table()</code> function that returns a regression table in Subsection <a href="5-regression.html#model1table">5.1.2</a> and the</li>
-<li><code>get_regression_points()</code> function that returns point-by-point information from a regression model in Subsection <a href="5-regression.html#model1points">5.1.3</a>.</li>
+<li><code>get_regression_table()</code> that returns a regression table in Subsection <a href="5-regression.html#model1table">5.1.2</a> and</li>
+<li><code>get_regression_points()</code> that returns point-by-point information from a regression model in Subsection <a href="5-regression.html#model1points">5.1.3</a>.</li>
 </ol>
-<p>What is going on behind the scenes with the <code>get_regression_table()</code> and <code>get_regression_points()</code> functions? We mentioned in Section <a href="5-regression.html#model1table">5.1.2</a> that these were examples of <em>wrapper functions</em>. Such functions take other pre-existing functions and “wrap” them into single functions that hide the user from their inner workings. This way all the user needs to worry about is what the inputs look like and what the outputs look like. In this subsection we’ll “get under the hood” of these functions and see how the “engine” of these wrapper functions work.</p>
+<p>What is going on behind the scenes with the <code>get_regression_table()</code> and <code>get_regression_points()</code> functions? We mentioned in Subsection <a href="5-regression.html#model1table">5.1.2</a> that these were examples of <em>wrapper functions</em>. Such functions take other pre-existing functions and “wrap” them into single functions that hide the user from their inner workings. This way all the user needs to worry about is what the inputs look like and what the outputs look like. In this subsection, we’ll “get under the hood” of these functions and see how the “engine” of these wrapper functions works.</p>
 <p>Recall our two-step process to generate a regression table from Subsection <a href="5-regression.html#model1table">5.1.2</a>:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit regression model:</span>
-score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals_ch6)
-<span class="co"># Get regression table:</span>
-<span class="kw">get_regression_table</span>(score_model)</code></pre>
+<div class="sourceCode" id="cb175"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb175-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb175-2" data-line-number="2">score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(<span class="dt">formula =</span> score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals_ch5)</a>
+<a class="sourceLine" id="cb175-3" data-line-number="3"><span class="co"># Get regression table:</span></a>
+<a class="sourceLine" id="cb175-4" data-line-number="4"><span class="kw">get_regression_table</span>(score_model)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:recall-table">TABLE 5.10: </span>Regression table.
+<span id="tab:recall-table">TABLE 5.10: </span>Regression table
 </caption>
 <thead>
 <tr>
@@ -2214,23 +2221,22 @@ <h3><span class="header-section-number">5.3.3</span> <code>get_regression_x()</c
 </tr>
 </tbody>
 </table>
-<p>The <code>get_regression_table()</code> wrapper function takes two pre-existing functions in other R packages</p>
+<p>The <code>get_regression_table()</code> wrapper function takes two pre-existing functions in other R packages:</p>
 <ul>
-<li>the <code>tidy()</code>  function from the <a href="https://broom.tidyverse.org/"><code>broom</code> package</a> <span class="citation">(Robinson and Hayes <a href="#ref-R-broom">2019</a>)</span> and</li>
-<li>the <code>clean_names()</code>  function from the <a href="https://github.com/sfirke/janitor"><code>janitor</code> package</a> <span class="citation">(Firke <a href="#ref-R-janitor">2019</a>)</span></li>
+<li><code>tidy()</code>  from the <a href="https://broom.tidyverse.org/"><code>broom</code> package</a> <span class="citation">(Robinson and Hayes <a href="#ref-R-broom">2019</a>)</span> and</li>
+<li><code>clean_names()</code>  from the <a href="https://github.com/sfirke/janitor"><code>janitor</code> package</a> <span class="citation">(Firke <a href="#ref-R-janitor">2019</a>)</span></li>
 </ul>
-<p>and “wraps” them into a single function that takes in a saved <code>lm()</code> linear model model, here <code>score_model</code>, and returns a regression table saved as a “tidy” data frame. Here is how we used the <code>tidy()</code> and <code>clean_names()</code> functions:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(broom)
-<span class="kw">library</span>(janitor)
-score_model <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">tidy</span>(<span class="dt">conf.int =</span> <span class="ot">TRUE</span>) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">mutate_if</span>(is.numeric, round, <span class="dt">digits =</span> <span class="dv">3</span>) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">clean_names</span>() <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">rename</span>(<span class="dt">lower_ci =</span> conf_low,
-         <span class="dt">upper_ci =</span> conf_high)</code></pre>
+<p>and “wraps” them into a single function that takes in a saved <code>lm()</code> linear model model, here <code>score_model</code>, and returns a regression table saved as a “tidy” data frame. Here is how we used the <code>tidy()</code> and <code>clean_names()</code> functions to produce Table <a href="5-regression.html#tab:regtable-broom">5.11</a>:</p>
+<div class="sourceCode" id="cb176"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb176-1" data-line-number="1"><span class="kw">library</span>(broom)</a>
+<a class="sourceLine" id="cb176-2" data-line-number="2"><span class="kw">library</span>(janitor)</a>
+<a class="sourceLine" id="cb176-3" data-line-number="3">score_model <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb176-4" data-line-number="4"><span class="st">  </span><span class="kw">tidy</span>(<span class="dt">conf.int =</span> <span class="ot">TRUE</span>) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb176-5" data-line-number="5"><span class="st">  </span><span class="kw">mutate_if</span>(is.numeric, round, <span class="dt">digits =</span> <span class="dv">3</span>) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb176-6" data-line-number="6"><span class="st">  </span><span class="kw">clean_names</span>() <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb176-7" data-line-number="7"><span class="st">  </span><span class="kw">rename</span>(<span class="dt">lower_ci =</span> conf_low, <span class="dt">upper_ci =</span> conf_high)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:regtable-broom">TABLE 5.11: </span>Regression table using tidy() from broom package.
+<span id="tab:regtable-broom">TABLE 5.11: </span>Regression table using tidy() from broom package
 </caption>
 <thead>
 <tr>
@@ -2306,18 +2312,18 @@ <h3><span class="header-section-number">5.3.3</span> <code>get_regression_x()</c
 </tr>
 </tbody>
 </table>
-<p>Yikes! That’s a lot of code! So in order to simplify your lives, we made the editorial decision to “wrap” all the code into <code>get_regression_table()</code>, freeing you from the need to understand the inner workings of the function. Note that the <code>mutate_if()</code> function is from the <code>dplyr</code> package and applies the <code>round()</code> function to 3 significant digits precision only to those variables that are numerical.</p>
-<p>Similarly, the <code>get_regression_points()</code> function is another wrapper function, but this time returning information the individual points involved in a regression model like the fitted values, observed values, and the residuals. <code>get_regression_points()</code>  uses the <code>augment()</code>  function in the <a href="https://broom.tidyverse.org/"><code>broom</code> package</a> instead of the <code>tidy()</code> function as with <code>get_regression_table()</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(broom)
-<span class="kw">library</span>(janitor)
-score_model <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">augment</span>() <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">mutate_if</span>(is.numeric, round, <span class="dt">digits =</span> <span class="dv">3</span>) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">clean_names</span>() <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">select</span>(<span class="op">-</span><span class="kw">c</span>(<span class="st">&quot;se_fit&quot;</span>, <span class="st">&quot;hat&quot;</span>, <span class="st">&quot;sigma&quot;</span>, <span class="st">&quot;cooksd&quot;</span>, <span class="st">&quot;std_resid&quot;</span>))</code></pre>
+<p>Yikes! That’s a lot of code! So, in order to simplify your lives, we made the editorial decision to “wrap” all the code into <code>get_regression_table()</code>, freeing you from the need to understand the inner workings of the function. Note that the <code>mutate_if()</code> function is from the <code>dplyr</code> package and applies the <code>round()</code> function to three significant digits precision only to those variables that are numerical.</p>
+<p>Similarly, the <code>get_regression_points()</code> function is another wrapper function, but this time returning information about the individual points involved in a regression model like the fitted values, observed values, and the residuals. <code>get_regression_points()</code>  uses the <code>augment()</code>  function in the <a href="https://broom.tidyverse.org/"><code>broom</code> package</a> instead of the <code>tidy()</code> function as with <code>get_regression_table()</code> to produce the data shown in Table <a href="5-regression.html#tab:regpoints-augment">5.12</a>:</p>
+<div class="sourceCode" id="cb177"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb177-1" data-line-number="1"><span class="kw">library</span>(broom)</a>
+<a class="sourceLine" id="cb177-2" data-line-number="2"><span class="kw">library</span>(janitor)</a>
+<a class="sourceLine" id="cb177-3" data-line-number="3">score_model <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb177-4" data-line-number="4"><span class="st">  </span><span class="kw">augment</span>() <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb177-5" data-line-number="5"><span class="st">  </span><span class="kw">mutate_if</span>(is.numeric, round, <span class="dt">digits =</span> <span class="dv">3</span>) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb177-6" data-line-number="6"><span class="st">  </span><span class="kw">clean_names</span>() <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb177-7" data-line-number="7"><span class="st">  </span><span class="kw">select</span>(<span class="op">-</span><span class="kw">c</span>(<span class="st">&quot;se_fit&quot;</span>, <span class="st">&quot;hat&quot;</span>, <span class="st">&quot;sigma&quot;</span>, <span class="st">&quot;cooksd&quot;</span>, <span class="st">&quot;std_resid&quot;</span>))</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:regpoints-augment">TABLE 5.12: </span>Regression points using augment() from broom package.
+<span id="tab:regpoints-augment">TABLE 5.12: </span>Regression points using augment() from broom package
 </caption>
 <thead>
 <tr>
@@ -2487,17 +2493,18 @@ <h2><span class="header-section-number">5.4</span> Conclusion</h2>
 <div id="additional-resources-basic-regression" class="section level3">
 <h3><span class="header-section-number">5.4.1</span> Additional resources</h3>
 <p>An R script file of all R code used in this chapter is available <a href="scripts/05-regression.R">here</a>.</p>
-<p>As we suggested in Subsection <a href="5-regression.html#model1EDA">5.1.1</a>, interpreting coefficients that are not close to the extreme values of -1, 0, and 1 can be somewhat subjective. To help develop your sense of correlation coefficients, we suggest you play the following 80’s-style video game called “Guess the correlation” at <a href="http://guessthecorrelation.com/" class="uri">http://guessthecorrelation.com/</a>.</p>
+<p>As we suggested in Subsection <a href="5-regression.html#model1EDA">5.1.1</a>, interpreting coefficients that are not close to the extreme values of -1, 0, and 1 can be somewhat subjective. To help develop your sense of correlation coefficients, we suggest you play the 80s-style video game called, “Guess the Correlation”, at <a href="http://guessthecorrelation.com/" class="uri">http://guessthecorrelation.com/</a>.</p>
+
 <div class="figure" style="text-align: center"><span id="fig:guess-the-correlation"></span>
-<img src="images/copyright/guess_the_correlation.png" alt="Preview of &quot;Guess the Correlation&quot; Game." width="70%" />
+<img src="images/copyright/guess_the_correlation.png" alt="Preview of “Guess the Correlation” game." width="70%" />
 <p class="caption">
-FIGURE 5.14: Preview of “Guess the Correlation” Game.
+FIGURE 5.14: Preview of “Guess the Correlation” game.
 </p>
 </div>
 </div>
 <div id="whats-to-come-4" class="section level3">
 <h3><span class="header-section-number">5.4.2</span> What’s to come?</h3>
-<p>In this chapter, you’ve studied what term “basic regression,” where you fit models that only have one explanatory variable. In Chapter <a href="6-multiple-regression.html#multiple-regression">6</a>, we’ll study <em>multiple regression</em>, where our regression models can now have more than one explanatory variable! In particular, we’ll consider two scenarios: regression models with one numerical and one categorical explanatory variable and regression models with two numerical explanatory variables. This will allow you to construct more sophisticated and more powerful models, all in the hopes of better explaining your outcome variable <span class="math inline">\(y\)</span>.</p>
+<p>In this chapter, you’ve studied the term <em>basic regression</em>, where you fit models that only have one explanatory variable. In Chapter <a href="6-multiple-regression.html#multiple-regression">6</a>, we’ll study <em>multiple regression</em>, where our regression models can now have more than one explanatory variable! In particular, we’ll consider two scenarios: regression models with one numerical and one categorical explanatory variable and regression models with two numerical explanatory variables. This will allow you to construct more sophisticated and more powerful models, all in the hopes of better explaining your outcome variable <span class="math inline">\(y\)</span>.</p>
 
 </div>
 </div>
@@ -2508,16 +2515,19 @@ <h3>References</h3>
 <p>Firke, Sam. 2019. <em>Janitor: Simple Tools for Examining and Cleaning Dirty Data</em>. <a href="https://CRAN.R-project.org/package=janitor">https://CRAN.R-project.org/package=janitor</a>.</p>
 </div>
 <div id="ref-rds2016">
-<p>Grolemund, Garrett, and Hadley Wickham. 2016. <em>R for Data Science</em>. <a href="http://r4ds.had.co.nz/">http://r4ds.had.co.nz/</a>.</p>
+<p>Grolemund, Garrett, and Hadley Wickham. 2017. <em>R for Data Science</em>. First. Sebastopol, CA: O’Reilly Media. <a href="https://r4ds.had.co.nz/">https://r4ds.had.co.nz/</a>.</p>
+</div>
+<div id="ref-islr2017">
+<p>James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2017. <em>An Introduction to Statistical Learning: With Applications in R</em>. First. New York, NY: Springer.</p>
 </div>
 <div id="ref-R-skimr">
-<p>Quinn, Michael, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2019. <em>Skimr: Compact and Flexible Summaries of Data</em>. <a href="https://CRAN.R-project.org/package=skimr">https://CRAN.R-project.org/package=skimr</a>.</p>
+<p>Quinn, Michael, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2019. <em>Skimr: Compact and Flexible Summaries of Data</em>. <a href="https://github.com/ropenscilabs/skimr">https://github.com/ropenscilabs/skimr</a>.</p>
 </div>
 <div id="ref-R-broom">
 <p>Robinson, David, and Alex Hayes. 2019. <em>Broom: Convert Statistical Analysis Objects into Tidy Tibbles</em>. <a href="https://CRAN.R-project.org/package=broom">https://CRAN.R-project.org/package=broom</a>.</p>
 </div>
 <div id="ref-R-tidyverse">
-<p>Wickham, Hadley. 2017. <em>Tidyverse: Easily Install and Load the ’Tidyverse’</em>. <a href="https://CRAN.R-project.org/package=tidyverse">https://CRAN.R-project.org/package=tidyverse</a>.</p>
+<p>Wickham, Hadley. 2019b. <em>Tidyverse: Easily Install and Load the ’Tidyverse’</em>. <a href="https://CRAN.R-project.org/package=tidyverse">https://CRAN.R-project.org/package=tidyverse</a>.</p>
 </div>
 </div>
             </section>
@@ -2531,11 +2541,13 @@ <h3>References</h3>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -2543,12 +2555,11 @@ <h3>References</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -2563,6 +2574,10 @@ <h3>References</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -2579,8 +2594,9 @@ <h3>References</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/6-multiple-regression.html b/docs/6-multiple-regression.html
index 0efc4ba56..4fa0bdcde 100644
--- a/docs/6-multiple-regression.html
+++ b/docs/6-multiple-regression.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>Chapter 6 Multiple Regression | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="Chapter 6 Multiple Regression | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="Chapter 6 Multiple Regression | Statistical Inference via Data Science" />
@@ -21,18 +21,18 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="5-regression.html">
-<link rel="next" href="7-sampling.html">
+<link rel="prev" href="5-regression.html"/>
+<link rel="next" href="7-sampling.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -570,8 +583,8 @@ <h1>
 </html>
 <div id="multiple-regression" class="section level1">
 <h1><span class="header-section-number">Chapter 6</span> Multiple Regression</h1>
-<p>In Chapter <a href="5-regression.html#regression">5</a> we introduced ideas related to modeling for explanation, in particular that the goal of modeling is to make explicit the relationship between some outcome variable <span class="math inline">\(y\)</span> and some explanatory variable <span class="math inline">\(x\)</span>. While there are many approaches to modeling, we focused on one particular technique: <em>linear regression</em>, one of the most commonly-used and easy-to-understand approaches to modeling. Furthermore to keep things simple we only considered models with one explanatory <span class="math inline">\(x\)</span> variable that was either numerical in Section <a href="5-regression.html#model1">5.1</a> or categorical in Section <a href="5-regression.html#model2">5.2</a>.</p>
-<p>In this chapter on multiple regression, we’ll start considering models that include more than one explanatory variable <span class="math inline">\(x\)</span>. You can imagine when trying to model a particular outcome variable, like teaching evaluation scores as in Section <a href="5-regression.html#model1">5.1</a> or life expectancy as in Section <a href="5-regression.html#model2">5.2</a>, that it would be very useful to include more than just one explanatory variable’s worth of information.</p>
+<p>In Chapter <a href="5-regression.html#regression">5</a> we introduced ideas related to modeling for explanation, in particular that the goal of modeling is to make explicit the relationship between some outcome variable <span class="math inline">\(y\)</span> and some explanatory variable <span class="math inline">\(x\)</span>. While there are many approaches to modeling, we focused on one particular technique: <em>linear regression</em>, one of the most commonly used and easy-to-understand approaches to modeling. Furthermore to keep things simple, we only considered models with one explanatory <span class="math inline">\(x\)</span> variable that was either numerical in Section <a href="5-regression.html#model1">5.1</a> or categorical in Section <a href="5-regression.html#model2">5.2</a>.</p>
+<p>In this chapter on multiple regression, we’ll start considering models that include more than one explanatory variable <span class="math inline">\(x\)</span>. You can imagine when trying to model a particular outcome variable, like teaching evaluation scores as in Section <a href="5-regression.html#model1">5.1</a> or life expectancy as in Section <a href="5-regression.html#model2">5.2</a>, that it would be useful to include more than just one explanatory variable’s worth of information.</p>
 <p>Since our regression models will now consider more than one explanatory variable, the interpretation of the associated effect of any one explanatory variable must be made in conjunction with the other explanatory variables included in your model. Let’s begin!</p>
 <div id="needed-packages-4" class="section level3 unnumbered">
 <h3>Needed packages</h3>
@@ -584,37 +597,37 @@ <h3>Needed packages</h3>
 <li>As well as the more advanced <code>purrr</code>, <code>tibble</code>, <code>stringr</code>, and <code>forcats</code> packages</li>
 </ul>
 <p>If needed, read Section <a href="1-getting-started.html#packages">1.3</a> for information on how to install and load R packages.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(tidyverse)
-<span class="kw">library</span>(moderndive)
-<span class="kw">library</span>(skimr)
-<span class="kw">library</span>(ISLR)</code></pre>
+<div class="sourceCode" id="cb178"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb178-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb178-2" data-line-number="2"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb178-3" data-line-number="3"><span class="kw">library</span>(skimr)</a>
+<a class="sourceLine" id="cb178-4" data-line-number="4"><span class="kw">library</span>(ISLR)</a></code></pre></div>
 </div>
 <div id="model4" class="section level2">
-<h2><span class="header-section-number">6.1</span> One numerical &amp; one categorical explanatory variable</h2>
-<p>Let’s revisit the instructor evaluation data from UT Austin we introduced in Section <a href="5-regression.html#model1">5.1</a>. We studied the relationship between teaching evaluation scores as given by students and “beauty” scores.The variable teaching <code>score</code> was the numerical outcome variable <span class="math inline">\(y\)</span> and the variable “beauty” score <code>bty_avg</code> was the numerical explanatory <span class="math inline">\(x\)</span> variable.</p>
-<p>In this section, we are going to consider a different model. Our outcome variable will still be teaching score, but now we’ll now include two different explanatory variables: age and gender. Could it be that instructors who are older receive better teaching evaluations from students? Or could it instead be that younger instructors receive better evaluations? Are there differences in evaluations given by students for instructors of different genders? We’ll answer these questions by modeling the relationship between these variables using <em>multiple regression</em>, where we have:</p>
+<h2><span class="header-section-number">6.1</span> One numerical and one categorical explanatory variable</h2>
+<p>Let’s revisit the instructor evaluation data from UT Austin we introduced in Section <a href="5-regression.html#model1">5.1</a>. We studied the relationship between teaching evaluation scores as given by students and “beauty” scores. The variable teaching <code>score</code> was the numerical outcome variable <span class="math inline">\(y\)</span>, and the variable “beauty” score (<code>bty_avg</code>) was the numerical explanatory <span class="math inline">\(x\)</span> variable.</p>
+<p>In this section, we are going to consider a different model. Our outcome variable will still be teaching score, but we’ll now include two different explanatory variables: age and (binary) gender. Could it be that instructors who are older receive better teaching evaluations from students? Or could it instead be that younger instructors receive better evaluations? Are there differences in evaluations given by students for instructors of different genders? We’ll answer these questions by modeling the relationship between these variables using <em>multiple regression</em>, where we have:</p>
 <ol style="list-style-type: decimal">
-<li>A numerical outcome variable <span class="math inline">\(y\)</span> the instructor’s teaching score and</li>
+<li>A numerical outcome variable <span class="math inline">\(y\)</span>, the instructor’s teaching score, and</li>
 <li>Two explanatory variables:
 <ol style="list-style-type: decimal">
-<li>A numerical explanatory variable <span class="math inline">\(x_1\)</span>, the instructor’s age</li>
-<li>A categorical explanatory variable <span class="math inline">\(x_2\)</span>, the instructor’s gender.</li>
+<li>A numerical explanatory variable <span class="math inline">\(x_1\)</span>, the instructor’s age.</li>
+<li>A categorical explanatory variable <span class="math inline">\(x_2\)</span>, the instructor’s (binary) gender.</li>
 </ol></li>
 </ol>
-<p>It is important to note that at the time of this study due to then commonly held beliefs about gender, this variable was often recorded as a binary variable. While the results of a model that oversimplifies gender this way may be imperfect, we still found the results to be very pertinent and relevant today.</p>
+<p>It is important to note that at the time of this study due to then commonly held beliefs about gender, this variable was often recorded as a binary variable. While the results of a model that oversimplifies gender this way may be imperfect, we still found the results to be pertinent and relevant today.</p>
 <div id="model4EDA" class="section level3">
 <h3><span class="header-section-number">6.1.1</span> Exploratory data analysis</h3>
-<p>Recall that data on the 463 courses at UT Austin can be found in the <code>evals</code> data frame included in the <code>moderndive</code> package. However, to keep things simple, let’s <code>select()</code> only the subset of the variables we’ll consider in this chapter, and save this data in a new data frame called <code>eval_ch7</code>. Note that these are different than the variables chosen in Chapter 6.</p>
-<pre class="sourceCode r"><code class="sourceCode r">evals_ch7 &lt;-<span class="st"> </span>evals <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">select</span>(ID, score, age, gender)</code></pre>
-<p>Recall the three common steps in an exploratory data analysis we saw in Section <a href="5-regression.html#model1EDA">5.1.1</a>:</p>
+<p>Recall that data on the 463 courses at UT Austin can be found in the <code>evals</code> data frame included in the <code>moderndive</code> package. However, to keep things simple, let’s <code>select()</code> only the subset of the variables we’ll consider in this chapter, and save this data in a new data frame called <code>evals_ch6</code>. Note that these are different than the variables chosen in Chapter <a href="5-regression.html#regression">5</a>.</p>
+<div class="sourceCode" id="cb179"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb179-1" data-line-number="1">evals_ch6 &lt;-<span class="st"> </span>evals <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb179-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(ID, score, age, gender)</a></code></pre></div>
+<p>Recall the three common steps in an exploratory data analysis we saw in Subsection <a href="5-regression.html#model1EDA">5.1.1</a>:</p>
 <ol style="list-style-type: decimal">
 <li>Looking at the raw data values.</li>
 <li>Computing summary statistics.</li>
 <li>Creating data visualizations.</li>
 </ol>
-<p>Let’s first look at the raw data values by either looking at <code>evals_ch7</code> using RStudio’s spreadsheet viewer or by using the <code>glimpse()</code> function from the <code>dplyr</code> package:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">glimpse</span>(evals_ch7)</code></pre>
+<p>Let’s first look at the raw data values by either looking at <code>evals_ch6</code> using RStudio’s spreadsheet viewer or by using the <code>glimpse()</code> function from the <code>dplyr</code> package:</p>
+<div class="sourceCode" id="cb180"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb180-1" data-line-number="1"><span class="kw">glimpse</span>(evals_ch6)</a></code></pre></div>
 <pre><code>Observations: 463
 Variables: 4
 $ ID     &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
@@ -622,8 +635,7 @@ <h3><span class="header-section-number">6.1.1</span> Exploratory data analysis</
 $ age    &lt;int&gt; 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 40…
 $ gender &lt;fct&gt; female, female, female, female, male, male, male, male, male, …</code></pre>
 <p>Let’s also display a random sample of 5 rows of the 463 rows corresponding to different courses in Table <a href="6-multiple-regression.html#tab:model4-data-preview">6.1</a>. Remember due to the random nature of the sampling, you will likely end up with a different subset of 5 rows.</p>
-<pre class="sourceCode r"><code class="sourceCode r">evals_ch7 <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">sample_n</span>(<span class="dt">size =</span> <span class="dv">5</span>)</code></pre>
+<div class="sourceCode" id="cb182"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb182-1" data-line-number="1">evals_ch6 <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">sample_n</span>(<span class="dt">size =</span> <span class="dv">5</span>)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:model4-data-preview">TABLE 6.1: </span>A random sample of 5 out of the 463 courses at UT Austin
@@ -717,55 +729,53 @@ <h3><span class="header-section-number">6.1.1</span> Exploratory data analysis</
 </tr>
 </tbody>
 </table>
-<p>Now that we’ve looked at the raw values in our <code>evals_ch7</code> data frame and got a sense of the data, let’s computing summary statistics. As we did in our exploratory data analyses in Sections <a href="5-regression.html#model1EDA">5.1.1</a> and <a href="5-regression.html#model2EDA">5.2.1</a> from the previous chapter, let’s use the <code>skim()</code> function from the <code>skimr</code> package, being sure to only <code>select()</code> the variables of interest in our model:</p>
-<pre class="sourceCode r"><code class="sourceCode r">evals_ch7 <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(score, age, gender) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">skim</span>()</code></pre>
+<p>Now that we’ve looked at the raw values in our <code>evals_ch6</code> data frame and got a sense of the data, let’s compute summary statistics. As we did in our exploratory data analyses in Sections <a href="5-regression.html#model1EDA">5.1.1</a> and <a href="5-regression.html#model2EDA">5.2.1</a> from the previous chapter, let’s use the <code>skim()</code> function from the <code>skimr</code> package, being sure to only <code>select()</code> the variables of interest in our model:</p>
+<div class="sourceCode" id="cb183"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb183-1" data-line-number="1">evals_ch6 <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">select</span>(score, age, gender) <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">skim</span>()</a></code></pre></div>
 <pre><code>Skim summary statistics
  n obs: 463 
  n variables: 3 
 
-── Variable type:factor ────────────────────────────────────────────────────────
+── Variable type:factor 
  variable missing complete   n n_unique                top_counts ordered
    gender       0      463 463        2 mal: 268, fem: 195, NA: 0   FALSE
 
-── Variable type:integer ───────────────────────────────────────────────────────
+── Variable type:integer 
  variable missing complete   n  mean  sd p0 p25 p50 p75 p100
       age       0      463 463 48.37 9.8 29  42  48  57   73
 
-── Variable type:numeric ───────────────────────────────────────────────────────
+── Variable type:numeric
  variable missing complete   n mean   sd  p0 p25 p50 p75 p100
     score       0      463 463 4.17 0.54 2.3 3.8 4.3 4.6    5</code></pre>
-<p>Observe for example that we have no missing data, that there are 268 courses taught by male instructors and 195 courses taught by female instructors, and that the average instructor age is 48.37. Recall however that each row of our data represents a particular course and that the same instructor often teaches more than one course. Therefore the average age of the unique instructors may differ.</p>
-<p>Furthermore, let’s compute the correlation coefficient between our two numerical variables: <code>score</code> and <code>age</code>. Recall from Section <a href="5-regression.html#model1EDA">5.1.1</a> that correlation coefficients only exist between numerical variables. We observe that they are “weakly negatively” correlated.</p>
-<pre class="sourceCode r"><code class="sourceCode r">evals_ch7 <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_correlation</span>(<span class="dt">formula =</span> score <span class="op">~</span><span class="st"> </span>age)</code></pre>
+<p>Observe that we have no missing data, that there are 268 courses taught by male instructors and 195 courses taught by female instructors, and that the average instructor age is 48.37. Recall that each row represents a particular course and that the same instructor often teaches more than one course. Therefore, the average age of the unique instructors may differ.</p>
+<p>Furthermore, let’s compute the correlation coefficient between our two numerical variables: <code>score</code> and <code>age</code>. Recall from Subsection <a href="5-regression.html#model1EDA">5.1.1</a> that correlation coefficients only exist between numerical variables. We observe that they are “weakly negatively” correlated.</p>
+<div class="sourceCode" id="cb185"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb185-1" data-line-number="1">evals_ch6 <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb185-2" data-line-number="2"><span class="st">  </span><span class="kw">get_correlation</span>(<span class="dt">formula =</span> score <span class="op">~</span><span class="st"> </span>age)</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
-  correlation
-        &lt;dbl&gt;
-1      -0.107</code></pre>
-<p>Let’s now perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Given that the outcome variable <code>score</code> and explanatory variable <code>age</code> are both numerical, we’ll use a scatterplot to display their relationship. How can we incorporate the categorical variable <code>gender</code> however? By mapping the variable <code>gender</code> to the color aesthetic, thereby creating a <em>colored</em> scatterplot. The following code is very similar to the code that created the scatterplot of teaching score over “beauty” score in Figure <a href="5-regression.html#fig:numxplot1">5.2</a>, but with <code>color = gender</code> added to the <code>aes()</code>thetic mapping.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(evals_ch7, <span class="kw">aes</span>(<span class="dt">x =</span> age, <span class="dt">y =</span> score, <span class="dt">color =</span> gender)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Age&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Teaching Score&quot;</span>, <span class="dt">color =</span> <span class="st">&quot;Gender&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>)</code></pre>
+     cor
+   &lt;dbl&gt;
+1 -0.107</code></pre>
+<p>Let’s now perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Given that the outcome variable <code>score</code> and explanatory variable <code>age</code> are both numerical, we’ll use a scatterplot to display their relationship. How can we incorporate the categorical variable <code>gender</code>, however? By <code>mapping</code> the variable <code>gender</code> to the <code>color</code> aesthetic, thereby creating a <em>colored</em> scatterplot. The following code is similar to the code that created the scatterplot of teaching score over “beauty” score in Figure <a href="5-regression.html#fig:numxplot1">5.2</a>, but with <code>color = gender</code> added to the <code>aes()</code>thetic mapping.</p>
+<div class="sourceCode" id="cb187"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb187-1" data-line-number="1"><span class="kw">ggplot</span>(evals_ch6, <span class="kw">aes</span>(<span class="dt">x =</span> age, <span class="dt">y =</span> score, <span class="dt">color =</span> gender)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb187-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb187-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Age&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Teaching Score&quot;</span>, <span class="dt">color =</span> <span class="st">&quot;Gender&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb187-4" data-line-number="4"><span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:numxcatxplot1"></span>
-<img src="moderndive_files/figure-html/numxcatxplot1-1.png" alt="Colored scatterplot of relationship of teaching and beauty scores." width="\textwidth" />
+<img src="ModernDive_files/figure-html/numxcatxplot1-1.png" alt="Colored scatterplot of relationship of teaching and beauty scores." width="\textwidth" />
 <p class="caption">
 FIGURE 6.1: Colored scatterplot of relationship of teaching and beauty scores.
 </p>
 </div>
-<p>In the resulting Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a>, observe that <code>ggplot</code> assigns a default red/blue color scheme to the points and to the lines associated with the two levels of <code>gender</code>: <code>female</code> and <code>male</code>. Furthermore the <code>geom_smooth(method = &quot;lm&quot;, se = FALSE)</code> layer automatically fits a different regression line for each group.</p>
-<p>We notice some interesting trends. First, there are almost no women faculty over the age of 60 as evidenced by lack of red dots above <span class="math inline">\(x\)</span> = 60. Second, while both regression lines are negatively sloped with age (i.e. older instructors tend to have lower scores), the slope for age for the female instructors is <em>more</em> negative. In other words, female instructors are paying a harsher penalty for their age than the male instructors.</p>
+<p>In the resulting Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a>, observe that <code>ggplot()</code> assigns a default in red/blue color scheme to the points and to the lines associated with the two levels of <code>gender</code>: <code>female</code> and <code>male</code>. Furthermore, the <code>geom_smooth(method = &quot;lm&quot;, se = FALSE)</code> layer automatically fits a different regression line for each group.</p>
+<p>We notice some interesting trends. First, there are almost no women faculty over the age of 60 as evidenced by lack of red dots above <span class="math inline">\(x\)</span> = 60. Second, while both regression lines are negatively sloped with age (i.e., older instructors tend to have lower scores), the slope for age for the female instructors is <em>more</em> negative. In other words, female instructors are paying a harsher penalty for advanced age than the male instructors.</p>
 </div>
 <div id="model4interactiontable" class="section level3">
 <h3><span class="header-section-number">6.1.2</span> Interaction model</h3>
 <p>Let’s now quantify the relationship of our outcome variable <span class="math inline">\(y\)</span> and the two explanatory variables using one type of multiple regression model known as an <em>interaction model</em>.  We’ll explain where the term “interaction” comes from at the end of this section.</p>
-<p>In particular, we’ll write out the equation of the two regression lines in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a> using the values from a regression table. Before we do this however, let’s go over a brief refresher of regression when you have a categorical explanatory variable <span class="math inline">\(x\)</span>.</p>
-<p>Recall in Section <a href="5-regression.html#model2table">5.2.2</a> we fit a regression model for countries’ life expectancies as a function of which continent the country was in. In other words, we had a numerical outcome variable <span class="math inline">\(y\)</span> = <code>lifeExp</code> and a categorical explanatory variable <span class="math inline">\(x\)</span> = <code>continent</code> which had 5 levels: <code>Africa</code>, <code>Americas</code>, <code>Asia</code>, <code>Europe</code>, and <code>Oceania</code>. Let’s re-display the regression table you saw in Table <a href="5-regression.html#tab:catxplot4b">5.8</a>:</p>
+<p>In particular, we’ll write out the equation of the two regression lines in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a> using the values from a regression table. Before we do this, however, let’s go over a brief refresher of regression when you have a categorical explanatory variable <span class="math inline">\(x\)</span>.</p>
+<p>Recall in Subsection <a href="5-regression.html#model2table">5.2.2</a> we fit a regression model for countries’ life expectancies as a function of which continent the country was in. In other words, we had a numerical outcome variable <span class="math inline">\(y\)</span> = <code>lifeExp</code> and a categorical explanatory variable <span class="math inline">\(x\)</span> = <code>continent</code> which had 5 levels: <code>Africa</code>, <code>Americas</code>, <code>Asia</code>, <code>Europe</code>, and <code>Oceania</code>. Let’s re-display the regression table you saw in Table <a href="5-regression.html#tab:catxplot4b">5.8</a>:</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:unnamed-chunk-198">TABLE 6.2: </span>Regression table for life expectancy as a function of continent.
+<span id="tab:unnamed-chunk-207">TABLE 6.2: </span>Regression table for life expectancy as a function of continent
 </caption>
 <thead>
 <tr>
@@ -910,15 +920,16 @@ <h3><span class="header-section-number">6.1.2</span> Interaction model</h3>
 </tr>
 </tbody>
 </table>
-<p>Recall our interpretation of the <code>estimate</code> column. Since <code>Africa</code> was the “baseline for comparison” group, the <code>intercept</code> term corresponds to the mean life expectancy for all countries in Africa of 54.8 years. The other 4 values of <code>estimate</code> correspond to “offsets” relative to the baseline group. So for example, the “offset” corresponding to the Americas is +18.8 as compared to the baseline for comparison group Africa. In other words, the average life expectancy for countries in the Americas is 18.8 years <em>higher</em>. Thus the mean life expectancy for all countries in the Americas is 54.8 + 18.8 = 73.6. The same interpretation holds for Asia, Europe, and Oceania.</p>
-<p>Going back to our multiple regression model for teaching <code>score</code> using <code>age</code> and <code>gender</code> in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a>, we generate the regression table using the same two-step approach from Chapter <a href="5-regression.html#regression">5</a>: we first “fit” the model using the <code>lm()</code> “linear model” function and then we apply the <code>get_regression_table()</code> function. This time however, our model formula won’t be of the form <code>y ~ x</code>, but rather of the form <code>y ~ x1 * x2</code>. In other words, our two explanatory variables <code>x1</code> and <code>x2</code> are separated by a <code>*</code> sign:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit regression model:</span>
-score_model_interaction &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>age <span class="op">*</span><span class="st"> </span>gender, <span class="dt">data =</span> evals_ch7)
-<span class="co"># Get regression table:</span>
-<span class="kw">get_regression_table</span>(score_model_interaction)</code></pre>
+<p>Recall our interpretation of the <code>estimate</code> column. Since <code>Africa</code> was the “baseline for comparison” group, the <code>intercept</code> term corresponds to the mean life expectancy for all countries in Africa of 54.8 years. The other four values of <code>estimate</code> correspond to “offsets” relative to the baseline group. So, for example, the “offset” corresponding to the Americas is +18.8 as compared to the baseline for comparison group Africa. In other words, the average life expectancy for countries in the Americas is 18.8 years <em>higher</em>. Thus the mean life expectancy for all countries in the Americas is 54.8 + 18.8 = 73.6. The same interpretation holds for Asia, Europe, and Oceania.</p>
+<p>Going back to our multiple regression model for teaching <code>score</code> using <code>age</code> and <code>gender</code> in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a>, we generate the regression table using the same two-step approach from Chapter <a href="5-regression.html#regression">5</a>: we first “fit” the model using the <code>lm()</code> “linear model” function and then we apply the <code>get_regression_table()</code> function. This time, however, our model formula won’t be of the form <code>y ~ x</code>, but rather of the form <code>y ~ x1 * x2</code>. In other words, our two explanatory variables <code>x1</code> and <code>x2</code> are separated by a <code>*</code> sign:</p>
+<div class="sourceCode" id="cb188"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb188-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb188-2" data-line-number="2">score_model_interaction &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>age <span class="op">*</span><span class="st"> </span>gender, <span class="dt">data =</span> evals_ch6)</a>
+<a class="sourceLine" id="cb188-3" data-line-number="3"></a>
+<a class="sourceLine" id="cb188-4" data-line-number="4"><span class="co"># Get regression table:</span></a>
+<a class="sourceLine" id="cb188-5" data-line-number="5"><span class="kw">get_regression_table</span>(score_model_interaction)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:regtable-interaction">TABLE 6.3: </span>Regression table for interaction model.
+<span id="tab:regtable-interaction">TABLE 6.3: </span>Regression table for interaction model
 </caption>
 <thead>
 <tr>
@@ -1040,14 +1051,14 @@ <h3><span class="header-section-number">6.1.2</span> Interaction model</h3>
 </tr>
 </tbody>
 </table>
-<p>Looking at the regression table output in Table <a href="6-multiple-regression.html#tab:regtable-interaction">6.3</a>, we see there are four rows of values in the <code>estimate</code> column. While it is not immediately apparent, using these four values we can write out the equations of both lines in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a>.</p>
-<p>First, since the word <code>female</code> comes alphabetically before <code>male</code>, female instructors are the “baseline for comparison” group. Therefore <code>intercept</code> is the intercept <em>for only the female instructors</em>. This holds similarly for <code>age</code>. It is the slope for age <em>for only the female instructors</em>. Thus the red regression line in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a> has an intercept of 4.883 and slope for age of -0.018. Remember that for this particular data, while the intercept has a mathematical interpretation, it has no <em>practical</em> interpretation since there can’t be any instructors with age = 0.</p>
-<p>What about the intercept and slope for age of the male instructors? In other words, the blue line in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a>? This is where our notion of “offsets” comes into play once again. The value for <code>gendermale</code> of -0.446 is not the intercept for the male instructors, but rather the <em>offset</em> in intercept for male instructors relative to female instructors. Therefore, the intercept for the male instructors is <code>intercept + gendermale</code> = 4.883 + (-0.446) = 4.883 - 0.446 = 4.437.</p>
-<p>Similarly, <code>age:gendermale</code> = 0.014 is not the slope for age for the male instructors, but rather the <em>offset</em> in slope for the male instructors. Therefore, the slope for age for the male instructors is <code>age + age:gendermale</code> = -0.018 + 0.014 = -0.004. Therefore the blue regression line in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a> has intercept 4.437 and slope for age of -0.004.</p>
-<p>Let’s summarize these values in Table <a href="6-multiple-regression.html#tab:interaction-summary">6.4</a> and focus on the two slopes for age:</p>
+<p>Looking at the regression table output in Table <a href="6-multiple-regression.html#tab:regtable-interaction">6.3</a>, there are four rows of values in the <code>estimate</code> column. While it is not immediately apparent, using these four values we can write out the equations of both lines in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a>. First, since the word <code>female</code> comes alphabetically before <code>male</code>, female instructors are the “baseline for comparison” group. Thus, <code>intercept</code> is the intercept <em>for only the female instructors</em>.</p>
+<p>This holds similarly for <code>age</code>. It is the slope for age <em>for only the female instructors</em>. Thus, the red regression line in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a> has an intercept of 4.883 and slope for age of -0.018. Remember that for this data, while the intercept has a mathematical interpretation, it has no <em>practical</em> interpretation since instructors can’t have zero age.</p>
+<p>What about the intercept and slope for age of the male instructors in the blue line in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a>? This is where our notion of “offsets” comes into play once again.</p>
+<p>The value for <code>gendermale</code> of -0.446 is not the intercept for the male instructors, but rather the <em>offset</em> in intercept for male instructors relative to female instructors. The intercept for the male instructors is <code>intercept + gendermale</code> = 4.883 + (-0.446) = 4.883 - 0.446 = 4.437.</p>
+<p>Similarly, <code>age:gendermale</code> = 0.014 is not the slope for age for the male instructors, but rather the <em>offset</em> in slope for the male instructors. Therefore, the slope for age for the male instructors is <code>age + age:gendermale</code> <span class="math inline">\(= -0.018 + 0.014 = -0.004\)</span>. Thus, the blue regression line in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a> has intercept 4.437 and slope for age of -0.004. Let’s summarize these values in Table <a href="6-multiple-regression.html#tab:interaction-summary">6.4</a> and focus on the two slopes for age:</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:interaction-summary">TABLE 6.4: </span>Comparison of intercepts and slopes for interaction model.
+<span id="tab:interaction-summary">TABLE 6.4: </span>Comparison of intercepts and slopes for interaction model
 </caption>
 <thead>
 <tr>
@@ -1087,17 +1098,17 @@ <h3><span class="header-section-number">6.1.2</span> Interaction model</h3>
 </tr>
 </tbody>
 </table>
-<p>Since the slope for age for the female instructors was -0.018, it means that on average, a female instructor who is a year older would have a teaching score that is 0.018 units <strong>lower</strong>. For the male instructors however, the corresponding associated decrease was on average only 0.004 units. While both slopes for age were negative, the slope for age for the female instructors is <em>more negative</em>. This is consistent with our observation from Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a>, that this model is suggesting that age is impacts teaching scores for female instructors more than for male instructors.</p>
+<p>Since the slope for age for the female instructors was -0.018, it means that on average, a female instructor who is a year older would have a teaching score that is 0.018 units <strong>lower</strong>. For the male instructors, however, the corresponding associated decrease was on average only 0.004 units. While both slopes for age were negative, the slope for age for the female instructors is <em>more negative</em>. This is consistent with our observation from Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a>, that this model is suggesting that age impacts teaching scores for female instructors more than for male instructors.</p>
 <p>Let’s now write the equation for our regression lines, which we can use to compute our fitted values <span class="math inline">\(\widehat{y} = \widehat{\text{score}}\)</span>.</p>
 <p><span class="math display">\[
 \begin{aligned}
-\widehat{y} = \widehat{\text{score}} &amp;= b_0 + b_{\mbox{age}} \cdot \mbox{age} + b_{\mbox{male}} \cdot \mathbb{1}_{\mbox{is male}}(x) + b_{\mbox{age,male}} \cdot \mbox{age} \cdot \mathbb{1}_{\mbox{is male}}\\
-&amp;= 4.883 -0.018 \cdot \mbox{age} - 0.446 \cdot \mathbb{1}_{\mbox{is male}}(x) + 0.014 \cdot \mbox{age} \cdot \mathbb{1}_{\mbox{is male}}
+\widehat{y} = \widehat{\text{score}} &amp;= b_0 + b_{\text{age}} \cdot \text{age} + b_{\text{male}} \cdot \mathbb{1}_{\text{is male}}(x) + b_{\text{age,male}} \cdot \text{age} \cdot \mathbb{1}_{\text{is male}}\\
+&amp;= 4.883 -0.018 \cdot \text{age} - 0.446 \cdot \mathbb{1}_{\text{is male}}(x) + 0.014 \cdot \text{age} \cdot \mathbb{1}_{\text{is male}}
 \end{aligned}
 \]</span></p>
-<p>Whoa! That’s even more daunting than the equation you saw for the life expectancy as a function of continent in Section <a href="5-regression.html#model2table">5.2.2</a>! However if you recall what an “indicator function” AKA “dummy variable” does, the equation simplifies greatly. In the previous equation, we have one indicator function of interest:</p>
+<p>Whoa! That’s even more daunting than the equation you saw for the life expectancy as a function of continent in Subsection <a href="5-regression.html#model2table">5.2.2</a>! However, if you recall what an “indicator function” does, the equation simplifies greatly. In the previous equation, we have one indicator function of interest:</p>
 <p><span class="math display">\[
-\mathbb{1}_{\mbox{is male}}(x) = \left\{
+\mathbb{1}_{\text{is male}}(x) = \left\{
 \begin{array}{ll}
 1 &amp; \text{if } \text{instructor } x \text{ is male} \\
 0 &amp; \text{otherwise}\end{array}
@@ -1106,56 +1117,56 @@ <h3><span class="header-section-number">6.1.2</span> Interaction model</h3>
 <p>Second, let’s match coefficients in the previous equation with values in the <code>estimate</code> column in our regression table in Table <a href="6-multiple-regression.html#tab:regtable-interaction">6.3</a>:</p>
 <ol style="list-style-type: decimal">
 <li><span class="math inline">\(b_0\)</span> is the <code>intercept</code> = 4.883 for the female instructors</li>
-<li><span class="math inline">\(b_{\mbox{age}}\)</span> is the slope for <code>age</code> = -0.018 for the female instructors</li>
-<li><span class="math inline">\(b_{\mbox{male}}\)</span> is the offset in intercept for the male instructors</li>
-<li><span class="math inline">\(b_{\mbox{age,male}}\)</span> is the offset in slope for age for the male instructors</li>
+<li><span class="math inline">\(b_{\text{age}}\)</span> is the slope for <code>age</code> = -0.018 for the female instructors</li>
+<li><span class="math inline">\(b_{\text{male}}\)</span> is the offset in intercept = -0.446 for the male instructors</li>
+<li><span class="math inline">\(b_{\text{age,male}}\)</span> is the offset in slope for age = 0.014 for the male instructors</li>
 </ol>
-<p>Let’s put this all together and compute the fitted value <span class="math inline">\(\widehat{y} = \widehat{\text{score}}\)</span> for female instructors. Since for female instructors <span class="math inline">\(\mathbb{1}_{\mbox{is male}}(x)\)</span> = 0, the previous equation becomes</p>
+<p>Let’s put this all together and compute the fitted value <span class="math inline">\(\widehat{y} = \widehat{\text{score}}\)</span> for female instructors. Since for female instructors <span class="math inline">\(\mathbb{1}_{\text{is male}}(x)\)</span> = 0, the previous equation becomes</p>
 <p><span class="math display">\[
 \begin{aligned}
-\widehat{y} = \widehat{\text{score}} &amp;= 4.883 - 0.018   \cdot \mbox{age} - 0.446 \cdot 0 + 0.014 \cdot \mbox{age} \cdot 0\\
-&amp;= 4.883 - 0.018    \cdot \mbox{age} - 0 + 0\\
-&amp;= 4.883 - 0.018    \cdot \mbox{age}\\
+\widehat{y} = \widehat{\text{score}} &amp;= 4.883 - 0.018   \cdot \text{age} - 0.446 \cdot 0 + 0.014 \cdot \text{age} \cdot 0\\
+&amp;= 4.883 - 0.018    \cdot \text{age} - 0 + 0\\
+&amp;= 4.883 - 0.018    \cdot \text{age}\\
 \end{aligned}
 \]</span></p>
-<p>which is the equation of the red regression line in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a> corresponding to the female instructors in Table <a href="6-multiple-regression.html#tab:interaction-summary">6.4</a>. Correspondingly, since for male instructors <span class="math inline">\(\mathbb{1}_{\mbox{is male}}(x)\)</span> = 1, the previous equation becomes</p>
+<p>which is the equation of the red regression line in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a> corresponding to the female instructors in Table <a href="6-multiple-regression.html#tab:interaction-summary">6.4</a>. Correspondingly, since for male instructors <span class="math inline">\(\mathbb{1}_{\text{is male}}(x)\)</span> = 1, the previous equation becomes</p>
 <p><span class="math display">\[
 \begin{aligned}
-\widehat{y} = \widehat{\text{score}} &amp;= 4.883 - 0.018   \cdot \mbox{age} - 0.446 + 0.014 \cdot \mbox{age}\\
-&amp;= (4.883 - 0.446) + (- 0.018 + 0.014) * \mbox{age}\\
-&amp;= 4.437 - 0.004    \cdot \mbox{age}\\
+\widehat{y} = \widehat{\text{score}} &amp;= 4.883 - 0.018   \cdot \text{age} - 0.446 + 0.014 \cdot \text{age}\\
+&amp;= (4.883 - 0.446) + (- 0.018 + 0.014) * \text{age}\\
+&amp;= 4.437 - 0.004    \cdot \text{age}\\
 \end{aligned}
 \]</span></p>
 <p>which is the equation of the blue regression line in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a> corresponding to the male instructors in Table <a href="6-multiple-regression.html#tab:interaction-summary">6.4</a>.</p>
-<p>Phew! That was a lot of arithmetic! Don’t fret however, this is as hard as modeling will get in this book. If you’re still a little unsure about using indicator functions and using categorical explanatory variables in a regression model, we <em>highly</em> suggest you re-read Section <a href="5-regression.html#model2table">5.2.2</a>. This involves only a single categorical explanatory variable and thus is much simpler.</p>
-<p>Before we end this section, we explain why we refer to this type of model as an “interaction model.” The <span class="math inline">\(b_{\mbox{age,male}}\)</span> term in the equation for the fitted value <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{score}}\)</span> is what’s known in statistical modeling as an “interaction effect.” The interaction term corresponds to the <code>age:gendermale</code> = 0.014 in the final row of the regression table in Table <a href="6-multiple-regression.html#tab:regtable-interaction">6.3</a>.</p>
-<p>We say there is an interaction effect if the associated effect of one variable <em>depends on the value of another variable</em>. In other words, the two variables are “interacting” with each other. In our case, the associated effect of the variable age <em>depends</em> on the value of the other variable gender. This was evidenced by the difference in slopes for age of +0.014 of male instructors relative to female instructors. </p>
-<p>Another way of thinking about interaction effects on teaching scores is as follows. For a given instructor at UT Austin, there might be an associated effect of their age <em>by itself</em>, there might be an associated effect of their gender <em>by itself</em>, but when age and gender are considered <em>together</em> there might an <em>additional effect</em> above and beyond the two individual effects.</p>
+<p>Phew! That was a lot of arithmetic! Don’t fret, however, this is as hard as modeling will get in this book. If you’re still a little unsure about using indicator functions and using categorical explanatory variables in a regression model, we <em>highly</em> suggest you re-read Subsection <a href="5-regression.html#model2table">5.2.2</a>. This involves only a single categorical explanatory variable and thus is much simpler.</p>
+<p>Before we end this section, we explain why we refer to this type of model as an “interaction model.” The <span class="math inline">\(b_{\text{age,male}}\)</span> term in the equation for the fitted value <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{score}}\)</span> is what’s known in statistical modeling as an “interaction effect.” The interaction term corresponds to the <code>age:gendermale</code> = 0.014 in the final row of the regression table in Table <a href="6-multiple-regression.html#tab:regtable-interaction">6.3</a>.</p>
+<p>We say there is an interaction effect if the associated effect of one variable <em>depends on the value of another variable</em>. That is to say, the two variables are “interacting” with each other. Here, the associated effect of the variable age <em>depends</em> on the value of the other variable gender. The difference in slopes for age of +0.014 of male instructors relative to female instructors shows this. </p>
+<p>Another way of thinking about interaction effects on teaching scores is as follows. For a given instructor at UT Austin, there might be an associated effect of their age <em>by itself</em>, there might be an associated effect of their gender <em>by itself</em>, but when age and gender are considered <em>together</em> there might be an <em>additional effect</em> above and beyond the two individual effects.</p>
 </div>
 <div id="model4table" class="section level3">
 <h3><span class="header-section-number">6.1.3</span> Parallel slopes model</h3>
-<p>When creating regression models with one numerical and one categorical explanatory variable, we are not just limited to interaction models as we just saw. Another type of model we can use is known as a <em>parallel slopes</em> model. Unlike interaction models where the regression lines can have different intercepts and different slopes, parallel slopes models still allow for different intercepts but <em>force</em> all lines to have the same slope. The resulting regression lines are thus parallel. Let’s visualize the best-fitting parallel slopes model to our <code>evals_ch7</code> data.</p>
-<p>Unfortunately, the <code>ggplot2</code> package does not have a convenient way to plot a parallel slopes model. We therefore created our own special purpose function <code>gg_parallel_slopes()</code> and included it in the <code>moderndive</code> package:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">gg_parallel_slopes</span>(<span class="dt">y =</span> <span class="st">&quot;score&quot;</span>, <span class="dt">num_x =</span> <span class="st">&quot;age&quot;</span>, <span class="dt">cat_x =</span> <span class="st">&quot;gender&quot;</span>, 
-                   <span class="dt">data =</span> evals_ch7)</code></pre>
+<p>When creating regression models with one numerical and one categorical explanatory variable, we are not just limited to interaction models as we just saw. Another type of model we can use is known as a <em>parallel slopes</em> model. Unlike interaction models where the regression lines can have different intercepts and different slopes, parallel slopes models still allow for different intercepts but <em>force</em> all lines to have the same slope. The resulting regression lines are thus parallel. Let’s visualize the best-fitting parallel slopes model to <code>evals_ch6</code>.</p>
+<p>Unfortunately, the <code>geom_smooth()</code> function in the <code>ggplot2</code> package does not have a convenient way to plot parallel slopes models. Evgeni Chasnovski thus created a special purpose function called <code>geom_parallel_slopes()</code> that is included in the <code>moderndive</code> package. You won’t find <code>geom_parallel_slopes()</code> in the <code>ggplot2</code> package, but rather the <code>moderndive</code> package. Thus, if you want to be able to use it, you will need to load both the <code>ggplot2</code> and <code>moderndive</code> packages. Using this function, let’s now plot the parallel slopes model for teaching score. Notice how the code is identical to the code that produced the visualization of the interaction model in Figure <a href="6-multiple-regression.html#fig:numxcatxplot1">6.1</a>, but now the <code>geom_smooth(method = &quot;lm&quot;, se = FALSE)</code> layer is replaced with <code>geom_parallel_slopes(se = FALSE)</code>.</p>
+<div class="sourceCode" id="cb189"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb189-1" data-line-number="1"><span class="kw">ggplot</span>(evals_ch6, <span class="kw">aes</span>(<span class="dt">x =</span> age, <span class="dt">y =</span> score, <span class="dt">color =</span> gender)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb189-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb189-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Age&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Teaching Score&quot;</span>, <span class="dt">color =</span> <span class="st">&quot;Gender&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb189-4" data-line-number="4"><span class="st">  </span><span class="kw">geom_parallel_slopes</span>(<span class="dt">se =</span> <span class="ot">FALSE</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:numxcatx-parallel"></span>
-<img src="moderndive_files/figure-html/numxcatx-parallel-1.png" alt="Parallel slopes model of relationship of score with age and gender." width="\textwidth" />
+<img src="ModernDive_files/figure-html/numxcatx-parallel-1.png" alt="Parallel slopes model of score with age and gender." width="\textwidth" />
 <p class="caption">
-FIGURE 6.2: Parallel slopes model of relationship of score with age and gender.
+FIGURE 6.2: Parallel slopes model of score with age and gender.
 </p>
 </div>
-<p>Note the arguments to this function: the outcome variable <code>y = &quot;score&quot;</code>, the numerical explanatory variable <code>num_x = &quot;age&quot;</code>, the categorical explanatory variable <code>cat_x = &quot;gender&quot;</code>, and the data frame that includes this <code>data = evals_ch7</code>. Be careful to include the quotation marks when specifying all variables.</p>
-<p>Note that the <code>gg_parallel_slopes()</code> function is quite different than all the <code>ggplot()</code> code you saw in Chapter <a href="2-viz.html#viz">2</a>. This is because the <code>ggplot2</code> package does not include a function for plotting parallel slopes models. Thus we had to write a new function for ourselves and include it in the <code>moderndive</code> package. If you’re curious, you can see the code for this function on <a href="https://github.com/moderndive/moderndive/blob/master/R/ggplot_parallel_slopes.R">GitHub</a>.</p>
-<p>Observe in Figure <a href="6-multiple-regression.html#fig:numxcatx-parallel">6.2</a> that we now have parallel lines corresponding to the female and male instructors respectively: here they have the same negative slope. This is telling us that instructors who are older will tend to receive lower teaching scores than instructors who are younger. Furthermore, since the lines are parallel, the associated penalty for aging is assumed to be the same for both female and male instructors.</p>
+<p>Observe in Figure <a href="6-multiple-regression.html#fig:numxcatx-parallel">6.2</a> that we now have parallel lines corresponding to the female and male instructors, respectively: here they have the same negative slope. This is telling us that instructors who are older will tend to receive lower teaching scores than instructors who are younger. Furthermore, since the lines are parallel, the associated penalty for being older is assumed to be the same for both female and male instructors.</p>
 <p>However, observe also in Figure <a href="6-multiple-regression.html#fig:numxcatx-parallel">6.2</a> that these two lines have different intercepts as evidenced by the fact that the blue line corresponding to the male instructors is higher than the red line corresponding to the female instructors. This is telling us that irrespective of age, female instructors tended to receive lower teaching scores than male instructors.</p>
-<p>In order to obtain the precise numerical values of the two intercepts and the single common slope, we once again “fit” the model using the <code>lm()</code> “linear model” function and then apply the <code>get_regression_table()</code> function. However, unlike the interaction model which had a model formula of the form <code>y ~ x1 * x2</code>, our model formula is now of the form <code>y ~ x1 + x2</code>. In other words our two explanatory variables <code>x1</code> and <code>x2</code> are separated by a <code>+</code> sign:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit regression model:</span>
-score_model_parallel_slopes &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>age <span class="op">+</span><span class="st"> </span>gender, <span class="dt">data =</span> evals_ch7)
-<span class="co"># Get regression table:</span>
-<span class="kw">get_regression_table</span>(score_model_parallel_slopes)</code></pre>
+<p>In order to obtain the precise numerical values of the two intercepts and the single common slope, we once again “fit” the model using the <code>lm()</code> “linear model” function and then apply the <code>get_regression_table()</code> function. However, unlike the interaction model which had a model formula of the form <code>y ~ x1 * x2</code>, our model formula is now of the form <code>y ~ x1 + x2</code>. In other words, our two explanatory variables <code>x1</code> and <code>x2</code> are separated by a <code>+</code> sign:</p>
+<div class="sourceCode" id="cb190"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb190-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb190-2" data-line-number="2">score_model_parallel_slopes &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>age <span class="op">+</span><span class="st"> </span>gender, <span class="dt">data =</span> evals_ch6)</a>
+<a class="sourceLine" id="cb190-3" data-line-number="3"><span class="co"># Get regression table:</span></a>
+<a class="sourceLine" id="cb190-4" data-line-number="4"><span class="kw">get_regression_table</span>(score_model_parallel_slopes)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:regtable-parallel-slopes">TABLE 6.5: </span>Regression table for parallel slopes model.
+<span id="tab:regtable-parallel-slopes">TABLE 6.5: </span>Regression table for parallel slopes model
 </caption>
 <thead>
 <tr>
@@ -1255,11 +1266,11 @@ <h3><span class="header-section-number">6.1.3</span> Parallel slopes model</h3>
 </tbody>
 </table>
 <p>Similarly to the regression table for the interaction model from Table <a href="6-multiple-regression.html#tab:regtable-interaction">6.3</a>, we have an <code>intercept</code> term corresponding to the intercept for the “baseline for comparison” female instructor group and a <code>gendermale</code> term corresponding to the <em>offset</em> in intercept for the male instructors relative to female instructors. In other words, in Figure <a href="6-multiple-regression.html#fig:numxcatx-parallel">6.2</a> the red regression line corresponding to the female instructors has an intercept of 4.484 while the blue regression line corresponding to the male instructors has an intercept of 4.484 + 0.191 = 4.675. Once again, since there aren’t any instructors of age 0, the intercepts only have a mathematical interpretation but no practical one.</p>
-<p>Unlike in Table <a href="6-multiple-regression.html#tab:regtable-interaction">6.3</a> however, we now only have a single slope for age of -0.009. This is because model specifies that both the female and male instructors have a common slope for age.  This is telling us that an instructor who is a year older than another instructor received a teaching score that is on average 0.018 units <em>lower</em>. This penalty for aging applies equally for both female and male instructors.</p>
+<p>Unlike in Table <a href="6-multiple-regression.html#tab:regtable-interaction">6.3</a>, however, we now only have a single slope for age of -0.009. This is because the model dictates that both the female and male instructors have a common slope for age.  This is telling us that an instructor who is a year older than another instructor received a teaching score that is on average 0.009 units <em>lower</em>. This penalty for being of advanced age applies equally to both female and male instructors.</p>
 <p>Let’s summarize these values in Table <a href="6-multiple-regression.html#tab:parallel-slopes-summary">6.6</a>, noting the different intercepts but common slopes:</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:parallel-slopes-summary">TABLE 6.6: </span>Comparison of intercepts and slope for parallel slopes model.
+<span id="tab:parallel-slopes-summary">TABLE 6.6: </span>Comparison of intercepts and slope for parallel slopes model
 </caption>
 <thead>
 <tr>
@@ -1302,55 +1313,55 @@ <h3><span class="header-section-number">6.1.3</span> Parallel slopes model</h3>
 <p>Let’s now write the equation for our regression lines, which we can use to compute our fitted values <span class="math inline">\(\widehat{y} = \widehat{\text{score}}\)</span>.</p>
 <p><span class="math display">\[
 \begin{aligned}
-\widehat{y} = \widehat{\text{score}} &amp;= b_0 + b_{\mbox{age}} \cdot \mbox{age} + b_{\mbox{male}} \cdot \mathbb{1}_{\mbox{is male}}(x)\\
-&amp;= 4.484 -0.009 \cdot \mbox{age} + 0.191 \cdot \mathbb{1}_{\mbox{is male}}(x) 
+\widehat{y} = \widehat{\text{score}} &amp;= b_0 + b_{\text{age}} \cdot \text{age} + b_{\text{male}} \cdot \mathbb{1}_{\text{is male}}(x)\\
+&amp;= 4.484 -0.009 \cdot \text{age} + 0.191 \cdot \mathbb{1}_{\text{is male}}(x) 
 \end{aligned}
 \]</span></p>
-<p>Let’s put this all together and compute the fitted value <span class="math inline">\(\widehat{y} = \widehat{\text{score}}\)</span> for female instructors. Since for female instructors the indicator function <span class="math inline">\(\mathbb{1}_{\mbox{is male}}(x)\)</span> = 0, the previous equation becomes</p>
+<p>Let’s put this all together and compute the fitted value <span class="math inline">\(\widehat{y} = \widehat{\text{score}}\)</span> for female instructors. Since for female instructors the indicator function <span class="math inline">\(\mathbb{1}_{\text{is male}}(x)\)</span> = 0, the previous equation becomes</p>
 <p><span class="math display">\[
 \begin{aligned}
-\widehat{y} = \widehat{\text{score}} &amp;= 4.484 -0.009    \cdot \mbox{age} + 0.191 \cdot 0\\
-&amp;= 4.484 -0.009 \cdot \mbox{age}
+\widehat{y} = \widehat{\text{score}} &amp;= 4.484 -0.009    \cdot \text{age} + 0.191 \cdot 0\\
+&amp;= 4.484 -0.009 \cdot \text{age}
 \end{aligned}
 \]</span></p>
-<p>which is the equation of the red regression line in Figure <a href="6-multiple-regression.html#fig:numxcatx-parallel">6.2</a> corresponding to the female instructors. Correspondingly, since for male instructors the indicator function <span class="math inline">\(\mathbb{1}_{\mbox{is male}}(x)\)</span> = 1, the previous equation becomes</p>
+<p>which is the equation of the red regression line in Figure <a href="6-multiple-regression.html#fig:numxcatx-parallel">6.2</a> corresponding to the female instructors. Correspondingly, since for male instructors the indicator function <span class="math inline">\(\mathbb{1}_{\text{is male}}(x)\)</span> = 1, the previous equation becomes</p>
 <p><span class="math display">\[
 \begin{aligned}
-\widehat{y} = \widehat{\text{score}} &amp;= 4.484 -0.009    \cdot \mbox{age} + 0.191 \cdot 1\\
-&amp;= (4.484 + 0.191) - 0.009 \cdot \mbox{age}\\
-&amp;= 4.67 -0.009 \cdot \mbox{age}
+\widehat{y} = \widehat{\text{score}} &amp;= 4.484 -0.009    \cdot \text{age} + 0.191 \cdot 1\\
+&amp;= (4.484 + 0.191) - 0.009 \cdot \text{age}\\
+&amp;= 4.675 -0.009 \cdot \text{age}
 \end{aligned}
 \]</span></p>
 <p>which is the equation of the blue regression line in Figure <a href="6-multiple-regression.html#fig:numxcatx-parallel">6.2</a> corresponding to the male instructors.</p>
 <p>Great! We’ve considered both an interaction model and a parallel slopes model for our data. Let’s compare the visualizations for both models side-by-side in Figure <a href="6-multiple-regression.html#fig:numxcatx-comparison">6.3</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:numxcatx-comparison"></span>
-<img src="moderndive_files/figure-html/numxcatx-comparison-1.png" alt="Comparison of interaction and parallel slopes models." width="\textwidth" />
+<img src="ModernDive_files/figure-html/numxcatx-comparison-1.png" alt="Comparison of interaction and parallel slopes models." width="\textwidth" />
 <p class="caption">
 FIGURE 6.3: Comparison of interaction and parallel slopes models.
 </p>
 </div>
-<p>At this point, you might be asking yourself: “Why would we ever use a parallel slopes model?” Looking at the left-hand plot in Figure <a href="6-multiple-regression.html#fig:numxcatx-comparison">6.3</a>, the two lines definitely do not appear to be parallel, so why would we <em>force</em> them to be parallel? For this data, we agree! It can easily be argued that the interaction model is more appropriate. However, in the upcoming Section <a href="6-multiple-regression.html#model-selection">6.3.1</a> on model selection, we’ll present an example where it can be argued that the case for a parallel slopes model might be stronger.</p>
+<p>At this point, you might be asking yourself: “Why would we ever use a parallel slopes model?”. Looking at the left-hand plot in Figure <a href="6-multiple-regression.html#fig:numxcatx-comparison">6.3</a>, the two lines definitely do not appear to be parallel, so why would we <em>force</em> them to be parallel? For this data, we agree! It can easily be argued that the interaction model on the left is more appropriate. However, in the upcoming Subsection <a href="6-multiple-regression.html#model-selection">6.3.1</a> on model selection, we’ll present an example where it can be argued that the case for a parallel slopes model might be stronger.</p>
 </div>
 <div id="model4points" class="section level3">
 <h3><span class="header-section-number">6.1.4</span> Observed/fitted values and residuals</h3>
-<p>For brevity’s sake, in this section we’ll only compute the observed values, fitted values, and residuals for the interaction model which we saved in <code>score_model_interaction</code>. You’ll have an opportunity to study these values for the parallel slopes model in the upcoming Learning Check.</p>
-<p>Say you have a professor who is female and is 36 years old. What fitted value <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{score}}\)</span> would our model yield? Say you have another professor who is male and is 59 years old. What would their fitted value <span class="math inline">\(\widehat{y}\)</span> be?</p>
-<p>We answer this question visually first by finding the intersection of the red regression line and the vertical line at <span class="math inline">\(x\)</span> = age = 36. We mark this value with a large red dot in Figure <a href="6-multiple-regression.html#fig:fitted-values">6.4</a>. Similarly, we can identify the fitted value <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{score}}\)</span> for the male instructor by finding the intersection of the blue regression line and the vertical line at <span class="math inline">\(x\)</span> = age = 59. We mark this value with a large blue dot in Figure <a href="6-multiple-regression.html#fig:fitted-values">6.4</a>.</p>
+<p>For brevity’s sake, in this section we’ll only compute the observed values, fitted values, and residuals for the interaction model which we saved in <code>score_model_interaction</code>. You’ll have an opportunity to study the corresponding values for the parallel slopes model in the upcoming <em>Learning check</em>.</p>
+<p>Say, you have an instructor who identifies as female and is 36 years old. What fitted value <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{score}}\)</span> would our model yield? Say, you have another instructor who identifies as male and is 59 years old. What would their fitted value <span class="math inline">\(\widehat{y}\)</span> be?</p>
+<p>We answer this question visually first for the female instructor by finding the intersection of the red regression line and the vertical line at <span class="math inline">\(x\)</span> = age = 36. We mark this value with a large red dot in Figure <a href="6-multiple-regression.html#fig:fitted-values">6.4</a>. Similarly, we can identify the fitted value <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{score}}\)</span> for the male instructor by finding the intersection of the blue regression line and the vertical line at <span class="math inline">\(x\)</span> = age = 59. We mark this value with a large blue dot in Figure <a href="6-multiple-regression.html#fig:fitted-values">6.4</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:fitted-values"></span>
-<img src="moderndive_files/figure-html/fitted-values-1.png" alt="Fitted values for two new professors." width="\textwidth" />
+<img src="ModernDive_files/figure-html/fitted-values-1.png" alt="Fitted values for two new professors." width="\textwidth" />
 <p class="caption">
 FIGURE 6.4: Fitted values for two new professors.
 </p>
 </div>
-<p>What are these two values of <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{score}}\)</span> precisely? We can use the equations of the two regression lines we computed in Section <a href="6-multiple-regression.html#model4interactiontable">6.1.2</a>, which in turn were based on values from the regression table in Table <a href="6-multiple-regression.html#tab:regtable-interaction">6.3</a>:</p>
+<p>What are these two values of <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{score}}\)</span> precisely? We can use the equations of the two regression lines we computed in Subsection <a href="6-multiple-regression.html#model4interactiontable">6.1.2</a>, which in turn were based on values from the regression table in Table <a href="6-multiple-regression.html#tab:regtable-interaction">6.3</a>:</p>
 <ul>
-<li>For all female instructors: <span class="math inline">\(\widehat{y} = \widehat{\text{score}} = 4.883 - 0.018 \cdot \mbox{age}\)</span></li>
-<li>For all male instructors: <span class="math inline">\(\widehat{y} = \widehat{\text{score}} = 4.437 - 0.004 \cdot \mbox{age}\)</span></li>
+<li>For all female instructors: <span class="math inline">\(\widehat{y} = \widehat{\text{score}} = 4.883 - 0.018 \cdot \text{age}\)</span></li>
+<li>For all male instructors: <span class="math inline">\(\widehat{y} = \widehat{\text{score}} = 4.437 - 0.004 \cdot \text{age}\)</span></li>
 </ul>
-<p>So our fitted values would be: 4.883 - 0.018 <span class="math inline">\(\cdot\)</span> 36 = 4.25 and 4.437 - 0.004 <span class="math inline">\(\cdot\)</span> 59 = 4.20 respectively.</p>
-<p>Now say we want the fitted values not just for the instructors of these two courses, but for the instructors of all 463 courses included in the <code>evals_ch7</code> data frame? Doing this by hand would be long and tedious! This is where the <code>get_regression_points()</code> function from the <code>moderndive</code> package can help: it will quickly automate this for all 463 courses. We present a preview of just the first 10 rows out of 463 in Table <a href="6-multiple-regression.html#tab:model4-points-table">6.7</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">regression_points &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(score_model_interaction)
-regression_points</code></pre>
+<p>So our fitted values would be: <span class="math inline">\(4.883 - 0.018 \cdot 36 = 4.25\)</span> and <span class="math inline">\(4.437 - 0.004 \cdot 59 = 4.20\)</span>, respectively.</p>
+<p>Now what if we want the fitted values not just for these two instructors, but for the instructors of all 463 courses included in the <code>evals_ch6</code> data frame? Doing this by hand would be long and tedious! This is where the <code>get_regression_points()</code> function from the <code>moderndive</code> package can help: it will quickly automate the above calculations for all 463 courses. We present a preview of just the first 10 rows out of 463 in Table <a href="6-multiple-regression.html#tab:model4-points-table">6.7</a>.</p>
+<div class="sourceCode" id="cb191"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb191-1" data-line-number="1">regression_points &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(score_model_interaction)</a>
+<a class="sourceLine" id="cb191-2" data-line-number="2">regression_points</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:model4-points-table">TABLE 6.7: </span>Regression points (First 10 out of 463 courses)
@@ -1580,7 +1591,7 @@ <h3><span class="header-section-number">6.1.4</span> Observed/fitted values and
 </tr>
 </tbody>
 </table>
-<p>In fact, it turns out that the female instructor of age 36 taught the first four courses, while the male instructor taught the next 3. The resulting <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{score}}\)</span> fitted values are in the <code>score_hat</code> column. Furthermore, the <code>get_regression_points()</code> function also returns the residuals <span class="math inline">\(y-\widehat{y}\)</span>. Notice for example the first and fourth courses the female instructor of age 36 taught had positive residuals, indicating that the actual teaching score they received from students was less than their fitted score of 4.25. On the other hand, the second and third course this instructor taught had negative residuals, indicating that the actual teaching score they received from students was more than their fitted score of 4.25.</p>
+<p>It turns out that the female instructor of age 36 taught the first four courses, while the male instructor taught the next 3. The resulting <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{score}}\)</span> fitted values are in the <code>score_hat</code> column. Furthermore, the <code>get_regression_points()</code> function also returns the residuals <span class="math inline">\(y-\widehat{y}\)</span>. Notice, for example, the first and fourth courses the female instructor of age 36 taught had positive residuals, indicating that the actual teaching scores they received from students were greater than their fitted score of 4.25. On the other hand, the second and third courses this instructor taught had negative residuals, indicating that the actual teaching scores they received from students were less than 4.25.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -1594,7 +1605,7 @@ <h3><span class="header-section-number">6.1.4</span> Observed/fitted values and
 </div>
 <div id="model3" class="section level2">
 <h2><span class="header-section-number">6.2</span> Two numerical explanatory variables</h2>
-<p>Let’s now switch gears and consider multiple regression models where instead of one numerical and one categorical explanatory variable, we now have two numerical explanatory variables. The dataset we’ll use is from <a href="http://www-bcf.usc.edu/~gareth/ISL/">“An Introduction to Statistical Learning with Applications in R (ISLR)”</a>, an intermediate-level textbook on statistical and machine learning. Its accompanying <code>ISLR</code> R package contains the datasets that the authors apply various machine learning methods to.</p>
+<p>Let’s now switch gears and consider multiple regression models where instead of one numerical and one categorical explanatory variable, we now have two numerical explanatory variables. The dataset we’ll use is from <a href="http://www-bcf.usc.edu/~gareth/ISL/"><em>An Introduction to Statistical Learning with Applications in R (ISLR)</em></a>, an intermediate-level textbook on statistical and machine learning <span class="citation">(James et al. <a href="#ref-islr2017">2017</a>)</span>. Its accompanying <code>ISLR</code> R package contains the datasets to which the authors apply various machine learning methods.</p>
 <p>One frequently used dataset in this book is the <code>Credit</code> dataset, where the outcome variable of interest is the credit card debt of 400 individuals. Other variables like income, credit limit, credit rating, and age are included as well. Note that the <code>Credit</code> data is not based on real individuals’ financial information, but rather is a simulated dataset used for educational purposes.</p>
 <p>In this section, we’ll fit a regression model where we have</p>
 <ol style="list-style-type: decimal">
@@ -1606,7 +1617,7 @@ <h2><span class="header-section-number">6.2</span> Two numerical explanatory var
 </ol></li>
 </ol>
 <!--
-In the forthcoming Learning Checks, we'll consider a different regression model
+In the forthcoming Learning checks, we'll consider a different regression model
 
 1. The same numerical outcome variable $y$, the cardholder's credit card debt
 1. Two different explanatory variables:
@@ -1615,14 +1626,13 @@ <h2><span class="header-section-number">6.2</span> Two numerical explanatory var
 -->
 <div id="model3EDA" class="section level3">
 <h3><span class="header-section-number">6.2.1</span> Exploratory data analysis</h3>
-<p>Let’s load the <code>Credit</code> dataset, but to keep things simple let’s <code>select()</code> only the subset of the variables we’ll consider in this chapter, and save this data in a new data frame called <code>credit_ch7</code>. Notice our slightly different use of the <code>select()</code> verb here than we introduced in Subsection <a href="3-wrangling.html#select">3.8.1</a>. For example, we’ll select the <code>Balance</code> variable from <code>Credit</code> but then save it with a new variable name <code>debt</code>. We do this because here the term “debt” is a little more interpretable than “balance.”</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(ISLR)
-credit_ch7 &lt;-<span class="st"> </span>Credit <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">as_tibble</span>() <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(ID, <span class="dt">debt =</span> Balance, <span class="dt">credit_limit =</span> Limit, 
-         <span class="dt">income =</span> Income, <span class="dt">credit_rating =</span> Rating, <span class="dt">age =</span> Age)</code></pre>
-<p>You can observe the effect of our use of<code>select()</code> in the first common step of an exploratory data analysis: looking at the raw values either in RStudio’s spreadsheet viewer or by using <code>glimpse()</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">glimpse</span>(credit_ch7)</code></pre>
+<p>Let’s load the <code>Credit</code> dataset. To keep things simple let’s <code>select()</code> the subset of the variables we’ll consider in this chapter, and save this data in the new data frame <code>credit_ch6</code>. Notice our slightly different use of the <code>select()</code> verb here than we introduced in Subsection <a href="3-wrangling.html#select">3.8.1</a>. For example, we’ll select the <code>Balance</code> variable from <code>Credit</code> but then save it with a new variable name <code>debt</code>. We do this because here the term “debt” is easier to interpret than “balance.”</p>
+<div class="sourceCode" id="cb192"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb192-1" data-line-number="1"><span class="kw">library</span>(ISLR)</a>
+<a class="sourceLine" id="cb192-2" data-line-number="2">credit_ch6 &lt;-<span class="st"> </span>Credit <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">as_tibble</span>() <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb192-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(ID, <span class="dt">debt =</span> Balance, <span class="dt">credit_limit =</span> Limit, </a>
+<a class="sourceLine" id="cb192-4" data-line-number="4">         <span class="dt">income =</span> Income, <span class="dt">credit_rating =</span> Rating, <span class="dt">age =</span> Age)</a></code></pre></div>
+<p>You can observe the effect of our use of <code>select()</code> in the first common step of an exploratory data analysis: looking at the raw values either in RStudio’s spreadsheet viewer or by using <code>glimpse()</code>.</p>
+<div class="sourceCode" id="cb193"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb193-1" data-line-number="1"><span class="kw">glimpse</span>(credit_ch6)</a></code></pre></div>
 <pre><code>Observations: 400
 Variables: 6
 $ ID            &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
@@ -1631,13 +1641,11 @@ <h3><span class="header-section-number">6.2.1</span> Exploratory data analysis</
 $ income        &lt;dbl&gt; 14.9, 106.0, 104.6, 148.9, 55.9, 80.2, 21.0, 71.4, 15.1…
 $ credit_rating &lt;int&gt; 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, …
 $ age           &lt;int&gt; 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49,…</code></pre>
-<p>Furthermore, let’s look at a random sample of five out of the 400 credit card holders in Table <a href="6-multiple-regression.html#tab:model3-data-preview">6.8</a>. Note due to the random nature of the sampling, you will likely end up with a different subset of five rows.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">set.seed</span>(<span class="dv">9</span>)
-credit_ch7 <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">sample_n</span>(<span class="dt">size =</span> <span class="dv">5</span>)</code></pre>
+<p>Furthermore, let’s look at a random sample of five out of the 400 credit card holders in Table <a href="6-multiple-regression.html#tab:model3-data-preview">6.8</a>. Once again, note that due to the random nature of the sampling, you will likely end up with a different subset of five rows.</p>
+<div class="sourceCode" id="cb195"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb195-1" data-line-number="1">credit_ch6 <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">sample_n</span>(<span class="dt">size =</span> <span class="dv">5</span>)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:model3-data-preview">TABLE 6.8: </span>Random sample of 5 credit card holders.
+<span id="tab:model3-data-preview">TABLE 6.8: </span>Random sample of 5 credit card holders
 </caption>
 <thead>
 <tr>
@@ -1764,35 +1772,31 @@ <h3><span class="header-section-number">6.2.1</span> Exploratory data analysis</
 </tr>
 </tbody>
 </table>
-<p>Now that we’ve looked at the raw values in our <code>credit_ch7</code> data frame and got a sense of the data, let’s move on to next common step in an exploratory data analysis: computing summary statistics. Let’s use the <code>skim()</code> function from the <code>skimr</code> package, being sure to only <code>select()</code> the columns of interest for our model:</p>
-<pre class="sourceCode r"><code class="sourceCode r">credit_ch7 <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(debt, credit_limit, income) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">skim</span>()</code></pre>
+<p>Now that we’ve looked at the raw values in our <code>credit_ch6</code> data frame and got a sense of the data, let’s move on to the next common step in an exploratory data analysis: computing summary statistics. Let’s use the <code>skim()</code> function from the <code>skimr</code> package, being sure to only <code>select()</code> the columns of interest for our model:</p>
+<div class="sourceCode" id="cb196"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb196-1" data-line-number="1">credit_ch6 <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">select</span>(debt, credit_limit, income) <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">skim</span>()</a></code></pre></div>
 <pre><code>Skim summary statistics
  n obs: 400 
  n variables: 3 
 
-── Variable type:integer ───────────────────────────────────────────────────────
+── Variable type:integer 
   variable missing complete   n    mean      sd  p0     p25    p50     p75  p100
 credit_limit     0      400 400 4735.6  2308.2  855 3088    4622.5 5872.75 13913
          debt    0      400 400  520.01  459.76   0   68.75  459.5  863     1999
 
-── Variable type:numeric ───────────────────────────────────────────────────────
+── Variable type:numeric 
  variable missing complete   n  mean    sd    p0   p25   p50   p75   p100
    income       0      400 400 45.22 35.24 10.35 21.01 33.12 57.47 186.63</code></pre>
-<p>Observe the summary statistics for the outcome variable <code>debt</code>: the mean and median credit card debt are $520.01 and $459.50 respectively and that 25% of card holders had debts of $68.75 or less. Let’s now look at one of the explanatory variables <code>credit_limit</code>: the mean and median credit card limit are $4735.6 and $4622.50 respectively while 75% of card holders had incomes of $57,470 or less.</p>
+<p>Observe the summary statistics for the outcome variable <code>debt</code>: the mean and median credit card debt are $520.01 and $459.50, respectively, and that 25% of card holders had debts of $68.75 or less. Let’s now look at one of the explanatory variables <code>credit_limit</code>: the mean and median credit card limit are $4735.6 and $4622.50, respectively, while 75% of card holders had incomes of $57,470 or less.</p>
 <p>Since our outcome variable <code>debt</code> and the explanatory variables <code>credit_limit</code> and <code>income</code> are numerical, we can compute the correlation coefficient between the different possible pairs of these variables. First, we can run the <code>get_correlation()</code> command as seen in Subsection <a href="5-regression.html#model1EDA">5.1.1</a> twice, once for each explanatory variable:</p>
-<pre class="sourceCode r"><code class="sourceCode r">credit_ch7 <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_correlation</span>(debt <span class="op">~</span><span class="st"> </span>credit_limit)
-credit_ch7 <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_correlation</span>(debt <span class="op">~</span><span class="st"> </span>income)</code></pre>
-<p>Or we can simultaneously compute them by returning a <em>correlation matrix</em> which we display in Table <a href="6-multiple-regression.html#tab:model3-correlation">6.9</a>.  We can read off the correlation coefficient for any pair of variables by looking them up in the appropriate row/column combination.</p>
-<pre class="sourceCode r"><code class="sourceCode r">credit_ch7 <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">select</span>(debt, credit_limit, income) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">cor</span>()</code></pre>
+<div class="sourceCode" id="cb198"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb198-1" data-line-number="1">credit_ch6 <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">get_correlation</span>(debt <span class="op">~</span><span class="st"> </span>credit_limit)</a>
+<a class="sourceLine" id="cb198-2" data-line-number="2">credit_ch6 <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">get_correlation</span>(debt <span class="op">~</span><span class="st"> </span>income)</a></code></pre></div>
+<p>Or we can simultaneously compute them by returning a <em>correlation matrix</em> which we display in Table <a href="6-multiple-regression.html#tab:model3-correlation">6.9</a>.  We can see the correlation coefficient for any pair of variables by looking them up in the appropriate row/column combination.</p>
+<div class="sourceCode" id="cb199"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb199-1" data-line-number="1">credit_ch6 <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb199-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(debt, credit_limit, income) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb199-3" data-line-number="3"><span class="st">  </span><span class="kw">cor</span>()</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:model3-correlation">TABLE 6.9: </span>Correlation coefficients between credit card debt, credit limit, and income.
+<span id="tab:model3-correlation">TABLE 6.9: </span>Correlation coefficients between credit card debt, credit limit, and income
 </caption>
 <thead>
 <tr>
@@ -1859,70 +1863,58 @@ <h3><span class="header-section-number">6.2.1</span> Exploratory data analysis</
 <li><code>debt</code> with itself is 1 as we would expect based on the definition of the correlation coefficient.</li>
 <li><code>debt</code> with <code>credit_limit</code> is 0.862. This indicates a strong positive linear relationship, which makes sense as only individuals with large credit limits can accrue large credit card debts.</li>
 <li><code>debt</code> with <code>income</code> is 0.464. This is suggestive of another positive linear relationship, although not as strong as the relationship between <code>debt</code> and <code>credit_limit</code>.</li>
-<li>As an added bonus, we can read off the correlation coefficient between the two explanatory variables, <code>credit_limit</code> and <code>income</code> of 0.792.</li>
+<li>As an added bonus, we can read off the correlation coefficient between the two explanatory variables of <code>credit_limit</code> and <code>income</code> as 0.792.</li>
 </ol>
 <p>We say there is a high degree of <em>collinearity</em> between the <code>credit_limit</code> and <code>income</code> explanatory variables. Collinearity (or multicollinearity) is a phenomenon where one explanatory variable in a multiple regression model is highly correlated with another.</p>
-<p>So in our case since <code>credit_limit</code> and <code>income</code> are highly correlated, if we knew someone’s <code>credit_limit</code>, we could make pretty good guesses about their <code>income</code> as well. Thus, these two variables provided somewhat redundant information. However, we’ll leave discussion on how to work with collinear explanatory variables to a more intermediate-level book on regression modeling.</p>
+<p>So in our case since <code>credit_limit</code> and <code>income</code> are highly correlated, if we knew someone’s <code>credit_limit</code>, we could make pretty good guesses about their <code>income</code> as well. Thus, these two variables provide somewhat redundant information. However, we’ll leave discussion on how to work with collinear explanatory variables to a more intermediate-level book on regression modeling.</p>
 <p>Let’s visualize the relationship of the outcome variable with each of the two explanatory variables in two separate plots in Figure <a href="6-multiple-regression.html#fig:2numxplot1">6.5</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(credit_ch7, <span class="kw">aes</span>(<span class="dt">x =</span> credit_limit, <span class="dt">y =</span> debt)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Credit limit (in $)&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Credit card debt (in $)&quot;</span>, 
-       <span class="dt">title =</span> <span class="st">&quot;Debt and credit limit&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>)
-  
-<span class="kw">ggplot</span>(credit_ch7, <span class="kw">aes</span>(<span class="dt">x =</span> income, <span class="dt">y =</span> debt)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Income (in $1000)&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Credit card debt (in $)&quot;</span>, 
-       <span class="dt">title =</span> <span class="st">&quot;Debt and income&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>)</code></pre>
+<div class="sourceCode" id="cb200"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb200-1" data-line-number="1"><span class="kw">ggplot</span>(credit_ch6, <span class="kw">aes</span>(<span class="dt">x =</span> credit_limit, <span class="dt">y =</span> debt)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb200-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb200-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Credit limit (in $)&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Credit card debt (in $)&quot;</span>, </a>
+<a class="sourceLine" id="cb200-4" data-line-number="4">       <span class="dt">title =</span> <span class="st">&quot;Debt and credit limit&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb200-5" data-line-number="5"><span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>)</a>
+<a class="sourceLine" id="cb200-6" data-line-number="6"></a>
+<a class="sourceLine" id="cb200-7" data-line-number="7"><span class="kw">ggplot</span>(credit_ch6, <span class="kw">aes</span>(<span class="dt">x =</span> income, <span class="dt">y =</span> debt)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb200-8" data-line-number="8"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb200-9" data-line-number="9"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Income (in $1000)&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Credit card debt (in $)&quot;</span>, </a>
+<a class="sourceLine" id="cb200-10" data-line-number="10">       <span class="dt">title =</span> <span class="st">&quot;Debt and income&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb200-11" data-line-number="11"><span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:2numxplot1"></span>
-<img src="moderndive_files/figure-html/2numxplot1-1.png" alt="Relationship between credit card debt and credit limit/income." width="\textwidth" />
+<img src="ModernDive_files/figure-html/2numxplot1-1.png" alt="Relationship between credit card debt and credit limit/income." width="\textwidth" />
 <p class="caption">
 FIGURE 6.5: Relationship between credit card debt and credit limit/income.
 </p>
 </div>
 <p>Observe there is a positive relationship between credit limit and credit card debt: as credit limit increases so also does credit card debt. This is consistent with the strongly positive correlation coefficient of 0.862 we computed earlier. In the case of income, the positive relationship doesn’t appear as strong, given the weakly positive correlation coefficient of 0.464.</p>
-<p>However, the two plots in Figure <a href="6-multiple-regression.html#fig:2numxplot1">6.5</a> only focus on the relationship of the outcome variable with each of the two explanatory variables <em>separately</em>. To visualize the <em>joint</em> relationship of all three variables simultaneously, we need a 3-dimensional (3D) scatterplot as seen in Figure <a href="6-multiple-regression.html#fig:3D-scatterplot">6.6</a>. Each of the 400 observations in the <code>credit_ch7</code> data frame are marked with a blue point where</p>
+<p>However, the two plots in Figure <a href="6-multiple-regression.html#fig:2numxplot1">6.5</a> only focus on the relationship of the outcome variable with each of the two explanatory variables <em>separately</em>. To visualize the <em>joint</em> relationship of all three variables simultaneously, we need a 3-dimensional (3D) scatterplot as seen in Figure <a href="6-multiple-regression.html#fig:3D-scatterplot">6.6</a>. Each of the 400 observations in the <code>credit_ch6</code> data frame are marked with a blue point where</p>
 <ol style="list-style-type: decimal">
-<li>The numerical outcome variable <span class="math inline">\(y\)</span> <code>debt</code> is on the vertical axis</li>
+<li>The numerical outcome variable <span class="math inline">\(y\)</span> <code>debt</code> is on the vertical axis.</li>
 <li>The two numerical explanatory variables, <span class="math inline">\(x_1\)</span> <code>income</code> and <span class="math inline">\(x_2\)</span> <code>credit_limit</code>, are on the two axes that form the bottom plane.</li>
 </ol>
 <div class="figure" style="text-align: center"><span id="fig:3D-scatterplot"></span>
-<img src="images/credit_card_balance_regression_plane.png" alt="3D scatterplot and regression plane." width="60%" />
+<img src="images/credit_card_balance_regression_plane.png" alt="3D scatterplot and regression plane." width="75%" />
 <p class="caption">
 FIGURE 6.6: 3D scatterplot and regression plane.
 </p>
 </div>
-<p>Furthermore, we also include the <em>regression plane</em>. Recall from Section <a href="5-regression.html#leastsquares">5.3.2</a> that regression lines are “best-fitting” in that of all possible lines we can draw through a cloud of points, the regression line minimizes the <em>sum of squared residuals</em>. This concept also extends to models with two numerical explanatory variables. The difference is instead of a “best-fitting” line, we now have a “best-fitting” plane that similarly minimizes the sum of squared residuals. Head to <a href="https://beta.rstudioconnect.com/connect/#/apps/3214/">here</a> to open an interactive version of this plot in your browser.</p>
+<p>Furthermore, we also include the <em>regression plane</em>. Recall from Subsection <a href="5-regression.html#leastsquares">5.3.2</a> that regression lines are “best-fitting” in that of all possible lines we can draw through a cloud of points, the regression line minimizes the <em>sum of squared residuals</em>. This concept also extends to models with two numerical explanatory variables. The difference is instead of a “best-fitting” line, we now have a “best-fitting” plane that similarly minimizes the sum of squared residuals. Head to <a href="https://moderndive.com/regression-plane">this website</a> to open an interactive version of this plot in your browser.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC6.2)</strong> Conduct a new exploratory data analysis with the same outcome variable <span class="math inline">\(y\)</span> being <code>debt</code> but with <code>credit_rating</code> and <code>age</code> as the new explanatory variables <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span>. Remember, this involves three things:</p>
-<ol style="list-style-type: decimal">
-<li>Most crucially: Looking at the raw data values.</li>
-<li>Computing summary statistics, such as means, medians, and interquartile ranges.</li>
-<li>Creating data visualizations.</li>
-</ol>
-<p>What can you say about the relationship between a credit card holder’s debt and their credit rating and age?</p>
+<p><strong>(LC6.2)</strong> Conduct a new exploratory data analysis with the same outcome variable <span class="math inline">\(y\)</span> <code>debt</code> but with <code>credit_rating</code> and <code>age</code> as the new explanatory variables <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span>. What can you say about the relationship between a credit card holder’s debt and their credit rating and age?</p>
 <div class="learncheck">
 
 </div>
 </div>
 <div id="model3table" class="section level3">
 <h3><span class="header-section-number">6.2.2</span> Regression plane</h3>
-<p>Let’s now fit a regression model and get the regression table corresponding to the regression plane in Figure <a href="6-multiple-regression.html#fig:3D-scatterplot">6.6</a>. To keep things brief in this subsection, we won’t consider an interaction model for the two numerical explanatory variables <code>income</code> and <code>credit_limit</code> like we did in Section <a href="6-multiple-regression.html#model4interactiontable">6.1.2</a> using the model formula <code>score ~ age * gender</code>.</p>
-<p>Rather we’ll only consider a model fit with a formula of the form <code>y ~ x1 + x2</code>. Somewhat confusing however, since we now have a regression plane instead of multiple lines, the label “parallel slopes” doesn’t apply when you have two numerical explanatory variables.</p>
-<p>Just as we have done multiple times throughout Chapters <a href="5-regression.html#regression">5</a> and this chapter, let’s get the regression table for this model using our two-step process and display the results in Table <a href="6-multiple-regression.html#tab:model3-table-output">6.10</a></p>
-<ol style="list-style-type: decimal">
-<li>We first “fit” the linear regression model using the <code>lm(y ~ x1 + x2, data)</code> function and save it in <code>debt_model</code>.</li>
-<li>We get the regression table by applying the <code>get_regression_table()</code> from the <code>moderndive</code> package to <code>debt_model</code>.</li>
-</ol>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit regression model:</span>
-debt_model &lt;-<span class="st"> </span><span class="kw">lm</span>(debt <span class="op">~</span><span class="st"> </span>credit_limit <span class="op">+</span><span class="st"> </span>income, <span class="dt">data =</span> credit_ch7)
-<span class="co"># Get regression table:</span>
-<span class="kw">get_regression_table</span>(debt_model)</code></pre>
+<p>Let’s now fit a regression model and get the regression table corresponding to the regression plane in Figure <a href="6-multiple-regression.html#fig:3D-scatterplot">6.6</a>. To keep things brief in this subsection, we won’t consider an interaction model for the two numerical explanatory variables <code>income</code> and <code>credit_limit</code> like we did in Subsection <a href="6-multiple-regression.html#model4interactiontable">6.1.2</a> using the model formula <code>score ~ age * gender</code>. Rather we’ll only consider a model fit with a formula of the form <code>y ~ x1 + x2</code>. Confusingly, however, since we now have a regression plane instead of multiple lines, the label “parallel slopes” doesn’t apply when you have two numerical explanatory variables. Just as we have done multiple times throughout Chapters <a href="5-regression.html#regression">5</a> and this chapter, the regression table for this model using our two-step process is in Table <a href="6-multiple-regression.html#tab:model3-table-output">6.10</a>.</p>
+<div class="sourceCode" id="cb201"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb201-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb201-2" data-line-number="2">debt_model &lt;-<span class="st"> </span><span class="kw">lm</span>(debt <span class="op">~</span><span class="st"> </span>credit_limit <span class="op">+</span><span class="st"> </span>income, <span class="dt">data =</span> credit_ch6)</a>
+<a class="sourceLine" id="cb201-3" data-line-number="3"><span class="co"># Get regression table:</span></a>
+<a class="sourceLine" id="cb201-4" data-line-number="4"><span class="kw">get_regression_table</span>(debt_model)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:model3-table-output">TABLE 6.10: </span>Multiple regression table
@@ -2024,25 +2016,29 @@ <h3><span class="header-section-number">6.2.2</span> Regression plane</h3>
 </tr>
 </tbody>
 </table>
-<p>Let’s interpret the three values in the <code>estimate</code> column. First, <code>intercept</code> = -$385.179. The intercept represents the credit card debt for an individual who has <code>credit_limit</code> of $0 and <code>income</code> of $0. In our data however, the intercept has limited practical interpretation since no individuals had <code>credit_limit</code> or <code>income</code> values of $0. Rather, the intercept is used to situate the regression plane in 3D space.</p>
-<p>Second, <code>credit_limit</code> = $0.264. Taking into account all the other explanatory variables in our model, for every increase of one dollar in <code>credit_limit</code>, there is an associated increase of on average $0.26 in credit card debt. Just as we did in Subsection <a href="5-regression.html#model1table">5.1.2</a>, we are cautious <em>not</em> imply causality as we saw in Subsection <a href="5-regression.html#correlation-is-not-causation">5.3.1</a> that “correlation is not necessarily causation.” We do this merely stating there was an <em>associated</em> increase.</p>
-<p>Furthermore, we preface our interpretation with the statement “taking into account all the other explanatory variables in our model.” Here, by all other explanatory variables we mean <code>income</code>. We do this to emphasize that we are now jointly interpreting the associated effect of multiple explanatory variables in the same model at the same time.</p>
-<p>Third, <code>income</code> = -$7.663. Taking into account all the other explanatory variables in our model, for every increase of one unit in the variable <code>income</code>, in other words $1000 in actual income, there is an associated decrease of on average $7.663 in credit card debt.</p>
+<ol style="list-style-type: decimal">
+<li>We first “fit” the linear regression model using the <code>lm(y ~ x1 + x2, data)</code> function and save it in <code>debt_model</code>.</li>
+<li>We get the regression table by applying the <code>get_regression_table()</code> function from the <code>moderndive</code> package to <code>debt_model</code>.</li>
+</ol>
+<p>Let’s interpret the three values in the <code>estimate</code> column. First, the <code>intercept</code> value is -$385.179. This intercept represents the credit card debt for an individual who has <code>credit_limit</code> of $0 and <code>income</code> of $0. In our data, however, the intercept has no practical interpretation since no individuals had <code>credit_limit</code> or <code>income</code> values of $0. Rather, the intercept is used to situate the regression plane in 3D space.</p>
+<p>Second, the <code>credit_limit</code> value is $0.264. Taking into account all the other explanatory variables in our model, for every increase of one dollar in <code>credit_limit</code>, there is an associated increase of on average $0.26 in credit card debt. Just as we did in Subsection <a href="5-regression.html#model1table">5.1.2</a>, we are cautious <em>not</em> to imply causality as we saw in Subsection <a href="5-regression.html#correlation-is-not-causation">5.3.1</a> that “correlation is not necessarily causation.” We do this merely stating there was an <em>associated</em> increase.</p>
+<p>Furthermore, we preface our interpretation with the statement, “taking into account all the other explanatory variables in our model.” Here, by all other explanatory variables we mean <code>income</code>. We do this to emphasize that we are now jointly interpreting the associated effect of multiple explanatory variables in the same model at the same time.</p>
+<p>Third, <code>income</code> = -$7.66. Taking into account all other explanatory variables in our model, for every increase of one unit of <code>income</code> ($1000 in actual income), there is an associated decrease of, on average, $7.66 in credit card debt.</p>
 <p>Putting these results together, the equation of the regression plane that gives us fitted values <span class="math inline">\(\widehat{y}\)</span> = <span class="math inline">\(\widehat{\text{debt}}\)</span> is:</p>
 <p><span class="math display">\[
 \begin{aligned}
 \widehat{y} &amp;= b_0 + b_1 \cdot x_1 +  b_2 \cdot x_2\\
 \widehat{\text{debt}} &amp;= b_0 + b_{\text{limit}} \cdot \text{limit} + b_{\text{income}} \cdot \text{income}\\
-&amp;= -387.179 + 0.263 \cdot\text{limit} - 7.663 \cdot\text{income}
+&amp;= -385.179 + 0.263 \cdot\text{limit} - 7.663 \cdot\text{income}
 \end{aligned}
 \]</span></p>
-<p>Recall in the right-hand plot of Figure <a href="6-multiple-regression.html#fig:2numxplot1">6.5</a> that when plotting the relationship between <code>debt</code> and <code>income</code> in isolation, there appeared to be a <em>positive</em> relationship. In the last discussed multiple regression however, when <em>jointly</em> modeling the relationship between <code>debt</code>, <code>credit_limit</code>, and <code>income</code>, there appears to be a <em>negative</em> relationship of <code>debt</code> and <code>income</code> as evidenced by the negative slope for <code>income</code> of -$7.663. What explains these contradictory results? A phenomenon known as <em>Simpson’s Paradox</em>, whereby overall trends that exist in aggregate either disappear or reverse when the data are broken down into groups. In Subsection <a href="6-multiple-regression.html#simpsonsparadox">6.3.3</a> we elaborate on this idea by looking at the relationship between <code>credit_limit</code> and credit card <code>debt</code>, but split along different <code>income</code> brackets.</p>
+<p>Recall however in the right-hand plot of Figure <a href="6-multiple-regression.html#fig:2numxplot1">6.5</a> that when plotting the relationship between <code>debt</code> and <code>income</code> in isolation, there appeared to be a <em>positive</em> relationship. In the last discussed multiple regression, however, when <em>jointly</em> modeling the relationship between <code>debt</code>, <code>credit_limit</code>, and <code>income</code>, there appears to be a <em>negative</em> relationship of <code>debt</code> and <code>income</code> as evidenced by the negative slope for <code>income</code> of -$7.663. What explains these contradictory results? A phenomenon known as <em>Simpson’s Paradox</em>, whereby overall trends that exist in aggregate either disappear or reverse when the data are broken down into groups. In Subsection <a href="6-multiple-regression.html#simpsonsparadox">6.3.3</a> we elaborate on this idea by looking at the relationship between <code>credit_limit</code> and credit card <code>debt</code>, but split along different <code>income</code> brackets.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC6.3)</strong> Fit a new simple linear regression using <code>lm(debt ~ credit_rating + age, data = credit_ch7)</code> where <code>credit_rating</code> and <code>age</code> are the new numerical explanatory variables <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span>. Get information about the “best-fitting” regression plane from the regression table by applying the <code>get_regression_table()</code> function. How do the regression results match up with the results from your previous exploratory data analysis?</p>
+<p><strong>(LC6.3)</strong> Fit a new simple linear regression using <code>lm(debt ~ credit_rating + age, data = credit_ch6)</code> where <code>credit_rating</code> and <code>age</code> are the new numerical explanatory variables <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span>. Get information about the “best-fitting” regression plane from the regression table by applying the <code>get_regression_table()</code> function. How do the regression results match up with the results from your previous exploratory data analysis?</p>
 <div class="learncheck">
 
 </div>
@@ -2052,14 +2048,13 @@ <h3><span class="header-section-number">6.2.3</span> Observed/fitted values and
 <p>Let’s also compute all fitted values and residuals for our regression model using the <code>get_regression_points()</code> function and present only the first 10 rows of output in Table <a href="6-multiple-regression.html#tab:model3-points-table">6.11</a>. Remember that the coordinates of each of the blue points in our 3D scatterplot in Figure <a href="6-multiple-regression.html#fig:3D-scatterplot">6.6</a> can be found in the <code>income</code>, <code>credit_limit</code>, and <code>debt</code> columns. The fitted values on the regression plane are found in the <code>debt_hat</code> column and are computed using our equation for the regression plane in the previous section:</p>
 <p><span class="math display">\[
 \begin{aligned}
-\widehat{y} = \widehat{\text{debt}} &amp;= -387.179 + 0.263 \cdot \text{limit} - 7.663 \cdot \text{income}
+\widehat{y} = \widehat{\text{debt}} &amp;= -385.179 + 0.263 \cdot \text{limit} - 7.663 \cdot \text{income}
 \end{aligned}
 \]</span></p>
-<pre class="sourceCode r"><code class="sourceCode r">regression_points &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(debt_model)
-regression_points</code></pre>
+<div class="sourceCode" id="cb202"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb202-1" data-line-number="1"><span class="kw">get_regression_points</span>(debt_model)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:model3-points-table">TABLE 6.11: </span>Regression points (First 10 credit card holders out of 400).
+<span id="tab:model3-points-table">TABLE 6.11: </span>Regression points (First 10 credit card holders out of 400)
 </caption>
 <thead>
 <tr>
@@ -2292,57 +2287,55 @@ <h3><span class="header-section-number">6.2.3</span> Observed/fitted values and
 <h2><span class="header-section-number">6.3</span> Related topics</h2>
 <div id="model-selection" class="section level3">
 <h3><span class="header-section-number">6.3.1</span> Model selection</h3>
-<p>When do we use an interaction model versus a parallel slopes model? Recall in Sections <a href="6-multiple-regression.html#model4interactiontable">6.1.2</a> and <a href="6-multiple-regression.html#model4table">6.1.3</a> we fit both interaction and parallel slopes models for the outcome variable <span class="math inline">\(y\)</span> teaching score using a numerical explanatory variable <span class="math inline">\(x_1\)</span> age and a categorical explanatory variable <span class="math inline">\(x_2\)</span> gender (recorded as a binary variable). We compared these models in Figure <a href="6-multiple-regression.html#fig:numxcatx-comparison">6.3</a>, which we display again now.</p>
+<p>When should we use an interaction model versus a parallel slopes model? Recall in Sections <a href="6-multiple-regression.html#model4interactiontable">6.1.2</a> and <a href="6-multiple-regression.html#model4table">6.1.3</a> we fit both interaction and parallel slopes models for the outcome variable <span class="math inline">\(y\)</span> (teaching score) using a numerical explanatory variable <span class="math inline">\(x_1\)</span> (age) and a categorical explanatory variable <span class="math inline">\(x_2\)</span> (gender recorded as a binary variable). We compared these models in Figure <a href="6-multiple-regression.html#fig:numxcatx-comparison">6.3</a>, which we display again now.</p>
 <div class="figure" style="text-align: center"><span id="fig:recall-parallel-vs-interaction"></span>
-<img src="moderndive_files/figure-html/recall-parallel-vs-interaction-1.png" alt="Previously seen comparison of interaction and parallel slopes models." width="\textwidth" />
+<img src="ModernDive_files/figure-html/recall-parallel-vs-interaction-1.png" alt="Previously seen comparison of interaction and parallel slopes models." width="\textwidth" />
 <p class="caption">
 FIGURE 6.7: Previously seen comparison of interaction and parallel slopes models.
 </p>
 </div>
-<p>A lot of you might have asked yourselves: “Why would I force the lines to have parallel slopes (as seen in the right-hand plot) when they clearly have different slopes (as seen in the left-hand plot).”</p>
-<p>The answer lies in a philosophical principle known as “Occam’s Razor.” It states that “all other things being equal, simpler solutions are more likely to be correct than complex ones.” When viewed in a modeling framework, Occam’s Razor  can be restated as “all other things being equal, simpler models are to be preferred over complex ones.” In other words, we should only favor the more complex model if the additional complexity is <em>warranted</em>.</p>
+<p>A lot of you might have asked yourselves: “Why would I force the lines to have parallel slopes (as seen in the right-hand plot) when they clearly have different slopes (as seen in the left-hand plot)?”.</p>
+<p>The answer lies in a philosophical principle known as “Occam’s Razor.” It states that, “all other things being equal, simpler solutions are more likely to be correct than complex ones.” When viewed in a modeling framework, Occam’s Razor  can be restated as, “all other things being equal, simpler models are to be preferred over complex ones.” In other words, we should only favor the more complex model if the additional complexity is <em>warranted</em>.</p>
 <p>Let’s revisit the equations for the regression line for both the interaction and parallel slopes model:</p>
 <p><span class="math display">\[
 \begin{aligned}
-\text{Interaction} &amp;: \widehat{y} = \widehat{\text{score}} = b_0 + b_{\mbox{age}} \cdot \mbox{age} + b_{\mbox{male}} \cdot \mathbb{1}_{\mbox{is male}}(x) + \\
-&amp; \qquad b_{\mbox{age,male}} \cdot \mbox{age} \cdot \mathbb{1}_{\mbox{is male}}\\
-\text{Parallel slopes} &amp;: \widehat{y} = \widehat{\text{score}} = b_0 + b_{\mbox{age}} \cdot \mbox{age} + b_{\mbox{male}} \cdot \mathbb{1}_{\mbox{is male}}(x)
+\text{Interaction} &amp;: \widehat{y} = \widehat{\text{score}} = b_0 + b_{\text{age}} \cdot \text{age} + b_{\text{male}} \cdot \mathbb{1}_{\text{is male}}(x) + \\
+&amp; \qquad b_{\text{age,male}} \cdot \text{age} \cdot \mathbb{1}_{\text{is male}}\\
+\text{Parallel slopes} &amp;: \widehat{y} = \widehat{\text{score}} = b_0 + b_{\text{age}} \cdot \text{age} + b_{\text{male}} \cdot \mathbb{1}_{\text{is male}}(x)
 \end{aligned}
 \]</span></p>
-<p>The interaction model is “more complex” in that there is an additional <span class="math inline">\(b_{\mbox{age,male}} \cdot \mbox{age} \cdot \mathbb{1}_{\mbox{is male}}\)</span> element to the equation not present for the parallel slopes model. Or viewed alternatively, the regression table for the interaction model in Table <a href="6-multiple-regression.html#tab:regtable-interaction">6.3</a> has <em>four</em> rows, whereas the regression table for the parallel slopes model in Table <a href="6-multiple-regression.html#tab:regtable-parallel-slopes">6.5</a> has <em>three</em> rows. The question becomes: “Is this additional complexity warranted?” In this case, it can be argued that this additional complexity is warranted, as evidenced by the clear x-shaped pattern of the two regression lines in the left-hand plot of Figure <a href="6-multiple-regression.html#fig:recall-parallel-vs-interaction">6.7</a>.</p>
-<p>However, let’s consider an example where the additional complexity might <em>not</em> be warranted. Let’s consider the <code>MA_schools</code> data which contains 2017 data on Massachusetts public high schools provided by the Massachusetts Department of Education; read the help file for this data by running <code>?MA_schools</code> if you would like more details.</p>
-<p>Let’s model the numerical outcome variable <span class="math inline">\(y\)</span>, average SAT math score for that high school, as a function of two explanatory variables:</p>
+<p>The interaction model is “more complex” in that there is an additional <span class="math inline">\(b_{\text{age,male}} \cdot \text{age} \cdot \mathbb{1}_{\text{is male}}\)</span> interaction term in the equation not present for the parallel slopes model. Or viewed alternatively, the regression table for the interaction model in Table <a href="6-multiple-regression.html#tab:regtable-interaction">6.3</a> has <em>four</em> rows, whereas the regression table for the parallel slopes model in Table <a href="6-multiple-regression.html#tab:regtable-parallel-slopes">6.5</a> has <em>three</em> rows. The question becomes: “Is this additional complexity warranted?”. In this case, it can be argued that this additional complexity is warranted, as evidenced by the clear x-shaped pattern of the two regression lines in the left-hand plot of Figure <a href="6-multiple-regression.html#fig:recall-parallel-vs-interaction">6.7</a>.</p>
+<p>However, let’s consider an example where the additional complexity might <em>not</em> be warranted. Let’s consider the <code>MA_schools</code> data included in the <code>moderndive</code> package which contains 2017 data on Massachusetts public high schools provided by the Massachusetts Department of Education. For more details, read the help file for this data by running <code>?MA_schools</code> in the console.</p>
+<p>Let’s model the numerical outcome variable <span class="math inline">\(y\)</span>, average SAT math score for a given high school, as a function of two explanatory variables:</p>
 <ol style="list-style-type: decimal">
 <li>A numerical explanatory variable <span class="math inline">\(x_1\)</span>, the percentage of that high school’s student body that are economically disadvantaged and</li>
-<li>A categorical explanatory variable <span class="math inline">\(x_2\)</span>, the school size as measured by enrollment: small (13-341 students), medium (342-541 students), and large (542-4264 students)</li>
+<li>A categorical explanatory variable <span class="math inline">\(x_2\)</span>, the school size as measured by enrollment: small (13-341 students), medium (342-541 students), and large (542-4264 students).</li>
 </ol>
-<p>Let’s create visualizations of both the interaction and parallel slopes model once again and display the output in Figure <a href="6-multiple-regression.html#fig:numxcatx-comparison-2">6.8</a>. Recall from Subsection <a href="6-multiple-regression.html#model4table">6.1.3</a> that the <code>gg_parallel_slopes()</code> function is a special purpose function included in the <code>moderndive</code> package, since the <code>ggplot2</code> package does not include a function for plotting parallel slopes models.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Interaction model</span>
-<span class="kw">ggplot</span>(MA_schools, 
-       <span class="kw">aes</span>(<span class="dt">x =</span> perc_disadvan, <span class="dt">y =</span> average_sat_math, <span class="dt">color =</span> size)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> <span class="fl">0.25</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span> ) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Percent economically disadvantaged&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Math SAT Score&quot;</span>, 
-       <span class="dt">color =</span> <span class="st">&quot;School size&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;Interaction model&quot;</span>)
-
-<span class="co"># Parallel slopes model</span>
-<span class="kw">gg_parallel_slopes</span>(<span class="dt">y =</span> <span class="st">&quot;average_sat_math&quot;</span>, <span class="dt">num_x =</span> <span class="st">&quot;perc_disadvan&quot;</span>, 
-                   <span class="dt">cat_x =</span> <span class="st">&quot;size&quot;</span>, <span class="dt">data =</span> MA_schools, <span class="dt">alpha =</span> <span class="fl">0.25</span>) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Percent economically disadvantaged&quot;</span>, 
-       <span class="dt">y =</span> <span class="st">&quot;Math SAT Score&quot;</span>, 
-       <span class="dt">color =</span> <span class="st">&quot;School size&quot;</span>, 
-       <span class="dt">title =</span> <span class="st">&quot;Parallel slopes model&quot;</span>) </code></pre>
+<p>Let’s create visualizations of both the interaction and parallel slopes model once again and display the output in Figure <a href="6-multiple-regression.html#fig:numxcatx-comparison-2">6.8</a>. Recall from Subsection <a href="6-multiple-regression.html#model4table">6.1.3</a> that the <code>geom_parallel_slopes()</code> function is a special purpose function included in the <code>moderndive</code> package, since the <code>geom_smooth()</code> method in the <code>ggplot2</code> package does not have a convenient way to plot parallel slopes models.</p>
+<div class="sourceCode" id="cb203"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb203-1" data-line-number="1"><span class="co"># Interaction model</span></a>
+<a class="sourceLine" id="cb203-2" data-line-number="2"><span class="kw">ggplot</span>(MA_schools, </a>
+<a class="sourceLine" id="cb203-3" data-line-number="3">       <span class="kw">aes</span>(<span class="dt">x =</span> perc_disadvan, <span class="dt">y =</span> average_sat_math, <span class="dt">color =</span> size)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb203-4" data-line-number="4"><span class="st">  </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> <span class="fl">0.25</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb203-5" data-line-number="5"><span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb203-6" data-line-number="6"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Percent economically disadvantaged&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Math SAT Score&quot;</span>, </a>
+<a class="sourceLine" id="cb203-7" data-line-number="7">       <span class="dt">color =</span> <span class="st">&quot;School size&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;Interaction model&quot;</span>)</a></code></pre></div>
+<div class="sourceCode" id="cb204"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb204-1" data-line-number="1"><span class="co"># Parallel slopes model</span></a>
+<a class="sourceLine" id="cb204-2" data-line-number="2"><span class="kw">ggplot</span>(MA_schools, </a>
+<a class="sourceLine" id="cb204-3" data-line-number="3">       <span class="kw">aes</span>(<span class="dt">x =</span> perc_disadvan, <span class="dt">y =</span> average_sat_math, <span class="dt">color =</span> size)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb204-4" data-line-number="4"><span class="st">  </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> <span class="fl">0.25</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb204-5" data-line-number="5"><span class="st">  </span><span class="kw">geom_parallel_slopes</span>(<span class="dt">se =</span> <span class="ot">FALSE</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb204-6" data-line-number="6"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Percent economically disadvantaged&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Math SAT Score&quot;</span>, </a>
+<a class="sourceLine" id="cb204-7" data-line-number="7">       <span class="dt">color =</span> <span class="st">&quot;School size&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;Parallel slopes model&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:numxcatx-comparison-2"></span>
-<img src="moderndive_files/figure-html/numxcatx-comparison-2-1.png" alt="Comparison of interaction and parallel slopes models for MA schools." width="\textwidth" />
+<img src="ModernDive_files/figure-html/numxcatx-comparison-2-1.png" alt="Comparison of interaction and parallel slopes models for Massachusetts schools." width="\textwidth" />
 <p class="caption">
-FIGURE 6.8: Comparison of interaction and parallel slopes models for MA schools.
+FIGURE 6.8: Comparison of interaction and parallel slopes models for Massachusetts schools.
 </p>
 </div>
-<p>Look closely at the left-hand plot of Figure <a href="6-multiple-regression.html#fig:numxcatx-comparison-2">6.8</a> corresponding to an interaction model. While the slopes are indeed different, they do not differ <em>by much</em>. In other words, they are near identical. Now look compare the left-hand plot with the right-hand plot corresponding to a parallel slopes model. The two models don’t appear all that different. Therefore in this case, it can be argued that the additional complexity of the interaction model is <em>not warranted</em>. Thus following Occam’s Razor, we should prefer the “simpler” parallel slopes model.</p>
-<p>Let’s explicitly define what “simpler” means in this case. Let’s compare the regression tables for the interaction and parallel slopes models in Tables <a href="6-multiple-regression.html#tab:model2-interaction">6.12</a> and <a href="6-multiple-regression.html#tab:model2-parallel-slopes">6.13</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">model_<span class="dv">2</span>_interaction &lt;-<span class="st"> </span><span class="kw">lm</span>(average_sat_math <span class="op">~</span><span class="st"> </span>perc_disadvan <span class="op">*</span><span class="st"> </span>size, 
-                          <span class="dt">data =</span> MA_schools)
-<span class="kw">get_regression_table</span>(model_<span class="dv">2</span>_interaction)</code></pre>
+<p>Look closely at the left-hand plot of Figure <a href="6-multiple-regression.html#fig:numxcatx-comparison-2">6.8</a> corresponding to an interaction model. While the slopes are indeed different, they do not differ <em>by much</em> and are nearly identical. Now compare the left-hand plot with the right-hand plot corresponding to a parallel slopes model. The two models don’t appear all that different. So in this case, it can be argued that the additional complexity of the interaction model is <em>not warranted</em>. Thus following Occam’s Razor, we should prefer the “simpler” parallel slopes model. Let’s explicitly define what “simpler” means in this case. Let’s compare the regression tables for the interaction and parallel slopes models in Tables <a href="6-multiple-regression.html#tab:model2-interaction">6.12</a> and <a href="6-multiple-regression.html#tab:model2-parallel-slopes">6.13</a>.</p>
+<div class="sourceCode" id="cb205"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb205-1" data-line-number="1">model_<span class="dv">2</span>_interaction &lt;-<span class="st"> </span><span class="kw">lm</span>(average_sat_math <span class="op">~</span><span class="st"> </span>perc_disadvan <span class="op">*</span><span class="st"> </span>size, </a>
+<a class="sourceLine" id="cb205-2" data-line-number="2">                          <span class="dt">data =</span> MA_schools)</a>
+<a class="sourceLine" id="cb205-3" data-line-number="3"><span class="kw">get_regression_table</span>(model_<span class="dv">2</span>_interaction)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:model2-interaction">TABLE 6.12: </span>Interaction model regression table
@@ -2513,9 +2506,9 @@ <h3><span class="header-section-number">6.3.1</span> Model selection</h3>
 </tr>
 </tbody>
 </table>
-<pre class="sourceCode r"><code class="sourceCode r">model_<span class="dv">2</span>_parallel_slopes &lt;-<span class="st"> </span><span class="kw">lm</span>(average_sat_math <span class="op">~</span><span class="st"> </span>perc_disadvan <span class="op">+</span><span class="st"> </span>size, 
-                              <span class="dt">data =</span> MA_schools)
-<span class="kw">get_regression_table</span>(model_<span class="dv">2</span>_parallel_slopes)</code></pre>
+<div class="sourceCode" id="cb206"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb206-1" data-line-number="1">model_<span class="dv">2</span>_parallel_slopes &lt;-<span class="st"> </span><span class="kw">lm</span>(average_sat_math <span class="op">~</span><span class="st"> </span>perc_disadvan <span class="op">+</span><span class="st"> </span>size, </a>
+<a class="sourceLine" id="cb206-2" data-line-number="2">                              <span class="dt">data =</span> MA_schools)</a>
+<a class="sourceLine" id="cb206-3" data-line-number="3"><span class="kw">get_regression_table</span>(model_<span class="dv">2</span>_parallel_slopes)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:model2-parallel-slopes">TABLE 6.13: </span>Parallel slopes regression table
@@ -2641,16 +2634,19 @@ <h3><span class="header-section-number">6.3.1</span> Model selection</h3>
 </tbody>
 </table>
 <p>Observe how the regression table for the interaction model has 2 more rows (6 versus 4). This reflects the additional “complexity” of the interaction model over the parallel slopes model.</p>
-<p>Furthermore, note in Table <a href="6-multiple-regression.html#tab:model2-interaction">6.12</a> how the <em>offsets for the slopes</em> <code>perc_disadvan:sizemedium</code> = 0.146 and <code>perc_disadvan:sizelarge</code> = 0.189 are very small relative to the <em>slope for the baseline group</em> of small schools. In other words, all three slopes for are similarly negative: -2.932 for small schools, -2.786 (= -2.932 + 0.146) for medium schools, and -2.743 (= -2.932 + 0.146) for large schools. These results are suggesting that irrespective of school size, the relationship between average math SAT scores and the percent of the student body that is economically disadvantaged is similar and alas very negative.</p>
-<p>What you have just performed is a rudimentary <em>model selection</em>: choosing which model fits data best among a set of candidate models. While the model selection you just performed was somewhat qualitative fashion, more statistically rigorous methods exist. If you’re curious, take a course on multiple regression!</p>
+<p>Furthermore, note in Table <a href="6-multiple-regression.html#tab:model2-interaction">6.12</a> how the <em>offsets for the slopes</em> <code>perc_disadvan:sizemedium</code> being 0.146 and <code>perc_disadvan:sizelarge</code> being 0.189 are small relative to the <em>slope for the baseline group</em> of small schools of <span class="math inline">\(-2.932\)</span>. In other words, all three slopes are similarly negative: <span class="math inline">\(-2.932\)</span> for small schools, <span class="math inline">\(-2.786\)</span> <span class="math inline">\((=-2.932 + 0.146)\)</span> for medium schools, and <span class="math inline">\(-2.743\)</span> <span class="math inline">\((=-2.932 + 0.189)\)</span> for large schools. These results are suggesting that irrespective of school size, the relationship between average math SAT scores and the percent of the student body that is economically disadvantaged is similar and, alas, quite negative.</p>
+<p>What you have just performed is a rudimentary <em>model selection</em>: choosing which model fits data best among a set of candidate models. While the model selection approach we just took was visual in nature and hence somewhat qualitative, more statistically rigorous methods for model selection exist in the fields of multiple regression and statistical/machine learning.</p>
+<!--
+TODO:
+Given that intercepts in the parallel slopes model in the right-hand plot of Figure \@ref(fig:numxcatx-comparison-2) are also similar as well, it can be argued that an even better model is one without the categorical variable school size. 
+-->
 </div>
 <div id="correlationcoefficient2" class="section level3">
 <h3><span class="header-section-number">6.3.2</span> Correlation coefficient</h3>
 <p>Recall from Table <a href="6-multiple-regression.html#tab:model3-correlation">6.9</a> that the correlation coefficient between <code>income</code> in thousands of dollars and credit card <code>debt</code> was 0.464. What if instead we looked at the correlation coefficient between <code>income</code> and credit card <code>debt</code>, but where <code>income</code> was in dollars and not thousands of dollars? This can be done by multiplying <code>income</code> by 1000.</p>
-<pre class="sourceCode r"><code class="sourceCode r">credit_ch7 <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">select</span>(debt, income) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">income =</span> income <span class="op">*</span><span class="st"> </span><span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">cor</span>()</code></pre>
+<div class="sourceCode" id="cb207"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb207-1" data-line-number="1">credit_ch6 <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">select</span>(debt, income) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb207-2" data-line-number="2"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">income =</span> income <span class="op">*</span><span class="st"> </span><span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb207-3" data-line-number="3"><span class="st">  </span><span class="kw">cor</span>()</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:cor-credit-2">TABLE 6.14: </span>Correlation between income (in dollars) and credit card debt
@@ -2692,21 +2688,21 @@ <h3><span class="header-section-number">6.3.2</span> Correlation coefficient</h3
 </tr>
 </tbody>
 </table>
-<p>We see it is the same! We say that the correlation coefficient is <em>invariant to linear transformations</em>! In other words, the correlation between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> will be the same as the correlation between <span class="math inline">\(a\cdot x + b\)</span> and <span class="math inline">\(y\)</span> for any numerical values <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>.</p>
+<p>We see it is the same! We say that the correlation coefficient is <em>invariant to linear transformations</em>. The correlation between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> will be the same as the correlation between <span class="math inline">\(a\cdot x + b\)</span> and <span class="math inline">\(y\)</span> for any numerical values <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>.</p>
 </div>
 <div id="simpsonsparadox" class="section level3">
 <h3><span class="header-section-number">6.3.3</span> Simpson’s Paradox</h3>
-<p>Recall in Section <a href="6-multiple-regression.html#model3">6.2</a>, we saw the two seemingly contradictory results when studying the relationship between credit card debt and income. On the one hand, the right hand plot of Figure <a href="6-multiple-regression.html#fig:2numxplot1">6.5</a> suggested that the relationship between credit card debt and income was <em>positive</em>. We re-display this plot in Figure <a href="6-multiple-regression.html#fig:2numxplot1-repeat">6.9</a>.</p>
+<p>Recall in Section <a href="6-multiple-regression.html#model3">6.2</a>, we saw the two seemingly contradictory results when studying the relationship between credit card <code>debt</code> and <code>income</code>. On the one hand, the right hand plot of Figure <a href="6-multiple-regression.html#fig:2numxplot1">6.5</a> suggested that the relationship between credit card <code>debt</code> and <code>income</code> was <em>positive</em>. We re-display this in Figure <a href="6-multiple-regression.html#fig:2numxplot1-repeat">6.9</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:2numxplot1-repeat"></span>
-<img src="moderndive_files/figure-html/2numxplot1-repeat-1.png" alt="Relationship between credit card debt and income." width="\textwidth" />
+<img src="ModernDive_files/figure-html/2numxplot1-repeat-1.png" alt="Relationship between credit card debt and income." width="\textwidth" />
 <p class="caption">
 FIGURE 6.9: Relationship between credit card debt and income.
 </p>
 </div>
-<p>On the other hand, the multiple regression table in Table <a href="6-multiple-regression.html#tab:model3-table-output">6.10</a> suggested that the relationship between debt and income was <em>negative</em>. We re-display this table in Table <a href="6-multiple-regression.html#tab:model3-table-output-repeat">6.15</a>.</p>
+<p>On the other hand, the multiple regression results in Table <a href="6-multiple-regression.html#tab:model3-table-output">6.10</a> suggested that the relationship between <code>debt</code> and <code>income</code> was <em>negative</em>. We re-display this information in Table <a href="6-multiple-regression.html#tab:model3-table-output-repeat">6.15</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:model3-table-output-repeat">TABLE 6.15: </span>Multiple regression table
+<span id="tab:model3-table-output-repeat">TABLE 6.15: </span>Multiple regression results
 </caption>
 <thead>
 <tr>
@@ -2805,37 +2801,37 @@ <h3><span class="header-section-number">6.3.3</span> Simpson’s Paradox</h3>
 </tr>
 </tbody>
 </table>
-<p>Observe how the slope for income is -7.663 and, most importantly for now, it is negative. This contradicts our observation in Figure <a href="6-multiple-regression.html#fig:2numxplot1-repeat">6.9</a> that the relationship is positive. How can this be? Recall the interpretation of the slope for <code>income</code> in the context of a multiple regression model: <em>taking into account all the other explanatory variables in our model</em>, for every increase of one unit in income (i.e. $1000), there is an associated decrease of on average $7.663 in debt.</p>
-<p>In other words, while in <em>isolation</em> the relationship between debt and income may be positive, when taking into account credit limit as well, this relationship becomes negative. These seemingly paradoxical results are due to a phenomenon aptly named <a href="https://en.wikipedia.org/wiki/Simpson%27s_paradox"><em>Simpson’s Paradox</em></a>. Simpson’s paradox occurs when trends that exist for the data in aggregate either disappear or reverse when the data are broken down into groups.</p>
-<p>Let’s show how Simpson’s Paradox manifests itself in the <code>credit_ch7</code> data. Let’s first visualize the distribution of the numerical explanatory variable credit limit with a histogram in Figure <a href="6-multiple-regression.html#fig:credit-limit-quartiles">6.10</a>.</p>
+<p>Observe how the slope for <code>income</code> is <span class="math inline">\(-7.663\)</span> and, most importantly for now, it is negative. This contradicts our observation in Figure <a href="6-multiple-regression.html#fig:2numxplot1-repeat">6.9</a> that the relationship is positive. How can this be? Recall the interpretation of the slope for <code>income</code> in the context of a multiple regression model: <em>taking into account all the other explanatory variables in our model</em>, for every increase of one unit in <code>income</code> (i.e., $1000), there is an associated decrease of on average $7.663 in <code>debt</code>.</p>
+<p>In other words, while in <em>isolation</em>, the relationship between <code>debt</code> and <code>income</code> may be positive, when taking into account <code>credit_limit</code> as well, this relationship becomes negative. These seemingly paradoxical results are due to a phenomenon aptly named <a href="https://en.wikipedia.org/wiki/Simpson%27s_paradox"><em>Simpson’s Paradox</em></a>. Simpson’s Paradox occurs when trends that exist for the data in aggregate either disappear or reverse when the data are broken down into groups.</p>
+<p>Let’s show how Simpson’s Paradox manifests itself in the <code>credit_ch6</code> data. Let’s first visualize the distribution of the numerical explanatory variable <code>credit_limit</code> with a histogram in Figure <a href="6-multiple-regression.html#fig:credit-limit-quartiles">6.10</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:credit-limit-quartiles"></span>
-<img src="moderndive_files/figure-html/credit-limit-quartiles-1.png" alt="Histogram of credit limits and brackets." width="\textwidth" />
+<img src="ModernDive_files/figure-html/credit-limit-quartiles-1.png" alt="Histogram of credit limits and brackets." width="\textwidth" />
 <p class="caption">
 FIGURE 6.10: Histogram of credit limits and brackets.
 </p>
 </div>
-<p>The vertical dashed lines are the <em>quartiles</em> that cut up the variable credit limit into four equally sized groups. Let’s think of these quartiles as converting our numerical variable credit limit into a categorical variable “credit limit bracket” with 4 levels. This means</p>
+<p>The vertical dashed lines are the <em>quartiles</em> that cut up the variable <code>credit_limit</code> into four equally sized groups. Let’s think of these quartiles as converting our numerical variable <code>credit_limit</code> into a categorical variable “<code>credit_limit</code> bracket” with four levels. This means that</p>
 <ol style="list-style-type: decimal">
-<li>25% of credit limits were between $0 and $3088. Let’s assign these 100 people to the “low” credit limit bracket.</li>
-<li>25% of credit limits were between $3088 and $4622. Let’s assign these 100 people to the “medium-low” credit limit bracket.</li>
-<li>25% of credit limits were between $4622 and $5873. Let’s assign these 100 people to the “medium-high” credit limit bracket.</li>
-<li>25% of credit limits were over $5873. Let’s assign these 100 people to the “high” credit limit bracket.</li>
+<li>25% of credit limits were between $0 and $3088. Let’s assign these 100 people to the “low” <code>credit_limit</code> bracket.</li>
+<li>25% of credit limits were between $3088 and $4622. Let’s assign these 100 people to the “medium-low” <code>credit_limit</code> bracket.</li>
+<li>25% of credit limits were between $4622 and $5873. Let’s assign these 100 people to the “medium-high” <code>credit_limit</code> bracket.</li>
+<li>25% of credit limits were over $5873. Let’s assign these 100 people to the “high” <code>credit_limit</code> bracket.</li>
 </ol>
-<p>Now in Figure <a href="6-multiple-regression.html#fig:2numxplot4">6.11</a> let’s re-display two versions of the scatterplot of debt and income from Figure <a href="6-multiple-regression.html#fig:2numxplot1-repeat">6.9</a>, but with a slight twist:</p>
+<p>Now in Figure <a href="6-multiple-regression.html#fig:2numxplot4">6.11</a> let’s re-display two versions of the scatterplot of <code>debt</code> and <code>income</code> from Figure <a href="6-multiple-regression.html#fig:2numxplot1-repeat">6.9</a>, but with a slight twist:</p>
 <ol style="list-style-type: decimal">
-<li>The left-hand plot shows the regular scatterplot and the single regression line, just as you saw previously.</li>
-<li>The right-hand plot shows the <em>colored scatterplot</em>, where the color aesthetic is mapped to “credit limit bracket.” Furthermore, there are now four separate regression lines.</li>
-<li>In other words, the location of the 400 points are the same in both scatterplots, but the right-hand plot shows an additional variable of information: credit limit bracket.</li>
+<li>The left-hand plot shows the regular scatterplot and the single regression line, just as you saw in Figure <a href="6-multiple-regression.html#fig:2numxplot1-repeat">6.9</a>.</li>
+<li>The right-hand plot shows the <em>colored scatterplot</em>, where the color aesthetic is mapped to “<code>credit_limit</code> bracket.” Furthermore, there are now four separate regression lines.</li>
 </ol>
+<p>In other words, the location of the 400 points are the same in both scatterplots, but the right-hand plot shows an additional variable of information: <code>credit_limit</code> bracket.</p>
 <div class="figure" style="text-align: center"><span id="fig:2numxplot4"></span>
-<img src="moderndive_files/figure-html/2numxplot4-1.png" alt="Relationship between credit card debt and income by credit limit bracket." width="\textwidth" />
+<img src="ModernDive_files/figure-html/2numxplot4-1.png" alt="Relationship between credit card debt and income by credit limit bracket." width="\textwidth" />
 <p class="caption">
 FIGURE 6.11: Relationship between credit card debt and income by credit limit bracket.
 </p>
 </div>
-<p>The left-hand plot of <a href="6-multiple-regression.html#fig:2numxplot4">6.11</a> focuses on the relationship between debt and income in <em>aggregate</em>. It is suggesting that overall there exists a positive relationship between debt and income. However, the right-hand plot of <a href="6-multiple-regression.html#fig:2numxplot4">6.11</a> focuses on the relationship between debt and income <em>broken down by credit limit bracket</em>. In other words, we focus on four <em>separate</em> relationships between debt and income: one for the “low” credit limit bracket, one for the “medium-low” credit limit bracket, and so on.</p>
-<p>Observe in the right-hand plot that the relationship between debt and income is clearly negative for the “medium-low” and “medium-high” credit limit brackets, while the relationship is somewhat flat for the “low” credit limit bracket. The only credit limit bracket where the relationship remains positive is for the “high” credit limit bracket. However, this relationship is less positive than in the relationship in aggregate, since the slope is shallower than the slope of the regression line in the left-hand plot.</p>
-<p>In this example of Simpson’s Paradox, credit limit is a <em>confounding variable</em> of the relationship between credit card debt and income as we defined in Subsection <a href="5-regression.html#correlation-is-not-causation">5.3.1</a>, as thus needs to be accounted for in any appropriate model for the relationship between debt and income.</p>
+<p>The left-hand plot of Figure <a href="6-multiple-regression.html#fig:2numxplot4">6.11</a> focuses on the relationship between <code>debt</code> and <code>income</code> in <em>aggregate</em>. It is suggesting that overall there exists a positive relationship between <code>debt</code> and <code>income</code>. However, the right-hand plot of Figure <a href="6-multiple-regression.html#fig:2numxplot4">6.11</a> focuses on the relationship between <code>debt</code> and <code>income</code> <em>broken down by <code>credit_limit</code> bracket</em>. In other words, we focus on four <em>separate</em> relationships between <code>debt</code> and <code>income</code>: one for the “low” <code>credit_limit</code> bracket, one for the “medium-low” <code>credit_limit</code> bracket, and so on.</p>
+<p>Observe in the right-hand plot that the relationship between <code>debt</code> and <code>income</code> is clearly negative for the “medium-low” and “medium-high” <code>credit_limit</code> brackets, while the relationship is somewhat flat for the “low” <code>credit_limit</code> bracket. The only <code>credit_limit</code> bracket where the relationship remains positive is for the “high” <code>credit_limit</code> bracket. However, this relationship is less positive than in the relationship in aggregate, since the slope is shallower than the slope of the regression line in the left-hand plot.</p>
+<p>In this example of Simpson’s Paradox, the <code>credit_limit</code> is a <em>confounding variable</em> of the relationship between credit card <code>debt</code> and <code>income</code> as we defined in Subsection <a href="5-regression.html#correlation-is-not-causation">5.3.1</a>. Thus, <code>credit_limit</code> needs to be accounted for in any appropriate model for the relationship between <code>debt</code> and <code>income</code>.</p>
 </div>
 </div>
 <div id="conclusion-5" class="section level2">
@@ -2846,23 +2842,31 @@ <h3><span class="header-section-number">6.4.1</span> Additional resources</h3>
 </div>
 <div id="whats-to-come-5" class="section level3">
 <h3><span class="header-section-number">6.4.2</span> What’s to come?</h3>
-<p>Congratulations! We’ve completed our first pass through the “Data modeling with moderndive” portion of this book. We’re ready to proceed to the next portion of this book: “Statistical inference with infer”. Statistical inference is the science of inferring about some unknown quantity using sampling.</p>
-<p>For example, among the most well-known examples of sampling involved <em>polls</em>. Because asking an entire population about their opinions would be a long and arduous task, pollsters often take a smaller sample that is hopefully representative of the population. Based on the results of this sample, pollsters hope to make claims about the entire population.</p>
-<p>Once we’ve covered Chapters <a href="7-sampling.html#sampling">7</a> on sampling, <a href="8-confidence-intervals.html#confidence-intervals">8</a> on confidence intervals, and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> on hypothesis testing, in Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a> on inference for regression we’ll revisit the regression models we studied in Chapter <a href="5-regression.html#regression">5</a> and <a href="6-multiple-regression.html#multiple-regression">6</a>. So far we’ve only studied the <code>estimate</code> column of all our regression tables. The next 4 chapters focus on what the remaining columns mean: the <code>std_error</code> standard error, the <code>statistic</code> test statistic, the <code>p_value</code> p-value, and the <code>lower_ci</code> and <code>upper_ci</code> lower and upper bounds of confidence intervals.</p>
-<p>Furthermore in Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a>, we’ll revisit the concept of residuals <span class="math inline">\(y - \widehat{y}\)</span> and discuss their importance when interpreting the results of a regression model. We’ll perform what is known as a <em>residual analysis</em> of the <code>residual</code> variable of all <code>get_regression_points()</code> outputs. Residual analyses allow you to verify what are known as the <em>conditions for inference for regression</em>. On to Chapter <a href="7-sampling.html#sampling">7</a> on sampling!</p>
-<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-224"></span>
-<img src="images/flowcharts/flowchart/flowchart.006.png" alt="ModernDive flowchart - On to Part III!" width="\textwidth" />
+<p>Congratulations! We’ve completed the “Data Modeling with <code>moderndive</code>” portion of this book. We’re ready to proceed to Part III of this book: “Statistical Inference with <code>infer</code>.” Statistical inference is the science of inferring about some unknown quantity using sampling.</p>
+<p>For example, among the most well-known examples of sampling involves <em>polls</em>. Because asking an entire population about their opinions would be a long and arduous task, pollsters often take a smaller sample that is hopefully representative of the population. Based on the results of this sample, pollsters hope to make claims about the entire population.</p>
+<p>Once we’ve covered Chapters <a href="7-sampling.html#sampling">7</a> on sampling, <a href="8-confidence-intervals.html#confidence-intervals">8</a> on confidence intervals, and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> on hypothesis testing, we’ll revisit the regression models we studied in Chapters <a href="5-regression.html#regression">5</a> and <a href="6-multiple-regression.html#multiple-regression">6</a> in Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a> on inference for regression. So far, we’ve only studied the <code>estimate</code> column of all our regression tables. The next four chapters focus on what the remaining columns mean: the standard error (<code>std_error</code>), the test <code>statistic</code>, the <code>p_value</code>, and the lower and upper bounds of confidence intervals (<code>lower_ci</code> and <code>upper_ci</code>).</p>
+<p>Furthermore in Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a>, we’ll revisit the concept of residuals <span class="math inline">\(y - \widehat{y}\)</span> and discuss their importance when interpreting the results of a regression model. We’ll perform what is known as a <em>residual analysis</em> of the <code>residual</code> variable of all <code>get_regression_points()</code> outputs. Residual analyses allow you to verify what are known as the <em>conditions for inference for regression</em>. On to Chapter <a href="7-sampling.html#sampling">7</a> on sampling in Part III as shown in Figure <a href="6-multiple-regression.html#fig:part3">6.12</a>!</p>
+
+<div class="figure" style="text-align: center"><span id="fig:part3"></span>
+<img src="images/flowcharts/flowchart/flowchart.006.png" alt="ModernDive flowchart - on to Part III!" width="\textwidth" />
 <p class="caption">
-FIGURE 6.12: ModernDive flowchart - On to Part III!
+FIGURE 6.12: <em>ModernDive</em> flowchart - on to Part III!
 </p>
 </div>
 
+
 </div>
 </div>
 </div>
 
 
 
+<h3>References</h3>
+<div id="refs" class="references">
+<div id="ref-islr2017">
+<p>James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2017. <em>An Introduction to Statistical Learning: With Applications in R</em>. First. New York, NY: Springer.</p>
+</div>
+</div>
             </section>
 
           </div>
@@ -2874,11 +2878,13 @@ <h3><span class="header-section-number">6.4.2</span> What’s to come?</h3>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -2886,12 +2892,11 @@ <h3><span class="header-section-number">6.4.2</span> What’s to come?</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -2906,6 +2911,10 @@ <h3><span class="header-section-number">6.4.2</span> What’s to come?</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -2922,8 +2931,9 @@ <h3><span class="header-section-number">6.4.2</span> What’s to come?</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/7-sampling.html b/docs/7-sampling.html
index 3107a5ff4..317ae32fc 100644
--- a/docs/7-sampling.html
+++ b/docs/7-sampling.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>Chapter 7 Sampling | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="Chapter 7 Sampling | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="Chapter 7 Sampling | Statistical Inference via Data Science" />
@@ -21,18 +21,18 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="6-multiple-regression.html">
-<link rel="next" href="8-confidence-intervals.html">
+<link rel="prev" href="6-multiple-regression.html"/>
+<link rel="next" href="8-confidence-intervals.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -570,7 +583,7 @@ <h1>
 </html>
 <div id="sampling" class="section level1">
 <h1><span class="header-section-number">Chapter 7</span> Sampling</h1>
-<p>In this chapter, we kick off the third portion of this book on statistical inference by learning about <em>sampling</em>. The concepts behind sampling form the basis of confidence intervals and hypothesis testing, which we’ll cover in Chapters <a href="8-confidence-intervals.html#confidence-intervals">8</a> and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>. We will see that the tools that you learned in the data science portion of this book, in particular data visualization and data wrangling, will also play an important role in the development of your understanding. As mentioned before, the concepts throughout this text all build into a culmination allowing you to “tell the story with data.”</p>
+<p>In this chapter, we kick off the third portion of this book on statistical inference by learning about <em>sampling</em>. The concepts behind sampling form the basis of confidence intervals and hypothesis testing, which we’ll cover in Chapters <a href="8-confidence-intervals.html#confidence-intervals">8</a> and <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>. We will see that the tools that you learned in the data science portion of this book, in particular data visualization and data wrangling, will also play an important role in the development of your understanding. As mentioned before, the concepts throughout this text all build into a culmination allowing you to “tell your story with data.”</p>
 <div id="needed-packages-5" class="section level3 unnumbered">
 <h3>Needed packages</h3>
 <p>Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section <a href="4-tidy.html#tidyverse-package">4.4</a> that loading the <code>tidyverse</code> package by running <code>library(tidyverse)</code> loads the following commonly used data science packages all at once:</p>
@@ -582,8 +595,8 @@ <h3>Needed packages</h3>
 <li>As well as the more advanced <code>purrr</code>, <code>tibble</code>, <code>stringr</code>, and <code>forcats</code> packages</li>
 </ul>
 <p>If needed, read Section <a href="1-getting-started.html#packages">1.3</a> for information on how to install and load R packages.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(tidyverse)
-<span class="kw">library</span>(moderndive)</code></pre>
+<div class="sourceCode" id="cb208"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb208-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb208-2" data-line-number="2"><span class="kw">library</span>(moderndive)</a></code></pre></div>
 </div>
 <div id="sampling-activity" class="section level2">
 <h2><span class="header-section-number">7.1</span> Sampling bowl activity</h2>
@@ -593,7 +606,7 @@ <h3><span class="header-section-number">7.1.1</span> What proportion of this bow
 <p>Take a look at the bowl in Figure <a href="7-sampling.html#fig:sampling-exercise-1">7.1</a>. It has a certain number of red and a certain number of white balls all of equal size. Furthermore, it appears the bowl has been mixed beforehand, as there does not seem to be any coherent pattern to the spatial distribution of the red and white balls.</p>
 <p>Let’s now ask ourselves, what proportion of this bowl’s balls are red?</p>
 <div class="figure" style="text-align: center"><span id="fig:sampling-exercise-1"></span>
-<img src="images/sampling/balls/sampling_bowl_1.jpg" alt="A bowl with red and white balls." width="80%" />
+<img src="images/sampling/balls/sampling_bowl_1.jpg" alt="A bowl with red and white balls." width="95%" />
 <p class="caption">
 FIGURE 7.1: A bowl with red and white balls.
 </p>
@@ -602,22 +615,22 @@ <h3><span class="header-section-number">7.1.1</span> What proportion of this bow
 </div>
 <div id="using-the-shovel-once" class="section level3">
 <h3><span class="header-section-number">7.1.2</span> Using the shovel once</h3>
-<p>Instead of performing an exhaustive count, let’s insert a shovel into the bowl as seen in Figure <a href="7-sampling.html#fig:sampling-exercise-2">7.2</a>. Using the shovel let’s remove 5 <span class="math inline">\(\times\)</span> 10 = 50 balls, as seen in Figure <a href="7-sampling.html#fig:sampling-exercise-3">7.3</a>.</p>
+<p>Instead of performing an exhaustive count, let’s insert a shovel into the bowl as seen in Figure <a href="7-sampling.html#fig:sampling-exercise-2">7.2</a>. Using the shovel, let’s remove <span class="math inline">\(5 \cdot 10 = 50\)</span> balls, as seen in Figure <a href="7-sampling.html#fig:sampling-exercise-3">7.3</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:sampling-exercise-2"></span>
-<img src="images/sampling/balls/sampling_bowl_2.jpg" alt="Inserting a shovel into the bowl." width="80%" />
+<img src="images/sampling/balls/sampling_bowl_2.jpg" alt="Inserting a shovel into the bowl." width="100%" />
 <p class="caption">
 FIGURE 7.2: Inserting a shovel into the bowl.
 </p>
 </div>
 <div class="figure" style="text-align: center"><span id="fig:sampling-exercise-3"></span>
-<img src="images/sampling/balls/sampling_bowl_3_cropped.jpg" alt="Fifty balls from the bowl." width="80%" />
+<img src="images/sampling/balls/sampling_bowl_3_cropped.jpg" alt="Removing 50 balls from the bowl." width="100%" />
 <p class="caption">
-FIGURE 7.3: Fifty balls from the bowl.
+FIGURE 7.3: Removing 50 balls from the bowl.
 </p>
 </div>
 <p>Observe that 17 of the balls are red and thus 0.34 = 34% of the shovel’s balls are red. We can view the proportion of balls that are red in this shovel as a guess of the proportion of balls that are red in the entire bowl. While not as exact as doing an exhaustive count of all the balls in the bowl, our guess of 34% took much less time and energy to make.</p>
 <p>However, say, we started this activity over from the beginning. In other words, we replace the 50 balls back into the bowl and start over. Would we remove exactly 17 red balls again? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% again? Maybe?</p>
-<p>What if we repeated this activity several times? Would we obtain exactly 17 red balls each time? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% every time? Surely not. Let’s repeat this exercise several times with the help of 33 groups of friends to understand how the value differs with repetition.</p>
+<p>What if we repeated this activity several times following the process shown in Figure <a href="7-sampling.html#fig:sampling-exercise-3b">7.4</a>? Would we obtain exactly 17 red balls each time? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% every time? Surely not. Let’s repeat this exercise several times with the help of 33 groups of friends to understand how the value differs with repetition.</p>
 </div>
 <div id="student-shovels" class="section level3">
 <h3><span class="header-section-number">7.1.3</span> Using the shovel 33 times</h3>
@@ -634,7 +647,7 @@ <h3><span class="header-section-number">7.1.3</span> Using the shovel 33 times</
 FIGURE 7.4: Repeating sampling activity 33 times.
 </p>
 </div>
-<p>Before returning the balls into the bowl, each of our 33 groups of friends are going to mark their proportion of the 50 balls that were red in a hand-drawn histogram as seen in Figure <a href="7-sampling.html#fig:sampling-exercise-4">7.5</a>.</p>
+<p>Each of our 33 groups of friends make note of their proportion of red balls from their sample collected. Each group then marks their proportion of their 50 balls that were red in the appropriate bin in a hand-drawn histogram as seen in Figure <a href="7-sampling.html#fig:sampling-exercise-4">7.5</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:sampling-exercise-4"></span>
 <img src="images/sampling/balls/tactile_3_a.jpg" alt="Constructing a histogram of proportions." width="80%" />
 <p class="caption">
@@ -643,20 +656,20 @@ <h3><span class="header-section-number">7.1.3</span> Using the shovel 33 times</
 </div>
 <p>Recall from Section <a href="2-viz.html#histograms">2.5</a> that histograms allow us to visualize the <em>distribution</em>  of a numerical variable. In particular, where the center of the values falls and how the values vary. A partially completed histogram of the first 10 out of 33 groups of friends’ results can be seen in Figure <a href="7-sampling.html#fig:sampling-exercise-5">7.6</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:sampling-exercise-5"></span>
-<img src="images/sampling/balls/tactile_3_c.jpg" alt="Hand-drawn histogram of first 10 out of 33 proportions." width="80%" />
+<img src="images/sampling/balls/tactile_3_c.jpg" alt="Hand-drawn histogram of first 10 out of 33 proportions." width="70%" />
 <p class="caption">
 FIGURE 7.6: Hand-drawn histogram of first 10 out of 33 proportions.
 </p>
 </div>
 <p>Observe the following in the histogram in Figure <a href="7-sampling.html#fig:sampling-exercise-5">7.6</a>:</p>
 <ul>
-<li>At the low end, one group removed 50 balls from the bowl with proportion between 0.20 and 0.25.</li>
+<li>At the low end, one group removed 50 balls from the bowl with proportion red between 0.20 and 0.25.</li>
 <li>At the high end, another group removed 50 balls from the bowl with proportion between 0.45 and 0.5 red.</li>
-<li>However the most frequently occurring proportions were between 0.30 and 0.35 red, right in the middle of the distribution.</li>
+<li>However, the most frequently occurring proportions were between 0.30 and 0.35 red, right in the middle of the distribution.</li>
 <li>The shape of this distribution is somewhat bell-shaped.</li>
 </ul>
-<p>Let’s construct this same hand-drawn histogram in R using your data visualization skills that you honed in Chapter <a href="2-viz.html#viz">2</a>. We saved our 33 groups of friends’ results in a data frame <code>tactile_prop_red</code> included in the <code>moderndive</code> package. Run the following to display the first 10 of 33 rows:</p>
-<pre class="sourceCode r"><code class="sourceCode r">tactile_prop_red</code></pre>
+<p>Let’s construct this same hand-drawn histogram in R using your data visualization skills that you honed in Chapter <a href="2-viz.html#viz">2</a>. We saved our 33 groups of friends’ results in the <code>tactile_prop_red</code> data frame included in the <code>moderndive</code> package. Run the following to display the first 10 of 33 rows:</p>
+<div class="sourceCode" id="cb209"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb209-1" data-line-number="1">tactile_prop_red</a></code></pre></div>
 <pre><code># A tibble: 33 x 4
    group            replicate red_balls prop_red
    &lt;chr&gt;                &lt;int&gt;     &lt;int&gt;    &lt;dbl&gt;
@@ -671,14 +684,14 @@ <h3><span class="header-section-number">7.1.3</span> Using the shovel 33 times</
  9 Daniel, Caroline         9        15     0.3 
 10 Josh, Maeve             10        17     0.34
 # … with 23 more rows</code></pre>
-<p>Observe for each <code>group</code> that we have their names, the number of <code>red_balls</code> they obtained, and the corresponding proportion out of 50 balls that were red named <code>prop_red</code>. We also have a variable <code>replicate</code> enumerating each of the 33 groups; we chose this name because each row can be viewed as one instance of a replicated (in other words repeated) activity: using the shovel to remove 50 balls and computing the proportion of those balls that are red.</p>
-<p>Let’s visualize the distribution of these 33 proportions using a <code>geom_histogram()</code> with <code>binwidth = 0.05</code> in Figure <a href="7-sampling.html#fig:samplingdistribution-tactile">7.7</a>. This is a computerized and complete version of the partially completed hand-drawn histogram you saw in Figure <a href="7-sampling.html#fig:sampling-exercise-5">7.6</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(tactile_prop_red, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 50 balls that were red&quot;</span>, 
-       <span class="dt">title =</span> <span class="st">&quot;Distribution of 33 proportions red&quot;</span>) </code></pre>
+<p>Observe for each <code>group</code> that we have their names, the number of <code>red_balls</code> they obtained, and the corresponding proportion out of 50 balls that were red named <code>prop_red</code>. We also have a <code>replicate</code> variable enumerating each of the 33 groups. We chose this name because each row can be viewed as one instance of a replicated (in other words repeated) activity: using the shovel to remove 50 balls and computing the proportion of those balls that are red.</p>
+<p>Let’s visualize the distribution of these 33 proportions using <code>geom_histogram()</code> with <code>binwidth = 0.05</code> in Figure <a href="7-sampling.html#fig:samplingdistribution-tactile">7.7</a>. This is a computerized and complete version of the partially completed hand-drawn histogram you saw in Figure <a href="7-sampling.html#fig:sampling-exercise-5">7.6</a>. Note that setting <code>boundary = 0.4</code> indicates that we want a binning scheme such that one of the bins’ boundary is at 0.4. This helps us to more closely align this histogram with the hand-drawn histogram in Figure <a href="7-sampling.html#fig:sampling-exercise-5">7.6</a>.</p>
+<div class="sourceCode" id="cb211"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb211-1" data-line-number="1"><span class="kw">ggplot</span>(tactile_prop_red, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb211-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb211-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 50 balls that were red&quot;</span>, </a>
+<a class="sourceLine" id="cb211-4" data-line-number="4">       <span class="dt">title =</span> <span class="st">&quot;Distribution of 33 proportions red&quot;</span>) </a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:samplingdistribution-tactile"></span>
-<img src="moderndive_files/figure-html/samplingdistribution-tactile-1.png" alt="Distribution of 33 proportions based on 33 samples of size 50." width="\textwidth" />
+<img src="ModernDive_files/figure-html/samplingdistribution-tactile-1.png" alt="Distribution of 33 proportions based on 33 samples of size 50." width="\textwidth" />
 <p class="caption">
 FIGURE 7.7: Distribution of 33 proportions based on 33 samples of size 50.
 </p>
@@ -686,7 +699,7 @@ <h3><span class="header-section-number">7.1.3</span> Using the shovel 33 times</
 </div>
 <div id="what-did-we-just-do" class="section level3">
 <h3><span class="header-section-number">7.1.4</span> What did we just do?</h3>
-<p>What we just demonstrated in this activity is the statistical concept of  <em>sampling</em>. We would like to know the proportion of the bowl’s balls that are red. However, because the bowl has a very large number of balls, performing an exhaustive count of the red and white balls would be very time-consuming. We therefore extracted a <em>sample</em> of 50 balls using the shovel to make an <em>estimate</em>. Using this sample of 50 balls, we estimated the proportion of the <em>bowl’s</em> balls that are red to be 34%.</p>
+<p>What we just demonstrated in this activity is the statistical concept of  <em>sampling</em>. We would like to know the proportion of the bowl’s balls that are red. Because the bowl has a large number of balls, performing an exhaustive count of the red and white balls would be time-consuming. We thus extracted a <em>sample</em> of 50 balls using the shovel to make an <em>estimate</em>. Using this sample of 50 balls, we estimated the proportion of the <em>bowl’s</em> balls that are red to be 34%.</p>
 <p>Moreover, because we mixed the balls before each use of the shovel, the samples were randomly drawn. Because each sample was drawn at random, the samples were different from each other. Because the samples were different from each other, we obtained the different proportions red observed in Figure <a href="7-sampling.html#fig:samplingdistribution-tactile">7.7</a>. This is known as the concept of <em>sampling variation</em>. </p>
 <p>The purpose of this sampling activity was to develop an understanding of two key concepts relating to sampling:</p>
 <ol style="list-style-type: decimal">
@@ -694,9 +707,8 @@ <h3><span class="header-section-number">7.1.4</span> What did we just do?</h3>
 <li>Understanding the effect of sample size on sampling variation.</li>
 </ol>
 <p>In Section <a href="7-sampling.html#sampling-simulation">7.2</a>, we’ll mimic the hands-on sampling activity we just performed on a computer. This will allow us not only to repeat the sampling exercise much more than 33 times, but it will also allow us to use shovels with different numbers of slots than just 50.</p>
-<p>Afterwards, we’ll present you with definitions, terminology, and notation related to sampling in Section <a href="7-sampling.html#sampling-framework">7.3</a>. As in many disciplines, such necessary background knowledge may seem very inaccessible and even confusing at first. However, as with many difficult topics, if you truly understand the underlying concepts and practice, practice, practice, you’ll be able to master them.</p>
-<p>To tie the contents of this chapter to the real-word, we’ll present an example of one of the most recognizable uses of sampling: polls. In Section <a href="7-sampling.html#sampling-case-study">7.4</a> we’ll look at a particular case study: a 2013 poll on then U.S. President Obama’s popularity among young Americans, conducted by the Harvard Kennedy School’s Institute of Politics.</p>
-<p>To close this chapter we’ll generalize the previous “sampling from a bowl” exercise to other sampling scenarios, present an important theoretical result known as the <em>Central Limit Theorem</em>, and present a few mathematical formulas related to sampling.</p>
+<p>Afterwards, we’ll present you with definitions, terminology, and notation related to sampling in Section <a href="7-sampling.html#sampling-framework">7.3</a>. As in many disciplines, such necessary background knowledge may seem inaccessible and even confusing at first. However, as with many difficult topics, if you truly understand the underlying concepts and practice, practice, practice, you’ll be able to master them.</p>
+<p>To tie the contents of this chapter to the real world, we’ll present an example of one of the most recognizable uses of sampling: polls. In Section <a href="7-sampling.html#sampling-case-study">7.4</a> we’ll look at a particular case study: a 2013 poll on then U.S. President Barack Obama’s popularity among young Americans, conducted by Kennedy School’s Institute of Politics at Harvard University. To close this chapter, we’ll generalize the “sampling from a bowl” exercise to other sampling scenarios and present a theoretical result known as the <em>Central Limit Theorem</em>.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -711,11 +723,11 @@ <h3><span class="header-section-number">7.1.4</span> What did we just do?</h3>
 </div>
 <div id="sampling-simulation" class="section level2">
 <h2><span class="header-section-number">7.2</span> Virtual sampling</h2>
-<p>In the previous Section <a href="7-sampling.html#sampling-activity">7.1</a>, we performed a <em>tactile</em> sampling activity by hand. In other words, we used a physical bowl of balls and a physical shovel. We performed this sampling activity by hand first so that we develop a firm understanding of the root ideas behind sampling. In this section, we’ll mimic this tactile sampling activity with a <em>virtual</em> sampling activity using a computer. In other words, we’ll use a virtual analog to the bowl of balls and a virtual analog to the shovel.</p>
+<p>In the previous Section <a href="7-sampling.html#sampling-activity">7.1</a>, we performed a <em>tactile</em> sampling activity by hand. In other words, we used a physical bowl of balls and a physical shovel. We performed this sampling activity by hand first so that we could develop a firm understanding of the root ideas behind sampling. In this section, we’ll mimic this tactile sampling activity with a <em>virtual</em> sampling activity using a computer. In other words, we’ll use a virtual analog to the bowl of balls and a virtual analog to the shovel.</p>
 <div id="using-the-virtual-shovel-once" class="section level3">
 <h3><span class="header-section-number">7.2.1</span> Using the virtual shovel once</h3>
-<p>Let’s start by performing the virtual analog of the tactile sampling exercise we performed in Section <a href="7-sampling.html#sampling-activity">7.1</a>. We first need a virtual analog of the bowl seen in Figure <a href="7-sampling.html#fig:sampling-exercise-1">7.1</a>. To this end, we included a data frame <code>bowl</code> in the <code>moderndive</code> package. The rows of <code>bowl</code> correspond exactly with the contents of the actual bowl.</p>
-<pre class="sourceCode r"><code class="sourceCode r">bowl</code></pre>
+<p>Let’s start by performing the virtual analog of the tactile sampling exercise we performed in Section <a href="7-sampling.html#sampling-activity">7.1</a>. We first need a virtual analog of the bowl seen in Figure <a href="7-sampling.html#fig:sampling-exercise-1">7.1</a>. To this end, we included a data frame named <code>bowl</code> in the <code>moderndive</code> package. The rows of <code>bowl</code> correspond exactly with the contents of the actual bowl.</p>
+<div class="sourceCode" id="cb212"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb212-1" data-line-number="1">bowl</a></code></pre></div>
 <pre><code># A tibble: 2,400 x 2
    ball_ID color
      &lt;int&gt; &lt;chr&gt;
@@ -730,12 +742,12 @@ <h3><span class="header-section-number">7.2.1</span> Using the virtual shovel on
  9       9 red  
 10      10 white
 # … with 2,390 more rows</code></pre>
-<p>Observe that <code>bowl</code> has 2400 rows, telling us that the bowl contains 2400 equally-sized balls. The first variable <code>ball_ID</code> is used as an <em>identification variable</em> as discussed in Subsection <a href="1-getting-started.html#identification-vs-measurement-variables">1.4.4</a>; none of the balls in the actual bowl are marked with numbers. The second variable <code>color</code> indicates whether a particular virtual ball is red or white. View the contents of the bowl in RStudio’s data viewer and scroll through the contents to convince yourself that <code>bowl</code> is indeed a virtual analog of the actual bowl in Figure <a href="7-sampling.html#fig:sampling-exercise-1">7.1</a>.</p>
+<p>Observe that <code>bowl</code> has 2400 rows, telling us that the bowl contains 2400 equally sized balls. The first variable <code>ball_ID</code> is used as an <em>identification variable</em> as discussed in Subsection <a href="1-getting-started.html#identification-vs-measurement-variables">1.4.4</a>; none of the balls in the actual bowl are marked with numbers. The second variable <code>color</code> indicates whether a particular virtual ball is red or white. View the contents of the bowl in RStudio’s data viewer and scroll through the contents to convince yourself that <code>bowl</code> is indeed a virtual analog of the actual bowl in Figure <a href="7-sampling.html#fig:sampling-exercise-1">7.1</a>.</p>
 <p>Now that we have a virtual analog of our bowl, we now need a virtual analog to the shovel seen in Figure <a href="7-sampling.html#fig:sampling-exercise-2">7.2</a> to generate virtual samples of 50 balls. We’re going to use the <code>rep_sample_n()</code> function included in the <code>moderndive</code> package. This function allows us to take <code>rep</code>eated, or <code>rep</code>licated, <code>samples</code> of size <code>n</code>.</p>
 <!--
 Note: Put this back in if people have trouble understanding rep_sample_n() at first:
 
-Let's show an example of this function in action. Let's first use the `tibble()` function to manually create a data frame of 5 fruit called `fruit_basket`. 
+Let's show an example of this function in action. Let's first use the `tibble()` function to manually create a data frame of five fruit called `fruit_basket`. 
 
 
 ```r
@@ -762,7 +774,7 @@ <h3><span class="header-section-number">7.2.1</span> Using the virtual shovel on
 3         1 Pamplemousse
 ```
 
-Your results will likely be different, since we are taking a *random* sample of size 3. Now let's see what happens when we try to sample 6 fruit:
+Your results will likely be different, since we are taking a *random* sample of size 3. Now let's see what happens when we try to sample six fruit:
 
 
 ```r
@@ -776,9 +788,9 @@ <h3><span class="header-section-number">7.2.1</span> Using the virtual shovel on
 
 We get an error message telling us that we cannot take a sample that has more rows than the original data frame. This is because `rep_sample_n()` by defaults samples *without replacement*\index{sampling without replacement}. Once it samples a fruit from the basket, it does not put it back in. 
 -->
-<pre class="sourceCode r"><code class="sourceCode r">virtual_shovel &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>)
-virtual_shovel</code></pre>
+<div class="sourceCode" id="cb214"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb214-1" data-line-number="1">virtual_shovel &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb214-2" data-line-number="2"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>)</a>
+<a class="sourceLine" id="cb214-3" data-line-number="3">virtual_shovel</a></code></pre></div>
 <pre><code># A tibble: 50 x 3
 # Groups:   replicate [1]
    replicate ball_ID color
@@ -794,10 +806,10 @@ <h3><span class="header-section-number">7.2.1</span> Using the virtual shovel on
  9         1     910 white
 10         1    1485 white
 # … with 40 more rows</code></pre>
-<p>Observe that <code>virtual_shovel</code> has 50 rows corresponding to our virtual sample of size 50. The <code>ball_ID</code> variable identifies which of the 2400 balls from <code>bowl</code> are included in our sample of 50 balls while <code>color</code> denotes its color. However what does the <code>replicate</code> variable indicate? In <code>virtual_shovel</code>’s case, <code>replicate</code> is equal to 1 for all 50 rows. This is telling us that these 50 rows correspond to the first repeated/replicated use of the shovel, in our case our first sample. We’ll see in what follows when we “virtually” take 33 samples, <code>replicate</code> will take values between 1 and 33.</p>
-<p>Let’s compute the proportion of balls in our virtual sample that are red using the <code>dplyr</code> data wrangling verbs you learned in Chapter <a href="3-wrangling.html#wrangling">3</a>. First, for each of our 50 sampled balls, let’s identify if it is red or not using a test for equality using <code>==</code>. Let’s create a new Boolean variable <code>is_red</code> using the <code>mutate()</code> function from Section <a href="3-wrangling.html#mutate">3.5</a>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_shovel <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">is_red =</span> (color <span class="op">==</span><span class="st"> &quot;red&quot;</span>))</code></pre>
+<p>Observe that <code>virtual_shovel</code> has 50 rows corresponding to our virtual sample of size 50. The <code>ball_ID</code> variable identifies which of the 2400 balls from <code>bowl</code> are included in our sample of 50 balls while <code>color</code> denotes its color. However, what does the <code>replicate</code> variable indicate? In <code>virtual_shovel</code>’s case, <code>replicate</code> is equal to 1 for all 50 rows. This is telling us that these 50 rows correspond to the first repeated/replicated use of the shovel, in our case our first sample. We’ll see shortly that when we “virtually” take 33 samples, <code>replicate</code> will take values between 1 and 33.</p>
+<p>Let’s compute the proportion of balls in our virtual sample that are red using the <code>dplyr</code> data wrangling verbs you learned in Chapter <a href="3-wrangling.html#wrangling">3</a>. First, for each of our 50 sampled balls, let’s identify if it is red or not using a test for equality with <code>==</code>. Let’s create a new Boolean variable <code>is_red</code> using the <code>mutate()</code> function from Section <a href="3-wrangling.html#mutate">3.5</a>:</p>
+<div class="sourceCode" id="cb216"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb216-1" data-line-number="1">virtual_shovel <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb216-2" data-line-number="2"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">is_red =</span> (color <span class="op">==</span><span class="st"> &quot;red&quot;</span>))</a></code></pre></div>
 <pre><code># A tibble: 50 x 4
 # Groups:   replicate [1]
    replicate ball_ID color is_red
@@ -813,42 +825,42 @@ <h3><span class="header-section-number">7.2.1</span> Using the virtual shovel on
  9         1     910 white FALSE 
 10         1    1485 white FALSE 
 # … with 40 more rows</code></pre>
-<p>Observe that for every row where <code>color == &quot;red&quot;</code>, the Boolean <code>TRUE</code> is returned and for every row where <code>color</code> is not equal to <code>&quot;red&quot;</code>, the Boolean <code>FALSE</code> is returned.</p>
+<p>Observe that for every row where <code>color == &quot;red&quot;</code>, the Boolean (logical) value <code>TRUE</code> is returned and for every row where <code>color</code> is not equal to <code>&quot;red&quot;</code>, the Boolean <code>FALSE</code> is returned.</p>
 <p>Second, let’s compute the number of balls out of 50 that are red using the <code>summarize()</code> function. Recall from Section <a href="3-wrangling.html#summarize">3.3</a> that <code>summarize()</code> takes a data frame with many rows and returns a data frame with a single row containing summary statistics, like the <code>mean()</code> or <code>median()</code>. In this case, we use the <code>sum()</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_shovel <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">is_red =</span> (color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">num_red =</span> <span class="kw">sum</span>(is_red))</code></pre>
+<div class="sourceCode" id="cb218"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb218-1" data-line-number="1">virtual_shovel <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb218-2" data-line-number="2"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">is_red =</span> (color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb218-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">num_red =</span> <span class="kw">sum</span>(is_red))</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
   replicate num_red
       &lt;int&gt;   &lt;int&gt;
 1         1      12</code></pre>
-<p>Why does this work? Because R treats <code>TRUE</code> like the number <code>1</code> and <code>FALSE</code> like the number <code>0</code>. So summing the number of <code>TRUE</code>’s and <code>FALSE</code>’s is equivalent to summing <code>1</code>’s and <code>0</code>’s. In the end, this operation counts the number of balls where <code>color</code> is <code>red</code>. In our case, 12 of the 50 balls were red. However, you might’ve gotten a different number red because of the randomness of the virtual sampling.</p>
+<p>Why does this work? Because R treats <code>TRUE</code> like the number <code>1</code> and <code>FALSE</code> like the number <code>0</code>. So summing the number of <code>TRUE</code>s and <code>FALSE</code>s is equivalent to summing <code>1</code>’s and <code>0</code>’s. In the end, this operation counts the number of balls where <code>color</code> is <code>red</code>. In our case, 12 of the 50 balls were red. However, you might have gotten a different number red because of the randomness of the virtual sampling.</p>
 <p>Third and lastly, let’s compute the proportion of the 50 sampled balls that are red by dividing <code>num_red</code> by 50:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_shovel <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">is_red =</span> color <span class="op">==</span><span class="st"> &quot;red&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">num_red =</span> <span class="kw">sum</span>(is_red)) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> num_red <span class="op">/</span><span class="st"> </span><span class="dv">50</span>)</code></pre>
+<div class="sourceCode" id="cb220"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb220-1" data-line-number="1">virtual_shovel <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb220-2" data-line-number="2"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">is_red =</span> color <span class="op">==</span><span class="st"> &quot;red&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb220-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">num_red =</span> <span class="kw">sum</span>(is_red)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb220-4" data-line-number="4"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> num_red <span class="op">/</span><span class="st"> </span><span class="dv">50</span>)</a></code></pre></div>
 <pre><code># A tibble: 1 x 3
   replicate num_red prop_red
       &lt;int&gt;   &lt;int&gt;    &lt;dbl&gt;
 1         1      12     0.24</code></pre>
-<p>In other words, 34% of this virtual sample’s balls were red. Let’s make this code a little more compact and succinct by combining the first <code>mutate()</code> and the <code>summarize()</code> as follows:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_shovel <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">num_red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> num_red <span class="op">/</span><span class="st"> </span><span class="dv">50</span>)</code></pre>
+<p>In other words, 24% of this virtual sample’s balls were red. Let’s make this code a little more compact and succinct by combining the first <code>mutate()</code> and the <code>summarize()</code> as follows:</p>
+<div class="sourceCode" id="cb222"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb222-1" data-line-number="1">virtual_shovel <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb222-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">num_red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb222-3" data-line-number="3"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> num_red <span class="op">/</span><span class="st"> </span><span class="dv">50</span>)</a></code></pre></div>
 <pre><code># A tibble: 1 x 3
   replicate num_red prop_red
       &lt;int&gt;   &lt;int&gt;    &lt;dbl&gt;
 1         1      12     0.24</code></pre>
-<p>Great! 34% of <code>virtual_shovel</code>’s 50 balls were red! So based on this particular sample of 50 balls, our guess at the proportion of the <code>bowl</code>’s balls that are red is 34%. But remember from our earlier tactile sampling activity that if we repeat this sampling, we will not necessarily obtain the same value of 34% again. There will likely be some variation. In fact, our 33 groups of friends computed 33 such proportions whose distribution we visualized in Figure <a href="7-sampling.html#fig:sampling-exercise-5">7.6</a>. We saw that these estimates <em>varied</em>. Let’s now perform the virtual analog of having 33 groups of students use the sampling shovel!</p>
+<p>Great! 24% of <code>virtual_shovel</code>’s 50 balls were red! So based on this particular sample of 50 balls, our guess at the proportion of the <code>bowl</code>’s balls that are red is 24%. But remember from our earlier tactile sampling activity that if we repeat this sampling, we will not necessarily obtain the same value of 24% again. There will likely be some variation. In fact, our 33 groups of friends computed 33 such proportions whose distribution we visualized in Figure <a href="7-sampling.html#fig:sampling-exercise-5">7.6</a>. We saw that these estimates <em>varied</em>. Let’s now perform the virtual analog of having 33 groups of students use the sampling shovel!</p>
 </div>
 <div id="using-the-virtual-shovel-33-times" class="section level3">
 <h3><span class="header-section-number">7.2.2</span> Using the virtual shovel 33 times</h3>
-<p>Recall that in our tactile sampling exercise in Section <a href="7-sampling.html#sampling-activity">7.1</a> we had 33 groups of students each use the shovel, yielding 33 samples of size 50 balls. We then used these 33 samples to compute 33 proportions. In other words we repeated/replicated using the shovel 33 times. We can perform this repeated/replicated sampling virtually by once again using our virtual shovel function <code>rep_sample_n()</code>, but by adding the <code>reps = 33</code> argument. This is telling R that we want to repeat the sampling 33 times.</p>
+<p>Recall that in our tactile sampling exercise in Section <a href="7-sampling.html#sampling-activity">7.1</a>, we had 33 groups of students each use the shovel, yielding 33 samples of size 50 balls. We then used these 33 samples to compute 33 proportions. In other words, we repeated/replicated using the shovel 33 times. We can perform this repeated/replicated sampling virtually by once again using our virtual shovel function <code>rep_sample_n()</code>, but by adding the <code>reps = 33</code> argument. This is telling R that we want to repeat the sampling 33 times.</p>
 <p>We’ll save these results in a data frame called <code>virtual_samples</code>. While we provide a preview of the first 10 rows of <code>virtual_samples</code> in what follows, we highly suggest you scroll through its contents using RStudio’s spreadsheet viewer by running <code>View(virtual_samples)</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_samples &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">reps =</span> <span class="dv">33</span>)
-virtual_samples</code></pre>
+<div class="sourceCode" id="cb224"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb224-1" data-line-number="1">virtual_samples &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb224-2" data-line-number="2"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">reps =</span> <span class="dv">33</span>)</a>
+<a class="sourceLine" id="cb224-3" data-line-number="3">virtual_samples</a></code></pre></div>
 <pre><code># A tibble: 1,650 x 3
 # Groups:   replicate [33]
    replicate ball_ID color
@@ -864,13 +876,13 @@ <h3><span class="header-section-number">7.2.2</span> Using the virtual shovel 33
  9         1     740 red  
 10         1     179 red  
 # … with 1,640 more rows</code></pre>
-<p>Observe in the spreadsheet viewer that the first 50 rows of <code>replicate</code> are equal to <code>1</code> while the next 50 rows of <code>replicate</code> are equal to <code>2</code>. This is telling us that the first 50 rows correspond to the first sample of 50 balls while the next 50 rows correspond to the second sample of 50 balls. This pattern continues for all <code>reps = 33</code> replicates and thus <code>virtual_samples</code> has 33 <span class="math inline">\(\times\)</span> 50 = 1650 rows.</p>
+<p>Observe in the spreadsheet viewer that the first 50 rows of <code>replicate</code> are equal to <code>1</code> while the next 50 rows of <code>replicate</code> are equal to <code>2</code>. This is telling us that the first 50 rows correspond to the first sample of 50 balls while the next 50 rows correspond to the second sample of 50 balls. This pattern continues for all <code>reps = 33</code> replicates and thus <code>virtual_samples</code> has 33 <span class="math inline">\(\cdot\)</span> 50 = 1650 rows.</p>
 <p>Let’s now take <code>virtual_samples</code> and compute the resulting 33 proportions red. We’ll use the same <code>dplyr</code> verbs as before, but this time with an additional <code>group_by()</code> of the <code>replicate</code> variable. Recall from Section <a href="3-wrangling.html#groupby">3.4</a> that by assigning the grouping variable “meta-data” before we <code>summarize()</code>, we’ll obtain 33 different proportions red. We display a preview of the first 10 out of 33 rows:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_prop_red &lt;-<span class="st"> </span>virtual_samples <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> red <span class="op">/</span><span class="st"> </span><span class="dv">50</span>)
-virtual_prop_red</code></pre>
+<div class="sourceCode" id="cb226"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb226-1" data-line-number="1">virtual_prop_red &lt;-<span class="st"> </span>virtual_samples <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb226-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb226-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb226-4" data-line-number="4"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> red <span class="op">/</span><span class="st"> </span><span class="dv">50</span>)</a>
+<a class="sourceLine" id="cb226-5" data-line-number="5">virtual_prop_red</a></code></pre></div>
 <pre><code># A tibble: 33 x 3
    replicate   red prop_red
        &lt;int&gt; &lt;int&gt;    &lt;dbl&gt;
@@ -885,13 +897,13 @@ <h3><span class="header-section-number">7.2.2</span> Using the virtual shovel 33
  9         9    24     0.48
 10        10    14     0.28
 # … with 23 more rows</code></pre>
-<p>As with our 33 groups of friends’ tactile samples, there is variation in the resulting 33 virtual proportions red. Let’s visualize this variation in a histogram in Figure <a href="7-sampling.html#fig:samplingdistribution-virtual">7.8</a>. Note that we add <code>binwidth = 0.05</code> and <code>boundary = 0.4</code> arguments as well. Setting <code>boundary = 0.4</code> indicates that we want a binning scheme such that one of the bins’ boundary is at 0.4. Since the <code>binwidth = 0.05</code> is also set, this will create bins with boundaries at 0.30, 0.35, 0.45, 0.5, etc as well.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(virtual_prop_red, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 50 balls that were red&quot;</span>, 
-       <span class="dt">title =</span> <span class="st">&quot;Distribution of 33 proportions red&quot;</span>) </code></pre>
+<p>As with our 33 groups of friends’ tactile samples, there is variation in the resulting 33 virtual proportions red. Let’s visualize this variation in a histogram in Figure <a href="7-sampling.html#fig:samplingdistribution-virtual">7.8</a>. Note that we add <code>binwidth = 0.05</code> and <code>boundary = 0.4</code> arguments as well. Recall that setting <code>boundary = 0.4</code> ensures a binning scheme with one of the bins’ boundaries at 0.4. Since the <code>binwidth = 0.05</code> is also set, this will create bins with boundaries at 0.30, 0.35, 0.45, 0.5, etc. as well.</p>
+<div class="sourceCode" id="cb228"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb228-1" data-line-number="1"><span class="kw">ggplot</span>(virtual_prop_red, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb228-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb228-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 50 balls that were red&quot;</span>, </a>
+<a class="sourceLine" id="cb228-4" data-line-number="4">       <span class="dt">title =</span> <span class="st">&quot;Distribution of 33 proportions red&quot;</span>) </a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:samplingdistribution-virtual"></span>
-<img src="moderndive_files/figure-html/samplingdistribution-virtual-1.png" alt="Distribution of 33 proportions based on 33 samples of size 50." width="\textwidth" />
+<img src="ModernDive_files/figure-html/samplingdistribution-virtual-1.png" alt="Distribution of 33 proportions based on 33 samples of size 50." width="\textwidth" />
 <p class="caption">
 FIGURE 7.8: Distribution of 33 proportions based on 33 samples of size 50.
 </p>
@@ -899,7 +911,7 @@ <h3><span class="header-section-number">7.2.2</span> Using the virtual shovel 33
 <p>Observe that we occasionally obtained proportions red that are less than 30%. On the other hand, we occasionally obtained proportions that are greater than 45%. However, the most frequently occurring proportions were between 35% and 40% (for 11 out of 33 samples). Why do we have these differences in proportions red? Because of <em>sampling variation</em>.</p>
 <p>Let’s now compare our virtual results with our tactile results from the previous section in Figure <a href="7-sampling.html#fig:tactile-vs-virtual">7.9</a>. Observe that both histograms are somewhat similar in their center and variation, although not identical. These slight differences are again due to random sampling variation. Furthermore, observe that both distributions are somewhat bell-shaped.</p>
 <div class="figure" style="text-align: center"><span id="fig:tactile-vs-virtual"></span>
-<img src="moderndive_files/figure-html/tactile-vs-virtual-1.png" alt="Comparing 33 virtual and 33 tactile proportions red." width="\textwidth" />
+<img src="ModernDive_files/figure-html/tactile-vs-virtual-1.png" alt="Comparing 33 virtual and 33 tactile proportions red." width="\textwidth" />
 <p class="caption">
 FIGURE 7.9: Comparing 33 virtual and 33 tactile proportions red.
 </p>
@@ -916,10 +928,10 @@ <h3><span class="header-section-number">7.2.2</span> Using the virtual shovel 33
 </div>
 <div id="shovel-1000-times" class="section level3">
 <h3><span class="header-section-number">7.2.3</span> Using the virtual shovel 1000 times</h3>
-<p>Now say we want to study the effects of sampling variation not for 33 samples, but rather for a very large number of samples, say 1000. We have two choices at this point. We could have our groups of friends manually take 1000 samples of 50 balls and compute the corresponding 1000 proportions. However, this would be a very tedious and time-consuming task. This is where computers excel: automating long and repetitive tasks while performing them very quickly. Thus at this point we will abandon tactile sampling in favor of only virtual sampling. Let’s once again use the <code>rep_sample_n()</code> function with sample <code>size</code> set to be 50 once again, but this time with the number of replicates <code>reps = 1000</code>. Be sure to scroll through the contents of <code>virtual_samples</code> in RStudio’s viewer.</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_samples &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)
-virtual_samples</code></pre>
+<p>Now say we want to study the effects of sampling variation not for 33 samples, but rather for a larger number of samples, say 1000. We have two choices at this point. We could have our groups of friends manually take 1000 samples of 50 balls and compute the corresponding 1000 proportions. However, this would be a tedious and time-consuming task. This is where computers excel: automating long and repetitive tasks while performing them quite quickly. Thus, at this point we will abandon tactile sampling in favor of only virtual sampling. Let’s once again use the <code>rep_sample_n()</code> function with sample <code>size</code> set to be 50 once again, but this time with the number of replicates <code>reps</code> set to <code>1000</code>. Be sure to scroll through the contents of <code>virtual_samples</code> in RStudio’s viewer.</p>
+<div class="sourceCode" id="cb229"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb229-1" data-line-number="1">virtual_samples &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb229-2" data-line-number="2"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)</a>
+<a class="sourceLine" id="cb229-3" data-line-number="3">virtual_samples</a></code></pre></div>
 <pre><code># A tibble: 50,000 x 3
 # Groups:   replicate [1,000]
    replicate ball_ID color
@@ -935,12 +947,12 @@ <h3><span class="header-section-number">7.2.3</span> Using the virtual shovel 10
  9         1     782 white
 10         1     898 white
 # … with 49,990 more rows</code></pre>
-<p>Observe that now <code>virtual_samples</code> has 1000 <span class="math inline">\(\times\)</span> 50 = 50,000 rows, instead of the 33 <span class="math inline">\(\times\)</span> 50 = 1650 rows from earlier. Using the same data wrangling code as earlier, let’s take the data frame <code>virtual_samples</code> with 1000 <span class="math inline">\(\times\)</span> 50 = 50,000 and compute the resulting 1000 proportions red.</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_prop_red &lt;-<span class="st"> </span>virtual_samples <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> red <span class="op">/</span><span class="st"> </span><span class="dv">50</span>)
-virtual_prop_red</code></pre>
+<p>Observe that now <code>virtual_samples</code> has 1000 <span class="math inline">\(\cdot\)</span> 50 = 50,000 rows, instead of the 33 <span class="math inline">\(\cdot\)</span> 50 = 1650 rows from earlier. Using the same data wrangling code as earlier, let’s take the data frame <code>virtual_samples</code> with 1000 <span class="math inline">\(\cdot\)</span> 50 = 50,000 rows and compute the resulting 1000 proportions of red balls.</p>
+<div class="sourceCode" id="cb231"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb231-1" data-line-number="1">virtual_prop_red &lt;-<span class="st"> </span>virtual_samples <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb231-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb231-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb231-4" data-line-number="4"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> red <span class="op">/</span><span class="st"> </span><span class="dv">50</span>)</a>
+<a class="sourceLine" id="cb231-5" data-line-number="5">virtual_prop_red</a></code></pre></div>
 <pre><code># A tibble: 1,000 x 3
    replicate   red prop_red
        &lt;int&gt; &lt;int&gt;    &lt;dbl&gt;
@@ -956,17 +968,17 @@ <h3><span class="header-section-number">7.2.3</span> Using the virtual shovel 10
 10        10    18     0.36
 # … with 990 more rows</code></pre>
 <p>Observe that we now have 1000 replicates of <code>prop_red</code>, the proportion of 50 balls that are red. Using the same code as earlier, let’s now visualize the distribution of these 1000 replicates of <code>prop_red</code> in a histogram in Figure <a href="7-sampling.html#fig:samplingdistribution-virtual-1000">7.10</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(virtual_prop_red, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 50 balls that were red&quot;</span>, 
-       <span class="dt">title =</span> <span class="st">&quot;Distribution of 1000 proportions red&quot;</span>) </code></pre>
+<div class="sourceCode" id="cb233"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb233-1" data-line-number="1"><span class="kw">ggplot</span>(virtual_prop_red, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb233-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb233-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 50 balls that were red&quot;</span>, </a>
+<a class="sourceLine" id="cb233-4" data-line-number="4">       <span class="dt">title =</span> <span class="st">&quot;Distribution of 1000 proportions red&quot;</span>) </a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:samplingdistribution-virtual-1000"></span>
-<img src="moderndive_files/figure-html/samplingdistribution-virtual-1000-1.png" alt="Distribution of 1000 proportions based on 33 samples of size 50." width="\textwidth" />
+<img src="ModernDive_files/figure-html/samplingdistribution-virtual-1000-1.png" alt="Distribution of 1000 proportions based on 1000 samples of size 50." width="\textwidth" />
 <p class="caption">
-FIGURE 7.10: Distribution of 1000 proportions based on 33 samples of size 50.
+FIGURE 7.10: Distribution of 1000 proportions based on 1000 samples of size 50.
 </p>
 </div>
-<p>Once again, the most frequently occurring proportions red occur between 35% and 40%. Every now and then, we obtain proportions as low as between 20% and 25%, and others as high as between 55% and 60%. These are rare, however. Furthermore, observe that we now have a much more symmetric and smoother bell-shaped distribution. This distribution is, in fact, a Normal distribution. At this point we recommend you read the “Normal distribution” section of Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a> for a brief discussion on the properties of the Normal distribution.</p>
+<p>Once again, the most frequently occurring proportions of red balls occur between 35% and 40%. Every now and then, we obtain proportions as low as between 20% and 25%, and others as high as between 55% and 60%. These are rare, however. Furthermore, observe that we now have a much more symmetric and smoother bell-shaped distribution. This distribution is, in fact, approximated well by a normal distribution. At this point we recommend you read the “Normal distribution” section (Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a>) for a brief discussion on the properties of the normal distribution.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -993,86 +1005,86 @@ <h3><span class="header-section-number">7.2.4</span> Using different shovels</h3
 </p>
 </div>
 <p>If your goal is still to estimate the proportion of the bowl’s balls that are red, which shovel would you choose? In our experience, most people would choose the largest shovel with 100 slots because it would yield the “best” guess of the proportion of the bowl’s balls that are red. Let’s define some criteria for “best” in this subsection.</p>
-<p>Using our newly developed tools for virtual sampling, let’s unpack the effect of having different sample sizes! In other words, let’s use <code>rep_sample_n()</code> with <code>size = 25</code>, <code>size = 50</code>, and <code>size = 100</code>, while keeping the number of repeated/replicated samples at 1000:</p>
+<p>Using our newly developed tools for virtual sampling, let’s unpack the effect of having different sample sizes! In other words, let’s use <code>rep_sample_n()</code> with <code>size</code> set to <code>25</code>, <code>50</code>, and <code>100</code>, respectively, while keeping the number of repeated/replicated samples at 1000:</p>
 <ol style="list-style-type: decimal">
 <li>Virtually use the appropriate shovel to generate 1000 samples with <code>size</code> balls.</li>
 <li>Compute the resulting 1000 replicates of the proportion of the shovel’s balls that are red.</li>
 <li>Visualize the distribution of these 1000 proportions red using a histogram.</li>
 </ol>
 <p>Run each of the following code segments individually and then compare the three resulting histograms.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Segment 1: sample size = 25 ------------------------------</span>
-<span class="co"># 1.a) Virtually use shovel 1000 times</span>
-virtual_samples_<span class="dv">25</span> &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">25</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)
-
-<span class="co"># 1.b) Compute resulting 1000 replicates of proportion red</span>
-virtual_prop_red_<span class="dv">25</span> &lt;-<span class="st"> </span>virtual_samples_<span class="dv">25</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> red <span class="op">/</span><span class="st"> </span><span class="dv">25</span>)
-
-<span class="co"># 1.c) Plot distribution via a histogram</span>
-<span class="kw">ggplot</span>(virtual_prop_red_<span class="dv">25</span>, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 25 balls that were red&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;25&quot;</span>) 
-
-
-<span class="co"># Segment 2: sample size = 50 ------------------------------</span>
-<span class="co"># 2.a) Virtually use shovel 1000 times</span>
-virtual_samples_<span class="dv">50</span> &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)
-
-<span class="co"># 2.b) Compute resulting 1000 replicates of proportion red</span>
-virtual_prop_red_<span class="dv">50</span> &lt;-<span class="st"> </span>virtual_samples_<span class="dv">50</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> red <span class="op">/</span><span class="st"> </span><span class="dv">50</span>)
-
-<span class="co"># 2.c) Plot distribution via a histogram</span>
-<span class="kw">ggplot</span>(virtual_prop_red_<span class="dv">50</span>, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 50 balls that were red&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;50&quot;</span>)  
-
-
-<span class="co"># Segment 3: sample size = 100 ------------------------------</span>
-<span class="co"># 3.a) Virtually using shovel with 100 slots 1000 times</span>
-virtual_samples_<span class="dv">100</span> &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">100</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)
-
-<span class="co"># 3.b) Compute resulting 1000 replicates of proportion red</span>
-virtual_prop_red_<span class="dv">100</span> &lt;-<span class="st"> </span>virtual_samples_<span class="dv">100</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> red <span class="op">/</span><span class="st"> </span><span class="dv">100</span>)
-
-<span class="co"># 3.c) Plot distribution via a histogram</span>
-<span class="kw">ggplot</span>(virtual_prop_red_<span class="dv">100</span>, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 100 balls that were red&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;100&quot;</span>) </code></pre>
+<div class="sourceCode" id="cb234"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb234-1" data-line-number="1"><span class="co"># Segment 1: sample size = 25 ------------------------------</span></a>
+<a class="sourceLine" id="cb234-2" data-line-number="2"><span class="co"># 1.a) Virtually use shovel 1000 times</span></a>
+<a class="sourceLine" id="cb234-3" data-line-number="3">virtual_samples_<span class="dv">25</span> &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb234-4" data-line-number="4"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">25</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)</a>
+<a class="sourceLine" id="cb234-5" data-line-number="5"></a>
+<a class="sourceLine" id="cb234-6" data-line-number="6"><span class="co"># 1.b) Compute resulting 1000 replicates of proportion red</span></a>
+<a class="sourceLine" id="cb234-7" data-line-number="7">virtual_prop_red_<span class="dv">25</span> &lt;-<span class="st"> </span>virtual_samples_<span class="dv">25</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb234-8" data-line-number="8"><span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb234-9" data-line-number="9"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb234-10" data-line-number="10"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> red <span class="op">/</span><span class="st"> </span><span class="dv">25</span>)</a>
+<a class="sourceLine" id="cb234-11" data-line-number="11"></a>
+<a class="sourceLine" id="cb234-12" data-line-number="12"><span class="co"># 1.c) Plot distribution via a histogram</span></a>
+<a class="sourceLine" id="cb234-13" data-line-number="13"><span class="kw">ggplot</span>(virtual_prop_red_<span class="dv">25</span>, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb234-14" data-line-number="14"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb234-15" data-line-number="15"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 25 balls that were red&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;25&quot;</span>) </a>
+<a class="sourceLine" id="cb234-16" data-line-number="16"></a>
+<a class="sourceLine" id="cb234-17" data-line-number="17"></a>
+<a class="sourceLine" id="cb234-18" data-line-number="18"><span class="co"># Segment 2: sample size = 50 ------------------------------</span></a>
+<a class="sourceLine" id="cb234-19" data-line-number="19"><span class="co"># 2.a) Virtually use shovel 1000 times</span></a>
+<a class="sourceLine" id="cb234-20" data-line-number="20">virtual_samples_<span class="dv">50</span> &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb234-21" data-line-number="21"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)</a>
+<a class="sourceLine" id="cb234-22" data-line-number="22"></a>
+<a class="sourceLine" id="cb234-23" data-line-number="23"><span class="co"># 2.b) Compute resulting 1000 replicates of proportion red</span></a>
+<a class="sourceLine" id="cb234-24" data-line-number="24">virtual_prop_red_<span class="dv">50</span> &lt;-<span class="st"> </span>virtual_samples_<span class="dv">50</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb234-25" data-line-number="25"><span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb234-26" data-line-number="26"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb234-27" data-line-number="27"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> red <span class="op">/</span><span class="st"> </span><span class="dv">50</span>)</a>
+<a class="sourceLine" id="cb234-28" data-line-number="28"></a>
+<a class="sourceLine" id="cb234-29" data-line-number="29"><span class="co"># 2.c) Plot distribution via a histogram</span></a>
+<a class="sourceLine" id="cb234-30" data-line-number="30"><span class="kw">ggplot</span>(virtual_prop_red_<span class="dv">50</span>, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb234-31" data-line-number="31"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb234-32" data-line-number="32"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 50 balls that were red&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;50&quot;</span>)  </a>
+<a class="sourceLine" id="cb234-33" data-line-number="33"></a>
+<a class="sourceLine" id="cb234-34" data-line-number="34"></a>
+<a class="sourceLine" id="cb234-35" data-line-number="35"><span class="co"># Segment 3: sample size = 100 ------------------------------</span></a>
+<a class="sourceLine" id="cb234-36" data-line-number="36"><span class="co"># 3.a) Virtually using shovel with 100 slots 1000 times</span></a>
+<a class="sourceLine" id="cb234-37" data-line-number="37">virtual_samples_<span class="dv">100</span> &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb234-38" data-line-number="38"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">100</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)</a>
+<a class="sourceLine" id="cb234-39" data-line-number="39"></a>
+<a class="sourceLine" id="cb234-40" data-line-number="40"><span class="co"># 3.b) Compute resulting 1000 replicates of proportion red</span></a>
+<a class="sourceLine" id="cb234-41" data-line-number="41">virtual_prop_red_<span class="dv">100</span> &lt;-<span class="st"> </span>virtual_samples_<span class="dv">100</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb234-42" data-line-number="42"><span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb234-43" data-line-number="43"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb234-44" data-line-number="44"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> red <span class="op">/</span><span class="st"> </span><span class="dv">100</span>)</a>
+<a class="sourceLine" id="cb234-45" data-line-number="45"></a>
+<a class="sourceLine" id="cb234-46" data-line-number="46"><span class="co"># 3.c) Plot distribution via a histogram</span></a>
+<a class="sourceLine" id="cb234-47" data-line-number="47"><span class="kw">ggplot</span>(virtual_prop_red_<span class="dv">100</span>, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb234-48" data-line-number="48"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb234-49" data-line-number="49"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 100 balls that were red&quot;</span>, <span class="dt">title =</span> <span class="st">&quot;100&quot;</span>) </a></code></pre></div>
 <p>For easy comparison, we present the three resulting histograms in a single row with matching x and y axes in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:comparing-sampling-distributions"></span>
-<img src="moderndive_files/figure-html/comparing-sampling-distributions-1.png" alt="Comparing the distributions of proportion red for different sample sizes." width="\textwidth" />
+<img src="ModernDive_files/figure-html/comparing-sampling-distributions-1.png" alt="Comparing the distributions of proportion red for different sample sizes." width="\textwidth" />
 <p class="caption">
 FIGURE 7.12: Comparing the distributions of proportion red for different sample sizes.
 </p>
 </div>
 <p>Observe that as the sample size increases, the variation of the 1000 replicates of the proportion of red decreases. In other words, as the sample size increases, there are fewer differences due to sampling variation and the distribution centers more tightly around the same value. Eyeballing Figure <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a>, all three histograms appear to center around roughly 40%.</p>
-<p>We can be numerically explicit about the amount of variation in our 3 sets of 1000 values of <code>prop_red</code> using the <em>standard deviation</em> . A standard deviation is a summary statistic that measures the amount of variation within a numerical variable (see Appendix <a href="A-appendixA.html#appendix-stat-terms">A.1</a> for a brief discussion on the properties of the standard deviation). For all three sample sizes, let’s compute the standard deviation of the 1000 proportions red by running the following data wrangling code that uses the <code>sd()</code> summary function.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># n = 25</span>
-virtual_prop_red_<span class="dv">25</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sd =</span> <span class="kw">sd</span>(prop_red))
-
-<span class="co"># n = 50</span>
-virtual_prop_red_<span class="dv">50</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sd =</span> <span class="kw">sd</span>(prop_red))
-
-<span class="co"># n = 100</span>
-virtual_prop_red_<span class="dv">100</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sd =</span> <span class="kw">sd</span>(prop_red))</code></pre>
-<p>Let’s compare these three measures of variation of the distributions in Table <a href="7-sampling.html#tab:comparing-n">7.1</a>.</p>
+<p>We can be numerically explicit about the amount of variation in our three sets of 1000 values of <code>prop_red</code> using the  <em>standard deviation</em>. A standard deviation is a summary statistic that measures the amount of variation within a numerical variable (see Appendix <a href="A-appendixA.html#appendix-stat-terms">A.1</a> for a brief discussion on the properties of the standard deviation). For all three sample sizes, let’s compute the standard deviation of the 1000 proportions red by running the following data wrangling code that uses the <code>sd()</code> summary function.</p>
+<div class="sourceCode" id="cb235"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb235-1" data-line-number="1"><span class="co"># n = 25</span></a>
+<a class="sourceLine" id="cb235-2" data-line-number="2">virtual_prop_red_<span class="dv">25</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb235-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sd =</span> <span class="kw">sd</span>(prop_red))</a>
+<a class="sourceLine" id="cb235-4" data-line-number="4"></a>
+<a class="sourceLine" id="cb235-5" data-line-number="5"><span class="co"># n = 50</span></a>
+<a class="sourceLine" id="cb235-6" data-line-number="6">virtual_prop_red_<span class="dv">50</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb235-7" data-line-number="7"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sd =</span> <span class="kw">sd</span>(prop_red))</a>
+<a class="sourceLine" id="cb235-8" data-line-number="8"></a>
+<a class="sourceLine" id="cb235-9" data-line-number="9"><span class="co"># n = 100</span></a>
+<a class="sourceLine" id="cb235-10" data-line-number="10">virtual_prop_red_<span class="dv">100</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb235-11" data-line-number="11"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sd =</span> <span class="kw">sd</span>(prop_red))</a></code></pre></div>
+<p>Let’s compare these three measures of distributional variation in Table <a href="7-sampling.html#tab:comparing-n">7.1</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:comparing-n">TABLE 7.1: </span>Comparing standard deviations of proportions red for 3 different shovels.
+<span id="tab:comparing-n">TABLE 7.1: </span>Comparing standard deviations of proportions red for three different shovels
 </caption>
 <thead>
 <tr>
@@ -1090,7 +1102,7 @@ <h3><span class="header-section-number">7.2.4</span> Using different shovels</h3
 25
 </td>
 <td style="text-align:right;">
-0.099
+0.094
 </td>
 </tr>
 <tr>
@@ -1098,7 +1110,7 @@ <h3><span class="header-section-number">7.2.4</span> Using different shovels</h3
 50
 </td>
 <td style="text-align:right;">
-0.071
+0.069
 </td>
 </tr>
 <tr>
@@ -1106,7 +1118,7 @@ <h3><span class="header-section-number">7.2.4</span> Using different shovels</h3
 100
 </td>
 <td style="text-align:right;">
-0.048
+0.045
 </td>
 </tr>
 </tbody>
@@ -1118,13 +1130,17 @@ <h3><span class="header-section-number">7.2.4</span> Using different shovels</h3
 </p>
 </div>
 <p><strong>(LC7.6)</strong> In Figure <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a>, we used shovels to take 1000 samples each, computed the resulting 1000 proportions of the shovel’s balls that were red, and then visualized the distribution of these 1000 proportions in a histogram. We did this for shovels with 25, 50, and 100 slots in them. As the size of the shovels increased, the histograms got narrower. In other words, as the size of the shovels increased from 25 to 50 to 100, did the 1000 proportions</p>
-<p>A. Vary less,
-B. Vary by the same amount, or
-C. Vary more?</p>
+<ul>
+<li>A. vary less,</li>
+<li>B. vary by the same amount, or</li>
+<li>C. vary more?</li>
+</ul>
 <p><strong>(LC7.7)</strong> What summary statistic did we use to quantify how much the 1000 proportions red varied?</p>
-<p>A. The inter-quartile range
-B. The standard deviation
-C. The range: the largest value minus the smallest</p>
+<ul>
+<li>A. The interquartile range</li>
+<li>B. The standard deviation</li>
+<li>C. The range: the largest value minus the smallest.</li>
+</ul>
 <div class="learncheck">
 
 </div>
@@ -1132,26 +1148,26 @@ <h3><span class="header-section-number">7.2.4</span> Using different shovels</h3
 </div>
 <div id="sampling-framework" class="section level2">
 <h2><span class="header-section-number">7.3</span> Sampling framework</h2>
-<p>In both our tactile and our virtual sampling activities, we used sampling for the purpose of estimation. We extracted samples in order to <em>estimate</em> the proportion of the bowl’s balls that are red. We used sampling as a less time consuming approach than to perform an exhaustive count of all the balls. Our virtual sampling activity built up to the results shown in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a> and Table <a href="7-sampling.html#tab:comparing-n">7.1</a>: comparing 1000 proportions red based on samples of size 25, 50, and 100. This was our first attempt at understanding two key concepts relating to sampling for estimation:</p>
+<p>In both our tactile and our virtual sampling activities, we used sampling for the purpose of estimation. We extracted samples in order to <em>estimate</em> the proportion of the bowl’s balls that are red. We used sampling as a less time-consuming approach than performing an exhaustive count of all the balls. Our virtual sampling activity built up to the results shown in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a> and Table <a href="7-sampling.html#tab:comparing-n">7.1</a>: comparing 1000 proportions red based on samples of size 25, 50, and 100. This was our first attempt at understanding two key concepts relating to sampling for estimation:</p>
 <ol style="list-style-type: decimal">
 <li>The effect of <em>sampling variation</em> on our estimates.</li>
 <li>The effect of sample size on <em>sampling variation</em>.</li>
 </ol>
 <p>Let’s now introduce some terminology and notation as well as statistical definitions related to sampling. Given the number of new words you’ll need to learn, you will likely have to read this section a few times. Keep in mind, however, that all of the concepts underlying these terminology, notation, and definitions tie directly to the concepts underlying our tactile and virtual sampling activities. It will simply take time and practice to master them.</p>
 <div id="terminology-and-notation" class="section level3">
-<h3><span class="header-section-number">7.3.1</span> Terminology &amp; notation</h3>
+<h3><span class="header-section-number">7.3.1</span> Terminology and notation</h3>
 <p>Here is a list of terminology and mathematical notation relating to sampling.</p>
-<p>First, A <strong>(study) population</strong> is a collection of individuals or observations about which we are interested in. We mathematically denote the population’s size using upper case <span class="math inline">\(N\)</span>. In our sampling activities, the (study) population is the collection of <span class="math inline">\(N\)</span> = 2400 identically sized red and white balls contained in the bowl.</p>
-<p>Second, a <strong>population parameter</strong> is a numerical summary quantity about the population that is unknown, but you wish you knew. For example, when this quantity is a mean, the population parameter of interest is the <em>population mean</em>. This is mathematically denoted with the Greek letter <span class="math inline">\(\mu\)</span> pronounced “mu” (We’ll see a sampling activity involving means in the upcoming Section <a href="8-confidence-intervals.html#resampling-tactile">8.1</a>). In our earlier sampling from the bowl activity however, since we were interested in the proportion of the bowl’s balls that were red, the population parameter is the <em>population proportion</em> . This is mathematically denoted with the letter <span class="math inline">\(p\)</span>.</p>
-<p>Third, a <strong>census</strong> is an exhaustive enumeration or counting of all <span class="math inline">\(N\)</span> individuals or observations in the population in order to compute the population parameter’s value <em>exactly</em>. In our sampling activity, this would correspond to counting the number of balls out of <span class="math inline">\(N\)</span> = 2400 that are red and computing the <em>population proportion</em> <span class="math inline">\(p\)</span> that are red <em>exactly</em>. When the number <span class="math inline">\(N\)</span> of individuals or observations in our population is large as was the case with our bowl, a census can be very expensive in terms of time, energy, and money.</p>
-<p>Fourth, <strong>sampling</strong> is the act of collecting a sample from the population when we don’t have the means to perform a census. We mathematically denote the sample’s size using lower case <span class="math inline">\(n\)</span>, as opposed to upper case <span class="math inline">\(N\)</span> which denotes the population’s size. Typically the sample size <span class="math inline">\(n\)</span> is much smaller than the population size <span class="math inline">\(N\)</span>. Thus sampling is a much cheaper alternative than performing a census. In our sampling activities, we used shovels with 25, 50, and 100 slots to extract a sample of size <span class="math inline">\(n\)</span> = 25, <span class="math inline">\(n\)</span> = 50, and <span class="math inline">\(n\)</span> = 100.</p>
-<p>Fifth, A <strong>point estimate (AKA sample statistic)</strong> is a summary statistic computed from a sample that <em>estimates</em> an unknown population parameter. In our sampling activities, recall that the unknown population parameter was the population proportion and that this is mathematically denoted with <span class="math inline">\(p\)</span>. Our point estimate is the <em>sample proportion</em>: the proportion of the shovel’s balls that are red. In other words, it is our guess of the proportion of the bowl’s balls balls that are red. We mathematically denote the sample proportion using <span class="math inline">\(\widehat{p}\)</span>. The “hat” on top of the <span class="math inline">\(p\)</span> indicates that it is an estimate of the unknown population proportion <span class="math inline">\(p\)</span>.</p>
-<p>Sixth, the idea of <strong>representative sampling</strong>. A sample is said to be a <em>representative sample</em> if it roughly <em>looks like</em> the population. In other words, are the sample’s characteristics a good representation of the population’s characteristics? In our sampling activity, are the samples of <span class="math inline">\(n\)</span> balls extracted using our shovels representative of the bowl’s <span class="math inline">\(N\)</span> = 2400 balls?</p>
-<p>Seventh, the idea of <strong>generalizability</strong>. We say a sample is generalizable if any results based on the sample can generalize to the population. In other words, does the value of the point estimate <em>generalize</em> to the population? In our sampling activity, can we generalize the sample proportion from our shovels to the entire bowl? Using our mathematical notation, this is akin to asking if <span class="math inline">\(\widehat{p}\)</span> a “good guess” of <span class="math inline">\(p\)</span>?</p>
-<p>Eighth, we say <strong>biased sampling</strong> occurs if certain individuals or observations in a population have a higher chance of being included in a sample than others. We say a sampling procedure is <em>unbiased</em> if every observation in a population had an equal chance of being sampled. In our sampling activities, since each equally sized balls had an equal chance of being sampled in our shovels, our samples were unbiased.</p>
+<p>First, a <strong>population</strong> is a collection of individuals or observations we are interested in. This is also commonly denoted as a <strong>study population</strong>. We mathematically denote the population’s size using upper-case <span class="math inline">\(N\)</span>. In our sampling activities, the (study) population is the collection of <span class="math inline">\(N\)</span> = 2400 identically sized red and white balls contained in the bowl.</p>
+<p>Second, a <strong>population parameter</strong> is a numerical summary quantity about the population that is unknown, but you wish you knew. For example, when this quantity is a mean, the population parameter of interest is the <em>population mean</em>. This is mathematically denoted with the Greek letter <span class="math inline">\(\mu\)</span> pronounced “mu” (we’ll see a sampling activity involving means in the upcoming Section <a href="8-confidence-intervals.html#resampling-tactile">8.1</a>). In our earlier sampling from the bowl activity, however, since we were interested in the proportion of the bowl’s balls that were red, the population parameter is the <em>population proportion</em>. This is mathematically denoted with the letter <span class="math inline">\(p\)</span>.</p>
+<p>Third, a <strong>census</strong> is an exhaustive enumeration or counting of all <span class="math inline">\(N\)</span> individuals or observations in the population in order to compute the population parameter’s value <em>exactly</em>. In our sampling activity, this would correspond to counting the number of balls out of <span class="math inline">\(N\)</span> = 2400 that are red and computing the <em>population proportion</em> <span class="math inline">\(p\)</span> that are red <em>exactly</em>. When the number <span class="math inline">\(N\)</span> of individuals or observations in our population is large as was the case with our bowl, a census can be quite expensive in terms of time, energy, and money.</p>
+<p>Fourth, <strong>sampling</strong> is the act of collecting a sample from the population when we don’t have the means to perform a census. We mathematically denote the sample’s size using lower case <span class="math inline">\(n\)</span>, as opposed to upper case <span class="math inline">\(N\)</span> which denotes the population’s size. Typically the sample size <span class="math inline">\(n\)</span> is much smaller than the population size <span class="math inline">\(N\)</span>. Thus sampling is a much cheaper alternative than performing a census. In our sampling activities, we used shovels with 25, 50, and 100 slots to extract samples of size <span class="math inline">\(n\)</span> = 25, <span class="math inline">\(n\)</span> = 50, and <span class="math inline">\(n\)</span> = 100.</p>
+<p>Fifth, a <strong>point estimate (AKA sample statistic)</strong> is a summary statistic computed from a sample that <em>estimates</em> an unknown population parameter. In our sampling activities, recall that the unknown population parameter was the population proportion and that this is mathematically denoted with <span class="math inline">\(p\)</span>. Our point estimate is the <em>sample proportion</em>: the proportion of the shovel’s balls that are red. In other words, it is our guess of the proportion of the bowl’s balls balls that are red. We mathematically denote the sample proportion using <span class="math inline">\(\widehat{p}\)</span>. The “hat” on top of the <span class="math inline">\(p\)</span> indicates that it is an estimate of the unknown population proportion <span class="math inline">\(p\)</span>.</p>
+<p>Sixth is the idea of <strong>representative sampling</strong>. A sample is said to be a <em>representative sample</em> if it roughly <em>looks like</em> the population. In other words, are the sample’s characteristics a good representation of the population’s characteristics? In our sampling activity, are the samples of <span class="math inline">\(n\)</span> balls extracted using our shovels representative of the bowl’s <span class="math inline">\(N\)</span> = 2400 balls?</p>
+<p>Seventh is the idea of <strong>generalizability</strong>. We say a sample is generalizable if any results based on the sample can generalize to the population. In other words, does the value of the point estimate <em>generalize</em> to the population? In our sampling activity, can we generalize the sample proportion from our shovels to the entire bowl? Using our mathematical notation, this is akin to asking if <span class="math inline">\(\widehat{p}\)</span> is a “good guess” of <span class="math inline">\(p\)</span>?</p>
+<p>Eighth, we say <strong>biased sampling</strong> occurs if certain individuals or observations in a population have a higher chance of being included in a sample than others. We say a sampling procedure is <em>unbiased</em> if every observation in a population had an equal chance of being sampled. In our sampling activities, since we mixed all <span class="math inline">\(N = 2400\)</span> balls prior to each group’s sampling and since each of the equally sized balls had an equal chance of being sampled, our samples were unbiased.</p>
 <p>Ninth and lastly, the idea of <strong>random sampling</strong>. We say a sampling procedure is <em>random</em> if we sample randomly from the population in an unbiased fashion. In our sampling activities, this would correspond to sufficiently mixing the bowl before each use of the shovel.</p>
 <p>Phew, that’s a lot of new terminology and notation to learn! Let’s put them all together to describe the paradigm of sampling.</p>
-<p><strong>In general</strong>:</p>
+<p><strong>In general:</strong></p>
 <ul>
 <li>If the sampling of a sample of size <span class="math inline">\(n\)</span> is done at <strong>random</strong>, then</li>
 <li>the sample is <strong>unbiased</strong> and <strong>representative</strong> of the population of size <span class="math inline">\(N\)</span>, thus</li>
@@ -1159,15 +1175,15 @@ <h3><span class="header-section-number">7.3.1</span> Terminology &amp; notation<
 <li>the point estimate is a <strong>“good guess”</strong> of the unknown population parameter, thus</li>
 <li>instead of performing a census, we can <strong>infer</strong> about the population using sampling.</li>
 </ul>
-<p><strong>Specific to our sampling activity:</strong>:</p>
+<p><strong>Specific to our sampling activity:</strong></p>
 <ul>
-<li>If we extract a sample of <span class="math inline">\(n=50\)</span> balls at <strong>random</strong>, in other words, we mix e equally-sized balls before using the shovel, then</li>
+<li>If we extract a sample of <span class="math inline">\(n=50\)</span> balls at <strong>random</strong>, in other words, we mix all of the equally sized balls before using the shovel, then</li>
 <li>the contents of the shovel are an <strong>unbiased representation</strong> of the contents of the bowl’s 2400 balls, thus</li>
 <li>any result based on the shovel’s balls can <strong>generalize</strong> to the bowl, thus</li>
-<li>the sample proportion <span class="math inline">\(\widehat{p}\)</span> of the <span class="math inline">\(n=50\)</span> balls in the shovel that are red is a <strong>“good guess”</strong> of the population proportion <span class="math inline">\(p\)</span> of the <span class="math inline">\(N\)</span>=2400 balls that are red, thus</li>
+<li>the sample proportion <span class="math inline">\(\widehat{p}\)</span> of the <span class="math inline">\(n=50\)</span> balls in the shovel that are red is a <strong>“good guess”</strong> of the population proportion <span class="math inline">\(p\)</span> of the <span class="math inline">\(N=2400\)</span> balls that are red, thus</li>
 <li>instead of manually going over all 2400 balls in the bowl, we can <strong>infer</strong> about the bowl using the shovel.</li>
 </ul>
-<p>Note that last word we wrote in bold: <strong>infer</strong>. The act of “inferring” means to deduce or conclude (information) from evidence and reasoning. In our sampling activities, we wanted to infer about the proportion of the bowl’s balls that are red. <em>Statistical inference</em> is the “theory, methods, and practice of forming judgments about the parameters of a population and the reliability of statistical relationships, typically on the basis of random sampling” (Wikipedia). In other words, statistical inference is the act of inference via sampling. In the upcoming Chapter <a href="8-confidence-intervals.html#confidence-intervals">8</a> on confidence intervals, we’ll introduce the <code>infer</code> package, which makes statistical inference “tidy” and transparent. It is why this third portion of the book is called “Statistical inference via infer”.</p>
+<p>Note that last word we wrote in bold: <strong>infer</strong>. The act of “inferring” means to deduce or conclude information from evidence and reasoning. In our sampling activities, we wanted to infer about the proportion of the bowl’s balls that are red. <a href="https://en.wikipedia.org/wiki/Statistical_inference"><em>Statistical inference</em></a> is the “theory, methods, and practice of forming judgments about the parameters of a population and the reliability of statistical relationships, typically on the basis of random sampling.” In other words, statistical inference is the act of inference via sampling. In the upcoming Chapter <a href="8-confidence-intervals.html#confidence-intervals">8</a> on confidence intervals, we’ll introduce the <code>infer</code> package, which makes statistical inference “tidy” and transparent. It is why this third portion of the book is called “Statistical inference via infer.”</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -1185,18 +1201,18 @@ <h3><span class="header-section-number">7.3.1</span> Terminology &amp; notation<
 </div>
 <div id="sampling-definitions" class="section level3">
 <h3><span class="header-section-number">7.3.2</span> Statistical definitions</h3>
-<p>Now for some important statistical definitions related to sampling. As a refresher of our 1000 repeated/replicated virtual samples of size <span class="math inline">\(n\)</span> = 25, <span class="math inline">\(n\)</span> = 50, and <span class="math inline">\(n\)</span> = 100 in Section <a href="7-sampling.html#sampling-simulation">7.2</a>, let’s display Figure <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a> again.</p>
+<p>Now, for some important statistical definitions related to sampling. As a refresher of our 1000 repeated/replicated virtual samples of size <span class="math inline">\(n\)</span> = 25, <span class="math inline">\(n\)</span> = 50, and <span class="math inline">\(n\)</span> = 100 in Section <a href="7-sampling.html#sampling-simulation">7.2</a>, let’s display Figure <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a> again as Figure <a href="7-sampling.html#fig:comparing-sampling-distributions-1b">7.13</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:comparing-sampling-distributions-1b"></span>
-<img src="moderndive_files/figure-html/comparing-sampling-distributions-1b-1.png" alt="Previously seen three sampling distributions of the sample proportion $\widehat{p}$." width="\textwidth" />
+<img src="ModernDive_files/figure-html/comparing-sampling-distributions-1b-1.png" alt="Previously seen three distributions of the sample proportion $\widehat{p}$." width="\textwidth" />
 <p class="caption">
-FIGURE 7.13: Previously seen three sampling distributions of the sample proportion <span class="math inline">\(\widehat{p}\)</span>.
+FIGURE 7.13: Previously seen three distributions of the sample proportion <span class="math inline">\(\widehat{p}\)</span>.
 </p>
 </div>
 <p>These types of distributions have a special name: <strong>sampling distributions</strong>;  their visualization displays the effect of sampling variation on the distribution of any point estimate, in this case, the sample proportion <span class="math inline">\(\widehat{p}\)</span>. Using these sampling distributions, for a given sample size <span class="math inline">\(n\)</span>, we can make statements about what values we can typically expect.</p>
-<p>For example, observe the centers of all three sampling distributions: they are all roughly centered around 0.4 = 40%. Furthermore, observe that while we are somewhat likely to observe sample proportions red of 0.2 = 20% when using the shovel with 25 slots, we will almost never observe a proportion of 20% when using the shovel with 100 slots. Observe also the effect of sample size on the sampling variation. As the sample size <span class="math inline">\(n\)</span> increases from 25 to 50 to 100,  the variation of the sampling distribution decreases and thus the values cluster more and more tightly around the same center of around 40%. We quantified this variation using the standard deviation of our sample proportions in Table <a href="7-sampling.html#tab:comparing-n">7.1</a>, which we display again:</p>
+<p>For example, observe the centers of all three sampling distributions: they are all roughly centered around <span class="math inline">\(0.4 = 40\%\)</span>. Furthermore, observe that while we are somewhat likely to observe sample proportions of red balls of <span class="math inline">\(0.2 = 20\%\)</span> when using the shovel with 25 slots, we will almost never observe a proportion of 20% when using the shovel with 100 slots. Observe also the effect of sample size on the sampling variation. As the sample size <span class="math inline">\(n\)</span> increases from 25 to 50 to 100,  the variation of the sampling distribution decreases and thus the values cluster more and more tightly around the same center of around 40%. We quantified this variation using the standard deviation of our sample proportions in Table <a href="7-sampling.html#tab:comparing-n">7.1</a>, which we display again as Table <a href="7-sampling.html#tab:comparing-n-repeat">7.2</a>:</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:comparing-n-repeat">TABLE 7.2: </span>Previously seen comparing standard deviations of proportions red for 3 different shovels.
+<span id="tab:comparing-n-repeat">TABLE 7.2: </span>Previously seen comparing standard deviations of proportions red for three different shovels
 </caption>
 <thead>
 <tr>
@@ -1214,7 +1230,7 @@ <h3><span class="header-section-number">7.3.2</span> Statistical definitions</h3
 25
 </td>
 <td style="text-align:right;">
-0.099
+0.094
 </td>
 </tr>
 <tr>
@@ -1222,7 +1238,7 @@ <h3><span class="header-section-number">7.3.2</span> Statistical definitions</h3
 50
 </td>
 <td style="text-align:right;">
-0.071
+0.069
 </td>
 </tr>
 <tr>
@@ -1230,16 +1246,16 @@ <h3><span class="header-section-number">7.3.2</span> Statistical definitions</h3
 100
 </td>
 <td style="text-align:right;">
-0.048
+0.045
 </td>
 </tr>
 </tbody>
 </table>
-<p>So as the sample size increases, the standard deviation decreases. This type of standard deviation has another special name:  <strong>standard error</strong>. Standard errors quantify the effect of sampling variation induced on our estimates. In other words, they quantify how much we can expect different proportions of a shovel’s balls that are red <em>to vary</em> from one sample to another sample to another sample, and so on.</p>
-<p>Unfortunately, these names confuse many people new to statistical inference. For example, it’s common for people new to statistical inference to call the “sampling distribution” the “sample distribution.” Another additional source of confusion is the name “standard deviation” and “standard error.” Remember that a standard error is merely a <em>kind</em> of standard deviation: the standard deviation of any point estimate from sampling. In other words, all standard errors are standard deviations, but not all standard deviations are necessarily a standard error.</p>
+<p>So as the sample size increases, the standard deviation of the proportion of red balls decreases. This type of standard deviation has another special name:  <strong>standard error</strong>. Standard errors quantify the effect of sampling variation induced on our estimates. In other words, they quantify how much we can expect different proportions of a shovel’s balls that are red <em>to vary</em> from one sample to another sample to another sample, and so on. As a general rule, as sample size increases, the standard error decreases.</p>
+<p>Unfortunately, these names confuse many people who are new to statistical inference. For example, it’s common for people who are new to statistical inference to call the “sampling distribution” the “sample distribution.” Another additional source of confusion is the name “standard deviation” and “standard error.” Remember that a standard error is merely a <em>kind</em> of standard deviation: the standard deviation of any point estimate from sampling. In other words, all standard errors are standard deviations, but not every standard deviation is necessarily a standard error.</p>
 <p>To help reinforce these concepts, let’s re-display Figure <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a> but using our new terminology, notation, and definitions relating to sampling in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions-2">7.14</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:comparing-sampling-distributions-2"></span>
-<img src="moderndive_files/figure-html/comparing-sampling-distributions-2-1.png" alt="Three sampling distributions of the sample proportion $\widehat{p}$." width="\textwidth" />
+<img src="ModernDive_files/figure-html/comparing-sampling-distributions-2-1.png" alt="Three sampling distributions of the sample proportion $\widehat{p}$." width="\textwidth" />
 <p class="caption">
 FIGURE 7.14: Three sampling distributions of the sample proportion <span class="math inline">\(\widehat{p}\)</span>.
 </p>
@@ -1247,7 +1263,7 @@ <h3><span class="header-section-number">7.3.2</span> Statistical definitions</h3
 <p>Furthermore, let’s re-display Table <a href="7-sampling.html#tab:comparing-n">7.1</a> but using our new terminology, notation, and definitions relating to sampling in Table <a href="7-sampling.html#tab:comparing-n-2">7.3</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:comparing-n-2">TABLE 7.3: </span>Three standard errors of the sample proportion based on n = 25, 50, 100.
+<span id="tab:comparing-n-2">TABLE 7.3: </span>Standard errors of the sample proportion based on sample sizes of 25, 50, and 100
 </caption>
 <thead>
 <tr>
@@ -1265,7 +1281,7 @@ <h3><span class="header-section-number">7.3.2</span> Statistical definitions</h3
 n = 25
 </td>
 <td style="text-align:right;">
-0.099
+0.094
 </td>
 </tr>
 <tr>
@@ -1273,7 +1289,7 @@ <h3><span class="header-section-number">7.3.2</span> Statistical definitions</h3
 n = 50
 </td>
 <td style="text-align:right;">
-0.071
+0.069
 </td>
 </tr>
 <tr>
@@ -1281,12 +1297,12 @@ <h3><span class="header-section-number">7.3.2</span> Statistical definitions</h3
 n = 100
 </td>
 <td style="text-align:right;">
-0.048
+0.045
 </td>
 </tr>
 </tbody>
 </table>
-<p>Remember the key message of this last table: that as the sample size <span class="math inline">\(n\)</span> goes up, the “typical” error of your point estimate will go down (as quantified by the <em>standard error</em>).</p>
+<p>Remember the key message of this last table: that as the sample size <span class="math inline">\(n\)</span> goes up, the “typical” error of your point estimate will go down, as quantified by the <em>standard error</em>.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -1301,38 +1317,38 @@ <h3><span class="header-section-number">7.3.2</span> Statistical definitions</h3
 <div id="moral-of-the-story" class="section level3">
 <h3><span class="header-section-number">7.3.3</span> The moral of the story</h3>
 <p>Let’s recap this section so far. We’ve seen that if a sample is generated at random, then the resulting point estimate is a “good guess” of the true unknown population parameter. In our sampling activities, since we made sure to mix the balls first before extracting a sample with the shovel, the resulting sample proportion <span class="math inline">\(\widehat{p}\)</span> of the shovel’s balls that were red was a “good guess” of the population proportion <span class="math inline">\(p\)</span> of the bowl’s balls that were red.</p>
-<p>However, what do we mean by our point estimate being a “good guess”? Sometimes we’ll get an estimate that is less than the true value of the population parameter, while at other times we’ll get an estimate that is greater. This is due to sampling variation. However, despite this sampling variation, our estimates will “on average” be correct and thus will be centered at the true value. This is because our sampling was done at random and thus in an unbiased fashion.</p>
+<p>However, what do we mean by our point estimate being a “good guess”? Sometimes, we’ll get an estimate that is less than the true value of the population parameter, while at other times we’ll get an estimate that is greater. This is due to sampling variation. However, despite this sampling variation, our estimates will “on average” be correct and thus will be centered at the true value. This is because our sampling was done at random and thus in an unbiased fashion.</p>
 <p>In our sampling activities, sometimes our sample proportion <span class="math inline">\(\widehat{p}\)</span> was less than the true population proportion <span class="math inline">\(p\)</span>, while at other times it was greater. This was due to the sampling variability. However, despite this sampling variation, our sample proportions <span class="math inline">\(\widehat{p}\)</span> were “on average” correct and thus were centered at the true value of the population proportion <span class="math inline">\(p\)</span>. This is because we mixed our bowl before taking samples and thus the sampling was done at random and thus in an unbiased fashion. This is also known as having an <em>accurate</em> estimate.</p>
 <p>What was the value of the population proportion <span class="math inline">\(p\)</span> of the <span class="math inline">\(N\)</span> = 2400 balls in the actual bowl that were red? There were 900 red balls, for a proportion red of 900/2400 = 0.375 = 37.5%! How do we know this? Did the authors do an exhaustive count of all the balls? No! They were listed in the contents of the box that the bowl came in! Hence we were able to make the contents of the virtual <code>bowl</code> match the tactile bowl:</p>
-<pre class="sourceCode r"><code class="sourceCode r">bowl <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sum_red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>), 
-            <span class="dt">sum_not_red =</span> <span class="kw">sum</span>(color <span class="op">!=</span><span class="st"> &quot;red&quot;</span>))</code></pre>
+<div class="sourceCode" id="cb236"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb236-1" data-line-number="1">bowl <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb236-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sum_red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>), </a>
+<a class="sourceLine" id="cb236-3" data-line-number="3">            <span class="dt">sum_not_red =</span> <span class="kw">sum</span>(color <span class="op">!=</span><span class="st"> &quot;red&quot;</span>))</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
   sum_red sum_not_red
     &lt;int&gt;       &lt;int&gt;
 1     900        1500</code></pre>
 <p>Let’s re-display our sampling distributions from Figures <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a> and <a href="7-sampling.html#fig:comparing-sampling-distributions-2">7.14</a>, but now with a vertical red line marking the true population proportion <span class="math inline">\(p\)</span> of balls that are red = 37.5% in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions-3">7.15</a>. We see that while there is a certain amount of error in the sample proportions <span class="math inline">\(\widehat{p}\)</span> for all three sampling distributions, on average the <span class="math inline">\(\widehat{p}\)</span> are centered at the true population proportion red <span class="math inline">\(p\)</span>.</p>
 <div class="figure" style="text-align: center"><span id="fig:comparing-sampling-distributions-3"></span>
-<img src="moderndive_files/figure-html/comparing-sampling-distributions-3-1.png" alt="Three sampling distributions with population proportion $p$ marked in red." width="\textwidth" />
+<img src="ModernDive_files/figure-html/comparing-sampling-distributions-3-1.png" alt="Three sampling distributions with population proportion $p$ marked by vertical line." width="\textwidth" />
 <p class="caption">
-FIGURE 7.15: Three sampling distributions with population proportion <span class="math inline">\(p\)</span> marked in red.
+FIGURE 7.15: Three sampling distributions with population proportion <span class="math inline">\(p\)</span> marked by vertical line.
 </p>
 </div>
 <p>We also saw in this section that as your sample size <span class="math inline">\(n\)</span> increases, your point estimates will vary less and less and be more and more concentrated around the true population parameter. This variation is quantified by the decreasing <em>standard error</em>. In other words, the typical error of your point estimates will decrease. In our sampling exercise, as the sample size increased, the variation of our sample proportions <span class="math inline">\(\widehat{p}\)</span> decreased. You can observe this behavior in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions-3">7.15</a>. This is also known as having a <em>precise</em> estimate.</p>
-<p>So random and unbiased sampling ensures our point estimates are <em>accurate</em>, while on the other hand having a large sample size ensures our point estimates are <em>precise</em>. While the terms “accuracy” and “precision” may sound like they mean the same thing, there is a subtle difference. Accuracy describes how “on target” our estimates are, whereas precision describes how “consistent” our estimates are. Figure <a href="7-sampling.html#fig:accuracy-vs-precision">7.16</a> illustrates the difference.</p>
+<p>So random sampling ensures our point estimates are <em>accurate</em>, while on the other hand having a large sample size ensures our point estimates are <em>precise</em>. While the terms “accuracy” and “precision” may sound like they mean the same thing, there is a subtle difference. Accuracy describes how “on target” our estimates are, whereas precision describes how “consistent” our estimates are. Figure <a href="7-sampling.html#fig:accuracy-vs-precision">7.16</a> illustrates the difference.</p>
 <div class="figure" style="text-align: center"><span id="fig:accuracy-vs-precision"></span>
-<img src="images/accuracy_vs_precision.jpg" alt="Comparing accuracy and precision." width="50%" />
+<img src="images/accuracy_vs_precision.jpg" alt="Comparing accuracy and precision." width="75%" height="75%" />
 <p class="caption">
 FIGURE 7.16: Comparing accuracy and precision.
 </p>
 </div>
-<p>As this point, you might be asking yourself: “If we already knew the true proportion of the bowl’s balls that are red was 37.5%, then why did do any sampling?” You might also be asking: “Why did we take 1000 repeated samples of size n = 25, 50, and 100? Shouldn’t we be taking only <em>one</em> sample that’s as large as possible?” If you did ask yourself these questions, your suspicion is merited!</p>
-<p>The sampling activity involving the bowl is merely an <em>idealized version</em> of how sampling is done in real-life. We performed this exercise only to study and understand:</p>
+<p>At this point, you might be asking yourself: “If we already knew the true proportion of the bowl’s balls that are red was 37.5%, then why did we do any sampling?”. You might also be asking: “Why did we take 1000 repeated samples of size n = 25, 50, and 100? Shouldn’t we be taking only <em>one</em> sample that’s as large as possible?”. If you did ask yourself these questions, your suspicion is merited!</p>
+<p>The sampling activity involving the bowl is merely an <em>idealized version</em> of how sampling is done in real life. We performed this exercise only to study and understand:</p>
 <ol style="list-style-type: decimal">
 <li>The effect of sampling variation.</li>
 <li>The effect of sample size on sampling variation.</li>
 </ol>
-<p>This not how sampling is done in real-life. In a real-life scenario, we won’t know what the true value of the population parameter is. Furthermore we wouldn’t take 1000 repeated/replicated samples, but rather a single sample that’s as large as we can afford. In the next section, let’s now study a real-life example of sampling: polls.</p>
+<p>This is not how sampling is done in real life. In a real-life scenario, we won’t know what the true value of the population parameter is. Furthermore, we wouldn’t take 1000 repeated/replicated samples, but rather a single sample that’s as large as we can afford. In the next section, let’s now study a real-life example of sampling: polls.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -1341,7 +1357,7 @@ <h3><span class="header-section-number">7.3.3</span> The moral of the story</h3>
 <p><strong>(LC7.16)</strong> The table that follows is a version of Table <a href="7-sampling.html#tab:comparing-n-2">7.3</a> matching sample sizes <span class="math inline">\(n\)</span> to different <em>standard errors</em> of the sample proportion <span class="math inline">\(\widehat{p}\)</span>, but with the rows randomly re-ordered and the sample sizes removed. Fill in the table by matching the correct sample sizes to the correct standard errors.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:comparing-n-3">TABLE 7.4: </span>Three standard errors of the sample proportion based on n = 25, 50, 100.
+<span id="tab:comparing-n-3">TABLE 7.4: </span>Standard errors of <span class="math inline">\(\widehat{p}\)</span> based on n = 25, 50, 100
 </caption>
 <thead>
 <tr>
@@ -1349,7 +1365,7 @@ <h3><span class="header-section-number">7.3.3</span> The moral of the story</h3>
 Sample size
 </th>
 <th style="text-align:right;">
-Standard error of p-hat
+Standard error of <span class="math inline">\(\widehat{p}\)</span>
 </th>
 </tr>
 </thead>
@@ -1359,7 +1375,7 @@ <h3><span class="header-section-number">7.3.3</span> The moral of the story</h3>
 n =
 </td>
 <td style="text-align:right;">
-0.099
+0.094
 </td>
 </tr>
 <tr>
@@ -1367,7 +1383,7 @@ <h3><span class="header-section-number">7.3.3</span> The moral of the story</h3>
 n =
 </td>
 <td style="text-align:right;">
-0.048
+0.045
 </td>
 </tr>
 <tr>
@@ -1375,16 +1391,16 @@ <h3><span class="header-section-number">7.3.3</span> The moral of the story</h3>
 n =
 </td>
 <td style="text-align:right;">
-0.071
+0.069
 </td>
 </tr>
 </tbody>
 </table>
-<p>For the following four learning checks, let the <em>estimate</em> be the sample proportion <span class="math inline">\(\widehat{p}\)</span>: the proportion of a shovel’s balls that were red. It estimates the population proportion <span class="math inline">\(p\)</span>: the proportion of the bowl’s balls that were red.</p>
-<p><strong>(LC7.17)</strong> What is the difference between an <em>accurate</em> estimate and a <em>precise</em> estimate?</p>
+<p>For the following four <em>Learning checks</em>, let the <em>estimate</em> be the sample proportion <span class="math inline">\(\widehat{p}\)</span>: the proportion of a shovel’s balls that were red. It estimates the population proportion <span class="math inline">\(p\)</span>: the proportion of the bowl’s balls that were red.</p>
+<p><strong>(LC7.17)</strong> What is the difference between an <em>accurate</em> and a <em>precise</em> estimate?</p>
 <p><strong>(LC7.18)</strong> How do we ensure that an estimate is <em>accurate</em>? How do we ensure that an estimate is <em>precise</em>?</p>
-<p><strong>(LC7.19)</strong> In a real-life situation, we would not take 1000 different samples to infer about a population, but rather only one. Then what was the purpose of our exercises where we took 1000 different samples?</p>
-<p><strong>(LC7.20)</strong> Figure <a href="7-sampling.html#fig:accuracy-vs-precision">7.16</a> with the targets shows four combinations of “accurate versus precise” estimates. Draw four corresponding <em>sampling distributions</em> of the sample proportion <span class="math inline">\(\widehat{p}\)</span>, like the one in the left-most plot in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions-3">7.15</a>.</p>
+<p><strong>(LC7.19)</strong> In a real-life situation, we would not take 1000 different samples to infer about a population, but rather only one. Then, what was the purpose of our exercises where we took 1000 different samples?</p>
+<p><strong>(LC7.20)</strong> Figure <a href="7-sampling.html#fig:accuracy-vs-precision">7.16</a> with the targets shows four combinations of “accurate versus precise” estimates. Draw four corresponding <em>sampling distributions</em> of the sample proportion <span class="math inline">\(\widehat{p}\)</span>, like the one in the leftmost plot in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions-3">7.15</a>.</p>
 <div class="learncheck">
 
 </div>
@@ -1393,15 +1409,15 @@ <h3><span class="header-section-number">7.3.3</span> The moral of the story</h3>
 <div id="sampling-case-study" class="section level2">
 <h2><span class="header-section-number">7.4</span> Case study: Polls</h2>
 <p>Let’s now switch gears to a more realistic sampling scenario than our bowl activity: a poll. In practice, pollsters do not take 1000 repeated samples as we did in our previous sampling activities, but rather take only a <em>single sample</em> that’s as large as possible.</p>
-<p>On December 4, 2013, National Public Radio in the US reported on a poll of President Obama’s approval rating among young Americans aged 18-29 in an article <a href="https://www.npr.org/sections/itsallpolitics/2013/12/04/248793753/poll-support-for-obama-among-young-americans-eroding">“Poll: Support For Obama Among Young Americans Eroding”</a>. The poll was conducted by the Harvard University Institute of Politics. A quote from the article:</p>
+<p>On December 4, 2013, National Public Radio in the US reported on a poll of President Obama’s approval rating among young Americans aged 18-29 in an article, <a href="https://www.npr.org/sections/itsallpolitics/2013/12/04/248793753/poll-support-for-obama-among-young-americans-eroding">“Poll: Support For Obama Among Young Americans Eroding.”</a> The poll was conducted by the Kennedy School’s Institute of Politics at Harvard University. A quote from the article:</p>
 <blockquote>
 <p>After voting for him in large numbers in 2008 and 2012, young Americans are souring on President Obama.</p>
 <p>According to a new Harvard University Institute of Politics poll, just 41 percent of millennials — adults ages 18-29 — approve of Obama’s job performance, his lowest-ever standing among the group and an 11-point drop from April.</p>
 </blockquote>
-<p>Let’s tie elements of the real-life poll in this new article with our “tactile” and “virtual” bowl activity from Sections <a href="7-sampling.html#sampling-activity">7.1</a> and <a href="7-sampling.html#sampling-simulation">7.2</a> using the terminology, notations, and definitions we learned in Section <a href="7-sampling.html#sampling-framework">7.3</a>. You see that our sampling activity with the bowl is an idealized version of what pollsters are trying to do in real-life.</p>
+<p>Let’s tie elements of the real-life poll in this new article with our “tactile” and “virtual” bowl activity from Sections <a href="7-sampling.html#sampling-activity">7.1</a> and <a href="7-sampling.html#sampling-simulation">7.2</a> using the terminology, notations, and definitions we learned in Section <a href="7-sampling.html#sampling-framework">7.3</a>. You’ll see that our sampling activity with the bowl is an idealized version of what pollsters are trying to do in real life.</p>
 <p>First, who is the <strong>(Study) Population</strong> of <span class="math inline">\(N\)</span> individuals or observations of interest? </p>
 <ul>
-<li>Bowl: <span class="math inline">\(N\)</span> = 2400 identically-sized red and white balls</li>
+<li>Bowl: <span class="math inline">\(N\)</span> = 2400 identically sized red and white balls</li>
 <li>Obama poll: <span class="math inline">\(N\)</span> = ? young Americans aged 18-29</li>
 </ul>
 <p>Second, what is the <strong>population parameter</strong>? </p>
@@ -1412,27 +1428,27 @@ <h2><span class="header-section-number">7.4</span> Case study: Polls</h2>
 <p>Third, what would a <strong>census</strong> look like? </p>
 <ul>
 <li>Bowl: Manually going over all <span class="math inline">\(N\)</span> = 2400 balls and exactly computing the population proportion <span class="math inline">\(p\)</span> of the balls that are red.</li>
-<li>Obama poll: Locating all <span class="math inline">\(N\)</span> young Americans and asking them all if they approve of Obama’s job performance. In the case, we don’t even know what the population size <span class="math inline">\(N\)</span> is!</li>
+<li>Obama poll: Locating all <span class="math inline">\(N\)</span> young Americans and asking them all if they approve of Obama’s job performance. In this case, we don’t even know what the population size <span class="math inline">\(N\)</span> is!</li>
 </ul>
 <p>Fourth, how do you perform <strong>sampling</strong> to obtain a sample of size <span class="math inline">\(n\)</span>? </p>
 <ul>
 <li>Bowl: Using a shovel with <span class="math inline">\(n\)</span> slots.</li>
-<li>Obama poll: One method is to get a list of phone numbers of all young Americans and pick out <span class="math inline">\(n\)</span> phone numbers. In this poll’s case, the sample size of this poll was <span class="math inline">\(n\)</span> = 2089 young Americans.</li>
+<li>Obama poll: One method is to get a list of phone numbers of all young Americans and pick out <span class="math inline">\(n\)</span> phone numbers. In this poll’s case, the sample size of this poll was <span class="math inline">\(n = 2089\)</span> young Americans.</li>
 </ul>
 <p>Fifth, what is your <strong>point estimate (AKA sample statistic)</strong> of the unknown population parameter?</p>
 <ul>
 <li>Bowl: The sample proportion <span class="math inline">\(\widehat{p}\)</span> of the balls in the shovel that were red.</li>
-<li>Obama poll: The sample proportion <span class="math inline">\(\widehat{p}\)</span> of young Americans in the sample that approve of Obama’s job performance. In this poll’s case, <span class="math inline">\(\widehat{p}\)</span> = 0.41 = 41%, the quoted percentage in the second paragraph of the article.  </li>
+<li>Obama poll: The sample proportion <span class="math inline">\(\widehat{p}\)</span> of young Americans in the sample that approve of Obama’s job performance. In this poll’s case, <span class="math inline">\(\widehat{p} = 0.41 = 41\%\)</span>, the quoted percentage in the second paragraph of the article.  </li>
 </ul>
 <p>Sixth, is the sampling procedure <strong>representative</strong>? </p>
 <ul>
 <li>Bowl: Are the contents of the shovel representative of the contents of the bowl? Because we mixed the bowl before sampling, we can feel confident that they are.</li>
-<li>Obama poll: Is the sample of <span class="math inline">\(n\)</span> = 2089 young Americans representative of <em>all</em> young Americans aged 18-29? This depends on whether the sampling was random.</li>
+<li>Obama poll: Is the sample of <span class="math inline">\(n = 2089\)</span> young Americans representative of <em>all</em> young Americans aged 18-29? This depends on whether the sampling was random.</li>
 </ul>
 <p>Seventh, are the samples <strong>generalizable</strong> to the greater population? </p>
 <ul>
 <li>Bowl: Is the sample proportion <span class="math inline">\(\widehat{p}\)</span> of the shovel’s balls that are red a “good guess” of the population proportion <span class="math inline">\(p\)</span> of the bowl’s balls that are red? Given that the sample was representative, the answer is yes.</li>
-<li>Obama poll: Is the sample proportion <span class="math inline">\(\widehat{p}\)</span> = 0.41 of the sample of young Americans who support Obama a “good guess” of the population proportion <span class="math inline">\(p\)</span> of all young Americans who support Obama? In other words, can we confidently say that roughly 41% of <em>all</em> young Americans approve of Obama? Again, this depends on whether the sampling was random.</li>
+<li>Obama poll: Is the sample proportion <span class="math inline">\(\widehat{p} = 0.41\)</span> of the sample of young Americans who supported Obama a “good guess” of the population proportion <span class="math inline">\(p\)</span> of all young Americans who supported Obama at this time in 2013? In other words, can we confidently say that roughly 41% of <em>all</em> young Americans approved of Obama at the time of the poll? Again, this depends on whether the sampling was random.</li>
 </ul>
 <p>Eighth, is the sampling procedure <strong>unbiased</strong>? In other words, do all observations have an equal chance of being included in the sample? </p>
 <ul>
@@ -1442,10 +1458,10 @@ <h2><span class="header-section-number">7.4</span> Case study: Polls</h2>
 <p>Ninth and lastly, was the sampling done at <strong>random</strong>? </p>
 <ul>
 <li>Bowl: As long as you mixed the bowl sufficiently before sampling, your samples would be random.</li>
-<li>Obama poll: Was the sample conducted at random? We can’t answer this question without knowing about the <em>sampling methodology</em> used by the Harvard University Institute of Politics. We’ll discuss this more at the end of this section.</li>
+<li>Obama poll: Was the sample conducted at random? We can’t answer this question without knowing about the <em>sampling methodology</em> used by Kennedy School’s Institute of Politics at Harvard University. We’ll discuss this more at the end of this section.</li>
 </ul>
-<p>In other words, the Harvard University Institute of Politics poll can be thought of as <em>an instance</em> of using the shovel to sample balls from the bowl. Furthermore, if another polling company conducted a similar poll of young Americans at roughly the same time, they would likely get a different estimate than 41%. This is due to <em>sampling variation</em>.</p>
-<p>Let’s now revisit the sampling paradigm from Section <a href="7-sampling.html#terminology-and-notation">7.3.1</a>:</p>
+<p>In other words, the poll by Kennedy School’s Institute of Politics at Harvard University can be thought of as <em>an instance</em> of using the shovel to sample balls from the bowl. Furthermore, if another polling company conducted a similar poll of young Americans at roughly the same time, they would likely get a different estimate than 41%. This is due to <em>sampling variation</em>.</p>
+<p>Let’s now revisit the sampling paradigm from Subsection <a href="7-sampling.html#terminology-and-notation">7.3.1</a>:</p>
 <p><strong>In general</strong>:</p>
 <ul>
 <li>If the sampling of a sample of size <span class="math inline">\(n\)</span> is done at <strong>random</strong>, then</li>
@@ -1454,23 +1470,23 @@ <h2><span class="header-section-number">7.4</span> Case study: Polls</h2>
 <li>the point estimate is a <strong>“good guess”</strong> of the unknown population parameter, thus</li>
 <li>instead of performing a census, we can <strong>infer</strong> about the population using sampling.</li>
 </ul>
-<p><strong>Specific to the bowl:</strong>:</p>
+<p><strong>Specific to the bowl:</strong></p>
 <ul>
-<li>If we extract a sample of <span class="math inline">\(n=50\)</span> balls at <strong>random</strong>, in other words, we mix e equally-sized balls before using the shovel, then</li>
+<li>If we extract a sample of <span class="math inline">\(n = 50\)</span> balls at <strong>random</strong>, in other words, we mix all of the equally sized balls before using the shovel, then</li>
 <li>the contents of the shovel are an <strong>unbiased representation</strong> of the contents of the bowl’s 2400 balls, thus</li>
 <li>any result based on the shovel’s balls can <strong>generalize</strong> to the bowl, thus</li>
-<li>the sample proportion <span class="math inline">\(\widehat{p}\)</span> of the <span class="math inline">\(n=50\)</span> balls in the shovel that are red is a <strong>“good guess”</strong> of the population proportion <span class="math inline">\(p\)</span> of the <span class="math inline">\(N\)</span>=2400 balls that are red, thus</li>
+<li>the sample proportion <span class="math inline">\(\widehat{p}\)</span> of the <span class="math inline">\(n = 50\)</span> balls in the shovel that are red is a <strong>“good guess”</strong> of the population proportion <span class="math inline">\(p\)</span> of the <span class="math inline">\(N = 2400\)</span> balls that are red, thus</li>
 <li>instead of manually going over all 2400 balls in the bowl, we can <strong>infer</strong> about the bowl using the shovel.</li>
 </ul>
-<p><strong>Specific to the Obama poll:</strong>:</p>
+<p><strong>Specific to the Obama poll:</strong></p>
 <ul>
-<li>If we had a way of contacting a <strong>randomly</strong> chosen sample of 2089 young Americans and poll their approval of President Obama, then</li>
-<li>these 2089 young Americans would be an <strong>unbiased</strong> and <strong>representative</strong> sample of <em>all</em> young Americans, thus</li>
-<li>any results based on this sample of 2089 young Americans can <strong>generalize</strong> to the entire population of <em>all</em> young Americans, thus</li>
-<li>the reported sample approval rating of 41% of these 2089 young Americans is a <strong>good guess</strong> of the true approval rating among all young Americans, thus</li>
-<li>instead of performing an expensive census of all young Americans, we can <strong>infer</strong> about all young Americans using polling.</li>
+<li>If we had a way of contacting a <strong>randomly</strong> chosen sample of 2089 young Americans and polling their approval of President Obama in 2013, then</li>
+<li>these 2089 young Americans would be an <strong>unbiased</strong> and <strong>representative</strong> sample of <em>all</em> young Americans in 2013, thus</li>
+<li>any results based on this sample of 2089 young Americans can <strong>generalize</strong> to the entire population of <em>all</em> young Americans in 2013, thus</li>
+<li>the reported sample approval rating of 41% of these 2089 young Americans is a <strong>good guess</strong> of the true approval rating among all young Americans in 2013, thus</li>
+<li>instead of performing an expensive census of all young Americans in 2013, we can <strong>infer</strong> about all young Americans in 2013 using polling.</li>
 </ul>
-<p>So as you can see, it was critical for the Harvard University Institute of Politics sample to be truly random in order to infer about <em>all</em> young Americans’ opinions about Obama. Was their sample truly random? It’s hard to answer such questions without knowing about the <em>sampling methodology</em> used. For example, if this poll was conducted using only mobile phone numbers, people without mobile phones would be left out and therefore not represented in the sample. What about if the Harvard University Institute of Politics conducted this poll on an internet news site? Then people who don’t read this internet news site would be left out. Ensuring that our samples were random was easy to do in our sampling bowl exercises, however in a real-life situation like the Obama poll, this is much harder to do.</p>
+<p>So as you can see, it was critical for the sample obtained by Kennedy School’s Institute of Politics at Harvard University to be truly random in order to infer about <em>all</em> young Americans’ opinions about Obama. Was their sample truly random? It’s hard to answer such questions without knowing about the <em>sampling methodology</em> they used. For example, if this poll was conducted using only mobile phone numbers, people without mobile phones would be left out and therefore not represented in the sample. What about if Kennedy School’s Institute of Politics at Harvard University conducted this poll on an internet news site? Then people who don’t read this particular internet news site would be left out. Ensuring that our samples were random was easy to do in our sampling bowl exercises; however, in a real-life situation like the Obama poll, this is much harder to do.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -1479,8 +1495,8 @@ <h2><span class="header-section-number">7.4</span> Case study: Polls</h2>
 <p>Comment on the representativeness of the following <em>sampling methodologies</em>:</p>
 <p><strong>(LC7.21)</strong> The Royal Air Force wants to study how resistant all their airplanes are to bullets. They study the bullet holes on all the airplanes on the tarmac after an air battle against the Luftwaffe (German Air Force).</p>
 <p><strong>(LC7.22)</strong> Imagine it is 1993, a time when almost all households had landlines. You want to know the average number of people in each household in your city. You randomly pick out 500 phone numbers from the phone book and conduct a phone survey.</p>
-<p><strong>(LC7.23)</strong> You want to know the prevalence of illegal downloading of TV shows among students at a local college. You get the emails of 100 randomly chosen students and ask them “How many times did you download a pirated TV show last week?”</p>
-<p><strong>(LC7.24)</strong> A local college administrator wants to know the average income of all graduates in the last 10 years. So they get the records of 5 randomly chosen graduates, contact them, and obtain their answers.</p>
+<p><strong>(LC7.23)</strong> You want to know the prevalence of illegal downloading of TV shows among students at a local college. You get the emails of 100 randomly chosen students and ask them, “How many times did you download a pirated TV show last week?”.</p>
+<p><strong>(LC7.24)</strong> A local college administrator wants to know the average income of all graduates in the last 10 years. So they get the records of five randomly chosen graduates, contact them, and obtain their answers.</p>
 <div class="learncheck">
 
 </div>
@@ -1488,7 +1504,7 @@ <h2><span class="header-section-number">7.4</span> Case study: Polls</h2>
 <div id="sampling-conclusion" class="section level2">
 <h2><span class="header-section-number">7.5</span> Conclusion</h2>
 <!--
-TODO:
+TODO: Contrast random sampling vs assignment
 
 ### Random sampling vs random assignment {#sampling-conclusion-sampling-vs-assignment}
 
@@ -1496,7 +1512,7 @@ <h2><span class="header-section-number">7.5</span> Conclusion</h2>
 -->
 <div id="sampling-conclusion-table" class="section level3">
 <h3><span class="header-section-number">7.5.1</span> Sampling scenarios</h3>
-<p>In this chapter, we performed both tactile and virtual sampling exercises to infer about an unknown proportion. We also presented a case study of sampling in real-life: polls. In both cases, we used the sample proportion <span class="math inline">\(\widehat{p}\)</span> to estimate the population proportion <span class="math inline">\(p\)</span>. However, we are not just limited to scenarios related to proportions. In other words, we can use sampling to estimate other population parameters using other point estimates as well. We present 5 more such scenarios in Table <a href="7-sampling.html#tab:table-ch8">7.5</a>.</p>
+<p>In this chapter, we performed both tactile and virtual sampling exercises to infer about an unknown proportion. We also presented a case study of sampling in real life with polls. In each case, we used the sample proportion <span class="math inline">\(\widehat{p}\)</span> to estimate the population proportion <span class="math inline">\(p\)</span>. However, we are not just limited to scenarios related to proportions. In other words, we can use sampling to estimate other population parameters using other point estimates as well. We present four more such scenarios in Table <a href="7-sampling.html#tab:table-ch8">7.5</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:table-ch8">TABLE 7.5: </span>Scenarios of sampling for inference
@@ -1516,7 +1532,7 @@ <h3><span class="header-section-number">7.5.1</span> Sampling scenarios</h3>
 Point estimate
 </th>
 <th style="text-align:left;">
-Notation.
+Symbol(s)
 </th>
 </tr>
 </thead>
@@ -1525,16 +1541,16 @@ <h3><span class="header-section-number">7.5.1</span> Sampling scenarios</h3>
 <td style="text-align:right;width: 0.5in; ">
 1
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.2in; ">
 Population proportion
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.8in; ">
 <span class="math inline">\(p\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.5in; ">
 Sample proportion
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.6in; ">
 <span class="math inline">\(\widehat{p}\)</span>
 </td>
 </tr>
@@ -1542,16 +1558,16 @@ <h3><span class="header-section-number">7.5.1</span> Sampling scenarios</h3>
 <td style="text-align:right;width: 0.5in; ">
 2
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.2in; ">
 Population mean
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.8in; ">
 <span class="math inline">\(\mu\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.5in; ">
 Sample mean
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.6in; ">
 <span class="math inline">\(\overline{x}\)</span> or <span class="math inline">\(\widehat{\mu}\)</span>
 </td>
 </tr>
@@ -1559,16 +1575,16 @@ <h3><span class="header-section-number">7.5.1</span> Sampling scenarios</h3>
 <td style="text-align:right;width: 0.5in; ">
 3
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.2in; ">
 Difference in population proportions
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.8in; ">
 <span class="math inline">\(p_1 - p_2\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.5in; ">
 Difference in sample proportions
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.6in; ">
 <span class="math inline">\(\widehat{p}_1 - \widehat{p}_2\)</span>
 </td>
 </tr>
@@ -1576,16 +1592,16 @@ <h3><span class="header-section-number">7.5.1</span> Sampling scenarios</h3>
 <td style="text-align:right;width: 0.5in; ">
 4
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.2in; ">
 Difference in population means
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.8in; ">
 <span class="math inline">\(\mu_1 - \mu_2\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.5in; ">
 Difference in sample means
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.6in; ">
 <span class="math inline">\(\overline{x}_1 - \overline{x}_2\)</span>
 </td>
 </tr>
@@ -1593,36 +1609,19 @@ <h3><span class="header-section-number">7.5.1</span> Sampling scenarios</h3>
 <td style="text-align:right;width: 0.5in; ">
 5
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.2in; ">
 Population regression slope
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.8in; ">
 <span class="math inline">\(\beta_1\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.5in; ">
 Fitted regression slope
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.6in; ">
 <span class="math inline">\(b_1\)</span> or <span class="math inline">\(\widehat{\beta}_1\)</span>
 </td>
 </tr>
-<tr>
-<td style="text-align:right;width: 0.5in; ">
-6
-</td>
-<td style="text-align:left;width: 0.7in; ">
-Population regression intercept
-</td>
-<td style="text-align:left;width: 1in; ">
-<span class="math inline">\(\beta_0\)</span>
-</td>
-<td style="text-align:left;width: 1.1in; ">
-Fitted regression intercept
-</td>
-<td style="text-align:left;width: 1in; ">
-<span class="math inline">\(b_0\)</span> or <span class="math inline">\(\widehat{\beta}_0\)</span>
-</td>
-</tr>
 </tbody>
 </table>
 <p>In the rest of this book, we’ll cover all the remaining scenarios as follows:</p>
@@ -1636,20 +1635,19 @@ <h3><span class="header-section-number">7.5.1</span> Sampling scenarios</h3>
 <ul>
 <li>Scenario 4: The difference <span class="math inline">\(\mu_1 - \mu_2\)</span> in mean IMDb ratings for action and romance movies. This is another example of <em>two-sample</em> inference.</li>
 </ul></li>
-<li>In Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a>, we’ll cover an example of statistical inference for regression by revisiting the regression models for teaching score as a function of various instructor demographic variables you saw in Chapters <a href="5-regression.html#regression">5</a> and <a href="6-multiple-regression.html#multiple-regression">6</a>. Specifically
+<li>In Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a>, we’ll cover an example of statistical inference for regression by revisiting the regression models for teaching score as a function of various instructor demographic variables you saw in Chapters <a href="5-regression.html#regression">5</a> and <a href="6-multiple-regression.html#multiple-regression">6</a>.
 <ul>
-<li>Scenario 5: The intercept <span class="math inline">\(\beta_0\)</span> of the population regression line.</li>
-<li>Scenario 6: The slope <span class="math inline">\(\beta_1\)</span> of the population regression line.</li>
+<li>Scenario 5: The slope <span class="math inline">\(\beta_1\)</span> of the population regression line.</li>
 </ul></li>
 </ul>
 </div>
 <div id="sampling-conclusion-central-limit-theorem" class="section level3">
 <h3><span class="header-section-number">7.5.2</span> Central Limit Theorem</h3>
-<p>What you visualized in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a> and summarized in Table <a href="7-sampling.html#tab:comparing-n">7.1</a> was a demonstration of a very famous theorem, or mathematically proven truth, called the  <em>Central Limit Theorem</em>. It loosely states that when sample means are based on larger and larger sample sizes, the sampling distribution of these sample means both more and more normally shaped and more and more narrow.</p>
+<p>What you visualized in Figures <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a> and <a href="7-sampling.html#fig:comparing-sampling-distributions-2">7.14</a> and summarized in Tables <a href="7-sampling.html#tab:comparing-n">7.1</a> and <a href="7-sampling.html#tab:comparing-n-2">7.3</a> was a demonstration of a famous theorem, or mathematically proven truth, called the  <em>Central Limit Theorem</em>. It loosely states that when sample means are based on larger and larger sample sizes, the sampling distribution of these sample means becomes both more and more normally shaped and more and more narrow.</p>
 <p>In other words, their sampling distribution increasingly follows a <em>normal distribution</em> and the variation of these sampling distributions gets smaller, as quantified by their standard errors.</p>
-<p>Shuyi Chiou, Casey Dunn, and Pathikrit Bhattacharyya created a 3 minute and 38 second video at <a href="https://youtu.be/jvoxEYmQHNM" class="uri">https://youtu.be/jvoxEYmQHNM</a> explaining this crucial statistical theorem using the average weight of wild bunny rabbits and the average wingspan of dragons as examples. Figure <a href="7-sampling.html#fig:CLT-video-preview">7.17</a> shows a preview of this video.</p>
+<p>Shuyi Chiou, Casey Dunn, and Pathikrit Bhattacharyya created a 3-minute and 38-second video at <a href="https://youtu.be/jvoxEYmQHNM" class="uri">https://youtu.be/jvoxEYmQHNM</a> explaining this crucial statistical theorem using the average weight of wild bunny rabbits and the average wingspan of dragons as examples. Figure <a href="7-sampling.html#fig:CLT-video-preview">7.17</a> shows a preview of this video.</p>
 <div class="figure" style="text-align: center"><span id="fig:CLT-video-preview"></span>
-<img src="images/copyright/CLT_video_preview.png" alt="Preview of Central Limit Theorem video." width="80%" />
+<img src="images/copyright/CLT_video_preview.png" alt="Preview of Central Limit Theorem video." width="75%" />
 <p class="caption">
 FIGURE 7.17: Preview of Central Limit Theorem video.
 </p>
@@ -1661,11 +1659,11 @@ <h3><span class="header-section-number">7.5.3</span> Additional resources</h3>
 </div>
 <div id="whats-to-come-6" class="section level3">
 <h3><span class="header-section-number">7.5.4</span> What’s to come?</h3>
-<p>Recall in our Obama poll case study in Section <a href="7-sampling.html#sampling-case-study">7.4</a> that based on this particular sample, the Harvard University Institute of Politics’ best guess of the U.S. President Obama’s approval rating among all young Americans was 41%. However, this isn’t the end of the story. If you read the article further, it states:</p>
+<p>Recall in our Obama poll case study in Section <a href="7-sampling.html#sampling-case-study">7.4</a> that based on this particular sample, the best guess by Kennedy School’s Institute of Politics at Harvard University of the U.S. President Obama’s approval rating among all young Americans was 41%. However, this isn’t the end of the story. If you read the article further, it states:</p>
 <blockquote>
 <p>The online survey of 2,089 adults was conducted from Oct. 30 to Nov. 11, just weeks after the federal government shutdown ended and the problems surrounding the implementation of the Affordable Care Act began to take center stage. The poll’s margin of error was plus or minus 2.1 percentage points.</p>
 </blockquote>
-<p>Note the term <em>margin of error</em>, which here is plus or minus 2.1 percentage points. Most polls won’t produce an estimate that’s perfectly right; there will always be a certain amount of error caused by <em>sampling variation</em>. The margin of error of plus or minus 2.1 percentage points is saying that a typical range of errors for polls of this type is about <span class="math inline">\(\pm\)</span> 2.1%, in words from about 2.1% too small to about 2.1% too big. We can restate this as interval of [41% - 2.1%, 41% + 2.1%] = [37.9%, 43.1%] (this notation indicates the interval contains all values 37.9% and 43.1% inclusively). We’ll see in the next chapter that such intervals are known as <em>confidence intervals</em>.</p>
+<p>Note the term <em>margin of error</em>, which here is “plus or minus 2.1 percentage points.” Most polls won’t produce an estimate that’s perfectly right; there will always be a certain amount of error caused by <em>sampling variation</em>. The margin of error of plus or minus 2.1 percentage points is saying that a typical range of errors for polls of this type is about <span class="math inline">\(\pm\)</span> 2.1%, in words from about 2.1% too small to about 2.1% too big. We can restate this as the interval of <span class="math inline">\([41\% - 2.1\%, 41\% + 2.1\%] = [37.9\%, 43.1\%]\)</span> (this notation indicates the interval contains all values between 37.9% and 43.1%, including the end points of 37.9% and 43.1%). We’ll see in the next chapter that such intervals are known as <em>confidence intervals</em>.</p>
 
 </div>
 </div>
@@ -1681,11 +1679,13 @@ <h3><span class="header-section-number">7.5.4</span> What’s to come?</h3>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -1693,12 +1693,11 @@ <h3><span class="header-section-number">7.5.4</span> What’s to come?</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -1713,6 +1712,10 @@ <h3><span class="header-section-number">7.5.4</span> What’s to come?</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -1729,8 +1732,9 @@ <h3><span class="header-section-number">7.5.4</span> What’s to come?</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/8-confidence-intervals.html b/docs/8-confidence-intervals.html
index 4ae786320..5eabbe714 100644
--- a/docs/8-confidence-intervals.html
+++ b/docs/8-confidence-intervals.html
@@ -4,35 +4,35 @@
 
   <meta charset="utf-8" />
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
-  <title>Chapter 8 Bootstrapping &amp; Confidence Intervals | Statistical Inference via Data Science</title>
+  <title>Chapter 8 Bootstrapping and Confidence Intervals | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
-  <meta property="og:title" content="Chapter 8 Bootstrapping &amp; Confidence Intervals | Statistical Inference via Data Science" />
+  <meta property="og:title" content="Chapter 8 Bootstrapping and Confidence Intervals | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
-  <meta name="twitter:title" content="Chapter 8 Bootstrapping &amp; Confidence Intervals | Statistical Inference via Data Science" />
+  <meta name="twitter:title" content="Chapter 8 Bootstrapping and Confidence Intervals | Statistical Inference via Data Science" />
   <meta name="twitter:site" content="@ModernDive" />
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="7-sampling.html">
-<link rel="next" href="9-hypothesis-testing.html">
+<link rel="prev" href="7-sampling.html"/>
+<link rel="next" href="9-hypothesis-testing.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -569,15 +582,15 @@ <h1>
 <img src='https://moderndive.com/wide_format.png' alt="ModernDive">
 </html>
 <div id="confidence-intervals" class="section level1">
-<h1><span class="header-section-number">Chapter 8</span> Bootstrapping &amp; Confidence Intervals</h1>
-<p>In Chapter <a href="7-sampling.html#sampling">7</a>, we studied sampling. We started with a “tactile” exercise where we wanted to know the proportion of balls in the sampling bowl in Figure <a href="7-sampling.html#fig:sampling-exercise-1">7.1</a> that are red. While we could have performed an exhaustive count, this would have been a tedious process. So instead we used a shovel to extract a sample of 50 balls and used the resulting proportion that were red as an <em>estimate</em>. Furthermore, we made sure to mix the bowl’s contents before every use of the shovel. Because of the randomness induced by the mixing, different uses of the shovel yielded different proportions red and hence different estimates of the proportion of the bowl’s balls that are red.</p>
-<p>We then mimicked this “tactile” sampling exercise with an equivalent “virtual” sampling exercise performed on the computer. Using our computers’ random number generator, we very quickly mimicked the above sampling procedure a large number of times. In Section <a href="7-sampling.html#different-shovels">7.2.4</a>, we quickly repeated this sampling procedure 1000 times, using three different “virtual” shovels with 25, 50, and 100 slots. We visualized these three sets of 1000 estimates in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions-3">7.15</a> and saw that as the sample size increased, the variation in the estimates decreased.</p>
-<p>What we did was construct <em>sampling distributions</em>. The motivation for taking 1000 repeated samples and visualizing the resulting estimates was to study how these estimates varied from one sample to another; in other words we wanted to study the effect of <em>sampling variation</em>. We quantified the variation of these estimates using their standard deviation, which has a special name: the <em>standard error</em>. In particular, we saw that as the sample size increased from 25 to 50 to 100, the standard error decreased and thus the sampling distributions narrowed. In other words, larger sample sizes lead to more <em>precise</em> estimates that varied less around the center.</p>
-<p>We then tied these sampling exercises to terminology and mathematical notation related to sampling in Section <a href="7-sampling.html#terminology-and-notation">7.3.1</a>. Our <em>study population</em> was the large bowl with <span class="math inline">\(N\)</span> = 2400 balls, while the <em>population parameter</em>, the unknown quantity of interest, here was the population proportion <span class="math inline">\(p\)</span> of the bowl’s balls that are red. Since performing a <em>census</em> would be very expensive in terms of time and energy, we instead extracted a <em>sample</em> of size <span class="math inline">\(n\)</span> = 50. The <em>point estimate</em>, also known as a <em>sample statistic</em>, used to estimate <span class="math inline">\(p\)</span> was the sample proportion <span class="math inline">\(\widehat{p}\)</span> of these 50 sampled balls that were red. Furthermore, since the sample was obtained at <em>random</em>, it can be considered as <em>unbiased</em> and <em>representative</em> of the population. Thus any results based on the sample could be <em>generalized</em> to the population. Thus, the proportion of the shovel’s balls that were red was a “good guess” of the proportion of the bowl’s balls that are red. In other words, we used the sample to <em>infer</em> about the population.</p>
-<p>However, as described in Section <a href="7-sampling.html#sampling-simulation">7.2</a>, both the tactile and virtual sampling exercises are not what one would do in real life; this was merely an activity used to study the effects of sampling variation. In a real life situation, we would not take 1000 samples of size <span class="math inline">\(n\)</span>, but rather take a <em>single</em> representative sample that’s as large as possible. Additionally, we knew what the true proportion of the bowl’s balls that were red was 37.5%. In a real life situation, we will not know what this value is. Because if we did, then why would we take a sample to estimate it?</p>
-<p>An example of a realistic sampling situation would be a poll, like the <a href="https://www.npr.org/sections/itsallpolitics/2013/12/04/248793753/poll-support-for-obama-among-young-americans-eroding">Obama poll</a> you saw in Section <a href="7-sampling.html#sampling-case-study">7.4</a>. Pollsters did not know the true proportion of <em>all</em> young Americans who supported President Obama, and thus they took a single sample of size <span class="math inline">\(n\)</span> = 2089 young Americans to estimate this value.</p>
+<h1><span class="header-section-number">Chapter 8</span> Bootstrapping and Confidence Intervals</h1>
+<p>In Chapter <a href="7-sampling.html#sampling">7</a>, we studied sampling. We started with a “tactile” exercise where we wanted to know the proportion of balls in the sampling bowl in Figure <a href="7-sampling.html#fig:sampling-exercise-1">7.1</a> that are red. While we could have performed an exhaustive count, this would have been a tedious process. So instead, we used a shovel to extract a sample of 50 balls and used the resulting proportion that were red as an <em>estimate</em>. Furthermore, we made sure to mix the bowl’s contents before every use of the shovel. Because of the randomness created by the mixing, different uses of the shovel yielded different proportions red and hence different estimates of the proportion of the bowl’s balls that are red.</p>
+<p>We then mimicked this “tactile” sampling exercise with an equivalent “virtual” sampling exercise performed on the computer. Using our computer’s random number generator, we quickly mimicked the above sampling procedure a large number of times. In Subsection <a href="7-sampling.html#different-shovels">7.2.4</a>, we quickly repeated this sampling procedure 1000 times, using three different “virtual” shovels with 25, 50, and 100 slots. We visualized these three sets of 1000 estimates in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions-3">7.15</a> and saw that as the sample size increased, the variation in the estimates decreased.</p>
+<p>In doing so, what we did was construct <em>sampling distributions</em>. The motivation for taking 1000 repeated samples and visualizing the resulting estimates was to study how these estimates varied from one sample to another; in other words, we wanted to study the effect of <em>sampling variation</em>. We quantified the variation of these estimates using their standard deviation, which has a special name: the <em>standard error</em>. In particular, we saw that as the sample size increased from 25 to 50 to 100, the standard error decreased and thus the sampling distributions narrowed. Larger sample sizes led to more <em>precise</em> estimates that varied less around the center.</p>
+<p>We then tied these sampling exercises to terminology and mathematical notation related to sampling in Subsection <a href="7-sampling.html#terminology-and-notation">7.3.1</a>. Our <em>study population</em> was the large bowl with <span class="math inline">\(N\)</span> = 2400 balls, while the <em>population parameter</em>, the unknown quantity of interest, was the population proportion <span class="math inline">\(p\)</span> of the bowl’s balls that were red. Since performing a <em>census</em> would be expensive in terms of time and energy, we instead extracted a <em>sample</em> of size <span class="math inline">\(n\)</span> = 50. The <em>point estimate</em>, also known as a <em>sample statistic</em>, used to estimate <span class="math inline">\(p\)</span> was the sample proportion <span class="math inline">\(\widehat{p}\)</span> of these 50 sampled balls that were red. Furthermore, since the sample was obtained at <em>random</em>, it can be considered as <em>unbiased</em> and <em>representative</em> of the population. Thus any results based on the sample could be <em>generalized</em> to the population. Therefore, the proportion of the shovel’s balls that were red was a “good guess” of the proportion of the bowl’s balls that are red. In other words, we used the sample to <em>infer</em> about the population.</p>
+<p>However, as described in Section <a href="7-sampling.html#sampling-simulation">7.2</a>, both the tactile and virtual sampling exercises are not what one would do in real life; this was merely an activity used to study the effects of sampling variation. In a real-life situation, we would not take 1000 samples of size <span class="math inline">\(n\)</span>, but rather take a <em>single</em> representative sample that’s as large as possible. Additionally, we knew that the true proportion of the bowl’s balls that were red was 37.5%. In a real-life situation, we will not know what this value is. Because if we did, then why would we take a sample to estimate it?</p>
+<p>An example of a realistic sampling situation would be a poll, like the <a href="https://www.npr.org/sections/itsallpolitics/2013/12/04/248793753/poll-support-for-obama-among-young-americans-eroding">Obama poll</a> you saw in Section <a href="7-sampling.html#sampling-case-study">7.4</a>. Pollsters did not know the true proportion of <em>all</em> young Americans who supported President Obama in 2013, and thus they took a single sample of size <span class="math inline">\(n\)</span> = 2089 young Americans to estimate this value.</p>
 <p>So how does one quantify the effects of sampling variation when you only have a <em>single sample</em> to work with? You cannot directly study the effects of sampling variation when you only have one sample. One common method to study this is <em>bootstrapping resampling</em>, which will be the focus of the earlier sections of this chapter.</p>
-<p>Furthermore, what if we would like not only a single estimate of the unknown population parameter, but also a <em>range of highly plausible</em> values? Going back to the Obama poll article, it stated that the pollsters’ estimate of the proportion of all young Americans who supported President Obama was 41%. But in addition it stated that the poll’s “margin of error was plus or minus 2.1 percentage points.” In other words, this “plausible range” was [41% - 2.1%, 41% + 2.1%] = [37.9%, 43.1%]. This range of plausible values is what’s known as a <em>confidence interval</em>, which will be the focus of the later sections of this chapter.</p>
+<p>Furthermore, what if we would like not only a single estimate of the unknown population parameter, but also a <em>range of highly plausible</em> values? Going back to the Obama poll article, it stated that the pollsters’ estimate of the proportion of all young Americans who supported President Obama was 41%. But in addition it stated that the poll’s “margin of error was plus or minus 2.1 percentage points.” This “plausible range” was [41% - 2.1%, 41% + 2.1%] = [38.9%, 43.1%]. This range of plausible values is what’s known as a <em>confidence interval</em>, which will be the focus of the later sections of this chapter.</p>
 <!--
 Create graphic illustrating two-step process of 1) construct bootstrap distribution
 and then 2) based on bootstrap dist'n create CI?
@@ -588,28 +601,28 @@ <h3>Needed packages</h3>
 <ul>
 <li><code>ggplot2</code> for data visualization</li>
 <li><code>dplyr</code> for data wrangling</li>
-<li><code>tidyr</code> for converting data to “tidy” format</li>
+<li><code>tidyr</code> for converting data to tidy format</li>
 <li><code>readr</code> for importing spreadsheet data into R</li>
 <li>As well as the more advanced <code>purrr</code>, <code>tibble</code>, <code>stringr</code>, and <code>forcats</code> packages</li>
 </ul>
 <p>If needed, read Section <a href="1-getting-started.html#packages">1.3</a> for information on how to install and load R packages.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(tidyverse)
-<span class="kw">library</span>(moderndive)
-<span class="kw">library</span>(infer)</code></pre>
+<div class="sourceCode" id="cb238"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb238-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb238-2" data-line-number="2"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb238-3" data-line-number="3"><span class="kw">library</span>(infer)</a></code></pre></div>
 </div>
 <div id="resampling-tactile" class="section level2">
 <h2><span class="header-section-number">8.1</span> Pennies activity</h2>
 <p>As we did in Chapter <a href="7-sampling.html#sampling">7</a>, we’ll begin with a hands-on tactile activity.</p>
 <div id="what-is-the-average-year-on-us-pennies-in-2019" class="section level3">
 <h3><span class="header-section-number">8.1.1</span> What is the average year on US pennies in 2019?</h3>
-<p>Try to imagine all the pennies being used in the United States in 2019. That’s a lot of pennies! Now say we’re interested in the average year of minting of <em>all</em> these pennies. One way to compute this value would be to gather up all pennies being used in the US, record the year, and compute the average. However, this would be near impossible! So instead, let’s collect a <em>sample</em> of 50 pennies collected from a local bank in downtown Northampton, Massachusetts, USA as seen in Figure <a href="8-confidence-intervals.html#fig:resampling-exercise-a">8.1</a>.</p>
+<p>Try to imagine all the pennies being used in the United States in 2019. That’s a lot of pennies! Now say we’re interested in the average year of minting of <em>all</em> these pennies. One way to compute this value would be to gather up all pennies being used in the US, record the year, and compute the average. However, this would be near impossible! So instead, let’s collect a <em>sample</em> of 50 pennies from a local bank in downtown Northampton, Massachusetts, USA as seen in Figure <a href="8-confidence-intervals.html#fig:resampling-exercise-a">8.1</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:resampling-exercise-a"></span>
 <img src="images/sampling/pennies/bank.jpg" alt="Collecting a sample of 50 US pennies from a local bank." width="40%" /><img src="images/sampling/pennies/roll.jpg" alt="Collecting a sample of 50 US pennies from a local bank." width="40%" />
 <p class="caption">
 FIGURE 8.1: Collecting a sample of 50 US pennies from a local bank.
 </p>
 </div>
-<p>An image of these 50 pennies can be seen in Figure <a href="8-confidence-intervals.html#fig:resampling-exercise-c">8.2</a>. For each the 50 pennies starting in the top left, progressing row-by-row, and ending in the bottom right, we assigned an “ID” identification variable and marked the year of minting.</p>
+<p>An image of these 50 pennies can be seen in Figure <a href="8-confidence-intervals.html#fig:resampling-exercise-c">8.2</a>. For each of the 50 pennies starting in the top left, progressing row-by-row, and ending in the bottom right, we assigned an “ID” identification variable and marked the year of minting.</p>
 <div class="figure" style="text-align: center"><span id="fig:resampling-exercise-c"></span>
 <img src="images/sampling/pennies/deliverable/3.jpg" alt="50 US pennies labelled." width="100%" />
 <p class="caption">
@@ -617,7 +630,7 @@ <h3><span class="header-section-number">8.1.1</span> What is the average year on
 </p>
 </div>
 <p>The <code>moderndive</code>  package contains this data on our 50 sampled pennies in the <code>pennies_sample</code> data frame:</p>
-<pre class="sourceCode r"><code class="sourceCode r">pennies_sample</code></pre>
+<div class="sourceCode" id="cb239"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb239-1" data-line-number="1">pennies_sample</a></code></pre></div>
 <pre><code># A tibble: 50 x 2
       ID  year
    &lt;int&gt; &lt;dbl&gt;
@@ -632,27 +645,27 @@ <h3><span class="header-section-number">8.1.1</span> What is the average year on
  9     9  2004
 10    10  2000
 # … with 40 more rows</code></pre>
-<p>The <code>pennies_sample</code> data frame has 50 rows corresponding to each penny with two variables. The first variable <code>ID</code> corresponds to the ID labels in Figure <a href="8-confidence-intervals.html#fig:resampling-exercise-c">8.2</a> whereas the second variable <code>year</code> corresponds to the year of minting saved as an integer, in other words a whole number.</p>
+<p>The <code>pennies_sample</code> data frame has 50 rows corresponding to each penny with two variables. The first variable <code>ID</code> corresponds to the ID labels in Figure <a href="8-confidence-intervals.html#fig:resampling-exercise-c">8.2</a>, whereas the second variable <code>year</code> corresponds to the year of minting saved as a numeric variable, also known as a double (<code>dbl</code>).</p>
 <p>Based on these 50 sampled pennies, what can we say about <em>all</em> US pennies in 2019? Let’s study some properties of our sample by performing an exploratory data analysis. Let’s first visualize the distribution of the year of these 50 pennies using our data visualization tools from Chapter <a href="2-viz.html#viz">2</a>. Since <code>year</code> is a numerical variable, we use a histogram in Figure <a href="8-confidence-intervals.html#fig:pennies-sample-histogram">8.3</a> to visualize its distribution.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(pennies_sample, <span class="kw">aes</span>(<span class="dt">x =</span> year)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">10</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb241"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb241-1" data-line-number="1"><span class="kw">ggplot</span>(pennies_sample, <span class="kw">aes</span>(<span class="dt">x =</span> year)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb241-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">10</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:pennies-sample-histogram"></span>
-<img src="moderndive_files/figure-html/pennies-sample-histogram-1.png" alt="Distribution of year on 50 US pennies." width="\textwidth" />
+<img src="ModernDive_files/figure-html/pennies-sample-histogram-1.png" alt="Distribution of year on 50 US pennies." width="\textwidth" />
 <p class="caption">
 FIGURE 8.3: Distribution of year on 50 US pennies.
 </p>
 </div>
-<p>Observe a slightly left-skewed  distribution, since most pennies fall in between 1980 and 2010 with only a few pennies older than 1970. What is the average year for the 50 sampled pennies? Eyeballing the histogram it appears to be around 1990. Let’s now compute this value exactly using our data wrangling tools from Chapter <a href="3-wrangling.html#wrangling">3</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))</code></pre>
+<p>Observe a slightly left-skewed  distribution, since most pennies fall between 1980 and 2010 with only a few pennies older than 1970. What is the average year for the 50 sampled pennies? Eyeballing the histogram it appears to be around 1990. Let’s now compute this value exactly using our data wrangling tools from Chapter <a href="3-wrangling.html#wrangling">3</a>.</p>
+<div class="sourceCode" id="cb242"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb242-1" data-line-number="1">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb242-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   mean_year
       &lt;dbl&gt;
 1   1995.44</code></pre>
 <p>Thus, if we’re willing to assume that <code>pennies_sample</code> is a representative sample from <em>all</em> US pennies, a “good guess” of the average year of minting of all US pennies would be 1995.44. In other words, around 1995. This should all start sounding similar to what we did previously in Chapter <a href="7-sampling.html#sampling">7</a>!</p>
-<p>In Chapter <a href="7-sampling.html#sampling">7</a>, our <em>study population</em> was the bowl of <span class="math inline">\(N\)</span> = 2400 balls. Our <em>population parameter</em> was the <em>population proportion</em> of these balls that were red, denoted mathematically by <span class="math inline">\(p\)</span>. In order to estimate <span class="math inline">\(p\)</span>, we extracted a sample of 50 balls using the shovel. We then computed the relevant <em>point estimate</em>: the <em>sample proportion</em> of these 50 balls that were red, denoted mathematically by <span class="math inline">\(\widehat{p}\)</span>.</p>
-<p>Here our population is <span class="math inline">\(N\)</span> = whatever the number of pennies are being used in the US, a value which we don’t know and probably never will. The population parameter of interest is now the <em>population mean</em> year of all these pennies, a value denoted mathematically by the Greek letter <span class="math inline">\(\mu\)</span> (pronounced “mu”). In order to estimate <span class="math inline">\(\mu\)</span>, we went to the bank and obtained a sample of 50 pennies and computed the relevant point estimate: the <em>sample mean</em> year of these 50 pennies, denoted mathematically by <span class="math inline">\(\overline{x}\)</span> (pronounced “x-bar”). An alternative and more intuitive notation for the sample mean is <span class="math inline">\(\widehat{\mu}\)</span>. However this is unfortunately not as commonly used, so in this book we’ll stick with convention and always denote the sample mean as <span class="math inline">\(\overline{x}\)</span>.</p>
-<p>We summarize the correspondence between the sampling bowl exercise in Chapter <a href="7-sampling.html#sampling">7</a> and our pennies exercise in Table <a href="8-confidence-intervals.html#tab:table-ch8-b">8.1</a>, which are the first two rows of the previously seen Table <a href="7-sampling.html#tab:table-ch8">7.5</a> of the various sampling scenarios we’ll cover in this text.</p>
+<p>In Chapter <a href="7-sampling.html#sampling">7</a>, our <em>study population</em> was the bowl of <span class="math inline">\(N\)</span> = 2400 balls. Our <em>population parameter</em> was the <em>population proportion</em> of these balls that were red, denoted by <span class="math inline">\(p\)</span>. In order to estimate <span class="math inline">\(p\)</span>, we extracted a sample of 50 balls using the shovel. We then computed the relevant <em>point estimate</em>: the <em>sample proportion</em> of these 50 balls that were red, denoted mathematically by <span class="math inline">\(\widehat{p}\)</span>.</p>
+<p>Here our population is <span class="math inline">\(N\)</span> = whatever the number of pennies are being used in the US, a value which we don’t know and probably never will. The population parameter of interest is now the <em>population mean</em> year of all these pennies, a value denoted mathematically by the Greek letter <span class="math inline">\(\mu\)</span> (pronounced “mu”). In order to estimate <span class="math inline">\(\mu\)</span>, we went to the bank and obtained a sample of 50 pennies and computed the relevant point estimate: the <em>sample mean</em> year of these 50 pennies, denoted mathematically by <span class="math inline">\(\overline{x}\)</span> (pronounced “x-bar”). An alternative and more intuitive notation for the sample mean is <span class="math inline">\(\widehat{\mu}\)</span>. However, this is unfortunately not as commonly used, so in this book we’ll stick with convention and always denote the sample mean as <span class="math inline">\(\overline{x}\)</span>.</p>
+<p>We summarize the correspondence between the sampling bowl exercise in Chapter <a href="7-sampling.html#sampling">7</a> and our pennies exercise in Table <a href="8-confidence-intervals.html#tab:table-ch8-b">8.1</a>, which are the first two rows of the previously seen Table <a href="7-sampling.html#tab:table-ch8">7.5</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:table-ch8-b">TABLE 8.1: </span>Scenarios of sampling for inference
@@ -672,7 +685,7 @@ <h3><span class="header-section-number">8.1.1</span> What is the average year on
 Point estimate
 </th>
 <th style="text-align:left;">
-Notation.
+Symbol(s)
 </th>
 </tr>
 </thead>
@@ -714,28 +727,29 @@ <h3><span class="header-section-number">8.1.1</span> What is the average year on
 </tbody>
 </table>
 <p>Going back to our 50 sampled pennies in Figure <a href="8-confidence-intervals.html#fig:resampling-exercise-c">8.2</a>, the point estimate of interest is the sample mean <span class="math inline">\(\overline{x}\)</span> of 1995.44. This quantity is an <em>estimate</em> of the population mean year of <em>all</em> US pennies <span class="math inline">\(\mu\)</span>.</p>
-<p>Recall that we also saw in Chapter <a href="7-sampling.html#sampling">7</a> that such estimates are prone to <em>sampling variation</em>. For example, in this particular sample in Figure <a href="8-confidence-intervals.html#fig:resampling-exercise-c">8.2</a>, we observed three pennies with the year of 1999. If we sampled another 50 pennies, would we observe exactly three pennies with the year of 1999 again? More than likely not. We might observe none, or one, or two, or maybe even all 50! The same can be said for the other 26 unique years that are represented in our sample of 50 pennies.</p>
-<p>To study the effects of <em>sampling variation</em> in Chapter <a href="7-sampling.html#sampling">7</a> we took many samples, something we could easily do with our shovel. In our case with pennies however, how would we obtain another sample? By going to the bank and getting another roll of 50 pennies. Say we’re feeling lazy however and don’t want to go back to the bank. How can we study the effects of sampling variation using our <em>single sample</em>. We will do so using a technique known as “bootstrap resampling with replacement,” which we now illustrate.</p>
+<p>Recall that we also saw in Chapter <a href="7-sampling.html#sampling">7</a> that such estimates are prone to <em>sampling variation</em>. For example, in this particular sample in Figure <a href="8-confidence-intervals.html#fig:resampling-exercise-c">8.2</a>, we observed three pennies with the year 1999. If we sampled another 50 pennies, would we observe exactly three pennies with the year 1999 again? More than likely not. We might observe none, one, two, or maybe even all 50! The same can be said for the other 26 unique years that are represented in our sample of 50 pennies.</p>
+<p>To study the effects of <em>sampling variation</em> in Chapter <a href="7-sampling.html#sampling">7</a>, we took many samples, something we could easily do with our shovel. In our case with pennies, however, how would we obtain another sample? By going to the bank and getting another roll of 50 pennies.</p>
+<p>Say we’re feeling lazy, however, and don’t want to go back to the bank. How can we study the effects of sampling variation using our <em>single sample</em>? We will do so using a technique known as <em>bootstrap resampling with replacement</em>, which we now illustrate.</p>
 </div>
 <div id="resampling-once" class="section level3">
 <h3><span class="header-section-number">8.1.2</span> Resampling once</h3>
-<p><strong>Step 1</strong>: Let’s print out identically-sized slips of paper representing our 50 pennies as seen in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-1">8.4</a>.</p>
+<p><strong>Step 1</strong>: Let’s print out identically sized slips of paper representing our 50 pennies as seen in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-1">8.4</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:tactile-resampling-1"></span>
-<img src="images/sampling/pennies/tactile_simulation/1_paper_slips.png" alt="Step 1: 50 slips of paper representing 50 US pennies." width="50%" />
+<img src="images/sampling/pennies/tactile_simulation/1_paper_slips.png" alt="Step 1: 50 slips of paper representing 50 US pennies." width="100%" />
 <p class="caption">
 FIGURE 8.4: Step 1: 50 slips of paper representing 50 US pennies.
 </p>
 </div>
 <p><strong>Step 2</strong>: Put the 50 slips of paper into a hat or tuque as seen in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-2">8.5</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:tactile-resampling-2"></span>
-<img src="images/sampling/pennies/tactile_simulation/2_insert_in_hat.png" alt="Step 2: Putting 50 slips of paper in a hat." width="50%" />
+<img src="images/sampling/pennies/tactile_simulation/2_insert_in_hat.png" alt="Step 2: Putting 50 slips of paper in a hat." width="60%" />
 <p class="caption">
 FIGURE 8.5: Step 2: Putting 50 slips of paper in a hat.
 </p>
 </div>
 <p><strong>Step 3</strong>: Mix the hat’s contents and draw one slip of paper at random as seen in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-3">8.6</a>. Record the year.</p>
 <div class="figure" style="text-align: center"><span id="fig:tactile-resampling-3"></span>
-<img src="images/sampling/pennies/tactile_simulation/3_draw_at_random.png" alt="Step 3: Drawing one slip of paper at random." width="50%" />
+<img src="images/sampling/pennies/tactile_simulation/3_draw_at_random.png" alt="Step 3: Drawing one slip of paper at random." width="60%" />
 <p class="caption">
 FIGURE 8.6: Step 3: Drawing one slip of paper at random.
 </p>
@@ -747,21 +761,18 @@ <h3><span class="header-section-number">8.1.2</span> Resampling once</h3>
 FIGURE 8.7: Step 4: Replacing slip of paper.
 </p>
 </div>
-<p><strong>Step 5</strong>: Repeat Steps 3 and 4 49 more times, resulting in 50 recorded years.</p>
+<p><strong>Step 5</strong>: Repeat Steps 3 and 4 a total of 49 more times, resulting in 50 recorded years.</p>
 <p>What we just performed was a <em>resampling</em>  of the original sample of 50 pennies. We are not sampling 50 pennies from the population of all US pennies as we did in our trip to the bank. Instead, we are mimicking this act by resampling 50 pennies from our original sample of 50 pennies.</p>
 <p>Now ask yourselves, why did we replace our resampled slip of paper back into the hat in Step 4? Because if we left the slip of paper out of the hat each time we performed Step 4, we would end up with the same 50 original pennies! In other words, replacing the slips of paper induces <em>sampling variation</em>.</p>
 <p>Being more precise with our terminology, we just performed a <em>resampling with replacement</em> from the original sample of 50 pennies. Had we left the slip of paper out of the hat each time we performed Step 4, this would be <em>resampling without replacement</em>.</p>
 <p>Let’s study our 50 resampled pennies via an exploratory data analysis. First, let’s load the data into R by manually creating a data frame <code>pennies_resample</code> of our 50 resampled values. We’ll do this using the <code>tibble()</code> command from the <code>dplyr</code> package. Note that the 50 values you resample will almost certainly not be the same as ours given the inherent randomness.</p>
-<!--
-TODO: Add this data frame to moderndive package.
--->
-<pre class="sourceCode r"><code class="sourceCode r">pennies_resample &lt;-<span class="st"> </span><span class="kw">tibble</span>(
-  <span class="dt">year =</span> <span class="kw">c</span>(<span class="dv">1976</span>, <span class="dv">1962</span>, <span class="dv">1976</span>, <span class="dv">1983</span>, <span class="dv">2017</span>, <span class="dv">2015</span>, <span class="dv">2015</span>, <span class="dv">1962</span>, <span class="dv">2016</span>, <span class="dv">1976</span>, 
-           <span class="dv">2006</span>, <span class="dv">1997</span>, <span class="dv">1988</span>, <span class="dv">2015</span>, <span class="dv">2015</span>, <span class="dv">1988</span>, <span class="dv">2016</span>, <span class="dv">1978</span>, <span class="dv">1979</span>, <span class="dv">1997</span>, 
-           <span class="dv">1974</span>, <span class="dv">2013</span>, <span class="dv">1978</span>, <span class="dv">2015</span>, <span class="dv">2008</span>, <span class="dv">1982</span>, <span class="dv">1986</span>, <span class="dv">1979</span>, <span class="dv">1981</span>, <span class="dv">2004</span>, 
-           <span class="dv">2000</span>, <span class="dv">1995</span>, <span class="dv">1999</span>, <span class="dv">2006</span>, <span class="dv">1979</span>, <span class="dv">2015</span>, <span class="dv">1979</span>, <span class="dv">1998</span>, <span class="dv">1981</span>, <span class="dv">2015</span>, 
-           <span class="dv">2000</span>, <span class="dv">1999</span>, <span class="dv">1988</span>, <span class="dv">2017</span>, <span class="dv">1992</span>, <span class="dv">1997</span>, <span class="dv">1990</span>, <span class="dv">1988</span>, <span class="dv">2006</span>, <span class="dv">2000</span>)
-)</code></pre>
+<div class="sourceCode" id="cb244"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb244-1" data-line-number="1">pennies_resample &lt;-<span class="st"> </span><span class="kw">tibble</span>(</a>
+<a class="sourceLine" id="cb244-2" data-line-number="2">  <span class="dt">year =</span> <span class="kw">c</span>(<span class="dv">1976</span>, <span class="dv">1962</span>, <span class="dv">1976</span>, <span class="dv">1983</span>, <span class="dv">2017</span>, <span class="dv">2015</span>, <span class="dv">2015</span>, <span class="dv">1962</span>, <span class="dv">2016</span>, <span class="dv">1976</span>, </a>
+<a class="sourceLine" id="cb244-3" data-line-number="3">           <span class="dv">2006</span>, <span class="dv">1997</span>, <span class="dv">1988</span>, <span class="dv">2015</span>, <span class="dv">2015</span>, <span class="dv">1988</span>, <span class="dv">2016</span>, <span class="dv">1978</span>, <span class="dv">1979</span>, <span class="dv">1997</span>, </a>
+<a class="sourceLine" id="cb244-4" data-line-number="4">           <span class="dv">1974</span>, <span class="dv">2013</span>, <span class="dv">1978</span>, <span class="dv">2015</span>, <span class="dv">2008</span>, <span class="dv">1982</span>, <span class="dv">1986</span>, <span class="dv">1979</span>, <span class="dv">1981</span>, <span class="dv">2004</span>, </a>
+<a class="sourceLine" id="cb244-5" data-line-number="5">           <span class="dv">2000</span>, <span class="dv">1995</span>, <span class="dv">1999</span>, <span class="dv">2006</span>, <span class="dv">1979</span>, <span class="dv">2015</span>, <span class="dv">1979</span>, <span class="dv">1998</span>, <span class="dv">1981</span>, <span class="dv">2015</span>, </a>
+<a class="sourceLine" id="cb244-6" data-line-number="6">           <span class="dv">2000</span>, <span class="dv">1999</span>, <span class="dv">1988</span>, <span class="dv">2017</span>, <span class="dv">1992</span>, <span class="dv">1997</span>, <span class="dv">1990</span>, <span class="dv">1988</span>, <span class="dv">2006</span>, <span class="dv">2000</span>)</a>
+<a class="sourceLine" id="cb244-7" data-line-number="7">)</a></code></pre></div>
 <p>The 50 values of <code>year</code> in <code>pennies_resample</code> represent a resample of size 50 from the original sample of 50 pennies. We display the 50 resampled pennies in Figure <a href="8-confidence-intervals.html#fig:resampling-exercise-d">8.8</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:resampling-exercise-d"></span>
 <img src="images/sampling/pennies/deliverable/4.jpg" alt="50 resampled US pennies labelled." width="100%" />
@@ -770,122 +781,118 @@ <h3><span class="header-section-number">8.1.2</span> Resampling once</h3>
 </p>
 </div>
 <p>Let’s compare the distribution of the numerical variable <code>year</code> of our 50 resampled pennies with the distribution of the numerical variable <code>year</code> of our original sample of 50 pennies in Figure <a href="8-confidence-intervals.html#fig:origandresample">8.9</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(pennies_resample, <span class="kw">aes</span>(<span class="dt">x =</span> year)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">10</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">title =</span> <span class="st">&quot;Resample of 50 pennies&quot;</span>)
-<span class="kw">ggplot</span>(pennies_sample, <span class="kw">aes</span>(<span class="dt">x =</span> year)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">10</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">title =</span> <span class="st">&quot;Original sample of 50 pennies&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb245"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb245-1" data-line-number="1"><span class="kw">ggplot</span>(pennies_resample, <span class="kw">aes</span>(<span class="dt">x =</span> year)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb245-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">10</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb245-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">title =</span> <span class="st">&quot;Resample of 50 pennies&quot;</span>)</a>
+<a class="sourceLine" id="cb245-4" data-line-number="4"><span class="kw">ggplot</span>(pennies_sample, <span class="kw">aes</span>(<span class="dt">x =</span> year)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb245-5" data-line-number="5"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">10</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb245-6" data-line-number="6"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">title =</span> <span class="st">&quot;Original sample of 50 pennies&quot;</span>)</a></code></pre></div>
 
 <div class="figure" style="text-align: center"><span id="fig:origandresample"></span>
-<img src="moderndive_files/figure-html/origandresample-1.png" alt="Comparing year in the resampled pennies_resample with the original sample pennies_sample." width="\textwidth" />
+<img src="ModernDive_files/figure-html/origandresample-1.png" alt="Comparing year in the resampled pennies_resample with the original sample pennies_sample." width="\textwidth" />
 <p class="caption">
 FIGURE 8.9: Comparing <code>year</code> in the resampled <code>pennies_resample</code> with the original sample <code>pennies_sample</code>.
 </p>
 </div>
-<p>Observe in Figure <a href="8-confidence-intervals.html#fig:origandresample">8.9</a> that while the general shapes of both distributions of <code>year</code> is roughly similar, they are not identical.</p>
+<p>Observe in Figure <a href="8-confidence-intervals.html#fig:origandresample">8.9</a> that while the general shapes of both distributions of <code>year</code> are roughly similar, they are not identical.</p>
 <p>Recall from the previous section that the sample mean of the original sample of 50 pennies from the bank was 1995.44. What about for our resample? Any guesses? Let’s have <code>dplyr</code> help us out as before:</p>
-<pre class="sourceCode r"><code class="sourceCode r">pennies_resample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))</code></pre>
+<div class="sourceCode" id="cb246"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb246-1" data-line-number="1">pennies_resample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb246-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   mean_year
       &lt;dbl&gt;
-1   1994.82</code></pre>
-<p>We obtained a different mean year of 1994.82. This variation is induced by resampling <em>with replacement</em> we performed earlier.</p>
-<p>What if we repeated this resampling exercise many times? Would we obtain the same mean <code>year</code> each time? In other words, would our guess at the mean year of all pennies in the US in 2019 be exactly 1994.82 every time? Just as we did in Chapter <a href="7-sampling.html#sampling">7</a>, let’s perform this resampling activity with the help of 35 of our friends.</p>
+1      1996</code></pre>
+<p>We obtained a different mean year of 1996. This variation is induced by the resampling <em>with replacement</em> we performed earlier.</p>
+<p>What if we repeated this resampling exercise many times? Would we obtain the same mean <code>year</code> each time? In other words, would our guess at the mean year of all pennies in the US in 2019 be exactly 1996 every time? Just as we did in Chapter <a href="7-sampling.html#sampling">7</a>, let’s perform this resampling activity with the help of some of our friends: 35 friends in total.</p>
 </div>
 <div id="student-resamples" class="section level3">
 <h3><span class="header-section-number">8.1.3</span> Resampling 35 times</h3>
-<p>Each of our 35 friends will repeat the same 5 steps:</p>
+<p>Each of our 35 friends will repeat the same five steps:</p>
 <ol style="list-style-type: decimal">
-<li>Start with 50 identically-sized slips of paper representing the 50 pennies.</li>
+<li>Start with 50 identically sized slips of paper representing the 50 pennies.</li>
 <li>Put the 50 small pieces of paper into a hat or beanie cap.</li>
 <li>Mix the hat’s contents and draw one slip of paper at random. Record the year in a spreadsheet.</li>
 <li>Replace the slip of paper back in the hat!</li>
-<li>Repeat Steps 3 and 4 49 more times, resulting in 50 recorded years.</li>
+<li>Repeat Steps 3 and 4 a total of 49 more times, resulting in 50 recorded years.</li>
 </ol>
-<p>Since we had 35 of our friends perform this task, we ended up with 35 <span class="math inline">\(\times\)</span> 50 = 1750 values. We recorded these values in a <a href="https://docs.google.com/spreadsheets/d/1y3kOsU_wDrDd5eiJbEtLeHT9L5SvpZb_TrzwFBsouk0/">shared spreadsheet</a> with 50 rows (plus a header row) and 35 columns. We display a snapshot of the first 10 rows and 5 columns of this shared spreadsheet in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-5">8.10</a>.</p>
-<!--
-TODO: Change header row in both spreadsheet and in corresponding pennies_resamples
-data frame in moderndive pkg.
--->
+<p>Since we had 35 of our friends perform this task, we ended up with <span class="math inline">\(35 \cdot 50 = 1750\)</span> values. We recorded these values in a <a href="https://docs.google.com/spreadsheets/d/1y3kOsU_wDrDd5eiJbEtLeHT9L5SvpZb_TrzwFBsouk0/">shared spreadsheet</a> with 50 rows (plus a header row) and 35 columns. We display a snapshot of the first 10 rows and five columns of this shared spreadsheet in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-5">8.10</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:tactile-resampling-5"></span>
 <img src="images/sampling/pennies/tactile_simulation/5_shared_spreadsheet.png" alt="Snapshot of shared spreadsheet of resampled pennies." width="70%" />
 <p class="caption">
 FIGURE 8.10: Snapshot of shared spreadsheet of resampled pennies.
 </p>
 </div>
-<p>For your convenience, we’ve taken these 35 <span class="math inline">\(\times\)</span> 50 = 1750 values and saved them in <code>pennies_resamples</code>, a “tidy” data frame included in the <code>moderndive</code> package. We saw what it means for a data frame to be “tidy” in Subsection <a href="4-tidy.html#tidy-definition">4.2.1</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">pennies_resamples</code></pre>
+<p>For your convenience, we’ve taken these 35 <span class="math inline">\(\cdot\)</span> 50 = 1750 values and saved them in <code>pennies_resamples</code>, a “tidy” data frame included in the <code>moderndive</code> package. We saw what it means for a data frame to be “tidy” in Subsection <a href="4-tidy.html#tidy-definition">4.2.1</a>.</p>
+<div class="sourceCode" id="cb248"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb248-1" data-line-number="1">pennies_resamples</a></code></pre></div>
 <pre><code># A tibble: 1,750 x 3
-   replicate name   year
-       &lt;int&gt; &lt;chr&gt; &lt;dbl&gt;
- 1         1 A      1988
- 2         1 A      2002
- 3         1 A      2015
- 4         1 A      1998
- 5         1 A      1979
- 6         1 A      1971
- 7         1 A      1971
- 8         1 A      2015
- 9         1 A      1988
-10         1 A      1979
+# Groups:   name [35]
+   replicate name     year
+       &lt;int&gt; &lt;chr&gt;   &lt;dbl&gt;
+ 1         1 Arianna  1988
+ 2         1 Arianna  2002
+ 3         1 Arianna  2015
+ 4         1 Arianna  1998
+ 5         1 Arianna  1979
+ 6         1 Arianna  1971
+ 7         1 Arianna  1971
+ 8         1 Arianna  2015
+ 9         1 Arianna  1988
+10         1 Arianna  1979
 # … with 1,740 more rows</code></pre>
 <p>What did each of our 35 friends obtain as the mean year? Once again, <code>dplyr</code> to the rescue! After grouping the rows by <code>name</code>, we summarize each group of 50 rows by their mean <code>year</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">resampled_means &lt;-<span class="st"> </span>pennies_resamples <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(name) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))
-resampled_means</code></pre>
+<div class="sourceCode" id="cb250"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb250-1" data-line-number="1">resampled_means &lt;-<span class="st"> </span>pennies_resamples <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb250-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(name) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb250-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))</a>
+<a class="sourceLine" id="cb250-4" data-line-number="4">resampled_means</a></code></pre></div>
 <pre><code># A tibble: 35 x 2
-   name  mean_year
-   &lt;chr&gt;     &lt;dbl&gt;
- 1 A       1992.5 
- 2 AA      1995.86
- 3 B       1996.42
- 4 BB      1992.4 
- 5 C       1996.32
- 6 CC      1995.88
- 7 D       1996.9 
- 8 DD      1997.46
- 9 E       1991.22
-10 EE      1998.44
+   name      mean_year
+   &lt;chr&gt;         &lt;dbl&gt;
+ 1 Arianna     1992.5 
+ 2 Artemis     1996.42
+ 3 Bea         1996.32
+ 4 Camryn      1996.9 
+ 5 Cassandra   1991.22
+ 6 Cindy       1995.48
+ 7 Claire      1995.52
+ 8 Dahlia      1998.48
+ 9 Dan         1993.86
+10 Eindra      1993.56
 # … with 25 more rows</code></pre>
-<p>Observe that <code>resampled_means</code> has 35 rows corresponding to the 35 means based on the 35 resamples. Furthermore, observe the variation in the 35 values in the variable <code>mean_year</code>. Let’s visualize this variation using a histogram in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-6">8.11</a>. Recall that adding the argument <code>boundary = 1990</code> to the <code>geom_histogram()</code> sets the binning structure so that one of the bin boundaries is 1990 exactly.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(resampled_means, <span class="kw">aes</span>(<span class="dt">x =</span> mean_year)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">1</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>, <span class="dt">boundary =</span> <span class="dv">1990</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Sampled mean year&quot;</span>)</code></pre>
+<p>Observe that <code>resampled_means</code> has 35 rows corresponding to the 35 means based on the 35 resamples. Furthermore, observe the variation in the 35 values in the variable <code>mean_year</code>. Let’s visualize this variation using a histogram in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-6">8.11</a>. Recall that adding the argument <code>boundary = 1990</code> to the <code>geom_histogram()</code> sets the binning structure so that one of the bin boundaries is at 1990 exactly.</p>
+<div class="sourceCode" id="cb252"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb252-1" data-line-number="1"><span class="kw">ggplot</span>(resampled_means, <span class="kw">aes</span>(<span class="dt">x =</span> mean_year)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb252-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">1</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>, <span class="dt">boundary =</span> <span class="dv">1990</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb252-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Sampled mean year&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:tactile-resampling-6"></span>
-<img src="moderndive_files/figure-html/tactile-resampling-6-1.png" alt="Distribution of 35 sample means from 35 resamples." width="\textwidth" />
+<img src="ModernDive_files/figure-html/tactile-resampling-6-1.png" alt="Distribution of 35 sample means from 35 resamples." width="\textwidth" />
 <p class="caption">
 FIGURE 8.11: Distribution of 35 sample means from 35 resamples.
 </p>
 </div>
-<p>Observe in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-6">8.11</a> that the distribution looks roughly normal and that we rarely observe sample mean years less than in 1992 or greater than 2000. Also observe how the distribution is roughly centered at 1995, which is the sample mean of 1995.44 of the <em>original sample</em> of 50 pennies from the bank.</p>
+<p>Observe in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-6">8.11</a> that the distribution looks roughly normal and that we rarely observe sample mean years less than 1992 or greater than 2000. Also observe how the distribution is roughly centered at 1995, which is close to the sample mean of 1995.44 of the <em>original sample</em> of 50 pennies from the bank.</p>
 </div>
 <div id="what-did-we-just-do-1" class="section level3">
 <h3><span class="header-section-number">8.1.4</span> What did we just do?</h3>
-<p>What we just demonstrated in this activity is the statistical procedure known as <em>bootstrap resampling with replacement</em> . We used <em>resampling</em> to mimic the sampling variation we studied in Chapter <a href="7-sampling.html#sampling">7</a> on sampling. However in this case, we did so using only a <em>single</em> sample from the population.</p>
-<p>In fact, the histogram of sample means from 35 resamples in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-6">8.11</a> is called the <em>bootstrap distribution</em> . It is an <em>approximation</em> to the <em>sampling distribution</em> of the sample mean, in the sense that both distributions will have a similar shape and similar spread. In fact in the upcoming Section <a href="8-confidence-intervals.html#ci-conclusion">8.7</a>, we’ll show you that this is the case.</p>
-<p>Using this bootstrap distribution, we can study the effect of sampling variation on our estimates. In particular, we’ll study the typical “error” of our estimates, known as the <em>standard error</em> .</p>
-<p>In Section <a href="8-confidence-intervals.html#resampling-simulation">8.2</a> we’ll mimic our tactile resampling activity virtually on the computer, allowing us to quickly perform the resampling many more than 35 times. In Section <a href="8-confidence-intervals.html#ci-build-up">8.3</a> we’ll define the statistical concept of a <em>confidence interval</em>, which builds off bootstrap distributions.</p>
-<p>In Section <a href="8-confidence-intervals.html#bootstrap-process">8.4</a>, construct confidence intervals using the <code>dplyr</code> package, as well as a new package: the <code>infer</code> package for “tidy” and transparent statistical inference. We’ve already used one of the <code>infer</code> package’s functions, <code>rep_sample_n()</code>, but there’s a lot more. We’ll introduce the “tidy” statistical inference framework that was the motivation for the <code>infer</code> package pipeline that will be the driving package throughout the rest of this book.</p>
-<p>As we did in Chapter <a href="7-sampling.html#sampling">7</a>, we’ll tie all these ideas together with a real-life case study in Section <a href="8-confidence-intervals.html#case-study-two-prop-ci">8.6</a>. This time we’ll look at data from an experiment about yawning from the US television show Mythbusters.</p>
+<p>What we just demonstrated in this activity is the statistical procedure known as  <em>bootstrap resampling with replacement</em>. We used <em>resampling</em> to mimic the sampling variation we studied in Chapter <a href="7-sampling.html#sampling">7</a> on sampling. However, in this case, we did so using only a <em>single</em> sample from the population.</p>
+<p>In fact, the histogram of sample means from 35 resamples in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-6">8.11</a> is called the  <em>bootstrap distribution</em>. It is an <em>approximation</em> to the <em>sampling distribution</em> of the sample mean, in the sense that both distributions will have a similar shape and similar spread. In fact in the upcoming Section <a href="8-confidence-intervals.html#ci-conclusion">8.7</a>, we’ll show you that this is the case. Using this bootstrap distribution, we can study the effect of sampling variation on our estimates. In particular, we’ll study the typical “error” of our estimates, known as the  <em>standard error</em>.</p>
+<p>In Section <a href="8-confidence-intervals.html#resampling-simulation">8.2</a> we’ll mimic our tactile resampling activity virtually on the computer, allowing us to quickly perform the resampling many more than 35 times. In Section <a href="8-confidence-intervals.html#ci-build-up">8.3</a> we’ll define the statistical concept of a <em>confidence interval</em>, which builds off the concept of bootstrap distributions.</p>
+<p>In Section <a href="8-confidence-intervals.html#bootstrap-process">8.4</a>, we’ll construct confidence intervals using the <code>dplyr</code> package, as well as a new package: the <code>infer</code> package for “tidy” and transparent statistical inference. We’ll introduce the “tidy” statistical inference framework that was the motivation for the <code>infer</code> package pipeline. The <code>infer</code> package will be the driving package throughout the rest of this book.</p>
+<p>As we did in Chapter <a href="7-sampling.html#sampling">7</a>, we’ll tie all these ideas together with a real-life case study in Section <a href="8-confidence-intervals.html#case-study-two-prop-ci">8.6</a>. This time we’ll look at data from an experiment about yawning from the US television show <em>Mythbusters</em>.</p>
 </div>
 </div>
 <div id="resampling-simulation" class="section level2">
 <h2><span class="header-section-number">8.2</span> Computer simulation of resampling</h2>
-<p>Let’s now mimic our tactile resampling activity virtually by using our computer.</p>
+<p>Let’s now mimic our tactile resampling activity virtually with a computer.</p>
 <div id="virtually-resampling-once" class="section level3">
 <h3><span class="header-section-number">8.2.1</span> Virtually resampling once</h3>
 <p>First, let’s perform the virtual analog of resampling once. Recall that the <code>pennies_sample</code> data frame included in the <code>moderndive</code> package contains the years of our original sample of 50 pennies from the bank. Furthermore, recall in Chapter <a href="7-sampling.html#sampling">7</a> on sampling that we used the <code>rep_sample_n()</code> function as a virtual shovel to sample balls from our virtual bowl of 2400 balls as follows:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_shovel &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>)</code></pre>
+<div class="sourceCode" id="cb253"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb253-1" data-line-number="1">virtual_shovel &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb253-2" data-line-number="2"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>)</a></code></pre></div>
 <p>Let’s modify this code to perform the resampling with replacement of the 50 slips of paper representing our original sample 50 pennies:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_resample &lt;-<span class="st"> </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>)</code></pre>
+<div class="sourceCode" id="cb254"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb254-1" data-line-number="1">virtual_resample &lt;-<span class="st"> </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb254-2" data-line-number="2"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>)</a></code></pre></div>
 <p>Observe how we explicitly set the <code>replace</code> argument to <code>TRUE</code> in order to tell <code>rep_sample_n()</code> that we would like to sample pennies  <em>with</em> replacement. Had we not set <code>replace = TRUE</code>, the function would’ve assumed the default value of <code>FALSE</code> and hence done resampling <em>without</em> replacement. Additionally, since we didn’t specify the number of replicates via the <code>reps</code> argument, the function assumes the default of one replicate <code>reps = 1</code>. Lastly, observe also that the <code>size</code> argument is set to match the original sample size of 50 pennies.</p>
 <p>Let’s look at only the first 10 out of 50 rows of <code>virtual_resample</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_resample</code></pre>
+<div class="sourceCode" id="cb255"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb255-1" data-line-number="1">virtual_resample</a></code></pre></div>
 <pre><code># A tibble: 50 x 3
 # Groups:   replicate [1]
    replicate    ID  year
@@ -901,10 +908,9 @@ <h3><span class="header-section-number">8.2.1</span> Virtually resampling once</
  9         1    23  1998
 10         1    44  2015
 # … with 40 more rows</code></pre>
-<p>The <code>replicate</code> variable only takes on the value of 1 corresponding to us only having <code>reps = 1</code>, the <code>ID</code> variable indicates which of the 50 pennies from <code>pennies_sample</code> was resampled, and <code>year</code> denotes the year of minting.</p>
-<p>Let’s now compute the mean <code>year</code> in our virtual resample of size 50 using data wrangling functions included in the <code>dplyr</code> package:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_resample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">resample_mean =</span> <span class="kw">mean</span>(year))</code></pre>
+<p>The <code>replicate</code> variable only takes on the value of 1 corresponding to us only having <code>reps = 1</code>, the <code>ID</code> variable indicates which of the 50 pennies from <code>pennies_sample</code> was resampled, and <code>year</code> denotes the year of minting. Let’s now compute the mean <code>year</code> in our virtual resample of size 50 using data wrangling functions included in the <code>dplyr</code> package:</p>
+<div class="sourceCode" id="cb257"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb257-1" data-line-number="1">virtual_resample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb257-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">resample_mean =</span> <span class="kw">mean</span>(year))</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
   replicate resample_mean
       &lt;int&gt;         &lt;dbl&gt;
@@ -921,9 +927,9 @@ <h3><span class="header-section-number">8.2.1</span> Virtually resampling once</
 <div id="bootstrap-35-replicates" class="section level3">
 <h3><span class="header-section-number">8.2.2</span> Virtually resampling 35 times</h3>
 <p>Let’s now perform the virtual analog of our 35 friends’ resampling. Using these results, we’ll be able to study the variability in the sample means from 35 resamples of size 50. Let’s first add a <code>reps = 35</code> argument to <code>rep_sample_n()</code>  to indicate we would like 35 replicates. Thus, we want to repeat the resampling with the replacement of 50 pennies 35 times.</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_resamples &lt;-<span class="st"> </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, <span class="dt">reps =</span> <span class="dv">35</span>)
-virtual_resamples</code></pre>
+<div class="sourceCode" id="cb259"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb259-1" data-line-number="1">virtual_resamples &lt;-<span class="st"> </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb259-2" data-line-number="2"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, <span class="dt">reps =</span> <span class="dv">35</span>)</a>
+<a class="sourceLine" id="cb259-3" data-line-number="3">virtual_resamples</a></code></pre></div>
 <pre><code># A tibble: 1,750 x 3
 # Groups:   replicate [35]
    replicate    ID  year
@@ -939,11 +945,11 @@ <h3><span class="header-section-number">8.2.2</span> Virtually resampling 35 tim
  9         1    49  2006
 10         1     2  1986
 # … with 1,740 more rows</code></pre>
-<p>The resulting <code>virtual_resamples</code> data frame has 35 <span class="math inline">\(\times\)</span> 50 = 1750 rows corresponding to 35 resamples of 50 pennies. Let’s now compute the resulting 35 sample means using the same <code>dplyr</code> code as we did in the previous section, but this time adding a <code>group_by(replicate)</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_resampled_means &lt;-<span class="st"> </span>virtual_resamples <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))
-virtual_resampled_means</code></pre>
+<p>The resulting <code>virtual_resamples</code> data frame has 35 <span class="math inline">\(\cdot\)</span> 50 = 1750 rows corresponding to 35 resamples of 50 pennies. Let’s now compute the resulting 35 sample means using the same <code>dplyr</code> code as we did in the previous section, but this time adding a <code>group_by(replicate)</code>:</p>
+<div class="sourceCode" id="cb261"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb261-1" data-line-number="1">virtual_resampled_means &lt;-<span class="st"> </span>virtual_resamples <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb261-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb261-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))</a>
+<a class="sourceLine" id="cb261-4" data-line-number="4">virtual_resampled_means</a></code></pre></div>
 <pre><code># A tibble: 35 x 2
    replicate mean_year
        &lt;int&gt;     &lt;dbl&gt;
@@ -959,23 +965,23 @@ <h3><span class="header-section-number">8.2.2</span> Virtually resampling 35 tim
 10        10   1996.88
 # … with 25 more rows</code></pre>
 <p>Observe that <code>virtual_resampled_means</code> has 35 rows, corresponding to the 35 resampled means. Furthermore, observe that the values of <code>mean_year</code> vary. Let’s visualize this variation using a histogram in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-7">8.12</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(virtual_resampled_means, <span class="kw">aes</span>(<span class="dt">x =</span> mean_year)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">1</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>, <span class="dt">boundary =</span> <span class="dv">1990</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Resample mean year&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb263"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb263-1" data-line-number="1"><span class="kw">ggplot</span>(virtual_resampled_means, <span class="kw">aes</span>(<span class="dt">x =</span> mean_year)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb263-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">1</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>, <span class="dt">boundary =</span> <span class="dv">1990</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb263-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Resample mean year&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:tactile-resampling-7"></span>
-<img src="moderndive_files/figure-html/tactile-resampling-7-1.png" alt="Distribution of 35 sample means from 35 resamples." width="\textwidth" />
+<img src="ModernDive_files/figure-html/tactile-resampling-7-1.png" alt="Distribution of 35 sample means from 35 resamples." width="\textwidth" />
 <p class="caption">
 FIGURE 8.12: Distribution of 35 sample means from 35 resamples.
 </p>
 </div>
 <p>Let’s compare our virtually constructed bootstrap distribution with the one our 35 friends constructed via our tactile resampling exercise in Figure <a href="8-confidence-intervals.html#fig:orig-and-resample-means">8.13</a>. Observe how they are somewhat similar, but not identical.</p>
 <div class="figure" style="text-align: center"><span id="fig:orig-and-resample-means"></span>
-<img src="moderndive_files/figure-html/orig-and-resample-means-1.png" alt="Comparing distributions of means from resamples." width="\textwidth" />
+<img src="ModernDive_files/figure-html/orig-and-resample-means-1.png" alt="Comparing distributions of means from resamples." width="\textwidth" />
 <p class="caption">
 FIGURE 8.13: Comparing distributions of means from resamples.
 </p>
 </div>
-<p>Recall that in the “resampling with replacement” scenario we are illustrating here both of these histograms have a special name: the <em>bootstrap distribution of the sample mean</em>. Furthermore, they are an approximation to the <em>sampling distribution</em> of the sample mean, a concept you saw in Chapter <a href="7-sampling.html#sampling">7</a> on sampling. These distributions allow us to study the effect of sampling variation on our estimates of the true population mean, in this case the true mean year for <em>all</em> US pennies. However, unlike in Chapter <a href="7-sampling.html#sampling">7</a> where took multiple samples (something one would never do in practice), bootstrap distributions are constructed by taking multiple resamples from a <em>single</em> sample. In this case the 50 original pennies from the bank.</p>
+<p>Recall that in the “resampling with replacement” scenario we are illustrating here, both of these histograms have a special name: the <em>bootstrap distribution of the sample mean</em>. Furthermore, recall they are an approximation to the <em>sampling distribution</em> of the sample mean, a concept you saw in Chapter <a href="7-sampling.html#sampling">7</a> on sampling. These distributions allow us to study the effect of sampling variation on our estimates of the true population mean, in this case the true mean year for <em>all</em> US pennies. However, unlike in Chapter <a href="7-sampling.html#sampling">7</a> where we took multiple samples (something one would never do in practice), bootstrap distributions are constructed by taking multiple resamples from a <em>single</em> sample: in this case, the 50 original pennies from the bank.</p>
 <!--
 <div class="learncheck">
 
@@ -994,20 +1000,20 @@ <h3><span class="header-section-number">8.2.2</span> Virtually resampling 35 tim
 <div id="bootstrap-1000-replicates" class="section level3">
 <h3><span class="header-section-number">8.2.3</span> Virtually resampling 1000 times</h3>
 <p>Remember that one of the goals of resampling with replacement is to construct the bootstrap distribution, which is an approximation of the sampling distribution. However, the bootstrap distribution in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-7">8.12</a> is based only on 35 resamples and hence looks a little coarse. Let’s increase the number of resamples to 1000, so that we can hopefully better see the shape and the variability between different resamples.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Repeat resampling 1000 times</span>
-virtual_resamples &lt;-<span class="st"> </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)
-
-<span class="co"># Compute 1000 sample means</span>
-virtual_resampled_means &lt;-<span class="st"> </span>virtual_resamples <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))</code></pre>
-<p>However, in the interest of brevity, going forward let’s combine these two operations into a single chain of <code>%&gt;%</code> pipe operators:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_resampled_means &lt;-<span class="st"> </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, <span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))
-virtual_resampled_means</code></pre>
+<div class="sourceCode" id="cb264"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb264-1" data-line-number="1"><span class="co"># Repeat resampling 1000 times</span></a>
+<a class="sourceLine" id="cb264-2" data-line-number="2">virtual_resamples &lt;-<span class="st"> </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb264-3" data-line-number="3"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)</a>
+<a class="sourceLine" id="cb264-4" data-line-number="4"></a>
+<a class="sourceLine" id="cb264-5" data-line-number="5"><span class="co"># Compute 1000 sample means</span></a>
+<a class="sourceLine" id="cb264-6" data-line-number="6">virtual_resampled_means &lt;-<span class="st"> </span>virtual_resamples <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb264-7" data-line-number="7"><span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb264-8" data-line-number="8"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))</a></code></pre></div>
+<p>However, in the interest of brevity, going forward let’s combine these two operations into a single chain of pipe (<code>%&gt;%</code>) operators:</p>
+<div class="sourceCode" id="cb265"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb265-1" data-line-number="1">virtual_resampled_means &lt;-<span class="st"> </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb265-2" data-line-number="2"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, <span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb265-3" data-line-number="3"><span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb265-4" data-line-number="4"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))</a>
+<a class="sourceLine" id="cb265-5" data-line-number="5">virtual_resampled_means</a></code></pre></div>
 <pre><code># A tibble: 1,000 x 2
    replicate mean_year
        &lt;int&gt;     &lt;dbl&gt;
@@ -1022,24 +1028,24 @@ <h3><span class="header-section-number">8.2.3</span> Virtually resampling 1000 t
  9         9   1994.88
 10        10   1996.3 
 # … with 990 more rows</code></pre>
-<p>In Figure <a href="8-confidence-intervals.html#fig:one-thousand-sample-means">8.14</a> let’s visualize the bootstrap distribution of these 1000 means based 1000 virtual resamples:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(virtual_resampled_means, <span class="kw">aes</span>(<span class="dt">x =</span> mean_year)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">1</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>, <span class="dt">boundary =</span> <span class="dv">1990</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;sample mean&quot;</span>)</code></pre>
+<p>In Figure <a href="8-confidence-intervals.html#fig:one-thousand-sample-means">8.14</a> let’s visualize the bootstrap distribution of these 1000 means based on 1000 virtual resamples:</p>
+<div class="sourceCode" id="cb267"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb267-1" data-line-number="1"><span class="kw">ggplot</span>(virtual_resampled_means, <span class="kw">aes</span>(<span class="dt">x =</span> mean_year)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb267-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">1</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>, <span class="dt">boundary =</span> <span class="dv">1990</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb267-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;sample mean&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:one-thousand-sample-means"></span>
-<img src="moderndive_files/figure-html/one-thousand-sample-means-1.png" alt="Bootstrap resampling distribution based on 1000 resamples." width="\textwidth" />
+<img src="ModernDive_files/figure-html/one-thousand-sample-means-1.png" alt="Bootstrap resampling distribution based on 1000 resamples." width="\textwidth" />
 <p class="caption">
 FIGURE 8.14: Bootstrap resampling distribution based on 1000 resamples.
 </p>
 </div>
 <p>Note here that the bell shape is starting to become much more apparent. We now have a general sense for the range of values that the sample mean may take on. But where is this histogram centered? Let’s compute the mean of the 1000 resample means:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_resampled_means <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_of_means =</span> <span class="kw">mean</span>(mean_year))</code></pre>
+<div class="sourceCode" id="cb268"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb268-1" data-line-number="1">virtual_resampled_means <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb268-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_of_means =</span> <span class="kw">mean</span>(mean_year))</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   mean_of_means
           &lt;dbl&gt;
 1       1995.36</code></pre>
-<p>The mean of these 1000 means is 1995.36, which is quite close to the mean of our original sample of 50 pennies of 1995.44. This is the case since each of the 1000 resamples are based on the original sample of 50 pennies.</p>
+<p>The mean of these 1000 means is 1995.36, which is quite close to the mean of our original sample of 50 pennies of 1995.44. This is the case since each of the 1000 resamples is based on the original sample of 50 pennies.</p>
 <p>Congratulations! You’ve just constructed your first bootstrap distribution! In the next section, you’ll see how to use this bootstrap distribution to construct <em>confidence intervals</em>.</p>
 <div class="learncheck">
 <p>
@@ -1073,25 +1079,26 @@ <h2><span class="header-section-number">8.3</span> Understanding confidence inte
 </div>
 <p>Our proposed interval of 1992 to 2000 was constructed by eye and was thus somewhat subjective. We now introduce two methods for constructing such intervals in a more exact fashion: the <em>percentile method</em> and the <em>standard error method</em>.</p>
 <p>Both methods for confidence interval construction share some commonalities. First, they are both constructed from a bootstrap distribution, as you constructed in Subsection <a href="8-confidence-intervals.html#bootstrap-1000-replicates">8.2.3</a> and visualized in Figure <a href="8-confidence-intervals.html#fig:one-thousand-sample-means">8.14</a>.</p>
-<p>Second, they both require you to specify the  <em>confidence level</em>. Commonly used confidence levels include 90%, 95%, and 99%. All other things being equal, higher confidence levels correspond to wider confidence intervals and lower confidence levels correspond to narrower confidence intervals. In this book, we’ll be mostly using 95% and hence constructing “95% confidence intervals for <span class="math inline">\(\mu\)</span>.”</p>
+<p>Second, they both require you to specify the  <em>confidence level</em>. Commonly used confidence levels include 90%, 95%, and 99%. All other things being equal, higher confidence levels correspond to wider confidence intervals, and lower confidence levels correspond to narrower confidence intervals. In this book, we’ll be mostly using 95% and hence constructing “95% confidence intervals for <span class="math inline">\(\mu\)</span>” for our pennies activity.</p>
 <div id="percentile-method" class="section level3">
 <h3><span class="header-section-number">8.3.1</span> Percentile method</h3>
-<p>One method to construct a confidence interval is to use the middle 95% of values of the bootstrap distribution. We can do this by computing the 2.5<sup>th</sup> and 97.5<sup>th</sup> percentiles, which are 1991.059 and 1999.283 respectively. This is known as the <em>percentile method</em> for constructing confidence intervals.</p>
-<p>For now, let’s focus only on the concepts behind a percentile method constructed confidence interval; we’ll show you the code to compute these values in the next section.</p>
-<p>Let’s mark these percentiles on the bootstrap distribution with vertical lines in Figure <a href="8-confidence-intervals.html#fig:percentile-method">8.16</a>. About 95% of the values in the <code>mean_year</code> variable in <code>virtual_resampled_means</code> fall between the 1991.059 and 1999.283 endpoints, with 2.5% to the left of the left-most line and 2.5% to the right of the right-most line.</p>
+<p>One method to construct a confidence interval is to use the middle 95% of values of the bootstrap distribution. We can do this by computing the 2.5th and 97.5th percentiles, which are 1991.059 and 1999.283, respectively. This is known as the <em>percentile method</em> for constructing confidence intervals.</p>
+<p>For now, let’s focus only on the concepts behind a percentile method constructed confidence interval; we’ll show you the code that computes these values in the next section.</p>
+<p>Let’s mark these percentiles on the bootstrap distribution with vertical lines in Figure <a href="8-confidence-intervals.html#fig:percentile-method">8.16</a>. About 95% of the <code>mean_year</code> variable values in <code>virtual_resampled_means</code> fall between 1991.059 and 1999.283, with 2.5% to the left of the leftmost line and 2.5% to the right of the rightmost line.</p>
+
 <div class="figure" style="text-align: center"><span id="fig:percentile-method"></span>
-<img src="moderndive_files/figure-html/percentile-method-1.png" alt="Percentile method 95 percent confidence interval. Interval marked by vertical lines." width="\textwidth" />
+<img src="ModernDive_files/figure-html/percentile-method-1.png" alt="Percentile method 95% confidence interval. Interval endpoints marked by vertical lines." width="\textwidth" />
 <p class="caption">
-FIGURE 8.16: Percentile method 95 percent confidence interval. Interval marked by vertical lines.
+FIGURE 8.16: Percentile method 95% confidence interval. Interval endpoints marked by vertical lines.
 </p>
 </div>
 </div>
 <div id="se-method" class="section level3">
 <h3><span class="header-section-number">8.3.2</span> Standard error method</h3>
-<p>Recall in Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a>, we saw that if a numerical variable follows a normal distribution, or in other words the histogram of this variable is bell-shaped, then roughly 95% of values fall between <span class="math inline">\(\pm\)</span> 1.96 standard deviations of the mean. Given that our bootstrap distribution based on 1000 resamples with replacement in Figure <a href="8-confidence-intervals.html#fig:one-thousand-sample-means">8.14</a> is normally shaped, let’s use this fact about normal distributions to construct a confidence interval in a different way.</p>
+<p>Recall in Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a>, we saw that if a numerical variable follows a normal distribution, or, in other words, the histogram of this variable is bell-shaped, then roughly 95% of values fall between <span class="math inline">\(\pm\)</span> 1.96 standard deviations of the mean. Given that our bootstrap distribution based on 1000 resamples with replacement in Figure <a href="8-confidence-intervals.html#fig:one-thousand-sample-means">8.14</a> is normally shaped, let’s use this fact about normal distributions to construct a confidence interval in a different way.</p>
 <p>First, recall the bootstrap distribution has a mean equal to 1995.36. This value almost coincides exactly with the value of the sample mean <span class="math inline">\(\overline{x}\)</span> of our original 50 pennies of 1995.44. Second, let’s compute the standard deviation of the bootstrap distribution using the values of <code>mean_year</code> in the <code>virtual_resampled_means</code> data frame:</p>
-<pre class="sourceCode r"><code class="sourceCode r">virtual_resampled_means <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">SE =</span> <span class="kw">sd</span>(mean_year))</code></pre>
+<div class="sourceCode" id="cb270"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb270-1" data-line-number="1">virtual_resampled_means <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb270-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">SE =</span> <span class="kw">sd</span>(mean_year))</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
        SE
     &lt;dbl&gt;
@@ -1106,13 +1113,14 @@ <h3><span class="header-section-number">8.3.2</span> Standard error method</h3>
 \end{aligned}
 \]</span></p>
 <p>Let’s now add the SE method confidence interval with dashed lines in Figure <a href="8-confidence-intervals.html#fig:percentile-and-se-method">8.17</a>.</p>
+
 <div class="figure" style="text-align: center"><span id="fig:percentile-and-se-method"></span>
-<img src="moderndive_files/figure-html/percentile-and-se-method-1.png" alt="Comparing two 95 percent confidence interval methods." width="\textwidth" />
+<img src="ModernDive_files/figure-html/percentile-and-se-method-1.png" alt="Comparing two 95% confidence interval methods." width="\textwidth" />
 <p class="caption">
-FIGURE 8.17: Comparing two 95 percent confidence interval methods.
+FIGURE 8.17: Comparing two 95% confidence interval methods.
 </p>
 </div>
-<p>We see that both methods produce nearly identical 95% confidence intervals for <span class="math inline">\(\mu\)</span> with the percentile method yielding <span class="math inline">\((1991.06, 1999.28)\)</span> while the standard error method being <span class="math inline">\((1991.22, 1999.66)\)</span>. However, recall that we can only use the standard error rule when the bootstrap distribution is roughly normally-shaped.</p>
+<p>We see that both methods produce nearly identical 95% confidence intervals for <span class="math inline">\(\mu\)</span> with the percentile method yielding <span class="math inline">\((1991.06, 1999.28)\)</span> while the standard error method produces <span class="math inline">\((1991.22, 1999.66)\)</span>. However, recall that we can only use the standard error rule when the bootstrap distribution is roughly normally shaped.</p>
 <p>Now that we’ve introduced the concept of confidence intervals and laid out the intuition behind two methods for constructing them, let’s explore the code that allows us to construct them.</p>
 <!--
 The variability of the sampling distribution may be approximated by the variability of the resampling distribution. Traditional theory-based methodologies for inference also have formulas for standard errors, assuming some conditions are met.
@@ -1146,29 +1154,30 @@ <h3><span class="header-section-number">8.3.2</span> Standard error method</h3>
 </div>
 <div id="bootstrap-process" class="section level2">
 <h2><span class="header-section-number">8.4</span> Constructing confidence intervals</h2>
-<p>Recall that the process of resampling with a replacement we performed by hand in Section <a href="8-confidence-intervals.html#resampling-tactile">8.1</a> and virtually in Section <a href="8-confidence-intervals.html#resampling-simulation">8.2</a> is known as  <em>bootstrapping</em>. The term bootstrapping originates in the expression of “pulling oneself up by their bootstraps,” meaning to <a href="https://en.wiktionary.org/wiki/pull_oneself_up_by_one%27s_bootstraps">“succeed only by one’s own efforts or abilities.”</a> From a statistical perspective, it alludes to succeeding in being able to study the effects of sampling variation on estimates from the “effort” of a single sample. Or more precisely,  constructing an approximation to the sampling distribution using only one sample.</p>
+<p>Recall that the process of resampling with replacement we performed by hand in Section <a href="8-confidence-intervals.html#resampling-tactile">8.1</a> and virtually in Section <a href="8-confidence-intervals.html#resampling-simulation">8.2</a> is known as  <em>bootstrapping</em>. The term bootstrapping originates in the expression of “pulling oneself up by their bootstraps,” meaning to <a href="https://en.wiktionary.org/wiki/pull_oneself_up_by_one%27s_bootstraps">“succeed only by one’s own efforts or abilities.”</a></p>
+<p>From a statistical perspective, bootstrapping alludes to succeeding in being able to study the effects of sampling variation on estimates from the “effort” of a single sample. Or more precisely,  it refers to constructing an approximation to the sampling distribution using only one sample.</p>
 <p>To perform this resampling with replacement virtually in Section <a href="8-confidence-intervals.html#resampling-simulation">8.2</a>, we used the <code>rep_sample_n()</code> function, making sure that the size of the resamples matched the original sample size of 50. In this section, we’ll build off these ideas to construct confidence intervals using a new package: the <code>infer</code> package for “tidy” and transparent statistical inference.</p>
 <div id="original-workflow" class="section level3">
 <h3><span class="header-section-number">8.4.1</span> Original workflow</h3>
-<p>Recall that in Section <a href="8-confidence-intervals.html#resampling-simulation">8.2</a>, we virtually performed bootstrap resampling with replacement to construct bootstrap distributions. Such distributions are approximations to the sampling distributions we saw in Chapter <a href="7-sampling.html#sampling">7</a>, but are constructed using only a single sample. Let’s revisit the original workflow using the <code>%&gt;%</code> pipe operator:</p>
+<p>Recall that in Section <a href="8-confidence-intervals.html#resampling-simulation">8.2</a>, we virtually performed bootstrap resampling with replacement to construct bootstrap distributions. Such distributions are approximations to the sampling distributions we saw in Chapter <a href="7-sampling.html#sampling">7</a>, but are constructed using only a single sample. Let’s revisit the original workflow using the <code>%&gt;%</code> pipe operator.</p>
 <p>First, we used the <code>rep_sample_n()</code> function to resample <code>size = 50</code> pennies with replacement from the original sample of 50 pennies in <code>pennies_sample</code> by setting <code>replace = TRUE</code>. Furthermore, we repeated this resampling 1000 times by setting <code>reps = 1000</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)</code></pre>
+<div class="sourceCode" id="cb272"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb272-1" data-line-number="1">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb272-2" data-line-number="2"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)</a></code></pre></div>
 <p>Second, since for each of our 1000 resamples of size 50, we wanted to compute a separate sample mean, we used the <code>dplyr</code> verb <code>group_by()</code> to group observations/rows together by the <code>replicate</code> variable…</p>
-<pre class="sourceCode r"><code class="sourceCode r">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, <span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(replicate) </code></pre>
+<div class="sourceCode" id="cb273"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb273-1" data-line-number="1">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb273-2" data-line-number="2"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, <span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb273-3" data-line-number="3"><span class="st">  </span><span class="kw">group_by</span>(replicate) </a></code></pre></div>
 <p>… followed by using <code>summarize()</code> to compute the sample <code>mean()</code> year for each <code>replicate</code> group:</p>
-<pre class="sourceCode r"><code class="sourceCode r">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, <span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))</code></pre>
+<div class="sourceCode" id="cb274"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb274-1" data-line-number="1">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb274-2" data-line-number="2"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, <span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb274-3" data-line-number="3"><span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb274-4" data-line-number="4"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))</a></code></pre></div>
 <p>For this simple case, we can get by with using the <code>rep_sample_n()</code> function and a couple of <code>dplyr</code> verbs to construct the bootstrap distribution. However, using only <code>dplyr</code> verbs only provides us with a limited set of tools. For more complicated situations, we’ll need a little more firepower. Let’s repeat this using the <code>infer</code> package.</p>
 </div>
 <div id="infer-workflow" class="section level3">
-<h3><span class="header-section-number">8.4.2</span> infer package workflow</h3>
+<h3><span class="header-section-number">8.4.2</span> <code>infer</code> package workflow</h3>
 <!--
-TODO: In future, consider
+TODO: Using infer to compute observed point estimate
 
 1. Showing `dplyr` code to compute observed point estimate
 1. Showing `infer` verbs to compute observed point estimate. i.e. no generate()
@@ -1177,30 +1186,30 @@ <h3><span class="header-section-number">8.4.2</span> infer package workflow</h3>
 bootstrap distribution of point estimate. i.e. with generate() and showing
 diagram.
 -->
-<p>The <code>infer</code> package is an R package for statistical inference. It makes efficient use of the <code>%&gt;%</code> pipe operator we saw in Section <a href="3-wrangling.html#piping">3.1</a> to spell out the sequence of steps necessary to perform statistical inference in a “tidy” and transparent fashion. Furthermore, just as the <code>dplyr</code> package provides functions with intuitive verb-like names to perform data wrangling, the <code>infer</code> package provides functions intuitive verb-like names to perform statistical inference.</p>
+<p>The <code>infer</code> package is an R package for statistical inference. It makes efficient use of the <code>%&gt;%</code> pipe operator we introduced in Section <a href="3-wrangling.html#piping">3.1</a> to spell out the sequence of steps necessary to perform statistical inference in a “tidy” and transparent fashion. Furthermore, just as the <code>dplyr</code> package provides functions with verb-like names to perform data wrangling, the <code>infer</code> package provides functions with intuitive verb-like names to perform statistical inference.</p>
 <p>Let’s go back to our pennies. Previously, we computed the value of the sample mean <span class="math inline">\(\overline{x}\)</span> using the <code>dplyr</code> function <code>summarize()</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">stat =</span> <span class="kw">mean</span>(year))</code></pre>
+<div class="sourceCode" id="cb275"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb275-1" data-line-number="1">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb275-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">stat =</span> <span class="kw">mean</span>(year))</a></code></pre></div>
 <p>We’ll see that we can also do this using <code>infer</code> functions <code>specify()</code> and <code>calculate()</code>: </p>
-<pre class="sourceCode r"><code class="sourceCode r">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</code></pre>
-<p>You might be asking yourself: “Isn’t the <code>infer</code> code longer? Why would I use that code?” While not immediately apparent, you’ll see that there are three chief benefits to the <code>infer</code> workflow as opposed to the <code>dplyr</code> workflow.</p>
-<p>First, the <code>infer</code> verb names better align with the overall resampling framework you need to understand to construct confidence intervals and to conduct hypothesis tests (in Chapter <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>). We’ll see flowchart diagrams of this framework in the upcoming Figures <a href="8-confidence-intervals.html#fig:infer-workflow-ci">8.23</a> and <a href="9-hypothesis-testing.html#fig:htdowney">9.14</a>.</p>
-<p>Second, you can jump back and forth seamlessly between confidence intervals and hypothesis testing with minimal changes to your code. This will become apparent in Subsection <a href="9-hypothesis-testing.html#comparing-infer-workflows">9.3.2</a> when we’ll compare the <code>infer</code> code for both these inferential methods.</p>
-<p>Third, the <code>infer</code> workflow is much simpler for conducting inference when you have <em>more than one variable</em>. We’ll see two such situations. We’ll first see situations of <em>two-sample</em> inference where the sample data is collected from two groups, such as in Section <a href="8-confidence-intervals.html#case-study-two-prop-ci">8.6</a> where we study the contagiousness of yawning and in Section <a href="8-confidence-intervals.html#case-study-two-prop-ci">8.6</a> where we compare promotion rates of two groups at banks in the 1970s. Then in Section <a href="10-inference-for-regression.html#infer-regression">10.4</a>, we’ll see situations of <em>inference for regression</em> using the regression models you fit in Chapter <a href="5-regression.html#regression">5</a>.</p>
-<p>Let’s now illustrate the sequence of verbs necessary to construct a confidence interval for <span class="math inline">\(\mu\)</span>, the population mean year of minting of all pennies in the US.</p>
+<div class="sourceCode" id="cb276"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb276-1" data-line-number="1">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb276-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb276-3" data-line-number="3"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</a></code></pre></div>
+<p>You might be asking yourself: “Isn’t the <code>infer</code> code longer? Why would I use that code?”. While not immediately apparent, you’ll see that there are three chief benefits to the <code>infer</code> workflow as opposed to the <code>dplyr</code> workflow.</p>
+<p>First, the <code>infer</code> verb names better align with the overall resampling framework you need to understand to construct confidence intervals and to conduct hypothesis tests (in Chapter <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>). We’ll see flowchart diagrams of this framework in the upcoming Figure <a href="8-confidence-intervals.html#fig:infer-workflow-ci">8.23</a> and in Chapter <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> with Figure <a href="9-hypothesis-testing.html#fig:htdowney">9.14</a>.</p>
+<p>Second, you can jump back and forth seamlessly between confidence intervals and hypothesis testing with minimal changes to your code. This will become apparent in Subsection <a href="9-hypothesis-testing.html#comparing-infer-workflows">9.3.2</a> when we’ll compare the <code>infer</code> code for both of these inferential methods.</p>
+<p>Third, the <code>infer</code> workflow is much simpler for conducting inference when you have <em>more than one variable</em>. We’ll see two such situations. We’ll first see situations of <em>two-sample</em> inference where the sample data is collected from two groups, such as in Section <a href="8-confidence-intervals.html#case-study-two-prop-ci">8.6</a> where we study the contagiousness of yawning and in Section <a href="9-hypothesis-testing.html#ht-activity">9.1</a> where we compare promotion rates of two groups at banks in the 1970s. Then in Section <a href="10-inference-for-regression.html#infer-regression">10.4</a>, we’ll see situations of <em>inference for regression</em> using the regression models you fit in Chapter <a href="5-regression.html#regression">5</a>.</p>
+<p>Let’s now illustrate the sequence of verbs necessary to construct a confidence interval for <span class="math inline">\(\mu\)</span>, the population mean year of minting of all US pennies in 2019.</p>
 <div id="specify-variables" class="section level4 unnumbered">
 <h4>1. <code>specify</code> variables</h4>
 <div class="figure" style="text-align: center"><span id="fig:infer-specify"></span>
-<img src="images/flowcharts/infer/specify.png" alt="Diagram of specify() variables." width="20%" />
+<img src="images/flowcharts/infer/specify.png" alt="Diagram of the specify() verb." width="20%" height="20%" />
 <p class="caption">
-FIGURE 8.18: Diagram of specify() variables.
+FIGURE 8.18: Diagram of the specify() verb.
 </p>
 </div>
-<p>The <code>specify()</code>  function is used to choose which variables in a data frame will be the focus of our statistical inference. We do this by specifying the <code>response</code> argument. For example, in our <code>pennies_sample</code> data frame of the 50 pennies sampled from the bank, the variable of interest is <code>year</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year)</code></pre>
+<p>As shown in Figure <a href="8-confidence-intervals.html#fig:infer-specify">8.18</a>, the <code>specify()</code>  function is used to choose which variables in a data frame will be the focus of our statistical inference. We do this by <code>specify</code>ing the <code>response</code> argument. For example, in our <code>pennies_sample</code> data frame of the 50 pennies sampled from the bank, the variable of interest is <code>year</code>:</p>
+<div class="sourceCode" id="cb277"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb277-1" data-line-number="1">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb277-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year)</a></code></pre></div>
 <pre><code>Response: year (numeric)
 # A tibble: 50 x 1
     year
@@ -1217,170 +1226,179 @@ <h4>1. <code>specify</code> variables</h4>
 10  2000
 # … with 40 more rows</code></pre>
 <p>Notice how the data itself doesn’t change, but the <code>Response: year (numeric)</code> <em>meta-data</em> does. This is similar to how the <code>group_by()</code> verb from <code>dplyr</code> doesn’t change the data, but only adds “grouping” meta-data, as we saw in Section <a href="3-wrangling.html#groupby">3.4</a>.</p>
-<p>We can also specify which variables will be the focus of our statistical inference using a <code>formula = y ~ x</code>. This is the same formula notation you saw in Chapters <a href="5-regression.html#regression">5</a> and <a href="6-multiple-regression.html#multiple-regression">6</a> on regression models: the response variable <code>y</code> is separated from the explanatory variable <code>x</code> by a <code>~</code> “tilde.” The following use of <code>specify()</code> with the <code>formula</code> argument yields the same result seen previously:</p>
-<pre class="sourceCode r"><code class="sourceCode r">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> year <span class="op">~</span><span class="st"> </span><span class="ot">NULL</span>)</code></pre>
+<p>We can also specify which variables will be the focus of our statistical inference using a <code>formula = y ~ x</code>. This is the same formula notation you saw in Chapters <a href="5-regression.html#regression">5</a> and <a href="6-multiple-regression.html#multiple-regression">6</a> on regression models: the response variable <code>y</code> is separated from the explanatory variable <code>x</code> by a <code>~</code> (“tilde”). The following use of <code>specify()</code> with the <code>formula</code> argument yields the same result seen previously:</p>
+<div class="sourceCode" id="cb279"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb279-1" data-line-number="1">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb279-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> year <span class="op">~</span><span class="st"> </span><span class="ot">NULL</span>)</a></code></pre></div>
 <p>Since in the case of pennies we only have a response variable and no explanatory variable of interest, we set the <code>x</code> on the right-hand side of the <code>~</code> to be <code>NULL</code>.</p>
-<p>While in the case of the pennies either specification works just fine, we’ll see examples later on where we have no choice but to use the <code>formula</code> specification. In particular in the upcoming Sections <a href="8-confidence-intervals.html#case-study-two-prop-ci">8.6</a> on comparing two proportions and <a href="10-inference-for-regression.html#infer-regression">10.4</a> on inference for regression.</p>
+<p>While in the case of the pennies either specification works just fine, we’ll see examples later on where the <code>formula</code> specification is simpler. In particular, this comes up in the upcoming Section <a href="8-confidence-intervals.html#case-study-two-prop-ci">8.6</a> on comparing two proportions and Section <a href="10-inference-for-regression.html#infer-regression">10.4</a> on inference for regression.</p>
 </div>
 <div id="generate-replicates" class="section level4 unnumbered">
 <h4>2. <code>generate</code> replicates</h4>
 <div class="figure" style="text-align: center"><span id="fig:infer-generate"></span>
-<img src="images/flowcharts/infer/generate.png" alt="Diagram of generate() replicates." width="50%" />
+<img src="images/flowcharts/infer/generate.png" alt="Diagram of generate() replicates." width="60%" height="60%" />
 <p class="caption">
 FIGURE 8.19: Diagram of generate() replicates.
 </p>
 </div>
-<p>After we <code>specify()</code> the variables of interest, we pipe the results into the <code>generate()</code> function to generate replicates. In other words, repeat the resampling process a large number of times. Recall in Sections <a href="8-confidence-intervals.html#bootstrap-35-replicates">8.2.2</a> and <a href="8-confidence-intervals.html#bootstrap-1000-replicates">8.2.3</a> we did this 35 and 1000 times.</p>
+<p>After we <code>specify()</code> the variables of interest, we pipe the results into the <code>generate()</code> function to generate replicates. Figure <a href="8-confidence-intervals.html#fig:infer-generate">8.19</a> shows how this is combined with <code>specify()</code> to start the pipeline. In other words, repeat the resampling process a large number of times. Recall in Sections <a href="8-confidence-intervals.html#bootstrap-35-replicates">8.2.2</a> and <a href="8-confidence-intervals.html#bootstrap-1000-replicates">8.2.3</a> we did this 35 and 1000 times.</p>
 <p>The <code>generate()</code>  function’s first argument is <code>reps</code>, which sets the number of replicates we would like to generate. Since we want to resample the 50 pennies in <code>pennies_sample</code> with replacement 1000 times, we set <code>reps = 1000</code>. The second argument <code>type</code> determines the type of computer simulation we’d like to perform. We set this to <code>type = &quot;bootstrap&quot;</code> indicating that we want to perform bootstrap resampling. You’ll see different options for <code>type</code> in Chapter <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb280"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb280-1" data-line-number="1">pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb280-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb280-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>)</a></code></pre></div>
 <pre><code>Response: year (numeric)
 # A tibble: 50,000 x 2
 # Groups:   replicate [1,000]
    replicate  year
        &lt;int&gt; &lt;dbl&gt;
- 1         1  1996
+ 1         1  1981
  2         1  1988
- 3         1  1979
- 4         1  1978
- 5         1  1983
- 6         1  1981
- 7         1  1993
- 8         1  1996
- 9         1  1992
-10         1  1978
+ 3         1  2006
+ 4         1  2016
+ 5         1  2002
+ 6         1  1985
+ 7         1  1979
+ 8         1  2000
+ 9         1  2006
+10         1  2016
 # … with 49,990 more rows</code></pre>
-<p>Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 pennies with replacement 1000 times and 50,000 = 50 <span class="math inline">\(\times\)</span> 1000. The variable <code>replicate</code> indicates which resample each row belongs to. So it has the value <code>1</code> 50 times, the value <code>2</code> 50 times, all the way through to the value <code>1000</code> 50 times.</p>
-<p>The default value of the <code>type</code> argument is <code>&quot;bootstrap&quot;</code>, so if the last line was written as <code>generate(reps = 1000)</code>, we’d obtain the same results.</p>
-<p><strong>Comparing with original workflow</strong>: Note that the steps up of the infer workflow so far produce the same results as the original workflow using the <code>rep_sample_n()</code> function we saw earlier. In other words, the following two code chunks produce similar results:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># infer workflow:                   # Original workflow:</span>
-pennies_sample <span class="op">%&gt;%</span><span class="st">                  </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year) <span class="op">%&gt;%</span><span class="st">        </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, 
-  <span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>)                            <span class="dt">reps =</span> <span class="dv">1000</span>)              </code></pre>
+<p>Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 pennies with replacement 1000 times and 50,000 = 50 <span class="math inline">\(\cdot\)</span> 1000.</p>
+<p>The variable <code>replicate</code> indicates which resample each row belongs to. So it has the value <code>1</code> 50 times, the value <code>2</code> 50 times, all the way through to the value <code>1000</code> 50 times. The default value of the <code>type</code> argument is <code>&quot;bootstrap&quot;</code> in this scenario, so if the last line was written as <code>generate(reps = 1000)</code>, we’d obtain the same results.</p>
+<p><strong>Comparing with original workflow</strong>: Note that the steps of the <code>infer</code> workflow so far produce the same results as the original workflow using the <code>rep_sample_n()</code> function we saw earlier. In other words, the following two code chunks produce similar results:</p>
+<div class="sourceCode" id="cb282"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb282-1" data-line-number="1"><span class="co"># infer workflow:                   # Original workflow:</span></a>
+<a class="sourceLine" id="cb282-2" data-line-number="2">pennies_sample <span class="op">%&gt;%</span><span class="st">                  </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb282-3" data-line-number="3"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year) <span class="op">%&gt;%</span><span class="st">        </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, </a>
+<a class="sourceLine" id="cb282-4" data-line-number="4">  <span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>)                            <span class="dt">reps =</span> <span class="dv">1000</span>)              </a></code></pre></div>
 </div>
 <div id="calculate-summary-statistics" class="section level4 unnumbered">
 <h4>3. <code>calculate</code> summary statistics</h4>
 <div class="figure" style="text-align: center"><span id="fig:infer-calculate"></span>
-<img src="images/flowcharts/infer/calculate.png" alt="Diagram of calculate() summary statistics." width="70%" />
+<img src="images/flowcharts/infer/calculate.png" alt="Diagram of calculate() summary statistics." width="80%" height="80%" />
 <p class="caption">
 FIGURE 8.20: Diagram of calculate() summary statistics.
 </p>
 </div>
-<p>After we <code>generate()</code> many replicates of bootstrap resampling with replacement, we next want to summarize each of 1000 resamples of size 50 to a single statistic value. As seen in the diagram, the <code>calculate()</code>  function does this.</p>
-<p>In our case, we want to calculate the mean <code>year</code> for each bootstrap resample of size 50. To do so, we set the <code>stat</code> argument to <code>&quot;mean&quot;</code>. You can also set the <code>stat</code> argument to a variety of other common summary statistics, like <code>&quot;median&quot;</code>, <code>&quot;sum&quot;</code>, <code>&quot;sd&quot;</code> (standard deviation), and <code>&quot;prop&quot;</code> (proportion). To see a list of all possible summary statistics you can use, type <code>?calculate</code> to read the help file. We’ll use these <code>stat</code> functions throughout this book.</p>
-<p>Let’s save the result in a data frame called <code>bootstrap_distribution</code> and explore it’s contents:</p>
-<pre class="sourceCode r"><code class="sourceCode r">bootstrap_distribution &lt;-<span class="st"> </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)
-bootstrap_distribution</code></pre>
+<p>After we <code>generate()</code> many replicates of bootstrap resampling with replacement, we next want to summarize each of the 1000 resamples of size 50 to a single sample statistic value. As seen in the diagram, the <code>calculate()</code>  function does this.</p>
+<p>In our case, we want to calculate the mean <code>year</code> for each bootstrap resample of size 50. To do so, we set the <code>stat</code> argument to <code>&quot;mean&quot;</code>. You can also set the <code>stat</code> argument to a variety of other common summary statistics, like <code>&quot;median&quot;</code>, <code>&quot;sum&quot;</code>, <code>&quot;sd&quot;</code> (standard deviation), and <code>&quot;prop&quot;</code> (proportion). To see a list of all possible summary statistics you can use, type <code>?calculate</code> and read the help file.</p>
+<p>Let’s save the result in a data frame called <code>bootstrap_distribution</code> and explore its contents:</p>
+<div class="sourceCode" id="cb283"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb283-1" data-line-number="1">bootstrap_distribution &lt;-<span class="st"> </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb283-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb283-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb283-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</a>
+<a class="sourceLine" id="cb283-5" data-line-number="5">bootstrap_distribution</a></code></pre></div>
 <pre><code># A tibble: 1,000 x 2
    replicate    stat
        &lt;int&gt;   &lt;dbl&gt;
- 1         1 1993.48
- 2         2 1993.8 
- 3         3 1996.88
- 4         4 1995.34
- 5         5 1996.98
- 6         6 1995.72
- 7         7 1995.36
- 8         8 1992.6 
- 9         9 1994.24
-10        10 1993.16
+ 1         1 1995.7 
+ 2         2 1994.04
+ 3         3 1993.62
+ 4         4 1994.5 
+ 5         5 1994.08
+ 6         6 1993.6 
+ 7         7 1995.26
+ 8         8 1996.64
+ 9         9 1994.3 
+10        10 1995.94
 # … with 990 more rows</code></pre>
-<p>Observe that the resulting data frame has 1000 rows and 2 columns corresponding to the 1000 <code>replicate</code> values and the mean year for each bootstrap resample saved in the variable <code>stat</code>.</p>
-<p><strong>Comparing with original workflow</strong>: You may have recognized at this point that the <code>calculate()</code> step in the <code>infer</code> workflow produces the same output as the <code>group_by() %&gt;% summarize()</code> steps in the original workflow:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># infer workflow:                   # Original workflow:</span>
-pennies_sample <span class="op">%&gt;%</span><span class="st">                  </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year) <span class="op">%&gt;%</span><span class="st">        </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, 
-  <span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st">                        </span><span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st">              </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)            <span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">                                      </span><span class="kw">summarize</span>(<span class="dt">mean_year =</span> <span class="kw">mean</span>(year))</code></pre>
+<p>Observe that the resulting data frame has 1000 rows and 2 columns corresponding to the 1000 <code>replicate</code> values. It also has the mean year for each bootstrap resample saved in the variable <code>stat</code>.</p>
+<p><strong>Comparing with original workflow</strong>: You may have recognized at this point that the <code>calculate()</code> step in the <code>infer</code> workflow produces the same output as the <code>group_by() %&gt;% summarize()</code> steps in the original workflow.</p>
+<div class="sourceCode" id="cb285"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb285-1" data-line-number="1"><span class="co"># infer workflow:                   # Original workflow:</span></a>
+<a class="sourceLine" id="cb285-2" data-line-number="2">pennies_sample <span class="op">%&gt;%</span><span class="st">                  </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb285-3" data-line-number="3"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year) <span class="op">%&gt;%</span><span class="st">        </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>, </a>
+<a class="sourceLine" id="cb285-4" data-line-number="4">  <span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st">                        </span><span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st">              </span></a>
+<a class="sourceLine" id="cb285-5" data-line-number="5"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)            <span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb285-6" data-line-number="6"><span class="st">                                      </span><span class="kw">summarize</span>(<span class="dt">stat =</span> <span class="kw">mean</span>(year))</a></code></pre></div>
 </div>
 <div id="visualize-the-results" class="section level4 unnumbered">
 <h4>4. <code>visualize</code> the results</h4>
 <div class="figure" style="text-align: center"><span id="fig:infer-visualize"></span>
-<img src="images/flowcharts/infer/visualize.png" alt="Diagram of visualize() results." width="100%" />
+<img src="images/flowcharts/infer/visualize.png" alt="Diagram of visualize() results." width="70%" />
 <p class="caption">
 FIGURE 8.21: Diagram of visualize() results.
 </p>
 </div>
-<p>The <code>visualize()</code>  verb provides a quick way to visualize the bootstrap distribution as a histogram of the numerical <code>stat</code> variable’s values.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(bootstrap_distribution)</code></pre>
+<p>The <code>visualize()</code>  verb provides a quick way to visualize the bootstrap distribution as a histogram of the numerical <code>stat</code> variable’s values. The pipeline of the main <code>infer</code> verbs used for exploring bootstrap distribution results is shown in Figure <a href="8-confidence-intervals.html#fig:infer-visualize">8.21</a>.</p>
+<div class="sourceCode" id="cb286"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb286-1" data-line-number="1"><span class="kw">visualize</span>(bootstrap_distribution)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:boostrap-distribution-infer"></span>
-<img src="moderndive_files/figure-html/boostrap-distribution-infer-1.png" alt="Bootstrap distribution." width="\textwidth" />
+<img src="ModernDive_files/figure-html/boostrap-distribution-infer-1.png" alt="Bootstrap distribution." width="\textwidth" />
 <p class="caption">
 FIGURE 8.22: Bootstrap distribution.
 </p>
 </div>
-<p><strong>Comparing with original workflow</strong>: In fact, <code>visualize()</code> is a <em>wrapper function</em> for the <code>ggplot()</code> function that uses a <code>geom_histogram()</code> layer. Recall that we illustrated the concept of a wrapper function in Figure <a href="5-regression.html#fig:moderndive-figure-wrapper">5.5</a> in Section <a href="5-regression.html#model1table">5.1.2</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># infer workflow:                    # Original workflow:</span>
-<span class="kw">visualize</span>(bootstrap_distribution)    <span class="kw">ggplot</span>(bootstrap_distribution, 
-                                            <span class="kw">aes</span>(<span class="dt">x =</span> stat)) <span class="op">+</span>
-<span class="st">                                       </span><span class="kw">geom_histogram</span>()</code></pre>
+<p><strong>Comparing with original workflow</strong>: In fact, <code>visualize()</code> is a <em>wrapper function</em> for the <code>ggplot()</code> function that uses a <code>geom_histogram()</code> layer. Recall that we illustrated the concept of a wrapper function in Figure <a href="5-regression.html#fig:moderndive-figure-wrapper">5.5</a> in Subsection <a href="5-regression.html#model1table">5.1.2</a>.</p>
+<div class="sourceCode" id="cb287"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb287-1" data-line-number="1"><span class="co"># infer workflow:                    # Original workflow:</span></a>
+<a class="sourceLine" id="cb287-2" data-line-number="2"><span class="kw">visualize</span>(bootstrap_distribution)    <span class="kw">ggplot</span>(bootstrap_distribution, </a>
+<a class="sourceLine" id="cb287-3" data-line-number="3">                                            <span class="kw">aes</span>(<span class="dt">x =</span> stat)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb287-4" data-line-number="4"><span class="st">                                       </span><span class="kw">geom_histogram</span>()</a></code></pre></div>
 <p>The <code>visualize()</code> function can take many other arguments which we’ll see momentarily to customize the plot further. It also works with helper functions to do the shading of the histogram values corresponding to the confidence interval values.</p>
-<p>Let’s recap the steps of the <code>infer</code> workflow for constructing a bootstrap distribution and then visualizing it.</p>
+<p>Let’s recap the steps of the <code>infer</code> workflow for constructing a bootstrap distribution and then visualizing it in Figure <a href="8-confidence-intervals.html#fig:infer-workflow-ci">8.23</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:infer-workflow-ci"></span>
 <img src="images/flowcharts/infer/ci_diagram.png" alt="infer package workflow for confidence intervals." width="100%" />
 <p class="caption">
 FIGURE 8.23: infer package workflow for confidence intervals.
 </p>
 </div>
-<p>Recall how we introduced two different methods for constructing 95% confidence intervals for an unknown population parameter in Section <a href="8-confidence-intervals.html#ci-build-up">8.3</a>: the <em>percentile method</em> and the <em>standard error method</em>. Let’s now check out the <code>infer</code> package code that explicitly constructs these. There are also some additional neat functions to visualize the resulting confidence intervals built-in!</p>
+<p>Recall how we introduced two different methods for constructing 95% confidence intervals for an unknown population parameter in Section <a href="8-confidence-intervals.html#ci-build-up">8.3</a>: the <em>percentile method</em> and the <em>standard error method</em>. Let’s now check out the <code>infer</code> package code that explicitly constructs these. There are also some additional neat functions to visualize the resulting confidence intervals built-in to the <code>infer</code> package!</p>
 </div>
 </div>
 <div id="percentile-method-infer" class="section level3">
-<h3><span class="header-section-number">8.4.3</span> Percentile method with infer</h3>
-<p>Recall the percentile method for constructing 95% confidence intervals we introduced in Section <a href="8-confidence-intervals.html#percentile-method">8.3.1</a>. This method sets the lower endpoint of the confidence interval at the 2.5<sup>th</sup> percentile of the bootstrap distribution and similarly sets the upper endpoint at the 97.5<sup>th</sup> percentile. The resulting interval captures the middle 95% of the values of the sample mean in the bootstrap distribution.</p>
-<p>We can compute the 95% confidence interval by piping the <code>bootstrap_distribution</code> data frame we created into the <code>get_confidence_interval()</code>  function from the <code>infer</code> package, with the confidence <code>level</code> set to 0.95 and the confidence interval <code>type</code> to be percentile. Let’s save the results in <code>percentile_ci</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">percentile_ci &lt;-<span class="st"> </span>bootstrap_distribution <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>)
-percentile_ci</code></pre>
+<h3><span class="header-section-number">8.4.3</span> Percentile method with <code>infer</code></h3>
+<p>Recall the percentile method for constructing 95% confidence intervals we introduced in Subsection <a href="8-confidence-intervals.html#percentile-method">8.3.1</a>. This method sets the lower endpoint of the confidence interval at the 2.5th percentile of the bootstrap distribution and similarly sets the upper endpoint at the 97.5th percentile. The resulting interval captures the middle 95% of the values of the sample mean in the bootstrap distribution.</p>
+<p>We can compute the 95% confidence interval by piping <code>bootstrap_distribution</code> into the <code>get_confidence_interval()</code>  function from the <code>infer</code> package, with the confidence <code>level</code> set to 0.95 and the confidence interval <code>type</code> to be <code>&quot;percentile&quot;</code>. Let’s save the results in <code>percentile_ci</code>.</p>
+<div class="sourceCode" id="cb288"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb288-1" data-line-number="1">percentile_ci &lt;-<span class="st"> </span>bootstrap_distribution <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb288-2" data-line-number="2"><span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>)</a>
+<a class="sourceLine" id="cb288-3" data-line-number="3">percentile_ci</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
    `2.5%` `97.5%`
     &lt;dbl&gt;   &lt;dbl&gt;
-1 1991.16 1999.58</code></pre>
-<p>Alternatively, we can visualize the interval (1991.16, 1999.58) by piping the <code>bootstrap_distribution</code> data frame into the <code>visualize()</code> function and adding a <code>shade_confidence_interval()</code>  layer. We set the <code>endpoints</code> argument to be <code>percentile_ci</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(bootstrap_distribution) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> percentile_ci)</code></pre>
+1 1991.24 1999.42</code></pre>
+<p>Alternatively, we can visualize the interval (1991.24, 1999.42) by piping the <code>bootstrap_distribution</code> data frame into the <code>visualize()</code> function and adding a <code>shade_confidence_interval()</code>  layer. We set the <code>endpoints</code> argument to be <code>percentile_ci</code>.</p>
+<div class="sourceCode" id="cb290"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb290-1" data-line-number="1"><span class="kw">visualize</span>(bootstrap_distribution) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb290-2" data-line-number="2"><span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> percentile_ci)</a></code></pre></div>
+
 <div class="figure" style="text-align: center"><span id="fig:percentile-ci-viz"></span>
-<img src="moderndive_files/figure-html/percentile-ci-viz-1.png" alt="Percentile method 95 percent confidence interval shaded corresponding to potential values." width="\textwidth" />
+<img src="ModernDive_files/figure-html/percentile-ci-viz-1.png" alt="Percentile method 95% confidence interval shaded corresponding to potential values." width="\textwidth" />
 <p class="caption">
-FIGURE 8.24: Percentile method 95 percent confidence interval shaded corresponding to potential values.
+FIGURE 8.24: Percentile method 95% confidence interval shaded corresponding to potential values.
 </p>
 </div>
 <p>Observe in Figure <a href="8-confidence-intervals.html#fig:percentile-ci-viz">8.24</a> that 95% of the sample means stored in the <code>stat</code> variable in <code>bootstrap_distribution</code> fall between the two endpoints marked with the darker lines, with 2.5% of the sample means to the left of the shaded area and 2.5% of the sample means to the right. You also have the option to change the colors of the shading using the <code>color</code> and <code>fill</code> arguments.</p>
-<p>You can also use the shorter named function <code>shade_ci()</code> and the results will be the same. This is for folks that don’t want to type out all of <code>confidence_interval</code> and prefer to type out <code>ci</code> instead. Try out the following code!</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(bootstrap_distribution) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">shade_ci</span>(<span class="dt">endpoints =</span> percentile_ci, <span class="dt">color =</span> <span class="st">&quot;hotpink&quot;</span>, <span class="dt">fill =</span> <span class="st">&quot;khaki&quot;</span>)</code></pre>
+<p>You can also use the shorter named function <code>shade_ci()</code> and the results will be the same. This is for folks who don’t want to type out all of <code>confidence_interval</code> and prefer to type out <code>ci</code> instead. Try out the following code!</p>
+<div class="sourceCode" id="cb291"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb291-1" data-line-number="1"><span class="kw">visualize</span>(bootstrap_distribution) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb291-2" data-line-number="2"><span class="st">  </span><span class="kw">shade_ci</span>(<span class="dt">endpoints =</span> percentile_ci, <span class="dt">color =</span> <span class="st">&quot;hotpink&quot;</span>, <span class="dt">fill =</span> <span class="st">&quot;khaki&quot;</span>)</a></code></pre></div>
 </div>
 <div id="infer-se" class="section level3">
-<h3><span class="header-section-number">8.4.4</span> Standard error method with infer</h3>
-<p>Recall the standard error method for constructing 95% confidence intervals we introduced in Section <a href="8-confidence-intervals.html#se-method">8.3.2</a>. For any distribution that is normally shaped, roughly 95% of the values lie within two standard deviations of the mean. In the case of the bootstrap distribution, the standard deviation has a special name: the standard error.</p>
-<p>So in our case, 95% of values of the bootstrap distribution will lie within <span class="math inline">\(\pm\)</span> 1.96 standard errors of <span class="math inline">\(\overline{x}\)</span>. Thus, a 95% confidence interval is <span class="math inline">\(\overline{x} \pm 1.96 \cdot SE\)</span> = <span class="math inline">\((\overline{x} - 1.96 \cdot SE,\)</span> <span class="math inline">\(\overline{x} + 1.96 \cdot SE)\)</span>.</p>
+<h3><span class="header-section-number">8.4.4</span> Standard error method with <code>infer</code></h3>
+<p>Recall the standard error method for constructing 95% confidence intervals we introduced in Subsection <a href="8-confidence-intervals.html#se-method">8.3.2</a>. For any distribution that is normally shaped, roughly 95% of the values lie within two standard deviations of the mean. In the case of the bootstrap distribution, the standard deviation has a special name: the <em>standard error</em>.</p>
+<p>So in our case, 95% of values of the bootstrap distribution will lie within <span class="math inline">\(\pm 1.96\)</span> standard errors of <span class="math inline">\(\overline{x}\)</span>. Thus, a 95% confidence interval is</p>
+<p><span class="math display">\[\overline{x} \pm 1.96 \cdot SE = (\overline{x} - 1.96 \cdot SE, \, \overline{x} + 1.96 \cdot SE).\]</span></p>
 <p>Computation of the 95% confidence interval can once again be done by piping the <code>bootstrap_distribution</code> data frame we created into the <code>get_confidence_interval()</code> function. However, this time we set the first <code>type</code> argument to be <code>&quot;se&quot;</code>. Second, we must specify the <code>point_estimate</code> argument in order to set the center of the confidence interval. We set this to be the sample mean of the original sample of 50 pennies of 1995.44.</p>
-<pre class="sourceCode r"><code class="sourceCode r">standard_error_ci &lt;-<span class="st"> </span>bootstrap_distribution <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">type =</span> <span class="st">&quot;se&quot;</span>, <span class="dt">point_estimate =</span> <span class="fl">1995.44</span>)
-standard_error_ci</code></pre>
+<!-- point_estimate = 1995.44 -->
+<div class="sourceCode" id="cb292"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb292-1" data-line-number="1">x_bar</a></code></pre></div>
+<pre><code># A tibble: 1 x 1
+  mean_year
+      &lt;dbl&gt;
+1   1995.44</code></pre>
+<div class="sourceCode" id="cb294"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb294-1" data-line-number="1">standard_error_ci &lt;-<span class="st"> </span>bootstrap_distribution <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb294-2" data-line-number="2"><span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">type =</span> <span class="st">&quot;se&quot;</span>, <span class="dt">point_estimate =</span> x_bar)</a>
+<a class="sourceLine" id="cb294-3" data-line-number="3">standard_error_ci</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
     lower   upper
     &lt;dbl&gt;   &lt;dbl&gt;
-1 1991.16 1999.72</code></pre>
-<p>If we would like to visualize the interval (1991.16, 1999.72), we can once again pipe the <code>bootstrap_distribution</code> data frame into the <code>visualize()</code> function and add a <code>shade_confidence_interval()</code> layer to our plot. We set the <code>endpoints</code> argument to be <code>standard_error_ci</code>. The resulting standard-error method based 95% confidence interval for <span class="math inline">\(\mu\)</span> can be seen in Figure <a href="8-confidence-intervals.html#fig:se-ci-viz">8.25</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(bootstrap_distribution) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> standard_error_ci)</code></pre>
+1 1991.35 1999.53</code></pre>
+<p>If we would like to visualize the interval (1991.35, 1999.53), we can once again pipe the <code>bootstrap_distribution</code> data frame into the <code>visualize()</code> function and add a <code>shade_confidence_interval()</code> layer to our plot. We set the <code>endpoints</code> argument to be <code>standard_error_ci</code>. The resulting standard-error method based on a 95% confidence interval for <span class="math inline">\(\mu\)</span> can be seen in Figure <a href="8-confidence-intervals.html#fig:se-ci-viz">8.25</a>.</p>
+
+<div class="sourceCode" id="cb296"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb296-1" data-line-number="1"><span class="kw">visualize</span>(bootstrap_distribution) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb296-2" data-line-number="2"><span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> standard_error_ci)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:se-ci-viz"></span>
-<img src="moderndive_files/figure-html/se-ci-viz-1.png" alt="Standard error method 95 percent confidence interval." width="\textwidth" />
+<img src="ModernDive_files/figure-html/se-ci-viz-1.png" alt="Standard-error-method 95% confidence interval." width="\textwidth" />
 <p class="caption">
-FIGURE 8.25: Standard error method 95 percent confidence interval.
+FIGURE 8.25: Standard-error-method 95% confidence interval.
 </p>
 </div>
 <p>As noted in Section <a href="8-confidence-intervals.html#ci-build-up">8.3</a>, both methods produce similar confidence intervals:</p>
 <ul>
-<li>Percentile method: (1991.16, 1999.58)</li>
-<li>Standard error method: (1991.16, 1999.72)</li>
+<li>Percentile method: (1991.24, 1999.42)</li>
+<li>Standard error method: (1991.35, 1999.53)</li>
 </ul>
 <div class="learncheck">
 <p>
@@ -1395,23 +1413,23 @@ <h3><span class="header-section-number">8.4.4</span> Standard error method with
 </div>
 <div id="one-prop-ci" class="section level2">
 <h2><span class="header-section-number">8.5</span> Interpreting confidence intervals</h2>
-<p>Now that we’ve shown you how to construct confidence intervals using a sample drawn from a population, let’s now focus on how to interpret their effectiveness. The effectiveness of a confidence interval is judged by whether or not it contains the true value of the population parameter. Going back to our fishing analogy in Section <a href="8-confidence-intervals.html#ci-build-up">8.3</a>, this is like asking “Did our net capture the fish?”</p>
-<p>So for example, does our percentile-based confidence interval of (1991.16, 1999.58) “capture” the true mean year <span class="math inline">\(\mu\)</span> of <em>all</em> US pennies? Alas, we’ll never know, because we don’t know what the true value of <span class="math inline">\(\mu\)</span> is. After all, we’re sampling to estimate it!</p>
+<p>Now that we’ve shown you how to construct confidence intervals using a sample drawn from a population, let’s now focus on how to interpret their effectiveness. The effectiveness of a confidence interval is judged by whether or not it contains the true value of the population parameter. Going back to our fishing analogy in Section <a href="8-confidence-intervals.html#ci-build-up">8.3</a>, this is like asking, “Did our net capture the fish?”.</p>
+<p>So, for example, does our percentile-based confidence interval of (1991.24, 1999.42) “capture” the true mean year <span class="math inline">\(\mu\)</span> of <em>all</em> US pennies? Alas, we’ll never know, because we don’t know what the true value of <span class="math inline">\(\mu\)</span> is. After all, we’re sampling to estimate it!</p>
 <p>In order to interpret a confidence interval’s effectiveness, we need to <em>know</em> what the value of the population parameter is. That way we can say whether or not a confidence interval “captured” this value.</p>
 <p>Let’s revisit our sampling bowl from Chapter <a href="7-sampling.html#sampling">7</a>. What proportion of the bowl’s 2400 balls are red? Let’s compute this:</p>
-<pre class="sourceCode r"><code class="sourceCode r">bowl <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">p_red =</span> <span class="kw">mean</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>))</code></pre>
+<div class="sourceCode" id="cb297"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb297-1" data-line-number="1">bowl <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb297-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">p_red =</span> <span class="kw">mean</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>))</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   p_red
   &lt;dbl&gt;
 1 0.375</code></pre>
 <p>In this case, we <em>know</em> what the value of the population parameter is: we know that the population proportion <span class="math inline">\(p\)</span> is 0.375. In other words, we know that 37.5% of the bowl’s balls are red.</p>
-<p>As we stated in Subsection <a href="7-sampling.html#moral-of-the-story">7.3.3</a>, the sampling bowl exercise doesn’t really reflect how sampling is done in real-life, but rather was an <em>idealized</em> activity. In real-life, we won’t know what the true value of the population parameter is, hence the need for estimation.</p>
-<p>Let’s now construct confidence intervals for <span class="math inline">\(p\)</span> using our 33 groups of friends’ samples from the bowl in Chapter <a href="7-sampling.html#sampling">7</a>. We’ll then see if the confidence intervals “captured” the true value of <span class="math inline">\(p\)</span>, which we know to be 37.5%. In other words: “Did net capture the fish?”</p>
+<p>As we stated in Subsection <a href="7-sampling.html#moral-of-the-story">7.3.3</a>, the sampling bowl exercise doesn’t really reflect how sampling is done in real life, but rather was an <em>idealized</em> activity. In real life, we won’t know what the true value of the population parameter is, hence the need for estimation.</p>
+<p>Let’s now construct confidence intervals for <span class="math inline">\(p\)</span> using our 33 groups of friends’ samples from the bowl in Chapter <a href="7-sampling.html#sampling">7</a>. We’ll then see if the confidence intervals “captured” the true value of <span class="math inline">\(p\)</span>, which we know to be 37.5%. That is to say, “Did the net capture the fish?”.</p>
 <div id="ilyas-yohan" class="section level3">
 <h3><span class="header-section-number">8.5.1</span> Did the net capture the fish?</h3>
-<p>Recall that we had 33 groups of friends each take samples of size 50 from the bowl and then compute the sample proportion of red <span class="math inline">\(\widehat{p}\)</span>. This resulted in 33 such estimates of <span class="math inline">\(p\)</span>. Let’s focus on Ilyas and Yohan’s sample, which is saved in the <code>bowl_sample_1</code> data frame in the <code>moderndive</code> package:</p>
-<pre class="sourceCode r"><code class="sourceCode r">bowl_sample_<span class="dv">1</span></code></pre>
+<p>Recall that we had 33 groups of friends each take samples of size 50 from the bowl and then compute the sample proportion of red balls <span class="math inline">\(\widehat{p}\)</span>. This resulted in 33 such estimates of <span class="math inline">\(p\)</span>. Let’s focus on Ilyas and Yohan’s sample, which is saved in the <code>bowl_sample_1</code> data frame in the <code>moderndive</code> package:</p>
+<div class="sourceCode" id="cb299"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb299-1" data-line-number="1">bowl_sample_<span class="dv">1</span></a></code></pre></div>
 <pre><code># A tibble: 50 x 1
    color
    &lt;chr&gt;
@@ -1427,17 +1445,17 @@ <h3><span class="header-section-number">8.5.1</span> Did the net capture the fis
 10 white
 # … with 40 more rows</code></pre>
 <p>They observed 21 red balls out of 50 and thus their sample proportion <span class="math inline">\(\widehat{p}\)</span> was 21/50 = 0.42 = 42%. Think of this as the “spear” from our fishing analogy.</p>
-<p>Let’s now follow the <code>infer</code> package workflow from Section <a href="8-confidence-intervals.html#infer-workflow">8.4.2</a> to create a percentile method based 95% confidence interval for <span class="math inline">\(p\)</span> using Ilyas and Yohan’s sample. Think of this as the “net.”</p>
+<p>Let’s now follow the <code>infer</code> package workflow from Subsection <a href="8-confidence-intervals.html#infer-workflow">8.4.2</a> to create a percentile-method-based 95% confidence interval for <span class="math inline">\(p\)</span> using Ilyas and Yohan’s sample. Think of this as the “net.”</p>
 <div id="specify-variables-1" class="section level4 unnumbered">
 <h4>1. <code>specify</code> variables</h4>
 <p>First, we <code>specify()</code> the <code>response</code> variable of interest <code>color</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">bowl_sample_<span class="dv">1</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> color)</code></pre>
+<div class="sourceCode" id="cb301"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb301-1" data-line-number="1">bowl_sample_<span class="dv">1</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb301-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> color)</a></code></pre></div>
 <pre><code>Error: A level of the response variable `color` needs to be specified for the `success`
 argument in `specify()`.</code></pre>
 <p>Whoops! We need to define which event is of interest! <code>red</code> or <code>white</code> balls? Since we are interested in the proportion red, let’s set <code>success</code> to be <code>&quot;red&quot;</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">bowl_sample_<span class="dv">1</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> color, <span class="dt">success =</span> <span class="st">&quot;red&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb303"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb303-1" data-line-number="1">bowl_sample_<span class="dv">1</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb303-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> color, <span class="dt">success =</span> <span class="st">&quot;red&quot;</span>)</a></code></pre></div>
 <pre><code>Response: color (factor)
 # A tibble: 50 x 1
    color
@@ -1457,9 +1475,9 @@ <h4>1. <code>specify</code> variables</h4>
 <div id="generate-replicates-1" class="section level4 unnumbered">
 <h4>2. <code>generate</code> replicates</h4>
 <p>Second, we <code>generate()</code> 1000 replicates of <em>bootstrap resampling with replacement</em> from <code>bowl_sample_1</code> by setting <code>reps = 1000</code> and <code>type = &quot;bootstrap&quot;</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">bowl_sample_<span class="dv">1</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> color, <span class="dt">success =</span> <span class="st">&quot;red&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb305"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb305-1" data-line-number="1">bowl_sample_<span class="dv">1</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb305-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> color, <span class="dt">success =</span> <span class="st">&quot;red&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb305-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>)</a></code></pre></div>
 <pre><code>Response: color (factor)
 # A tibble: 50,000 x 2
 # Groups:   replicate [1,000]
@@ -1467,68 +1485,67 @@ <h4>2. <code>generate</code> replicates</h4>
        &lt;int&gt; &lt;fct&gt;
  1         1 white
  2         1 white
- 3         1 red  
+ 3         1 white
  4         1 white
- 5         1 white
+ 5         1 red  
  6         1 white
  7         1 white
- 8         1 red  
+ 8         1 white
  9         1 white
-10         1 white
+10         1 red  
 # … with 49,990 more rows</code></pre>
-<p>Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 balls with replacement 1000 times and thus 50,000 = 50 <span class="math inline">\(\times\)</span> 1000. The variable <code>replicate</code> indicates which resample each row belongs to. So it has the value <code>1</code> 50 times, the value <code>2</code> 50 times, all the way through to the value <code>1000</code> 50 times.</p>
+<p>Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 balls with replacement 1000 times and thus 50,000 = 50 <span class="math inline">\(\cdot\)</span> 1000. The variable <code>replicate</code> indicates which resample each row belongs to. So it has the value <code>1</code> 50 times, the value <code>2</code> 50 times, all the way through to the value <code>1000</code> 50 times.</p>
 </div>
 <div id="calculate-summary-statistics-1" class="section level4 unnumbered">
 <h4>3. <code>calculate</code> summary statistics</h4>
-<p>Third, we summarize each of 1000 resamples of size 50 with the proportion of “successes”. In other words, the proportion of the balls that are <code>&quot;red&quot;</code>. We can set the summary statistic to be calculated to be the proportion by setting the <code>stat</code> argument to be <code>&quot;prop&quot;</code>. Let’s save the result in a data frame called <code>sample_1_bootstrap</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">sample_<span class="dv">1</span>_bootstrap &lt;-<span class="st"> </span>bowl_sample_<span class="dv">1</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> color, <span class="dt">success =</span> <span class="st">&quot;red&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;prop&quot;</span>)
-sample_<span class="dv">1</span>_bootstrap</code></pre>
+<p>Third, we summarize each of the 1000 resamples of size 50 with the proportion of <em>successes</em>. In other words, the proportion of the balls that are <code>&quot;red&quot;</code>. We can set the summary statistic to be calculated as the proportion by setting the <code>stat</code> argument to be <code>&quot;prop&quot;</code>. Let’s save the result as <code>sample_1_bootstrap</code>:</p>
+<div class="sourceCode" id="cb307"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb307-1" data-line-number="1">sample_<span class="dv">1</span>_bootstrap &lt;-<span class="st"> </span>bowl_sample_<span class="dv">1</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb307-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> color, <span class="dt">success =</span> <span class="st">&quot;red&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb307-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb307-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;prop&quot;</span>)</a>
+<a class="sourceLine" id="cb307-5" data-line-number="5">sample_<span class="dv">1</span>_bootstrap</a></code></pre></div>
 <pre><code># A tibble: 1,000 x 2
    replicate  stat
        &lt;int&gt; &lt;dbl&gt;
- 1         1  0.36
+ 1         1  0.32
  2         2  0.42
- 3         3  0.52
- 4         4  0.38
- 5         5  0.38
- 6         6  0.38
- 7         7  0.46
- 8         8  0.3 
- 9         9  0.5 
-10        10  0.46
+ 3         3  0.44
+ 4         4  0.4 
+ 5         5  0.44
+ 6         6  0.52
+ 7         7  0.38
+ 8         8  0.44
+ 9         9  0.34
+10        10  0.42
 # … with 990 more rows</code></pre>
 <p>Observe there are 1000 rows in this data frame and thus 1000 values of the variable <code>stat</code>. These 1000 values of <code>stat</code> represent our 1000 replicated values of the proportion, each based on a different resample.</p>
 </div>
 <div id="visualize-the-results-1" class="section level4 unnumbered">
 <h4>4. <code>visualize</code> the results</h4>
 <p>Fourth and lastly, let’s compute the resulting 95% confidence interval.</p>
-<pre class="sourceCode r"><code class="sourceCode r">percentile_ci_<span class="dv">1</span> &lt;-<span class="st"> </span>sample_<span class="dv">1</span>_bootstrap <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>)
-percentile_ci_<span class="dv">1</span></code></pre>
+<div class="sourceCode" id="cb309"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb309-1" data-line-number="1">percentile_ci_<span class="dv">1</span> &lt;-<span class="st"> </span>sample_<span class="dv">1</span>_bootstrap <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb309-2" data-line-number="2"><span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>)</a>
+<a class="sourceLine" id="cb309-3" data-line-number="3">percentile_ci_<span class="dv">1</span></a></code></pre></div>
 <pre><code># A tibble: 1 x 2
-  `2.5%`  `97.5%`
-   &lt;dbl&gt;    &lt;dbl&gt;
-1   0.28 0.540500</code></pre>
+  `2.5%` `97.5%`
+   &lt;dbl&gt;   &lt;dbl&gt;
+1    0.3    0.56</code></pre>
 <p>Let’s visualize the bootstrap distribution along with the <code>percentile_ci_1</code> percentile-based 95% confidence interval for <span class="math inline">\(p\)</span> in Figure <a href="8-confidence-intervals.html#fig:shovel-bootstrap-1-infer">8.26</a>. We’ll adjust the number of bins to better see the resulting shape. Furthermore, we’ll add a dashed vertical line at Ilyas and Yohan’s observed <span class="math inline">\(\widehat{p}\)</span> = 21/50 = 0.42 = 42% using <code>geom_vline()</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">sample_<span class="dv">1</span>_bootstrap <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">visualize</span>(<span class="dt">bins =</span> <span class="dv">15</span>) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> percentile_ci_<span class="dv">1</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_vline</span>(<span class="dt">xintercept =</span> <span class="fl">0.375</span>, <span class="dt">linetype =</span> <span class="st">&quot;dashed&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb311"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb311-1" data-line-number="1">sample_<span class="dv">1</span>_bootstrap <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb311-2" data-line-number="2"><span class="st">  </span><span class="kw">visualize</span>(<span class="dt">bins =</span> <span class="dv">15</span>) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb311-3" data-line-number="3"><span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> percentile_ci_<span class="dv">1</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb311-4" data-line-number="4"><span class="st">  </span><span class="kw">geom_vline</span>(<span class="dt">xintercept =</span> <span class="fl">0.375</span>, <span class="dt">linetype =</span> <span class="st">&quot;dashed&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:shovel-bootstrap-1-infer"></span>
-<img src="moderndive_files/figure-html/shovel-bootstrap-1-infer-1.png" alt="Bootstrap distribution." width="\textwidth" />
+<img src="ModernDive_files/figure-html/shovel-bootstrap-1-infer-1.png" alt="Bootstrap distribution." width="\textwidth" />
 <p class="caption">
 FIGURE 8.26: Bootstrap distribution.
 </p>
 </div>
-<p>Did Ilyas and Yohan’s net capture the fish? In other words, did their 95% confidence interval for <span class="math inline">\(p\)</span> based on their sample contain the true value of <span class="math inline">\(p\)</span> of 0.375? Yes! 0.375 is between the endpoints of our confidence interval (0.28, 0.54).</p>
+<p>Did Ilyas and Yohan’s net capture the fish? Did their 95% confidence interval for <span class="math inline">\(p\)</span> based on their sample contain the true value of <span class="math inline">\(p\)</span> of 0.375? Yes! 0.375 is between the endpoints of their confidence interval (0.3, 0.56).</p>
 <p>However, will <em>every</em> 95% confidence interval for <span class="math inline">\(p\)</span> capture this value? In other words, if we had a different sample of 50 balls and constructed a different confidence interval, would it necessarily contain <span class="math inline">\(p\)</span> = 0.375 as well? Let’s see!</p>
 <p>Let’s first take a different sample from the bowl, this time using the computer as we did in Chapter <a href="7-sampling.html#sampling">7</a>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">bowl_sample_<span class="dv">2</span> &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>)
-bowl_sample_<span class="dv">2</span></code></pre>
+<div class="sourceCode" id="cb312"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb312-1" data-line-number="1">bowl_sample_<span class="dv">2</span> &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>)</a>
+<a class="sourceLine" id="cb312-2" data-line-number="2">bowl_sample_<span class="dv">2</span></a></code></pre></div>
 <pre><code># A tibble: 50 x 3
 # Groups:   replicate [1]
    replicate ball_ID color
@@ -1544,77 +1561,81 @@ <h4>4. <code>visualize</code> the results</h4>
  9         1    1951 white
 10         1    2061 white
 # … with 40 more rows</code></pre>
-<p>Let’s reapply the same <code>infer</code> functions on <code>bowl_sample_2</code> to generate a different 95% confidence interval for <span class="math inline">\(p\)</span>. First we create the new bootstrap distribution and save the results in <code>sample_2_bootstrap</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">sample_<span class="dv">2</span>_bootstrap &lt;-<span class="st"> </span>bowl_sample_<span class="dv">2</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> color, <span class="dt">success =</span> <span class="st">&quot;red&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;prop&quot;</span>)
-sample_<span class="dv">2</span>_bootstrap</code></pre>
+<p>Let’s reapply the same <code>infer</code> functions on <code>bowl_sample_2</code> to generate a different 95% confidence interval for <span class="math inline">\(p\)</span>. First, we create the new bootstrap distribution and save the results in <code>sample_2_bootstrap</code>:</p>
+<div class="sourceCode" id="cb314"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb314-1" data-line-number="1">sample_<span class="dv">2</span>_bootstrap &lt;-<span class="st"> </span>bowl_sample_<span class="dv">2</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb314-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> color, </a>
+<a class="sourceLine" id="cb314-3" data-line-number="3">          <span class="dt">success =</span> <span class="st">&quot;red&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb314-4" data-line-number="4"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, </a>
+<a class="sourceLine" id="cb314-5" data-line-number="5">           <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb314-6" data-line-number="6"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;prop&quot;</span>)</a>
+<a class="sourceLine" id="cb314-7" data-line-number="7">sample_<span class="dv">2</span>_bootstrap</a></code></pre></div>
 <pre><code># A tibble: 1,000 x 2
    replicate  stat
        &lt;int&gt; &lt;dbl&gt;
- 1         1  0.36
+ 1         1  0.48
  2         2  0.38
- 3         3  0.42
- 4         4  0.26
- 5         5  0.5 
- 6         6  0.32
- 7         7  0.4 
- 8         8  0.32
- 9         9  0.5 
-10        10  0.44
+ 3         3  0.32
+ 4         4  0.32
+ 5         5  0.34
+ 6         6  0.26
+ 7         7  0.3 
+ 8         8  0.36
+ 9         9  0.44
+10        10  0.36
 # … with 990 more rows</code></pre>
 <p>We once again compute a percentile-based 95% confidence interval for <span class="math inline">\(p\)</span>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">percentile_ci_<span class="dv">2</span> &lt;-<span class="st"> </span>sample_<span class="dv">2</span>_bootstrap <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>)
-percentile_ci_<span class="dv">2</span></code></pre>
+<div class="sourceCode" id="cb316"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb316-1" data-line-number="1">percentile_ci_<span class="dv">2</span> &lt;-<span class="st"> </span>sample_<span class="dv">2</span>_bootstrap <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb316-2" data-line-number="2"><span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>)</a>
+<a class="sourceLine" id="cb316-3" data-line-number="3">percentile_ci_<span class="dv">2</span></a></code></pre></div>
 <pre><code># A tibble: 1 x 2
   `2.5%` `97.5%`
    &lt;dbl&gt;   &lt;dbl&gt;
-1   0.22     0.5</code></pre>
-<p>Does this new net capture the fish? In other words, does the 95% confidence interval for <span class="math inline">\(p\)</span> based on the new sample contain the true value of <span class="math inline">\(p\)</span> of 0.375? Yes again! 0.375 is between the endpoints of our confidence interval (0.22, 0.5).</p>
+1    0.2    0.48</code></pre>
+<p>Does this new net capture the fish? In other words, does the 95% confidence interval for <span class="math inline">\(p\)</span> based on the new sample contain the true value of <span class="math inline">\(p\)</span> of 0.375? Yes again! 0.375 is between the endpoints of our confidence interval (0.2, 0.48).</p>
 <p>Let’s now repeat this process 100 more times: we take 100 virtual samples from the bowl and construct 100 95% confidence intervals. Let’s visualize the results in Figure <a href="8-confidence-intervals.html#fig:reliable-percentile">8.27</a> where:</p>
 <ol style="list-style-type: decimal">
-<li>We mark the true value of <span class="math inline">\(p\)</span> = 0.375 with a vertical line.</li>
+<li>We mark the true value of <span class="math inline">\(p = 0.375\)</span> with a vertical line.</li>
 <li>We mark each of the 100 95% confidence intervals with horizontal lines. These are the “nets.”</li>
 <li>The horizontal line is colored grey if the confidence interval “captures” the true value of <span class="math inline">\(p\)</span> marked with the vertical line. The horizontal line is colored black otherwise.</li>
 </ol>
+
 <div class="figure" style="text-align: center"><span id="fig:reliable-percentile"></span>
-<img src="moderndive_files/figure-html/reliable-percentile-1.png" alt="100 percentile-based 95 percent confidence intervals for $p$." width="\textwidth" />
+<img src="ModernDive_files/figure-html/reliable-percentile-1.png" alt="100 percentile-based 95% confidence intervals for \(p\)." width="\textwidth" />
 <p class="caption">
-FIGURE 8.27: 100 percentile-based 95 percent confidence intervals for <span class="math inline">\(p\)</span>.
+FIGURE 8.27: 100 percentile-based 95% confidence intervals for <span class="math inline">\(p\)</span>.
 </p>
 </div>
-<p>Of the 100 95% confidence intervals, 96 of them captured the true value <span class="math inline">\(p\)</span> = 0.375, whereas 4 of them didn’t. In other words, 96 of our nets caught the fish, whereas 4 of our nets didn’t.</p>
-<p>This is where the “95% confidence level” we defined in Section <a href="8-confidence-intervals.html#ci-build-up">8.3</a> comes into play: for every 100 95% confidence intervals, we <em>expect</em> that 95 of them will capture <span class="math inline">\(p\)</span> and that 5 of them won’t.</p>
-<p>Note that “expect” is a probabilistic statement referring to a long-run average. In other words, for every 100 confidence intervals, we will observe <em>about</em> 95 confidence intervals that capture <span class="math inline">\(p\)</span>, but not necessarily exactly 95. In Figure <a href="8-confidence-intervals.html#fig:reliable-percentile">8.27</a> for example, 96 of the confidence intervals capture <span class="math inline">\(p\)</span>.</p>
+<p>Of the 100 95% confidence intervals, 95 of them captured the true value <span class="math inline">\(p = 0.375\)</span>, whereas 5 of them didn’t. In other words, 95 of our nets caught the fish, whereas 5 of our nets didn’t.</p>
+<p>This is where the “95% confidence level” we defined in Section <a href="8-confidence-intervals.html#ci-build-up">8.3</a> comes into play: for every 100 95% confidence intervals, we <em>expect</em> that 95 of them will capture <span class="math inline">\(p\)</span> and that five of them won’t.</p>
+<p>Note that “expect” is a probabilistic statement referring to a long-run average. In other words, for every 100 confidence intervals, we will observe <em>about</em> 95 confidence intervals that capture <span class="math inline">\(p\)</span>, but not necessarily exactly 95. In Figure <a href="8-confidence-intervals.html#fig:reliable-percentile">8.27</a> for example, 95 of the confidence intervals capture <span class="math inline">\(p\)</span>.</p>
 <p>To further accentuate our point about confidence levels, let’s generate a figure similar to Figure <a href="8-confidence-intervals.html#fig:reliable-percentile">8.27</a>, but this time constructing 80% standard-error method based confidence intervals instead. Let’s visualize the results in Figure <a href="8-confidence-intervals.html#fig:reliable-se">8.28</a> with the scale on the x-axis being the same as in Figure <a href="8-confidence-intervals.html#fig:reliable-percentile">8.27</a> to make comparison easy. Furthermore, since all standard-error method 95% confidence intervals for <span class="math inline">\(p\)</span> are centered at their respective point estimates <span class="math inline">\(\widehat{p}\)</span>, we mark this value on each line with dots.</p>
+
 <div class="figure" style="text-align: center"><span id="fig:reliable-se"></span>
-<img src="moderndive_files/figure-html/reliable-se-1.png" alt="100 SE-based 80 percent confidence intervals for $p$ with point estimate center marked with dots." width="\textwidth" />
+<img src="ModernDive_files/figure-html/reliable-se-1.png" alt="100 SE-based 80% confidence intervals for \(p\) with point estimate center marked with dots." width="\textwidth" />
 <p class="caption">
-FIGURE 8.28: 100 SE-based 80 percent confidence intervals for <span class="math inline">\(p\)</span> with point estimate center marked with dots.
+FIGURE 8.28: 100 SE-based 80% confidence intervals for <span class="math inline">\(p\)</span> with point estimate center marked with dots.
 </p>
 </div>
-<p>Observe how the 80% confidence intervals are narrower than the 95% confidence intervals, reflecting our lower degree of confidence. Think of this as using a smaller “net.” We’ll explore other determinants of confidence interval width in the upcoming Section <a href="8-confidence-intervals.html#ci-width">8.5.3</a>.</p>
+<p>Observe how the 80% confidence intervals are narrower than the 95% confidence intervals, reflecting our lower degree of confidence. Think of this as using a smaller “net.” We’ll explore other determinants of confidence interval width in the upcoming Subsection <a href="8-confidence-intervals.html#ci-width">8.5.3</a>.</p>
 <p>Furthermore, observe that of the 100 80% confidence intervals, 82 of them captured the population proportion <span class="math inline">\(p\)</span> = 0.375, whereas 18 of them did not. Since we lowered the confidence level from 95% to 80%, we now have a much larger number of confidence intervals that failed to “catch the fish.”</p>
 </div>
 </div>
 <div id="shorthand" class="section level3">
-<h3><span class="header-section-number">8.5.2</span> Precise &amp; shorthand interpretation</h3>
+<h3><span class="header-section-number">8.5.2</span> Precise and shorthand interpretation</h3>
 <p></p>
 <p>Let’s return our attention to 95% confidence intervals. The precise and mathematically correct interpretation of a 95% confidence interval is a little long-winded:</p>
 <blockquote>
 <p>Precise interpretation: If we repeated our sampling procedure a large number of times, we expect about 95% of the resulting confidence intervals to capture the value of the population parameter.</p>
 </blockquote>
-<p>This is what we observed in Figure <a href="8-confidence-intervals.html#fig:reliable-percentile">8.27</a>. Our confidence interval construction procedure is 95% “reliable.” In other words, we can expect our confidence intervals to include the true population parameter about 95% of the time.</p>
+<p>This is what we observed in Figure <a href="8-confidence-intervals.html#fig:reliable-percentile">8.27</a>. Our confidence interval construction procedure is 95% <em>reliable</em>. That is to say, we can expect our confidence intervals to include the true population parameter about 95% of the time.</p>
 <p>A common but incorrect interpretation is: “There is a 95% probability that the confidence interval contains <span class="math inline">\(p\)</span>.” Looking at Figure <a href="8-confidence-intervals.html#fig:reliable-percentile">8.27</a>, each of the confidence intervals either does or doesn’t contain <span class="math inline">\(p\)</span>. In other words, the probability is either a 1 or a 0.</p>
-<p>So if the 95% confidence level only relates to the reliability of the confidence interval construction procedure and not to a given confidence interval itself, what insight can be derived from a given confidence interval? For example, going back to the pennies example, we found that the percentile method 95% confidence interval for <span class="math inline">\(\mu\)</span> was (1991.16, 1999.58) whereas the standard error method 95% confidence interval was (1991.16, 1999.72). What can be said about these two intervals?</p>
+<p>So if the 95% confidence level only relates to the reliability of the confidence interval construction procedure and not to a given confidence interval itself, what insight can be derived from a given confidence interval? For example, going back to the pennies example, we found that the percentile method 95% confidence interval for <span class="math inline">\(\mu\)</span> was (1991.24, 1999.42), whereas the standard error method 95% confidence interval was (1991.35, 1999.53). What can be said about these two intervals?</p>
 <p>Loosely speaking, we can think of these intervals as our “best guess” of a plausible range of values for the mean year <span class="math inline">\(\mu\)</span> of <em>all</em> US pennies. For the rest of this book, we’ll use the following shorthand summary of the precise interpretation.</p>
 <blockquote>
 <p>Short-hand interpretation: We are 95% “confident” that a 95% confidence interval captures the value of the population parameter.</p>
 </blockquote>
 <p>We use quotation marks around “confident” to emphasize that while 95% relates to the reliability of our confidence interval construction procedure, ultimately a constructed confidence interval is our best guess of an interval that contains the population parameter. In other words, it’s our best net.</p>
-<p>So returning to our pennies example and focusing on the percentile-method, we are 95% “confident” that the true mean year of pennies in circulation in 2019 is somewhere between 1991.16 and 1999.58.</p>
+<p>So returning to our pennies example and focusing on the percentile method, we are 95% “confident” that the true mean year of pennies in circulation in 2019 is somewhere between 1991.24 and 1999.42.</p>
 </div>
 <div id="ci-width" class="section level3">
 <h3><span class="header-section-number">8.5.3</span> Width of confidence intervals</h3>
@@ -1622,14 +1643,16 @@ <h3><span class="header-section-number">8.5.3</span> Width of confidence interva
 <div id="impact-of-confidence-level" class="section level4 unnumbered">
 <h4>Impact of confidence level</h4>
 <p>One factor that determines confidence interval widths is the pre-specified confidence level. For example, in Figures <a href="8-confidence-intervals.html#fig:reliable-percentile">8.27</a> and <a href="8-confidence-intervals.html#fig:reliable-se">8.28</a>, we compared the widths of 95% and 80% confidence intervals and observed that the 95% confidence intervals were wider. The quantification of the confidence level should match what many expect of the word “confident.” In order to be more confident in our best guess of a range of values, we need to widen the range of values.</p>
-<p>To elaborate on this, imagine we want to guess the forecasted high temperature in Seoul, South Korea on August 15th. Given Seoul’s temperate climate with 4 distinct seasons, we could say somewhat confidently that the high temperature would be between 50°F - 95°F (10°C - 35°C). However, if we wanted a temperature range we were <em>absolutely</em> confident about, would we need to widen it.</p>
-<p>We need this wider range to allow for the possibility of anomalous weather, like a freak cold spell or an extreme heat wave. So a range of temperatures we could be near certain about would be between 32°F - 110°F (0°C - 43°C). On the other hand, if could tolerate being a little less confident, we could narrow this range to between 70°F - 85°F (21°C - 30°C).</p>
-<p>Let’s revisit our sampling bowl from Chapter <a href="7-sampling.html#sampling">7</a>. Let’s compare <span class="math inline">\(10 \times 3 = 30\)</span> confidence intervals for <span class="math inline">\(p\)</span> based on three different confidence levels: 80%, 95%, and 99%. Specifically, we’ll first take 30 different random samples of size <span class="math inline">\(n\)</span> = 50 balls from the bowl. Then we’ll construct 10 percentile-based confidence intervals using each of the three different confidence levels. Finally, we’ll compare the widths of these intervals. We visualize the resulting confidence intervals in Figure <a href="8-confidence-intervals.html#fig:reliable-percentile-80-95-99">8.29</a> along with a vertical line marking the true value of <span class="math inline">\(p\)</span> = 0.375.</p>
+<p>To elaborate on this, imagine we want to guess the forecasted high temperature in Seoul, South Korea on August 15th. Given Seoul’s temperate climate with four distinct seasons, we could say somewhat confidently that the high temperature would be between 50°F - 95°F (10°C - 35°C). However, if we wanted a temperature range we were <em>absolutely</em> confident about, we would need to widen it.</p>
+<p>We need this wider range to allow for the possibility of anomalous weather, like a freak cold spell or an extreme heat wave. So a range of temperatures we could be near certain about would be between 32°F - 110°F (0°C - 43°C). On the other hand, if we could tolerate being a little less confident, we could narrow this range to between 70°F - 85°F (21°C - 30°C).</p>
+<p>Let’s revisit our sampling bowl from Chapter <a href="7-sampling.html#sampling">7</a>. Let’s compare <span class="math inline">\(10 \cdot 3 = 30\)</span> confidence intervals for <span class="math inline">\(p\)</span> based on three different confidence levels: 80%, 95%, and 99%.</p>
+<p>Specifically, we’ll first take 30 different random samples of size <span class="math inline">\(n\)</span> = 50 balls from the bowl. Then we’ll construct 10 percentile-based confidence intervals using each of the three different confidence levels.</p>
+<p>Finally, we’ll compare the widths of these intervals. We visualize the resulting confidence intervals in Figure <a href="8-confidence-intervals.html#fig:reliable-percentile-80-95-99">8.29</a> along with a vertical line marking the true value of <span class="math inline">\(p\)</span> = 0.375.</p>
 <!-- 
 Chester says: Should we load the perc_cis_by_level and percentile_cis_by_n data
-frames into the moderndive package too so that readers can explore them a bit?
+frames into the moderndive package too so that readers can explore them a bit? No need to include the code as well that generates them in the book.
 
-Albert says: I totally agree. However making the code to replicate this process
+Albert says: I totally agree. However, making the code to replicate this process
 student-friendly is going to take a lot of work and this chapter is getting
 rather large as is, so let's punt until next edition. For now, let's just show
 the resulting faceted plots comparing:
@@ -1646,16 +1669,17 @@ <h4>Impact of confidence level</h4>
 
 We see that the sample proportion of reds varies in the `point_estimate` column with varying `lower` and `upper` bounds as well depending on the variability of the bootstrap distribution. The width of the confidence intervals appears to increase from left to right going from 80% confidence levels to 95% and then to 99%. Let's now compute the confidence interval (CI) width for each of these intervals and then get the median and mean length.
 -->
+
 <div class="figure" style="text-align: center"><span id="fig:reliable-percentile-80-95-99"></span>
-<img src="moderndive_files/figure-html/reliable-percentile-80-95-99-1.png" alt="Ten 80, 95, and 99 percent confidence intervals for $p$ based on $n = 50$." width="\textwidth" />
+<img src="ModernDive_files/figure-html/reliable-percentile-80-95-99-1.png" alt="Ten 80, 95, and 99% confidence intervals for \(p\) based on \(n = 50\)." width="\textwidth" />
 <p class="caption">
-FIGURE 8.29: Ten 80, 95, and 99 percent confidence intervals for <span class="math inline">\(p\)</span> based on <span class="math inline">\(n = 50\)</span>.
+FIGURE 8.29: Ten 80, 95, and 99% confidence intervals for <span class="math inline">\(p\)</span> based on <span class="math inline">\(n = 50\)</span>.
 </p>
 </div>
-<p>Observe that as the confidence level increases from 80% to 95% to 99%, the confidence intervals tend to get wider. Let’s compare their average widths in Table <a href="8-confidence-intervals.html#tab:perc-cis-average-width">8.2</a>.</p>
+<p>Observe that as the confidence level increases from 80% to 95% to 99%, the confidence intervals tend to get wider as seen in Table <a href="8-confidence-intervals.html#tab:perc-cis-average-width">8.2</a> where we compare their average widths.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:perc-cis-average-width">TABLE 8.2: </span>Average width of 80, 95, and 99 percent confidence intervals.
+<span id="tab:perc-cis-average-width">TABLE 8.2: </span>Average width of 80, 95, and 99% confidence intervals
 </caption>
 <thead>
 <tr>
@@ -1673,7 +1697,7 @@ <h4>Impact of confidence level</h4>
 80%
 </td>
 <td style="text-align:right;">
-0.166
+0.162
 </td>
 </tr>
 <tr>
@@ -1681,7 +1705,7 @@ <h4>Impact of confidence level</h4>
 95%
 </td>
 <td style="text-align:right;">
-0.264
+0.262
 </td>
 </tr>
 <tr>
@@ -1694,24 +1718,23 @@ <h4>Impact of confidence level</h4>
 </tr>
 </tbody>
 </table>
-<p>So in order to have a higher confidence level, our confidence intervals must be wider. Ideally, we would have both a high confidence level and narrow confidence intervals. However, we cannot have it both ways. If we want to “be more confident”, we need to allow for wider intervals. Conversely, if we would like a narrow interval, we must tolerate a lower confidence level.</p>
-<p>The moral of the story is:  <strong>Higher confidence levels tend to produce wider confidence intervals.</strong> However, when looking at Figure <a href="8-confidence-intervals.html#fig:reliable-percentile-80-95-99">8.29</a> it is important to keep in mind that we kept the sample size fixed at <span class="math inline">\(n\)</span> = 50. In other words, all <span class="math inline">\(10 \times 3 = 30\)</span> random samples from the <code>bowl</code> had the same sample size.</p>
-<p>What happens if instead we took samples of different sizes? Recall that we did this in Section <a href="7-sampling.html#different-shovels">7.2.4</a> using virtual shovels with 25, 50, and 100 slots. We delve into this next.</p>
-
+<p>So in order to have a higher confidence level, our confidence intervals must be wider. Ideally, we would have both a high confidence level and narrow confidence intervals. However, we cannot have it both ways. If we want to <em>be more confident</em>, we need to allow for wider intervals. Conversely, if we would like a narrow interval, we must tolerate a lower confidence level.</p>
+<p>The moral of the story is:  <strong>Higher confidence levels tend to produce wider confidence intervals.</strong> When looking at Figure <a href="8-confidence-intervals.html#fig:reliable-percentile-80-95-99">8.29</a> it is important to keep in mind that we kept the sample size fixed at <span class="math inline">\(n\)</span> = 50. Thus, all <span class="math inline">\(10 \cdot 3 = 30\)</span> random samples from the <code>bowl</code> had the same sample size. What happens if instead we took samples of different sizes? Recall that we did this in Subsection <a href="7-sampling.html#different-shovels">7.2.4</a> using virtual shovels with 25, 50, and 100 slots. <!-- We delve into this next. --></p>
 </div>
 <div id="impact-of-sample-size" class="section level4 unnumbered">
 <h4>Impact of sample size</h4>
-<p>This time, let’s fix the confidence level at 95%, but consider three different sample sizes <span class="math inline">\(n\)</span>: 25, 50, and 100. Specifically, we’ll first take 10 different random samples of size 25, 10 different random samples of size 50, and 10 different random samples of size 100. We’ll then construct 95% percentile-based confidence intervals. Finally, we’ll compare the widths of these intervals. We visualize the resulting 30 confidence intervals in Figure <a href="8-confidence-intervals.html#fig:reliable-percentile-n-25-50-100">8.30</a>. Note also the vertical line marking the true value of <span class="math inline">\(p\)</span> = 0.375.</p>
+<p>This time, let’s fix the confidence level at 95%, but consider three different sample sizes for <span class="math inline">\(n\)</span>: 25, 50, and 100. Specifically, we’ll first take 10 different random samples of size 25, 10 different random samples of size 50, and 10 different random samples of size 100. We’ll then construct 95% percentile-based confidence intervals for each sample. Finally, we’ll compare the widths of these intervals. We visualize the resulting 30 confidence intervals in Figure <a href="8-confidence-intervals.html#fig:reliable-percentile-n-25-50-100">8.30</a>. Note also the vertical line marking the true value of <span class="math inline">\(p\)</span> = 0.375.</p>
+
 <div class="figure" style="text-align: center"><span id="fig:reliable-percentile-n-25-50-100"></span>
-<img src="moderndive_files/figure-html/reliable-percentile-n-25-50-100-1.png" alt="Ten 95 percent confidence intervals for $p$ based on n = 25, 50, and 100." width="\textwidth" />
+<img src="ModernDive_files/figure-html/reliable-percentile-n-25-50-100-1.png" alt="Ten 95% confidence intervals for \(p\) with \(n = 25, 50,\) and \(100\)." width="\textwidth" />
 <p class="caption">
-FIGURE 8.30: Ten 95 percent confidence intervals for <span class="math inline">\(p\)</span> based on n = 25, 50, and 100.
+FIGURE 8.30: Ten 95% confidence intervals for <span class="math inline">\(p\)</span> with <span class="math inline">\(n = 25, 50,\)</span> and <span class="math inline">\(100\)</span>.
 </p>
 </div>
 <p>Observe that as the confidence intervals are constructed from larger and larger sample sizes, they tend to get narrower. Let’s compare the average widths in Table <a href="8-confidence-intervals.html#tab:perc-cis-average-width-2">8.3</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:perc-cis-average-width-2">TABLE 8.3: </span>Average width of 95 percent confidence intervals based on n = 25, 50, and 100.
+<span id="tab:perc-cis-average-width-2">TABLE 8.3: </span>Average width of 95% confidence intervals based on <span class="math inline">\(n = 25\)</span>, <span class="math inline">\(50\)</span>, and <span class="math inline">\(100\)</span>
 </caption>
 <thead>
 <tr>
@@ -1737,7 +1760,7 @@ <h4>Impact of sample size</h4>
 n = 50
 </td>
 <td style="text-align:right;">
-0.270
+0.268
 </td>
 </tr>
 <tr>
@@ -1745,27 +1768,27 @@ <h4>Impact of sample size</h4>
 n = 100
 </td>
 <td style="text-align:right;">
-0.183
+0.189
 </td>
 </tr>
 </tbody>
 </table>
-<p>The moral of the story is:  <strong>Larger sample sizes tend to produce narrower confidence intervals.</strong> Recall that this was a key message in Section <a href="7-sampling.html#moral-of-the-story">7.3.3</a>. As we used larger and larger shovels for our samples, the sample proportions red <span class="math inline">\(\widehat{p}\)</span> tended to vary less. In other words, our estimates got more and more <em>precise</em>.</p>
+<p>The moral of the story is:  <strong>Larger sample sizes tend to produce narrower confidence intervals.</strong> Recall that this was a key message in Subsection <a href="7-sampling.html#moral-of-the-story">7.3.3</a>. As we used larger and larger shovels for our samples, the sample proportions red <span class="math inline">\(\widehat{p}\)</span> tended to vary less. In other words, our estimates got more and more <em>precise</em>.</p>
 <p>Recall that we visualized these results in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions-3">7.15</a>, where we compared the <em>sampling distributions</em> for <span class="math inline">\(\widehat{p}\)</span> based on samples of size <span class="math inline">\(n\)</span> equal 25, 50, and 100. We also quantified the sampling variation of these sampling distributions using their standard deviation, which has that special name: the <em>standard error</em>. So as the sample size increases, the standard error decreases.</p>
 <p>In fact, the standard error is another related factor in determining confidence interval width. We’ll explore this fact in Subsection <a href="8-confidence-intervals.html#theory-ci">8.7.2</a> when we discuss theory-based methods for constructing confidence intervals using mathematical formulas. Such methods are an alternative to the computer-based methods we’ve been using so far.</p>
 <!-- 
-A good learning check might be to have the readers calculate confidence intervals when n = 1000, 2000, 2400. To their astonishment (maybe), they'll see that the size of the confidence interval is 0 when they get to 2400. 
+A good Learning check might be to have the readers calculate confidence intervals when n = 1000, 2000, 2400. To their astonishment (maybe), they'll see that the size of the confidence interval is 0 when they get to 2400. 
 -->
 </div>
 </div>
 </div>
 <div id="case-study-two-prop-ci" class="section level2">
 <h2><span class="header-section-number">8.6</span> Case study: Is yawning contagious?</h2>
-<p>Let’s apply our knowledge of confidence intervals to answer the question: “Is yawning contagious?” If you see someone else yawn, are you more likely to yawn? In an episode of the US show <a href="http://www.discovery.com/tv-shows/mythbusters/mythbusters-database/yawning-contagious/"><em>Mythbusters</em></a>, the hosts conducted an experiment to answer this question. The episode is available to view in the United States on the Discovery Network website <a href="https://www.discovery.com/tv-shows/mythbusters/videos/is-yawning-contagious">here</a> and more information about the episode is also available on <a href="https://www.imdb.com/title/tt0768479/">IMDb</a>.</p>
+<p>Let’s apply our knowledge of confidence intervals to answer the question: “Is yawning contagious?”. If you see someone else yawn, are you more likely to yawn? In an episode of the US show <a href="http://www.discovery.com/tv-shows/mythbusters/mythbusters-database/yawning-contagious/"><em>Mythbusters</em></a>, the hosts conducted an experiment to answer this question. The episode is available to view in the United States on the Discovery Network website <a href="https://www.discovery.com/tv-shows/mythbusters/videos/is-yawning-contagious">here</a> and more information about the episode is also available on <a href="https://www.imdb.com/title/tt0768479/">IMDb</a>.</p>
 <div id="mythbusters-study-data" class="section level3">
-<h3><span class="header-section-number">8.6.1</span> Mythbusters study data</h3>
-<p>Fifty adult participants who thought they were being considered for an appearance on the show were interviewed by a show recruiter. In the interview, the recruiter either yawned or did not. Participants then sat by themselves in a large van and were asked to wait. While in the van, the Mythbusters team watched the participants using a hidden camera to see if they yawned. The data frame containing the results of their experiment is available in the <code>mythbusters_yawn</code> data frame included in the <code>moderndive</code> package: </p>
-<pre class="sourceCode r"><code class="sourceCode r">mythbusters_yawn</code></pre>
+<h3><span class="header-section-number">8.6.1</span> <em>Mythbusters</em> study data</h3>
+<p>Fifty adult participants who thought they were being considered for an appearance on the show were interviewed by a show recruiter. In the interview, the recruiter either yawned or did not. Participants then sat by themselves in a large van and were asked to wait. While in the van, the <em>Mythbusters</em> team watched the participants using a hidden camera to see if they yawned. The data frame containing the results of their experiment is available in the <code>mythbusters_yawn</code> data frame included in the <code>moderndive</code> package: </p>
+<div class="sourceCode" id="cb318"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb318-1" data-line-number="1">mythbusters_yawn</a></code></pre></div>
 <pre><code># A tibble: 50 x 3
     subj group   yawn 
    &lt;int&gt; &lt;chr&gt;   &lt;chr&gt;
@@ -1788,9 +1811,9 @@ <h3><span class="header-section-number">8.6.1</span> Mythbusters study data</h3>
 </ul>
 <p>Recall that you learned about treatment and response variables in Subsection <a href="5-regression.html#correlation-is-not-causation">5.3.1</a> in our discussion on confounding variables. </p>
 <p>Let’s use some data wrangling to obtain counts of the four possible outcomes:</p>
-<pre class="sourceCode r"><code class="sourceCode r">mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(group, yawn) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">count =</span> <span class="kw">n</span>())</code></pre>
+<div class="sourceCode" id="cb320"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb320-1" data-line-number="1">mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb320-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(group, yawn) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb320-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">count =</span> <span class="kw">n</span>())</a></code></pre></div>
 <pre><code># A tibble: 4 x 3
 # Groups:   group [2]
   group   yawn  count
@@ -1800,17 +1823,13 @@ <h3><span class="header-section-number">8.6.1</span> Mythbusters study data</h3>
 3 seed    no       24
 4 seed    yes      10</code></pre>
 <p>Let’s first focus on the <code>&quot;control&quot;</code> group participants who were not exposed to yawning. 12 such participants did not yawn, while 4 such participants did. So out of the 16 people who were not exposed to yawning, 4/16 = 0.25 = 25% did yawn.</p>
-<p>Let’s now focus on the <code>&quot;seed&quot;</code> group participants who were exposed to yawning. 24 such participants did not yawn, while 10 such participants did yawn. So out of the 34 people who were exposed to yawning, 10/34 = 0.294 = 29.4% did yawn.</p>
-<p>Comparing these two percentages, the participants who were exposed to yawning yawned 29.4% - 25% = 4.4% more often than those who were not.</p>
+<p>Let’s now focus on the <code>&quot;seed&quot;</code> group participants who were exposed to yawning where 24 such participants did not yawn, while 10 such participants did yawn. So out of the 34 people who were exposed to yawning, 10/34 = 0.294 = 29.4% did yawn. Comparing these two percentages, the participants who were exposed to yawning yawned 29.4% - 25% = 4.4% more often than those who were not.</p>
 </div>
 <div id="sampling-scenario" class="section level3">
 <h3><span class="header-section-number">8.6.2</span> Sampling scenario</h3>
-<p>Let’s now revisit this study in terms of terminology and notation related to sampling we studied in Section <a href="7-sampling.html#terminology-and-notation">7.3.1</a>.</p>
-<p>In Chapter <a href="7-sampling.html#sampling">7</a> our <em>study population</em> was the bowl of <span class="math inline">\(N\)</span> = 2400 balls. Our <em>population parameter</em> of interest was the <em>population proportion</em> of these balls that were red, denoted mathematically by <span class="math inline">\(p\)</span>. In order to estimate <span class="math inline">\(p\)</span>, we extracted a sample of 50 balls using the shovel and computed the relevant <em>point estimate</em>: the <em>sample proportion</em> that were red, denoted mathematically by <span class="math inline">\(\widehat{p}\)</span>.</p>
-<p>Who is the study population here? All humans? All the people who watch the show Mythbusters? It’s hard to say! This question can only be answered if we know how the show’s hosts recruited participants! In other words, what was the <em>sampling methodology</em> used by the Mythbusters to recruit participants?</p>
-<p>We alas are not provided with this information. Only for the purposes of this case study, however, we’ll <em>assume</em> that the 50 participants are a representative sample of all Americans given the popularity of this show. Thus, we’ll be assuming that any results of this experiment will generalize to all <span class="math inline">\(N\)</span> = 327 million Americans (2018 population).</p>
-<p>Just like with our sampling bowl, the population parameter here will involve proportions. However, in this case it will be the <em>difference in population proportions</em> <span class="math inline">\(p_{seed} - p_{control}\)</span>, where <span class="math inline">\(p_{seed}\)</span> is the proportion of <em>all</em> Americans who if exposed to yawning will yawn themselves, and <span class="math inline">\(p_{control}\)</span> is the proportion of <em>all</em> Americans who if not exposed to yawning still yawn themselves.</p>
-<p>Correspondingly, the point estimate/sample statistic based the Mythbusters’ sample of participants will be the <em>difference in sample proportions</em> <span class="math inline">\(\widehat{p}_{seed} - \widehat{p}_{control}\)</span>. Let’s extend Table <a href="7-sampling.html#tab:table-ch8">7.5</a> of scenarios of sampling for inference to include our latest scenario.</p>
+<p>Let’s review the terminology and notation related to sampling we studied in Subsection <a href="7-sampling.html#terminology-and-notation">7.3.1</a>. In Chapter <a href="7-sampling.html#sampling">7</a> our <em>study population</em> was the bowl of <span class="math inline">\(N\)</span> = 2400 balls. Our <em>population parameter</em> of interest was the <em>population proportion</em> of these balls that were red, denoted mathematically by <span class="math inline">\(p\)</span>. In order to estimate <span class="math inline">\(p\)</span>, we extracted a sample of 50 balls using the shovel and computed the relevant <em>point estimate</em>: the <em>sample proportion</em> that were red, denoted mathematically by <span class="math inline">\(\widehat{p}\)</span>.</p>
+<p>Who is the study population here? All humans? All the people who watch the show <em>Mythbusters</em>? It’s hard to say! This question can only be answered if we know how the show’s hosts recruited participants! In other words, what was the <em>sampling methodology</em> used by the <em>Mythbusters</em> to recruit participants? We alas are not provided with this information. Only for the purposes of this case study, however, we’ll <em>assume</em> that the 50 participants are a representative sample of all Americans given the popularity of this show. Thus, we’ll be assuming that any results of this experiment will generalize to all <span class="math inline">\(N\)</span> = 327 million Americans (2018 population).</p>
+<p>Just like with our sampling bowl, the population parameter here will involve proportions. However, in this case it will be the <em>difference in population proportions</em> <span class="math inline">\(p_{seed} - p_{control}\)</span>, where <span class="math inline">\(p_{seed}\)</span> is the proportion of <em>all</em> Americans who if exposed to yawning will yawn themselves, and <span class="math inline">\(p_{control}\)</span> is the proportion of <em>all</em> Americans who if not exposed to yawning still yawn themselves. Correspondingly, the point estimate/sample statistic based the <em>Mythbusters</em>’ sample of participants will be the <em>difference in sample proportions</em> <span class="math inline">\(\widehat{p}_{seed} - \widehat{p}_{control}\)</span>. Let’s extend Table <a href="7-sampling.html#tab:table-ch8">7.5</a> of scenarios of sampling for inference to include our latest scenario.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:table-ch8-c">TABLE 8.4: </span>Scenarios of sampling for inference
@@ -1830,7 +1849,7 @@ <h3><span class="header-section-number">8.6.2</span> Sampling scenario</h3>
 Point estimate
 </th>
 <th style="text-align:left;">
-Notation.
+Symbol(s)
 </th>
 </tr>
 </thead>
@@ -1839,16 +1858,16 @@ <h3><span class="header-section-number">8.6.2</span> Sampling scenario</h3>
 <td style="text-align:right;width: 0.5in; ">
 1
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.5in; ">
 Population proportion
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(p\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.6in; ">
 Sample proportion
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(\widehat{p}\)</span>
 </td>
 </tr>
@@ -1856,16 +1875,16 @@ <h3><span class="header-section-number">8.6.2</span> Sampling scenario</h3>
 <td style="text-align:right;width: 0.5in; ">
 2
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.5in; ">
 Population mean
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(\mu\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.6in; ">
 Sample mean
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(\overline{x}\)</span> or <span class="math inline">\(\widehat{\mu}\)</span>
 </td>
 </tr>
@@ -1873,33 +1892,33 @@ <h3><span class="header-section-number">8.6.2</span> Sampling scenario</h3>
 <td style="text-align:right;width: 0.5in; ">
 3
 </td>
-<td style="text-align:left;width: 0.7in; ">
+<td style="text-align:left;width: 1.5in; ">
 Difference in population proportions
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(p_1 - p_2\)</span>
 </td>
-<td style="text-align:left;width: 1.1in; ">
+<td style="text-align:left;width: 1.6in; ">
 Difference in sample proportions
 </td>
-<td style="text-align:left;width: 1in; ">
+<td style="text-align:left;width: 0.65in; ">
 <span class="math inline">\(\widehat{p}_1 - \widehat{p}_2\)</span>
 </td>
 </tr>
 </tbody>
 </table>
-<p>This is known as a <em>two-sample</em> inference situation since we have two separate samples. Based on their two-samples of size <span class="math inline">\(n_{seed}\)</span> = 34 and <span class="math inline">\(n_{control}\)</span> = 16, their point estimate is</p>
+<p>This is known as a <em>two-sample</em> inference situation since we have two separate samples. Based on their two-samples of size <span class="math inline">\(n_{seed}\)</span> = 34 and <span class="math inline">\(n_{control}\)</span> = 16, the point estimate is</p>
 <p><span class="math display">\[
 \widehat{p}_{seed} - \widehat{p}_{control} = \frac{24}{34} - \frac{12}{16} = 0.04411765 \approx 4.4\%
 \]</span></p>
-<p>However, say the Mythbusters repeated this experiment. In other words, say they recruited 50 new participants and exposed 34 of them to yawning and 16 not. Would they obtain the exact same estimated difference of 4.4%? Probably not, again, because of <em>sampling variation</em>.</p>
-<p>How does this sampling variation affect their estimate of 4.4%? In other words, what would be a plausible range of values for this difference that accounts for this sampling variation? We can answer this question with confidence intervals! Furthermore, since the Mythbusters only have a single two-sample of 50 participants, the would have to construct a 95% confidence interval for <span class="math inline">\(p_{seed} - p_{control}\)</span> using <em>bootstrap resampling with replacement</em>.</p>
-<p>We make a couple of important notes. First, for the comparison between the <code>&quot;seed&quot;</code> and <code>&quot;control&quot;</code> groups to make sense however, both groups need to be <em>independent</em> from each other. Otherwise, they could influence each other’s results.</p>
+<p>However, say the <em>Mythbusters</em> repeated this experiment. In other words, say they recruited 50 new participants and exposed 34 of them to yawning and 16 not. Would they obtain the exact same estimated difference of 4.4%? Probably not, again, because of <em>sampling variation</em>.</p>
+<p>How does this sampling variation affect their estimate of 4.4%? In other words, what would be a plausible range of values for this difference that accounts for this sampling variation? We can answer this question with confidence intervals! Furthermore, since the <em>Mythbusters</em> only have a single two-sample of 50 participants, they would have to construct a 95% confidence interval for <span class="math inline">\(p_{seed} - p_{control}\)</span> using <em>bootstrap resampling with replacement</em>.</p>
+<p>We make a couple of important notes. First, for the comparison between the <code>&quot;seed&quot;</code> and <code>&quot;control&quot;</code> groups to make sense, however, both groups need to be <em>independent</em> from each other. Otherwise, they could influence each other’s results. This means that a participant being selected for the <code>&quot;seed&quot;</code> or <code>&quot;control&quot;</code> group has no influence on another participant being assigned to one of the two groups. As an example, if there were a mother and her child as participants in the study, they wouldn’t necessarily be in the same group. They would each be assigned randomly to one of the two groups of the explanatory variable.</p>
 <p>Second, the order of the subtraction in the difference doesn’t matter so long as you are consistent and tailor your interpretations accordingly. In other words, using a point estimate of <span class="math inline">\(\widehat{p}_{seed} - \widehat{p}_{control}\)</span> or <span class="math inline">\(\widehat{p}_{control} - \widehat{p}_{seed}\)</span> does not make a material difference, you just need to stay consistent and interpret your results accordingly.</p>
 </div>
 <div id="ci-build" class="section level3">
 <h3><span class="header-section-number">8.6.3</span> Constructing the confidence interval</h3>
-<p>As we did in Section <a href="8-confidence-intervals.html#infer-workflow">8.4.2</a>, let’s first construct the bootstrap distribution for <span class="math inline">\(\widehat{p}_{seed} - \widehat{p}_{control}\)</span> and then use this to construct 95% confidence intervals for <span class="math inline">\(p_{seed} - p_{control}\)</span>. We’ll do this using the <code>infer</code> workflow again. However, since the difference in proportions is a new scenario for inference, we’ll need to use some new arguments in the <code>infer</code> functions along the way.</p>
+<p>As we did in Subsection <a href="8-confidence-intervals.html#infer-workflow">8.4.2</a>, let’s first construct the bootstrap distribution for <span class="math inline">\(\widehat{p}_{seed} - \widehat{p}_{control}\)</span> and then use this to construct 95% confidence intervals for <span class="math inline">\(p_{seed} - p_{control}\)</span>. We’ll do this using the <code>infer</code> workflow again. However, since the difference in proportions is a new scenario for inference, we’ll need to use some new arguments in the <code>infer</code> functions along the way.</p>
 <div id="specify-variables-2" class="section level4 unnumbered">
 <h4>1. <code>specify</code> variables</h4>
 <p>Let’s take our <code>mythbusters_yawn</code> data frame and <code>specify()</code> which variables are of interest using the <code>y ~ x</code> formula interface where:</p>
@@ -1907,13 +1926,13 @@ <h4>1. <code>specify</code> variables</h4>
 <li>Our response variable is <code>yawn</code>: whether or not a participant yawned. It has levels <code>&quot;yes&quot;</code> and <code>&quot;no&quot;</code>.</li>
 <li>The explanatory variable is <code>group</code>: whether or not a participant was exposed to yawning. It has levels <code>&quot;seed&quot;</code> (exposed to yawning) and <code>&quot;control&quot;</code> (not exposed to yawning).</li>
 </ul>
-<pre class="sourceCode r"><code class="sourceCode r">mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> yawn <span class="op">~</span><span class="st"> </span>group)</code></pre>
+<div class="sourceCode" id="cb322"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb322-1" data-line-number="1">mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb322-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> yawn <span class="op">~</span><span class="st"> </span>group)</a></code></pre></div>
 <pre><code>Error: A level of the response variable `yawn` needs to be 
 specified for the `success` argument in `specify()`.</code></pre>
-<p>Alas, we got an error message similar to the one from Subsection <a href="8-confidence-intervals.html#ilyas-yohan">8.5.1</a>: <code>infer</code> is telling us that one of the levels of the categorical variable <code>yawn</code> needs to be defined as the <code>success</code>. Recall that we define <code>success</code> to be the event of interest we are trying to count and compute proportions of. Are we interested in those participants who <code>&quot;yes&quot;</code> yawned or those who <code>&quot;no&quot;</code> didn’t yawn? This isn’t clear to R, so we need to set the <code>success</code> argument to <code>&quot;yes&quot;</code> as follows:</p>
-<pre class="sourceCode r"><code class="sourceCode r">mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> yawn <span class="op">~</span><span class="st"> </span>group, <span class="dt">success =</span> <span class="st">&quot;yes&quot;</span>)</code></pre>
+<p>Alas, we got an error message similar to the one from Subsection <a href="8-confidence-intervals.html#ilyas-yohan">8.5.1</a>: <code>infer</code> is telling us that one of the levels of the categorical variable <code>yawn</code> needs to be defined as the <code>success</code>. Recall that we define <code>success</code> to be the event of interest we are trying to count and compute proportions of. Are we interested in those participants who <code>&quot;yes&quot;</code> yawned or those who <code>&quot;no&quot;</code> didn’t yawn? This isn’t clear to R or someone just picking up the code and results for the first time, so we need to set the <code>success</code> argument to <code>&quot;yes&quot;</code> as follows to improve the transparency of the code:</p>
+<div class="sourceCode" id="cb324"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb324-1" data-line-number="1">mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb324-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> yawn <span class="op">~</span><span class="st"> </span>group, <span class="dt">success =</span> <span class="st">&quot;yes&quot;</span>)</a></code></pre></div>
 <pre><code>Response: yawn (factor)
 Explanatory: group (factor)
 # A tibble: 50 x 2
@@ -1933,9 +1952,10 @@ <h4>1. <code>specify</code> variables</h4>
 </div>
 <div id="generate-replicates-2" class="section level4 unnumbered">
 <h4>2. <code>generate</code> replicates</h4>
-<p>Our next step is to perform <em>bootstrap resampling with replacement</em> like we did with the slips of paper in our pennies activity in Section <a href="8-confidence-intervals.html#resampling-tactile">8.1</a>. We saw how it works with both a single variable in computing bootstrap means in Subsection <a href="8-confidence-intervals.html#bootstrap-process">8.4</a> and in computing bootstrap proportions in Section <a href="8-confidence-intervals.html#one-prop-ci">8.5</a>, but we haven’t yet worked with bootstrapping involving multiple variables though.</p>
-<p>In the <code>infer</code> package, bootstrapping with multiple variables means that each <em>row</em> is potentially resampled. Let’s investigate this by looking at the first few rows of <code>mythbusters_yawn</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(mythbusters_yawn)</code></pre>
+<p>Our next step is to perform <em>bootstrap resampling with replacement</em> like we did with the slips of paper in our pennies activity in Section <a href="8-confidence-intervals.html#resampling-tactile">8.1</a>. We saw how it works with both a single variable in computing bootstrap means in Section <a href="8-confidence-intervals.html#bootstrap-process">8.4</a> and in computing bootstrap proportions in Section <a href="8-confidence-intervals.html#one-prop-ci">8.5</a>, but we haven’t yet worked with bootstrapping involving multiple variables.</p>
+<p>In the <code>infer</code> package, bootstrapping with multiple variables means that each <em>row</em> is potentially resampled. Let’s investigate this by focusing only on the first six rows of <code>mythbusters_yawn</code>:</p>
+<div class="sourceCode" id="cb326"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb326-1" data-line-number="1">first_six_rows &lt;-<span class="st"> </span><span class="kw">head</span>(mythbusters_yawn)</a>
+<a class="sourceLine" id="cb326-2" data-line-number="2">first_six_rows</a></code></pre></div>
 <pre><code># A tibble: 6 x 3
    subj group   yawn 
   &lt;int&gt; &lt;chr&gt;   &lt;chr&gt;
@@ -1946,8 +1966,8 @@ <h4>2. <code>generate</code> replicates</h4>
 5     5 seed    no   
 6     6 control no   </code></pre>
 <p>When we bootstrap this data, we are potentially pulling the subject’s readings multiple times. Thus, we could see the entries of <code>&quot;seed&quot;</code> for <code>group</code> and <code>&quot;no&quot;</code> for <code>yawn</code> together in a new row in a bootstrap sample. This is further seen by exploring the <code>sample_n()</code> function in <code>dplyr</code> on this smaller 6-row data frame comprised of <code>head(mythbusters_yawn)</code>. The <code>sample_n()</code> function can perform this bootstrapping procedure and is similar to the <code>rep_sample_n()</code> function in <code>infer</code>, except that it is not repeated, but rather only performs one sample with or without replacement.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(mythbusters_yawn) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">sample_n</span>(<span class="dt">size =</span> <span class="dv">6</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>)</code></pre>
+<div class="sourceCode" id="cb328"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb328-1" data-line-number="1">first_six_rows <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb328-2" data-line-number="2"><span class="st">  </span><span class="kw">sample_n</span>(<span class="dt">size =</span> <span class="dv">6</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>)</a></code></pre></div>
 <pre><code># A tibble: 6 x 3
    subj group   yawn 
   &lt;int&gt; &lt;chr&gt;   &lt;chr&gt;
@@ -1957,120 +1977,124 @@ <h4>2. <code>generate</code> replicates</h4>
 4     5 seed    no   
 5     4 seed    yes  
 6     4 seed    yes  </code></pre>
-<p>We can see that in this bootstrap sample generated from the first six rows of <code>mythbusters_yawn</code>, we have some rows repeated. The same is true when we perform the <code>generate()</code> step in <code>infer</code> as done in what follows. Using this fact, we <code>generate</code> 1000 replicates, or in other words, we bootstrap resample the 50 participants with replacement 1000 times.</p>
-<pre class="sourceCode r"><code class="sourceCode r">mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> yawn <span class="op">~</span><span class="st"> </span>group, <span class="dt">success =</span> <span class="st">&quot;yes&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>)</code></pre>
+<p>We can see that in this bootstrap sample generated from the first six rows of <code>mythbusters_yawn</code>, we have some rows repeated. The same is true when we perform the <code>generate()</code> step in <code>infer</code> as done in what follows. Using this fact, we <code>generate</code> 1000 replicates, or, in other words, we bootstrap resample the 50 participants with replacement 1000 times.</p>
+<div class="sourceCode" id="cb330"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb330-1" data-line-number="1">mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb330-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> yawn <span class="op">~</span><span class="st"> </span>group, <span class="dt">success =</span> <span class="st">&quot;yes&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb330-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>)</a></code></pre></div>
 <pre><code>Response: yawn (factor)
 Explanatory: group (factor)
 # A tibble: 50,000 x 3
 # Groups:   replicate [1,000]
    replicate yawn  group  
        &lt;int&gt; &lt;fct&gt; &lt;fct&gt;  
- 1         1 no    seed   
- 2         1 no    seed   
- 3         1 yes   control
- 4         1 yes   seed   
- 5         1 no    control
+ 1         1 yes   seed   
+ 2         1 yes   control
+ 3         1 no    control
+ 4         1 no    control
+ 5         1 yes   seed   
  6         1 yes   seed   
- 7         1 no    control
- 8         1 no    seed   
+ 7         1 yes   seed   
+ 8         1 yes   seed   
  9         1 no    seed   
-10         1 no    seed   
+10         1 yes   seed   
 # … with 49,990 more rows</code></pre>
-<p>Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 participants with replacement 1000 times and 50,000 = 1000 <span class="math inline">\(\times\)</span> 50. The variable <code>replicate</code> indicates which resample each row belongs to. So it has the value <code>1</code> 50 times, the value <code>2</code> 50 times, all the way through to the value <code>1000</code> 50 times.</p>
+<p>Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 participants with replacement 1000 times and 50,000 = 1000 <span class="math inline">\(\cdot\)</span> 50. The variable <code>replicate</code> indicates which resample each row belongs to. So it has the value <code>1</code> 50 times, the value <code>2</code> 50 times, all the way through to the value <code>1000</code> 50 times.</p>
 </div>
 <div id="calculate-summary-statistics-2" class="section level4 unnumbered">
 <h4>3. <code>calculate</code> summary statistics</h4>
 <p>After we <code>generate()</code> many replicates of bootstrap resampling with replacement, we next want to summarize the bootstrap resamples of size 50 with a single summary statistic, the difference in proportions. We do this by setting the <code>stat</code> argument to <code>&quot;diff in props&quot;</code>:</p>
 <!-- 
-Chester: A challenging learning check for those {dplyr} diehards is to get these values 
+Chester: A challenging Learning check for those {dplyr} diehards is to get these values 
 without using {infer}. It takes a double group_by() and some trickery, but could 
 be a good exercise for those that don't quite see the power of {infer}.
 
 Albert: Great idea!
 -->
-<pre class="sourceCode r"><code class="sourceCode r">mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> yawn <span class="op">~</span><span class="st"> </span>group, <span class="dt">success =</span> <span class="st">&quot;yes&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb332"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb332-1" data-line-number="1">mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb332-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> yawn <span class="op">~</span><span class="st"> </span>group, <span class="dt">success =</span> <span class="st">&quot;yes&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb332-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb332-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>)</a></code></pre></div>
 <pre><code>Error: Statistic is based on a difference; specify the `order` in which to
 subtract the levels of the explanatory variable.</code></pre>
 <p>We see another error here. We need to specify the order of the subtraction. Is it <span class="math inline">\(\widehat{p}_{seed} - \widehat{p}_{control}\)</span> or <span class="math inline">\(\widehat{p}_{control} - \widehat{p}_{seed}\)</span>. We specify it to be <span class="math inline">\(\widehat{p}_{seed} - \widehat{p}_{control}\)</span> by setting <code>order = c(&quot;seed&quot;, &quot;control&quot;)</code>. Note that you could’ve also set <code>order = c(&quot;control&quot;, &quot;seed&quot;)</code>. As we stated earlier, the order of the subtraction does not matter, so long as you stay consistent throughout your analysis and tailor your interpretations accordingly.</p>
 <p>Let’s save the output in a data frame <code>bootstrap_distribution_yawning</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">bootstrap_distribution_yawning &lt;-<span class="st"> </span>mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> yawn <span class="op">~</span><span class="st"> </span>group, <span class="dt">success =</span> <span class="st">&quot;yes&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;seed&quot;</span>, <span class="st">&quot;control&quot;</span>))
-bootstrap_distribution_yawning</code></pre>
+<div class="sourceCode" id="cb334"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb334-1" data-line-number="1">bootstrap_distribution_yawning &lt;-<span class="st"> </span>mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb334-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> yawn <span class="op">~</span><span class="st"> </span>group, <span class="dt">success =</span> <span class="st">&quot;yes&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb334-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb334-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;seed&quot;</span>, <span class="st">&quot;control&quot;</span>))</a>
+<a class="sourceLine" id="cb334-5" data-line-number="5">bootstrap_distribution_yawning</a></code></pre></div>
 <pre><code># A tibble: 1,000 x 2
-   replicate       stat
-       &lt;int&gt;      &lt;dbl&gt;
- 1         1 -0.0213904
- 2         2  0.0459770
- 3         3  0        
- 4         4 -0.0129870
- 5         5  0.326765 
- 6         6  0.122807 
- 7         7  0.293718 
- 8         8  0.0761905
- 9         9  0.0679117
-10        10 -0.0231729
+   replicate        stat
+       &lt;int&gt;       &lt;dbl&gt;
+ 1         1  0.0357143 
+ 2         2  0.229167  
+ 3         3  0.00952381
+ 4         4  0.0106952 
+ 5         5  0.00483092
+ 6         6  0.00793651
+ 7         7 -0.0845588 
+ 8         8 -0.00466200
+ 9         9  0.164686  
+10        10  0.124777  
 # … with 990 more rows</code></pre>
-<p>Observe that the resulting data frame has 1000 rows and 2 columns corresponding to the 1000 <code>replicate</code> ID’s and the 1000 difference in proportions for each bootstrap resample in <code>stat</code>.</p>
+<p>Observe that the resulting data frame has 1000 rows and 2 columns corresponding to the 1000 <code>replicate</code> ID’s and the 1000 differences in proportions for each bootstrap resample in <code>stat</code>.</p>
 </div>
 <div id="visualize-the-results-2" class="section level4 unnumbered">
 <h4>4. <code>visualize</code> the results</h4>
 <p>In Figure <a href="8-confidence-intervals.html#fig:bootstrap-distribution-mythbusters">8.31</a> we <code>visualize()</code> the resulting bootstrap resampling distribution. Let’s also add a vertical line at 0 by adding a <code>geom_vline()</code> layer.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(bootstrap_distribution_yawning) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_vline</span>(<span class="dt">xintercept =</span> <span class="dv">0</span>)</code></pre>
+<div class="sourceCode" id="cb336"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb336-1" data-line-number="1"><span class="kw">visualize</span>(bootstrap_distribution_yawning) <span class="op">+</span></a>
+<a class="sourceLine" id="cb336-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_vline</span>(<span class="dt">xintercept =</span> <span class="dv">0</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:bootstrap-distribution-mythbusters"></span>
-<img src="moderndive_files/figure-html/bootstrap-distribution-mythbusters-1.png" alt="Bootstrap distribution." width="\textwidth" />
+<img src="ModernDive_files/figure-html/bootstrap-distribution-mythbusters-1.png" alt="Bootstrap distribution." width="\textwidth" />
 <p class="caption">
 FIGURE 8.31: Bootstrap distribution.
 </p>
 </div>
-<p>First, let’s compute the 95% confidence interval for <span class="math inline">\(p_{seed} - p_{control}\)</span> using the percentile method, in other words by identifying the 2.5<sup>th</sup> and 97.5<sup>th</sup> percentiles which include the middle 95% of values. Recall that this method does not require the bootstrap distribution to be normally shaped.</p>
-<pre class="sourceCode r"><code class="sourceCode r">bootstrap_distribution_yawning <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>, <span class="dt">level =</span> <span class="fl">0.95</span>)</code></pre>
+<p>First, let’s compute the 95% confidence interval for <span class="math inline">\(p_{seed} - p_{control}\)</span> using the percentile method, in other words, by identifying the 2.5th and 97.5th percentiles which include the middle 95% of values. Recall that this method does not require the bootstrap distribution to be normally shaped.</p>
+<div class="sourceCode" id="cb337"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb337-1" data-line-number="1">bootstrap_distribution_yawning <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb337-2" data-line-number="2"><span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>, <span class="dt">level =</span> <span class="fl">0.95</span>)</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
      `2.5%`  `97.5%`
       &lt;dbl&gt;    &lt;dbl&gt;
-1 -0.218313 0.304763</code></pre>
-<p>Second, since the bootstrap distribution is roughly bell-shaped, we can construct a confidence interval using the standard error method as well. Recall that to construct a confidence interval using the standard error method, we need to specify the center of the interval using the <code>point_estimate</code> argument. In our case, we need to set it to be the difference in sample proportions of 4.4% that the Mythbusters observed.</p>
-<p>However, we can also use the <code>infer</code> workflow to compute this value by excluding the <code>generate()</code> 1000 bootstrap replicates step. In other words, do not generate replicates, but rather use only the original sample data. We can achieve this by commenting out the <code>generate()</code> line, telling R to ignore it:</p>
-<pre class="sourceCode r"><code class="sourceCode r">mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> yawn <span class="op">~</span><span class="st"> </span>group, <span class="dt">success =</span> <span class="st">&quot;yes&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="co"># generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;seed&quot;</span>, <span class="st">&quot;control&quot;</span>))</code></pre>
+1 -0.238276 0.302464</code></pre>
+<p>Second, since the bootstrap distribution is roughly bell-shaped, we can construct a confidence interval using the standard error method as well. Recall that to construct a confidence interval using the standard error method, we need to specify the center of the interval using the <code>point_estimate</code> argument. In our case, we need to set it to be the difference in sample proportions of 4.4% that the <em>Mythbusters</em> observed.</p>
+<p>We can also use the <code>infer</code> workflow to compute this value by excluding the <code>generate()</code> 1000 bootstrap replicates step. In other words, do not generate replicates, but rather use only the original sample data. We can achieve this by commenting out the <code>generate()</code> line, telling R to ignore it:</p>
+<div class="sourceCode" id="cb339"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb339-1" data-line-number="1">obs_diff_in_props &lt;-<span class="st"> </span>mythbusters_yawn <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb339-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> yawn <span class="op">~</span><span class="st"> </span>group, <span class="dt">success =</span> <span class="st">&quot;yes&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb339-3" data-line-number="3"><span class="st">  </span><span class="co"># generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% </span></a>
+<a class="sourceLine" id="cb339-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;seed&quot;</span>, <span class="st">&quot;control&quot;</span>))</a>
+<a class="sourceLine" id="cb339-5" data-line-number="5">obs_diff_in_props</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
        stat
       &lt;dbl&gt;
 1 0.0441176</code></pre>
-<p>We thus plug this value as the <code>point_estimate</code> argument.</p>
-<pre class="sourceCode r"><code class="sourceCode r">bootstrap_distribution_yawning <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">type =</span> <span class="st">&quot;se&quot;</span>, <span class="dt">point_estimate =</span> <span class="fl">0.0441176</span>)</code></pre>
+<p>We thus plug this value in as the <code>point_estimate</code> argument.</p>
+<div class="sourceCode" id="cb341"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb341-1" data-line-number="1">myth_ci_se &lt;-<span class="st"> </span>bootstrap_distribution_yawning <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb341-2" data-line-number="2"><span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">type =</span> <span class="st">&quot;se&quot;</span>, <span class="dt">point_estimate =</span> obs_diff_in_props)</a>
+<a class="sourceLine" id="cb341-3" data-line-number="3">myth_ci_se</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
       lower    upper
       &lt;dbl&gt;    &lt;dbl&gt;
-1 -0.213435 0.301670</code></pre>
-<p>Let’s visualize both confidence intervals in Figure <a href="8-confidence-intervals.html#fig:bootstrap-distribution-mythbusters-CI">8.32</a>, with the percentile method interval marked with solid lines and the standard error method marked with dashed lines. Observe that they are both similar to each other.</p>
+1 -0.227291 0.315526</code></pre>
+<p>Let’s visualize both confidence intervals in Figure <a href="8-confidence-intervals.html#fig:bootstrap-distribution-mythbusters-CI">8.32</a>, with the percentile-method interval marked with black lines and the standard-error-method marked with grey lines. Observe that they are both similar to each other.</p>
 <div class="figure" style="text-align: center"><span id="fig:bootstrap-distribution-mythbusters-CI"></span>
-<img src="moderndive_files/figure-html/bootstrap-distribution-mythbusters-CI-1.png" alt="Two 95 percent confidence intervals: percentile method (solid) and standard error method (dashed)." width="\textwidth" />
+<img src="ModernDive_files/figure-html/bootstrap-distribution-mythbusters-CI-1.png" alt="Two 95\% confidence intervals: percentile method (black) and standard error method (grey)." width="\textwidth" />
 <p class="caption">
-FIGURE 8.32: Two 95 percent confidence intervals: percentile method (solid) and standard error method (dashed).
+FIGURE 8.32: Two 95% confidence intervals: percentile method (black) and standard error method (grey).
 </p>
 </div>
 </div>
 </div>
 <div id="interpreting-the-confidence-interval" class="section level3">
 <h3><span class="header-section-number">8.6.4</span> Interpreting the confidence interval</h3>
-<p>Given that both confidence intervals are quite similar, let’s focus our interpretation to only the percentile method confidence interval of (-0.218, 0.305). Recall from Subsection <a href="8-confidence-intervals.html#shorthand">8.5.2</a> that the precise statistical interpretation of a 95% confidence interval is: if repeated this construction procedure 100 times, then we expect about 95 of the confidence intervals to capture the true value of <span class="math inline">\(p_{seed} - p_{control}\)</span>. In other words, if we gathered 100 samples of <span class="math inline">\(n\)</span> = 50 participants from a similar pool of people and constructed 100 confidence intervals, about 95 of them will contain the true value of <span class="math inline">\(p_{seed} - p_{control}\)</span> while about 5 won’t. Given that this is a little long winded, we use the shorthand interpretation: we’re 95% “confident” that the true difference in proportions <span class="math inline">\(p_{seed} - p_{control}\)</span> is between (-0.22, 0.3).</p>
-<p>There is one value of particular interest that this 95% confidence interval contains: zero. If <span class="math inline">\(p_{seed} - p_{control}\)</span> were equal to 0, then there would be no difference in proportion yawning between the two groups. This would suggest that there is no associated effect of being exposed to yawning on whether you yawn yourself.</p>
+<p>Given that both confidence intervals are quite similar, let’s focus our interpretation to only the percentile-method confidence interval of (-0.238, 0.302). Recall from Subsection <a href="8-confidence-intervals.html#shorthand">8.5.2</a> that the precise statistical interpretation of a 95% confidence interval is: if this construction procedure is repeated 100 times, then we expect about 95 of the confidence intervals to capture the true value of <span class="math inline">\(p_{seed} - p_{control}\)</span>. In other words, if we gathered 100 samples of <span class="math inline">\(n\)</span> = 50 participants from a similar pool of people and constructed 100 confidence intervals each based on each of the 100 samples, about 95 of them will contain the true value of <span class="math inline">\(p_{seed} - p_{control}\)</span> while about five won’t. Given that this is a little long winded, we use the shorthand interpretation: we’re 95% “confident” that the true difference in proportions <span class="math inline">\(p_{seed} - p_{control}\)</span> is between (-0.238, 0.302).</p>
+<p>There is one value of particular interest that this 95% confidence interval contains: zero. If <span class="math inline">\(p_{seed} - p_{control}\)</span> were equal to 0, then there would be no difference in proportion yawning between the two groups. This would suggest that there is no associated effect of being exposed to a yawning recruiter on whether you yawn yourself.</p>
 <p>In our case, since the 95% confidence interval includes 0, we cannot conclusively say if either proportion is larger. Of our 1000 bootstrap resamples with replacement, sometimes <span class="math inline">\(\widehat{p}_{seed}\)</span> was higher and thus those exposed to yawning yawned themselves more often. At other times, the reverse happened.</p>
-<p>Say on the other hand the 95% confidence interval was entirely above zero. This would suggestive that <span class="math inline">\(p_{seed} - p_{control} &gt; 0\)</span>, or in other words <span class="math inline">\(p_{seed} &gt; p_{control}\)</span>, and thus we’d have evidence suggesting those exposed to yawning do yawn more often.</p>
+<p>Say, on the other hand, the 95% confidence interval was entirely above zero. This would suggest that <span class="math inline">\(p_{seed} - p_{control} &gt; 0\)</span>, or, in other words <span class="math inline">\(p_{seed} &gt; p_{control}\)</span>, and thus we’d have evidence suggesting those exposed to yawning do yawn more often.</p>
 <!--
-TODO: Add this back once we add a discussion on random assignment and 
+TODO: Talk about randomized experiment nature of Mythbusters data
+
+Add this back once we add a discussion on random assignment and 
 randomized experiments in Conclusion of sampling chapter
 
 Furthermore, if the 50 participants were randomly allocated to the `"seed"` and `"control"` groups, then this would be suggestive that being exposed to yawning doesn't not *cause* yawning. In other words, yawning is not contagious. However, no information on how participants were assigned to be exposed to yawning or not could be found, so we cannot make such a causal statement. 
@@ -2082,28 +2106,40 @@ <h2><span class="header-section-number">8.7</span> Conclusion</h2>
 <div id="bootstrap-vs-sampling" class="section level3">
 <h3><span class="header-section-number">8.7.1</span> Comparing bootstrap and sampling distributions</h3>
 <p>Let’s talk more about the relationship between <em>sampling distributions</em> and <em>bootstrap distributions</em>.</p>
-<p>Recall back in Section <a href="7-sampling.html#shovel-1000-times">7.2.3</a>, we took 1000 virtual samples from the <code>bowl</code> using a virtual shovel, computed 1000 values of the sample proportion red <span class="math inline">\(\widehat{p}\)</span>, then visualized their distribution in a histogram. Recall that this distribution is called the <em>sampling distribution of</em> <span class="math inline">\(\widehat{p}\)</span> . Furthermore, the standard deviation of the sampling distribution has a special name: the <em>standard error</em>.</p>
-<p>We also mentioned that this sampling activity does not reflect how sampling is done in real-life. Rather, it was an <em>idealized version</em> of sampling so that we could study the effects of sampling variation on estimates, like the proportion of the shovel’s balls that are red. In real-life however, one would take a single sample that’s as large as possible, much like in the Obama poll we saw in Section <a href="7-sampling.html#sampling-case-study">7.4</a>. But how can we get a sense of the effect of sampling variation on estimates if we only have one sample and thus only one estimate? Don’t we need many samples and hence many estimates?</p>
+<p>Recall back in Subsection <a href="7-sampling.html#shovel-1000-times">7.2.3</a>, we took 1000 virtual samples from the <code>bowl</code> using a virtual shovel, computed 1000 values of the sample proportion red <span class="math inline">\(\widehat{p}\)</span>, then visualized their distribution in a histogram. Recall that this distribution is called the <em>sampling distribution of</em> <span class="math inline">\(\widehat{p}\)</span> . Furthermore, the standard deviation of the sampling distribution has a special name: the <em>standard error</em>.</p>
+<p>We also mentioned that this sampling activity does not reflect how sampling is done in real life. Rather, it was an <em>idealized version</em> of sampling so that we could study the effects of sampling variation on estimates, like the proportion of the shovel’s balls that are red. In real life, however, one would take a single sample that’s as large as possible, much like in the Obama poll we saw in Section <a href="7-sampling.html#sampling-case-study">7.4</a>. But how can we get a sense of the effect of sampling variation on estimates if we only have one sample and thus only one estimate? Don’t we need many samples and hence many estimates?</p>
 <p>The workaround to having a <em>single</em> sample was to perform <em>bootstrap resampling with replacement</em> from the single sample. We did this in the resampling activity in Section <a href="8-confidence-intervals.html#resampling-tactile">8.1</a> where we focused on the mean year of minting of pennies. We used pieces of paper representing the original sample of 50 pennies from the bank and resampled them with replacement from a hat. We had 35 of our friends perform this activity and visualized the resulting 35 sample means <span class="math inline">\(\overline{x}\)</span> in a histogram in Figure <a href="8-confidence-intervals.html#fig:tactile-resampling-6">8.11</a>.</p>
 <p>This distribution was called the <em>bootstrap distribution</em> of <span class="math inline">\(\overline{x}\)</span>. We stated at the time that the bootstrap distribution is an <em>approximation</em> to the sampling distribution of <span class="math inline">\(\overline{x}\)</span> in the sense that both distributions will have a similar shape and similar spread.  Thus the <em>standard error</em> of the bootstrap distribution can be used as an approximation to the <em>standard error</em> of the sampling distribution.</p>
-<p>Let’s show you that this is the case by now compare these two types of distributions. Specifically, we’ll compare the</p>
+<p>Let’s show you that this is the case by now comparing these two types of distributions. Specifically, we’ll compare</p>
 <ol style="list-style-type: decimal">
-<li>The sampling distribution of <span class="math inline">\(\widehat{p}\)</span> based on 1000 virtual samples from the <code>bowl</code> from Section <a href="7-sampling.html#shovel-1000-times">7.2.3</a>.</li>
-<li>The bootstrap distribution of <span class="math inline">\(\widehat{p}\)</span> based on 1000 virtual resamples with replacement from Ilyas and Yohan’s single sample <code>bowl_sample_1</code> from Section <a href="8-confidence-intervals.html#ilyas-yohan">8.5.1</a></li>
+<li>the sampling distribution of <span class="math inline">\(\widehat{p}\)</span> based on 1000 virtual samples from the <code>bowl</code> from Subsection <a href="7-sampling.html#shovel-1000-times">7.2.3</a> to</li>
+<li>the bootstrap distribution of <span class="math inline">\(\widehat{p}\)</span> based on 1000 virtual resamples with replacement from Ilyas and Yohan’s single sample <code>bowl_sample_1</code> from Subsection <a href="8-confidence-intervals.html#ilyas-yohan">8.5.1</a>.</li>
 </ol>
 <div id="sampling-distribution" class="section level4 unnumbered">
 <h4>Sampling distribution</h4>
-<p>Here is the code you previously saw in Section <a href="7-sampling.html#shovel-1000-times">7.2.3</a> to construct the sampling distribution of <span class="math inline">\(\widehat{p}\)</span>, with some small changes to incorporate the statistical terminology relating to sampling you learned in Section <a href="7-sampling.html#terminology-and-notation">7.3.1</a>.</p>
+<p>Here is the code you saw in Subsection <a href="7-sampling.html#shovel-1000-times">7.2.3</a> to construct the sampling distribution of <span class="math inline">\(\widehat{p}\)</span> shown again in Figure <a href="8-confidence-intervals.html#fig:sampling-distribution-part-deux">8.33</a>, with some changes to incorporate the statistical terminology relating to sampling from Subsection <a href="7-sampling.html#terminology-and-notation">7.3.1</a>.</p>
+<div class="sourceCode" id="cb343"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb343-1" data-line-number="1"><span class="co"># Take 1000 virtual samples of size 50 from the bowl:</span></a>
+<a class="sourceLine" id="cb343-2" data-line-number="2">virtual_samples &lt;-<span class="st"> </span>bowl <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb343-3" data-line-number="3"><span class="st">  </span><span class="kw">rep_sample_n</span>(<span class="dt">size =</span> <span class="dv">50</span>, <span class="dt">reps =</span> <span class="dv">1000</span>)</a>
+<a class="sourceLine" id="cb343-4" data-line-number="4"><span class="co"># Compute the sampling distribution of 1000 values of p-hat</span></a>
+<a class="sourceLine" id="cb343-5" data-line-number="5">sampling_distribution &lt;-<span class="st"> </span>virtual_samples <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb343-6" data-line-number="6"><span class="st">  </span><span class="kw">group_by</span>(replicate) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb343-7" data-line-number="7"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">red =</span> <span class="kw">sum</span>(color <span class="op">==</span><span class="st"> &quot;red&quot;</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb343-8" data-line-number="8"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">prop_red =</span> red <span class="op">/</span><span class="st"> </span><span class="dv">50</span>)</a>
+<a class="sourceLine" id="cb343-9" data-line-number="9"><span class="co"># Visualize sampling distribution of p-hat</span></a>
+<a class="sourceLine" id="cb343-10" data-line-number="10"><span class="kw">ggplot</span>(sampling_distribution, <span class="kw">aes</span>(<span class="dt">x =</span> prop_red)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb343-11" data-line-number="11"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.05</span>, <span class="dt">boundary =</span> <span class="fl">0.4</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb343-12" data-line-number="12"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Proportion of 50 balls that were red&quot;</span>, </a>
+<a class="sourceLine" id="cb343-13" data-line-number="13">       <span class="dt">title =</span> <span class="st">&quot;Sampling distribution&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:sampling-distribution-part-deux"></span>
-<img src="moderndive_files/figure-html/sampling-distribution-part-deux-1.png" alt="Previously seen sampling distribution of sample proportion red for $n = 1000$." width="\textwidth" />
+<img src="ModernDive_files/figure-html/sampling-distribution-part-deux-1.png" alt="Previously seen sampling distribution of sample proportion red for $n = 1000$." width="\textwidth" />
 <p class="caption">
 FIGURE 8.33: Previously seen sampling distribution of sample proportion red for <span class="math inline">\(n = 1000\)</span>.
 </p>
 </div>
 <p>An important thing to keep in mind is the default value for <code>replace</code> is <code>FALSE</code> when using <code>rep_sample_n()</code>. This is because when sampling 50 balls with a shovel, we are extracting 50 balls one-by-one <em>without</em> replacing them. This is in contrast to bootstrap resampling <em>with</em> replacement, where we resample a ball and put it back, and repeat this process 50 times.</p>
-<p>Let’s quantify the variability in this sampling distribution by calculating the standard deviation of the <code>propr_red</code> variable representing 1000 values of the sample proportion <span class="math inline">\(\widehat{p}\)</span>. Remember that the standard deviation of the sampling distribution is the <em>standard error</em>, frequently denoted as <code>se</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">sampling_distribution <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">se =</span> <span class="kw">sd</span>(prop_red))</code></pre>
+<p>Let’s quantify the variability in this sampling distribution by calculating the standard deviation of the <code>prop_red</code> variable representing 1000 values of the sample proportion <span class="math inline">\(\widehat{p}\)</span>. Remember that the standard deviation of the sampling distribution is the <em>standard error</em>, frequently denoted as <code>se</code>.</p>
+<div class="sourceCode" id="cb344"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb344-1" data-line-number="1">sampling_distribution <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">summarize</span>(<span class="dt">se =</span> <span class="kw">sd</span>(prop_red))</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
          se
       &lt;dbl&gt;
@@ -2111,45 +2147,42 @@ <h4>Sampling distribution</h4>
 </div>
 <div id="bootstrap-distribution" class="section level4 unnumbered">
 <h4>Bootstrap distribution</h4>
-<p>Here is the code you previously saw in Section <a href="8-confidence-intervals.html#ilyas-yohan">8.5.1</a> to construct the bootstrap distribution of <span class="math inline">\(\widehat{p}\)</span> based on Ilyas and Yohan’s original sample of 50 balls saved in <code>bowl_sample_1</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Compute the bootstrap distribution using infer workflow:</span>
-bootstrap_distribution &lt;-<span class="st"> </span>bowl_sample_<span class="dv">1</span> <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> color, <span class="dt">success =</span> <span class="st">&quot;red&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;prop&quot;</span>)</code></pre>
+<p>Here is the code you previously saw in Subsection <a href="8-confidence-intervals.html#ilyas-yohan">8.5.1</a> to construct the bootstrap distribution of <span class="math inline">\(\widehat{p}\)</span> based on Ilyas and Yohan’s original sample of 50 balls saved in <code>bowl_sample_1</code>.</p>
+<div class="sourceCode" id="cb346"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb346-1" data-line-number="1">bootstrap_distribution &lt;-<span class="st"> </span>bowl_sample_<span class="dv">1</span> <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb346-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> color, <span class="dt">success =</span> <span class="st">&quot;red&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb346-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb346-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;prop&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:bootstrap-distribution-part-deux"></span>
-<img src="moderndive_files/figure-html/bootstrap-distribution-part-deux-1.png" alt="Bootstrap distribution of sample proportion red for $n = 1000$." width="\textwidth" />
+<img src="ModernDive_files/figure-html/bootstrap-distribution-part-deux-1.png" alt="Bootstrap distribution of proportion red for $n = 1000$." width="\textwidth" />
 <p class="caption">
-FIGURE 8.34: Bootstrap distribution of sample proportion red for <span class="math inline">\(n = 1000\)</span>.
+FIGURE 8.34: Bootstrap distribution of proportion red for <span class="math inline">\(n = 1000\)</span>.
 </p>
 </div>
-<pre class="sourceCode r"><code class="sourceCode r">bootstrap_distribution <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">se =</span> <span class="kw">sd</span>(stat))</code></pre>
+<div class="sourceCode" id="cb347"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb347-1" data-line-number="1">bootstrap_distribution <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">summarize</span>(<span class="dt">se =</span> <span class="kw">sd</span>(stat))</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
          se
       &lt;dbl&gt;
-1 0.0693340</code></pre>
+1 0.0712212</code></pre>
 </div>
 <div id="comparison" class="section level4 unnumbered">
 <h4>Comparison</h4>
-<p>Now that we have computed both the sampling distribution and the bootstrap distributions, let’s compare them side-by-side in Figure <a href="8-confidence-intervals.html#fig:side-by-side">8.35</a>. We’ll make both histograms have matching scales on the x and y-axes to make them more comparable. Furthermore, we’ll add:</p>
+<p>Now that we have computed both the sampling distribution and the bootstrap distributions, let’s compare them side-by-side in Figure <a href="8-confidence-intervals.html#fig:side-by-side">8.35</a>. We’ll make both histograms have matching scales on the x- and y-axes to make them more comparable. Furthermore, we’ll add:</p>
 <ol style="list-style-type: decimal">
 <li>To the sampling distribution on the top: a solid line denoting the proportion of the bowl’s balls that are red <span class="math inline">\(p\)</span> = 0.375.</li>
 <li>To the bootstrap distribution on the bottom: a dashed line at the sample proportion <span class="math inline">\(\widehat{p}\)</span> = 21/50 = 0.42 = 42% that Ilyas and Yohan observed.</li>
 </ol>
 <div class="figure" style="text-align: center"><span id="fig:side-by-side"></span>
-<img src="moderndive_files/figure-html/side-by-side-1.png" alt="Comparing the sampling and bootstrap distributions of $\widehat{p}$" width="\textwidth" />
+<img src="ModernDive_files/figure-html/side-by-side-1.png" alt="Comparing the sampling and bootstrap distributions of $\widehat{p}$." width="\textwidth" />
 <p class="caption">
-FIGURE 8.35: Comparing the sampling and bootstrap distributions of <span class="math inline">\(\widehat{p}\)</span>
+FIGURE 8.35: Comparing the sampling and bootstrap distributions of <span class="math inline">\(\widehat{p}\)</span>.
 </p>
 </div>
-<p>There is a lot going on in Figure <a href="8-confidence-intervals.html#fig:side-by-side">8.35</a>, so let’s break down all the comparisons slowly.</p>
-<p>First, observe how the sampling distribution on top is centered at <span class="math inline">\(p\)</span> = 0.375. This is because the sampling is done at random and in an unbiased fashion. So the estimates <span class="math inline">\(\widehat{p}\)</span> are centered at the true value of <span class="math inline">\(p\)</span>.</p>
+<p>There is a lot going on in Figure <a href="8-confidence-intervals.html#fig:side-by-side">8.35</a>, so let’s break down all the comparisons slowly. First, observe how the sampling distribution on top is centered at <span class="math inline">\(p\)</span> = 0.375. This is because the sampling is done at random and in an unbiased fashion. So the estimates <span class="math inline">\(\widehat{p}\)</span> are centered at the true value of <span class="math inline">\(p\)</span>.</p>
 <p>However, this is not the case with the following bootstrap distribution. The bootstrap distribution is centered at 0.42, which is the proportion red of Ilyas and Yohan’s 50 sampled balls. This is because we are resampling from the same sample over and over again. Since the bootstrap distribution is centered at the original sample’s proportion, it doesn’t necessarily provide a better estimate of <span class="math inline">\(p\)</span> = 0.375. This leads us to our first lesson about bootstrapping:</p>
 <blockquote>
 <p>The bootstrap distribution will likely not have the same center as the sampling distribution. In other words, bootstrapping cannot improve the quality of an estimate.</p>
 </blockquote>
-<p>Second, let’s now compare the spread (in the words the variation) of the two distributions: they are somewhat similar. In the previous code, we computed the standard deviations of both distributions as well. Recall that such standard deviations have a special name: <em>standard errors</em>. Let’s compare them in Table <a href="8-confidence-intervals.html#tab:comparing-se">8.5</a>.</p>
+<p>Second, let’s now compare the spread of the two distributions: they are somewhat similar. In the previous code, we computed the standard deviations of both distributions as well. Recall that such standard deviations have a special name: <em>standard errors</em>. Let’s compare them in Table <a href="8-confidence-intervals.html#tab:comparing-se">8.5</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:comparing-se">TABLE 8.5: </span>Comparing standard errors
@@ -2178,7 +2211,7 @@ <h4>Comparison</h4>
 Bootstrap distribution
 </td>
 <td style="text-align:right;">
-0.069
+0.071
 </td>
 </tr>
 </tbody>
@@ -2192,16 +2225,16 @@ <h4>Comparison</h4>
 </div>
 <div id="theory-ci" class="section level3">
 <h3><span class="header-section-number">8.7.2</span> Theory-based confidence intervals</h3>
-<p>So far in this chapter, we’ve constructed confidence intervals using two methods: the percentile method and the standard error method. Recall also from Section <a href="8-confidence-intervals.html#se-method">8.3.2</a> that we can only use the standard-error method if the bootstrap distribution is bell-shaped i.e. normally distributed.</p>
+<p>So far in this chapter, we’ve constructed confidence intervals using two methods: the percentile method and the standard error method. Recall also from Subsection <a href="8-confidence-intervals.html#se-method">8.3.2</a> that we can only use the standard-error method if the bootstrap distribution is bell-shaped (i.e., normally distributed).</p>
 <p>In a similar vein, if the sampling distribution is normally shaped, there is another method for constructing confidence intervals that does not involve using your computer. You can use a <em>theory-based method</em> involving a mathematical formulas!</p>
 <p>The formula uses the rule of thumb we saw in Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a> that 95% of values in a normal distribution are within <span class="math inline">\(\pm 1.96\)</span> standard deviations of the mean. In the case of sampling and bootstrap distributions, recall that the standard deviation has a special name: the <em>standard error</em>.</p>
 <div id="theory-based-method-for-computing-standard-errors" class="section level4 unnumbered">
 <h4>Theory-based method for computing standard errors</h4>
 <p>There exists in many cases a formula that approximates the standard error! In the case of our <code>bowl</code> where we used the sample proportion red <span class="math inline">\(\widehat{p}\)</span> to estimate the proportion of the bowl’s balls that are red, the formula that approximates the standard error is:</p>
 <p><span class="math display">\[\text{SE}_{\widehat{p}} \approx \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\]</span></p>
-<p>For example, recall from <code>bowl_sample_1</code> that Yohan and Ilyas sampled <span class="math inline">\(n\)</span> = 50 balls and observed a sample proportion <span class="math inline">\(\widehat{p}\)</span> of 21/50 = 0.42. So using the formula, an approximation of the standard error of <span class="math inline">\(\widehat{p}\)</span> is</p>
+<p>For example, recall from <code>bowl_sample_1</code> that Yohan and Ilyas sampled <span class="math inline">\(n = 50\)</span> balls and observed a sample proportion <span class="math inline">\(\widehat{p}\)</span> of 21/50 = 0.42. So, using the formula, an approximation of the standard error of <span class="math inline">\(\widehat{p}\)</span> is</p>
 <p><span class="math display">\[\text{SE}_{\widehat{p}} \approx \sqrt{\frac{0.42(1-0.42)}{50}} = \sqrt{0.004872} = 0.0698 \approx 0.070\]</span></p>
-<p>The key observation to make here is that there is an <span class="math inline">\(n\)</span> in the denominator. In other words, as the sample size <span class="math inline">\(n\)</span> increases, the standard error decreases. We’ve demonstrated this fact this using our virtual shovels in Section <a href="7-sampling.html#moral-of-the-story">7.3.3</a>. If you don’t recall this demonstration, we highly recommend you go back and read that section.</p>
+<p>The key observation to make here is that there is an <span class="math inline">\(n\)</span> in the denominator. So as the sample size <span class="math inline">\(n\)</span> increases, the standard error decreases. We’ve demonstrated this fact using our virtual shovels in Subsection <a href="7-sampling.html#moral-of-the-story">7.3.3</a>. If you don’t recall this demonstration, we highly recommend you go back and read that subsection.</p>
 <p>Let’s compare this theory-based standard error to the standard error of the sampling and bootstrap distributions you computed previously in Subsection <a href="8-confidence-intervals.html#bootstrap-vs-sampling">8.7.1</a> in Table <a href="8-confidence-intervals.html#tab:comparing-se-2">8.6</a>. Notice how they are all similar!</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
@@ -2231,7 +2264,7 @@ <h4>Theory-based method for computing standard errors</h4>
 Bootstrap distribution
 </td>
 <td style="text-align:right;">
-0.069
+0.071
 </td>
 </tr>
 <tr>
@@ -2246,23 +2279,23 @@ <h4>Theory-based method for computing standard errors</h4>
 </table>
 <p>Going back to Yohan and Ilyas’ sample proportion of <span class="math inline">\(\widehat{p}\)</span> of 21/50 = 0.42, say this were based on a sample of size <span class="math inline">\(n\)</span> = 100 instead of 50. Then the standard error would be:</p>
 <p><span class="math display">\[\text{SE}_{\widehat{p}} \approx \sqrt{\frac{0.42(1-0.42)}{100}} = \sqrt{0.002436} = 0.0494\]</span></p>
-<p>Observe that the standard error has gone done from 0.0698 to 0.0494. In other words, the “typical” error of our estimates using <span class="math inline">\(n\)</span> = 100 will go down and hence are more <em>precise</em>. Recall we illustrated the difference between accuracy and precision of estimates in Figure <a href="7-sampling.html#fig:accuracy-vs-precision">7.16</a>.</p>
-<p>Why is this formula true? Unfortunately, we don’t have the tools at this point to prove this; you’ll need to take a more advanced course in probability and statistics.</p>
+<p>Observe that the standard error has gone down from 0.0698 to 0.0494. In other words, the “typical” error of our estimates using <span class="math inline">\(n\)</span> = 100 will go down and hence be more <em>precise</em>. Recall that we illustrated the difference between accuracy and precision of estimates in Figure <a href="7-sampling.html#fig:accuracy-vs-precision">7.16</a>.</p>
+<p>Why is this formula true? Unfortunately, we don’t have the tools at this point to prove this; you’ll need to take a more advanced course in probability and statistics. (It is related to the concepts of Bernoulli and Binomial Distributions. You can read more about its derivation <a href="http://onlinestatbook.com/2/sampling_distributions/samp_dist_p.html">here</a> if you like.)</p>
 </div>
 <div id="theory-based-method-for-constructing-confidence-intervals" class="section level4 unnumbered">
 <h4>Theory-based method for constructing confidence intervals</h4>
-<p>Using these theory-based standard errors, let’s present a theory-based method for constructing 95% confidence intervals that does not involve using a computer, but rather mathematical formulas. Note that this theory-based method only holds if the sampling distribution is normally shaped, so that we can use the 95% rule of thumb about normal distributions in Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a>.</p>
+<p>Using these theory-based standard errors, let’s present a theory-based method for constructing 95% confidence intervals that does not involve using a computer, but rather mathematical formulas. Note that this theory-based method only holds if the sampling distribution is normally shaped, so that we can use the 95% rule of thumb about normal distributions discussed in Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a>.</p>
 <ol style="list-style-type: decimal">
 <li>Collect a single representative sample of size <span class="math inline">\(n\)</span> that’s as large as possible.</li>
 <li>Compute the <em>point estimate</em>: the <em>sample proportion</em> <span class="math inline">\(\widehat{p}\)</span>. Think of this as the center of your “net.”</li>
 <li>Compute the approximation to the standard error</li>
 </ol>
 <p><span class="math display">\[\text{SE}_{\widehat{p}} \approx \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\]</span></p>
-<ol style="list-style-type: decimal">
-<li>Compute a quantity known as the <em>margin of error</em> (more later):</li>
+<ol start="4" style="list-style-type: decimal">
+<li>Compute a quantity known as the <em>margin of error</em> (more on this later after we list the five steps):</li>
 </ol>
 <p><span class="math display">\[\text{MoE}_{\widehat{p}} = 1.96 \cdot \text{SE}_{\widehat{p}} =  1.96 \cdot \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\]</span></p>
-<ol style="list-style-type: decimal">
+<ol start="5" style="list-style-type: decimal">
 <li>Compute both endpoints of the confidence interval.
 <ul>
 <li><p>The lower end-point. Think of this as the left end-point of the net:
@@ -2271,23 +2304,30 @@ <h4>Theory-based method for constructing confidence intervals</h4>
 <span class="math display">\[\widehat{p} + \text{MoE}_{\widehat{p}} = \widehat{p} + 1.96 \cdot \text{SE}_{\widehat{p}} = \widehat{p} + 1.96 \cdot \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\]</span></p></li>
 <li><p>Alternatively, you can succinctly summarize a 95% confidence interval for <span class="math inline">\(p\)</span> using the <span class="math inline">\(\pm\)</span> symbol:</p></li>
 </ul>
-<span class="math display">\[\widehat{p} \pm \text{MoE}_{\widehat{p}} = \widehat{p} \pm 1.96 \cdot \text{SE}_{\widehat{p}} = \widehat{p} \pm 1.96 \cdot \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\]</span></li>
+<span class="math display">\[\widehat{p} \pm \text{MoE}_{\widehat{p}} = \widehat{p} \pm (1.96 \cdot \text{SE}_{\widehat{p}}) = \widehat{p} \pm \left( 1.96 \cdot \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}} \right)\]</span></li>
 </ol>
-<p>So going back to Yohan and Ilyas’ sample of <span class="math inline">\(n=50\)</span> balls that had 21 red balls, the 95% confidence interval for <span class="math inline">\(p\)</span> is 0.42 <span class="math inline">\(\pm\)</span> 1.96 <span class="math inline">\(\cdot\)</span> 0.0698 = 0.42 <span class="math inline">\(\pm\)</span> 0.137 = (0.42 - 0.137, 0.42 + 0.137) = (0.283, 0.557). In other words, Yohan and Ilyas are 95% “confident” that the true proportion red of the bowl’s balls is between 28.3% and 55.7%. Given that the true population proportion <span class="math inline">\(p\)</span> was 0.375, in this case they successfully captured the fish.</p>
-<p>In Step 4, we defined a statistical quantity known as the <em>margin of error</em>. You can think of this quantity as how much the net extends to the left and to the right of the center of our net. The 1.96 multiplier roots in the 95% rule of thumb we introduced earlier and the fact that we want the confidence level to be 95%. The value of the margin error entirely determines the width of the confidence interval. Recall from Section <a href="8-confidence-intervals.html#ci-width">8.5.3</a> that confidence interval widths are determined by an interplay of the confidence level, the sample size <span class="math inline">\(n\)</span>, and the standard error.</p>
-<p>Let’s revisit the poll of President Obama’s approval rating among young Americans aged 18-29 we introduced in Section <a href="7-sampling.html#sampling-case-study">7.4</a>. Pollsters found that based on a representative sample of <span class="math inline">\(n\)</span> = 2089 young Americans, <span class="math inline">\(\widehat{p}\)</span> = 0.41 = 41% supported President Obama.</p>
+<p>So going back to Yohan and Ilyas’ sample of <span class="math inline">\(n = 50\)</span> balls that had 21 red balls, the 95% confidence interval for <span class="math inline">\(p\)</span> is</p>
+<p><span class="math display">\[
+\begin{aligned}
+0.41 \pm 1.96 \cdot 0.0698 &amp;= 0.41 \, \pm \, 0.137 \\ &amp;= (0.41 - 0.137, \, 0.41 + 0.137) \\ &amp;= (0.273, \, 0.547).
+\end{aligned}
+\]</span></p>
+<p>Yohan and Ilyas are 95% “confident” that the true proportion red of the bowl’s balls is between 28.3% and 55.7%. Given that the true population proportion <span class="math inline">\(p\)</span> was 0.375, in this case they successfully captured the fish.</p>
+<p>In Step 4, we defined a statistical quantity known as the <em>margin of error</em>. You can think of this quantity as how much the net extends to the left and to the right of the center of our net. The 1.96 multiplier is rooted in the 95% rule of thumb we introduced earlier and the fact that we want the confidence level to be 95%. The value of the margin of error entirely determines the width of the confidence interval. Recall from Subsection <a href="8-confidence-intervals.html#ci-width">8.5.3</a> that confidence interval widths are determined by an interplay of the confidence level, the sample size <span class="math inline">\(n\)</span>, and the standard error.</p>
+<p>Let’s revisit the poll of President Obama’s approval rating among young Americans aged 18-29 which we introduced in Section <a href="7-sampling.html#sampling-case-study">7.4</a>. Pollsters found that based on a representative sample of <span class="math inline">\(n\)</span> = 2089 young Americans, <span class="math inline">\(\widehat{p}\)</span> = 0.41 = 41% supported President Obama.</p>
 <p>If you look towards the end of the article, it also states: “The poll’s margin of error was plus or minus 2.1 percentage points.” This is precisely the <span class="math inline">\(\text{MoE}\)</span>:</p>
 <p><span class="math display">\[
 \begin{aligned}
 \text{MoE} &amp;= 1.96 \cdot \text{SE} =  1.96 \cdot \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}} = 1.96 \cdot \sqrt{\frac{0.41(1-0.41)}{2089}} \\
-&amp;= 1.96 \cdot 0.0108 = 0.021 = 2.1%
+&amp;= 1.96 \cdot 0.0108 = 0.021 = 2.1\%
 \end{aligned}
 \]</span></p>
-<p>Their poll results are based on a confidence level of 95% and the resulting 95% confidence interval for the proportion of all young Americans who support Obama is: <span class="math inline">\(\widehat{p} \pm \text{MoE}\)</span> = 0.42 <span class="math inline">\(\pm\)</span> 0.021 = (0.339, 0.441) = (33.9%, 44.1%).</p>
+<p>Their poll results are based on a confidence level of 95% and the resulting 95% confidence interval for the proportion of all young Americans who support Obama is:</p>
+<p><span class="math display">\[\widehat{p} \pm \text{MoE} = 0.41 \pm 0.021 = (0.389, \, 0.431) = (38.9\%, \, 43.1\%).\]</span></p>
 </div>
 <div id="confidence-intervals-based-on-33-tactile-samples" class="section level4 unnumbered">
 <h4>Confidence intervals based on 33 tactile samples</h4>
-<p>Let’s revisit our 33 friends’ samples from the <code>bowl</code> from Section <a href="7-sampling.html#student-shovels">7.1.3</a>. We’ll use their 33 samples to construct 33 theory-based 95% confidence intervals for <span class="math inline">\(p\)</span>. Recall this data was saved in the <code>tactile_prop_red</code> data frame included in the <code>moderndive</code> package:</p>
+<p>Let’s revisit our 33 friends’ samples from the <code>bowl</code> from Subsection <a href="7-sampling.html#student-shovels">7.1.3</a>. We’ll use their 33 samples to construct 33 theory-based 95% confidence intervals for <span class="math inline">\(p\)</span>. Recall this data was saved in the <code>tactile_prop_red</code> data frame included in the <code>moderndive</code> package:</p>
 <ol style="list-style-type: decimal">
 <li><code>rename()</code> the variable <code>prop_red</code> to <code>p_hat</code>, the statistical name of the sample proportion <span class="math inline">\(\widehat{p}\)</span>.</li>
 <li><code>mutate()</code> a new variable <code>n</code> making explicit the sample size of 50.</li>
@@ -2299,16 +2339,15 @@ <h4>Confidence intervals based on 33 tactile samples</h4>
 <li>The right endpoint of the confidence interval <code>upper_ci</code></li>
 </ul></li>
 </ol>
-<pre class="sourceCode r"><code class="sourceCode r">conf_ints &lt;-<span class="st"> </span>tactile_prop_red <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rename</span>(<span class="dt">p_hat =</span> prop_red) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">mutate</span>(
-    <span class="dt">n =</span> <span class="dv">50</span>,
-    <span class="dt">SE =</span> <span class="kw">sqrt</span>(p_hat <span class="op">*</span><span class="st"> </span>(<span class="dv">1</span> <span class="op">-</span><span class="st"> </span>p_hat) <span class="op">/</span><span class="st"> </span>n),
-    <span class="dt">MoE =</span> <span class="fl">1.96</span> <span class="op">*</span><span class="st"> </span>SE,
-    <span class="dt">lower_ci =</span> p_hat <span class="op">-</span><span class="st"> </span>MoE,
-    <span class="dt">upper_ci =</span> p_hat <span class="op">+</span><span class="st"> </span>MoE
-  )
-conf_ints</code></pre>
+<div class="sourceCode" id="cb349"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb349-1" data-line-number="1">conf_ints &lt;-<span class="st"> </span>tactile_prop_red <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb349-2" data-line-number="2"><span class="st">  </span><span class="kw">rename</span>(<span class="dt">p_hat =</span> prop_red) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb349-3" data-line-number="3"><span class="st">  </span><span class="kw">mutate</span>(</a>
+<a class="sourceLine" id="cb349-4" data-line-number="4">    <span class="dt">n =</span> <span class="dv">50</span>,</a>
+<a class="sourceLine" id="cb349-5" data-line-number="5">    <span class="dt">SE =</span> <span class="kw">sqrt</span>(p_hat <span class="op">*</span><span class="st"> </span>(<span class="dv">1</span> <span class="op">-</span><span class="st"> </span>p_hat) <span class="op">/</span><span class="st"> </span>n),</a>
+<a class="sourceLine" id="cb349-6" data-line-number="6">    <span class="dt">MoE =</span> <span class="fl">1.96</span> <span class="op">*</span><span class="st"> </span>SE,</a>
+<a class="sourceLine" id="cb349-7" data-line-number="7">    <span class="dt">lower_ci =</span> p_hat <span class="op">-</span><span class="st"> </span>MoE,</a>
+<a class="sourceLine" id="cb349-8" data-line-number="8">    <span class="dt">upper_ci =</span> p_hat <span class="op">+</span><span class="st"> </span>MoE</a>
+<a class="sourceLine" id="cb349-9" data-line-number="9">  )</a></code></pre></div>
 <pre><code># A tibble: 33 x 9
    group    replicate red_balls p_hat     n        SE      MoE lower_ci upper_ci
    &lt;chr&gt;        &lt;int&gt;     &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;     &lt;dbl&gt;    &lt;dbl&gt;    &lt;dbl&gt;    &lt;dbl&gt;
@@ -2323,16 +2362,16 @@ <h4>Confidence intervals based on 33 tactile samples</h4>
  9 Daniel,…         9        15  0.3     50 0.0648074 0.127023 0.172977 0.427023
 10 Josh, M…        10        17  0.34    50 0.0669925 0.131305 0.208695 0.471305
 # … with 23 more rows</code></pre>
-<p>In Figure <a href="8-confidence-intervals.html#fig:tactile-conf-int">8.36</a>, let’s plot the 33 confidence intervals for <span class="math inline">\(p\)</span> saved in <code>conf_ints</code> along with a vertical line at <span class="math inline">\(p\)</span> = 0.375 indicating the true proportion of the <code>bowl</code>’s balls that are red. Furthermore, let’s mark the sample proportions <span class="math inline">\(\widehat{p}\)</span> that are the centers of the confidence intervals with dots.</p>
+<p>In Figure <a href="8-confidence-intervals.html#fig:tactile-conf-int">8.36</a>, let’s plot the 33 confidence intervals for <span class="math inline">\(p\)</span> saved in <code>conf_ints</code> along with a vertical line at <span class="math inline">\(p\)</span> = 0.375 indicating the true proportion of the <code>bowl</code>’s balls that are red. Furthermore, let’s mark the sample proportions <span class="math inline">\(\widehat{p}\)</span> with dots since they represent the centers of these confidence intervals.</p>
 <div class="figure" style="text-align: center"><span id="fig:tactile-conf-int"></span>
-<img src="moderndive_files/figure-html/tactile-conf-int-1.png" alt="33 95 percent confidence intervals based on 33 tactile samples of size n = 50." width="\textwidth" />
+<img src="ModernDive_files/figure-html/tactile-conf-int-1.png" alt="33 confidence intervals at the 95\% level based on 33 tactile samples of size $n = 50$." width="\textwidth" />
 <p class="caption">
-FIGURE 8.36: 33 95 percent confidence intervals based on 33 tactile samples of size n = 50.
+FIGURE 8.36: 33 confidence intervals at the 95% level based on 33 tactile samples of size <span class="math inline">\(n = 50\)</span>.
 </p>
 </div>
 <p>Observe that 31 of the 33 confidence intervals “captured” the true value of <span class="math inline">\(p\)</span>, for a success rate of 31 / 33 = 93.94%. While this is not quite 95%, recall that we <em>expect</em> about 95% of such confidence intervals to capture <span class="math inline">\(p\)</span>. The actual observed success rate will vary slightly.</p>
-<p>Theory-based methods like this have largely been used in the past because we didn’t have the computing power to perform simulation-based methods such as bootstrapping. They are still commonly used however and if the sampling is normally distributed, we have access to an alternative method for constructing confidence intervals as well as performing hypothesis tests as we will see in Chapter <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>.</p>
-<p>The kind of computer-based statistical inference we’ve seen so far has a particular name in the field of statistics: <em>simulation-based inference</em>. This is because we are performing statistical inference using computer simulations. In our opinion, two large benefits of simulation-based methods over theory-based methods are that 1) they are easier for people new to statistical inference to understand and 2) they also work in situations where theory-based methods and mathematical formulas don’t exist.</p>
+<p>Theory-based methods like this have largely been used in the past because we didn’t have the computing power to perform simulation-based methods such as bootstrapping. They are still commonly used, however, and if the sampling distribution is normally distributed, we have access to an alternative method for constructing confidence intervals as well as performing hypothesis tests as we will see in Chapter <a href="9-hypothesis-testing.html#hypothesis-testing">9</a>.</p>
+<p>The kind of computer-based statistical inference we’ve seen so far has a particular name in the field of statistics: <em>simulation-based inference</em>. This is because we are performing statistical inference using computer simulations. In our opinion, two large benefits of simulation-based methods over theory-based methods are that (1) they are easier for people new to statistical inference to understand and (2) they also work in situations where theory-based methods and mathematical formulas don’t exist.</p>
 <!--
 Albert: This section is getting hella long, so will comment this out for now. Let's consider adding 
 it back if we have room in book.
@@ -2341,7 +2380,7 @@ <h4>Confidence intervals based on 33 tactile samples</h4>
 
 #### Confidence intervals based on 100 virtual samples {-}
 
-Let's say however, we repeated this 100 times, not tactilely, but virtually. Let's do this only 100 times instead of 1000 like we did before so that the results can fit on the screen. Again, the steps for compute a 95% confidence interval for $p$ are:
+Let's say, however, we repeated this 100 times, not tactilely, but virtually. Let's do this only 100 times instead of 1000 like we did before so that the results can fit on the screen. Again, the steps for compute a 95% confidence interval for $p$ are:
 
 1. Collect a sample of size $n = 50$ as we did in Chapter \@ref(sampling)
 1. Compute $\widehat{p}$: the sample proportion red of these $n$ = 50 balls
@@ -2381,7 +2420,7 @@ <h4>Confidence intervals based on 33 tactile samples</h4>
 
 
 
-We see that of our 100 confidence intervals based on samples of size $n$ = 50, `sum(virtual_prop_red[["captured"]])` of them captured the true $p = 900/2400$, whereas `100 - sum(virtual_prop_red[["captured"]])` of them missed. As we create more and more confidence intervals based on more and more samples, about 95% of these intervals will capture. In other words our procedure is "95% reliable." 
+We see that of our 100 confidence intervals based on samples of size $n$ = 50, `sum(virtual_prop_red[["captured"]])` of them captured the true $p = 900/2400$, whereas `100 - sum(virtual_prop_red[["captured"]])` of them missed. As we create more and more confidence intervals based on more and more samples, about 95% of these intervals will capture. In other words, our procedure is "95% reliable." 
 -->
 </div>
 </div>
@@ -2392,10 +2431,7 @@ <h3><span class="header-section-number">8.7.3</span> Additional resources</h3>
 </div>
 <div id="whats-to-come-7" class="section level3">
 <h3><span class="header-section-number">8.7.4</span> What’s to come?</h3>
-<p>Now that we’ve equipped ourselves with confidence intervals, in Chapter <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> we’ll cover the other common tool for statistical inference: hypothesis testing.</p>
-<!--
-TODO: Bridge confidence intervals and hypothesis tests better.
--->
+<p>Now that we’ve equipped ourselves with confidence intervals, in Chapter <a href="9-hypothesis-testing.html#hypothesis-testing">9</a> we’ll cover the other common tool for statistical inference: hypothesis testing. Just like confidence intervals, hypothesis tests are used to infer about a population using a sample. However, we’ll see that the framework for making such inferences is slightly different.</p>
 
 </div>
 </div>
@@ -2411,11 +2447,13 @@ <h3><span class="header-section-number">8.7.4</span> What’s to come?</h3>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -2423,12 +2461,11 @@ <h3><span class="header-section-number">8.7.4</span> What’s to come?</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -2443,6 +2480,10 @@ <h3><span class="header-section-number">8.7.4</span> What’s to come?</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -2459,8 +2500,9 @@ <h3><span class="header-section-number">8.7.4</span> What’s to come?</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/9-hypothesis-testing.html b/docs/9-hypothesis-testing.html
index d0e145551..68fe7d8a3 100644
--- a/docs/9-hypothesis-testing.html
+++ b/docs/9-hypothesis-testing.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>Chapter 9 Hypothesis Testing | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="Chapter 9 Hypothesis Testing | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="Chapter 9 Hypothesis Testing | Statistical Inference via Data Science" />
@@ -21,18 +21,18 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="8-confidence-intervals.html">
-<link rel="next" href="10-inference-for-regression.html">
+<link rel="prev" href="8-confidence-intervals.html"/>
+<link rel="next" href="10-inference-for-regression.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -570,9 +583,10 @@ <h1>
 </html>
 <div id="hypothesis-testing" class="section level1">
 <h1><span class="header-section-number">Chapter 9</span> Hypothesis Testing</h1>
-<p>Now that we’ve studied confidence intervals in Chapter <a href="8-confidence-intervals.html#confidence-intervals">8</a>, let’s study the commonly used method for statistical inference: hypothesis testing. Hypothesis tests allow us to take a sample of data from a population and infer about the plausibility of competing hypotheses. For example, in the upcoming “promotions” activity in Section <a href="9-hypothesis-testing.html#ht-activity">9.1</a>, you’ll study the data collected from a psychology study in the 1970’s to investigate whether there exists gender-based discrimination in promotion rates in the banking industry.</p>
+<p>Now that we’ve studied confidence intervals in Chapter <a href="8-confidence-intervals.html#confidence-intervals">8</a>, let’s study another commonly used method for statistical inference: hypothesis testing. Hypothesis tests allow us to take a sample of data from a population and infer about the plausibility of competing hypotheses. For example, in the upcoming “promotions” activity in Section <a href="9-hypothesis-testing.html#ht-activity">9.1</a>, you’ll study the data collected from a psychology study in the 1970s to investigate whether gender-based discrimination in promotion rates existed in the banking industry at the time of the study.</p>
 <p>The good news is we’ve already covered many of the necessary concepts to understand hypothesis testing in Chapters <a href="7-sampling.html#sampling">7</a> and <a href="8-confidence-intervals.html#confidence-intervals">8</a>. We will expand further on these ideas here and also provide a general framework for understanding hypothesis tests. By understanding this general framework, you’ll be able to adapt it to many different scenarios.</p>
-<p>The same can be said for confidence intervals. There was one general framework that applies to <em>all</em> confidence intervals and the <code>infer</code> package was designed around this framework. While the specifics may change slightly for different types of confidence intervals, the general framework stays the same. We believe that this approach is much better for long-term learning than focusing on specific details for specific confidence intervals and as you’ll now see, hypothesis tests as well.</p>
+<p>The same can be said for confidence intervals. There was one general framework that applies to <em>all</em> confidence intervals and the <code>infer</code> package was designed around this framework. While the specifics may change slightly for different types of confidence intervals, the general framework stays the same.</p>
+<p>We believe that this approach is much better for long-term learning than focusing on specific details for specific confidence intervals using theory-based approaches. As you’ll now see, we prefer this general framework for hypothesis tests as well.</p>
 <p>If you’d like more practice or you’re curious to see how this framework applies to different scenarios, you can find fully-worked out examples for many common hypothesis tests and their corresponding confidence intervals in Appendix B. We recommend that you carefully review these examples as they also cover how the general frameworks apply to traditional theory-based methods like the <span class="math inline">\(t\)</span>-test and normal-theory confidence intervals. You’ll see there that these traditional methods are just approximations for the computer-based methods we’ve been focusing on. However, they also require conditions to be met for their results to be valid. Computer-based methods using randomization, simulation, and bootstrapping have much fewer restrictions. Furthermore, they help develop your computational thinking, which is one big reason they are emphasized throughout this book.</p>
 <div id="needed-packages-7" class="section level3 unnumbered">
 <h3>Needed packages</h3>
@@ -585,52 +599,49 @@ <h3>Needed packages</h3>
 <li>As well as the more advanced <code>purrr</code>, <code>tibble</code>, <code>stringr</code>, and <code>forcats</code> packages</li>
 </ul>
 <p>If needed, read Section <a href="1-getting-started.html#packages">1.3</a> for information on how to install and load R packages.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(tidyverse)
-<span class="kw">library</span>(infer)
-<span class="kw">library</span>(moderndive)
-<span class="kw">library</span>(nycflights13)
-<span class="kw">library</span>(ggplot2movies)</code></pre>
+<div class="sourceCode" id="cb351"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb351-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb351-2" data-line-number="2"><span class="kw">library</span>(infer)</a>
+<a class="sourceLine" id="cb351-3" data-line-number="3"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb351-4" data-line-number="4"><span class="kw">library</span>(nycflights13)</a>
+<a class="sourceLine" id="cb351-5" data-line-number="5"><span class="kw">library</span>(ggplot2movies)</a></code></pre></div>
 </div>
 <div id="ht-activity" class="section level2">
 <h2><span class="header-section-number">9.1</span> Promotions activity</h2>
 <p>Let’s start with an activity studying the effect of gender on promotions at a bank.</p>
-<div id="does-gender-affect-promotions-at-bank" class="section level3">
-<h3><span class="header-section-number">9.1.1</span> Does gender affect promotions at bank?</h3>
-<p>Say you are working at a bank in the 1970’s and you are submitting your resume to apply for a promotion. Will your gender affect your chances of getting promoted? To answer this question, we’ll focus on data from a study published in the “Journal of Applied Psychology” in 1974. This data is also used in the <a href="https://www.openintro.org/">OpenIntro</a> series of statistics textbooks.</p>
-<p>To begin the study, 48 bank supervisors were asked to assume the role of a hypothetical director of a bank with multiple branches. Every one of the bank supervisors was given a resume and asked whether or not the candidate on the resume was fit to be promoted to a new position in one of their branches.</p>
-<p>However, each of these 48 resumes were identical in all respects except one: the name of the applicant at the top of the resume. 24 of the supervisors were randomly given resumes with stereotypically “male” names while 24 of the supervisors were randomly given resumes with stereotypically “female” names. Since only (binary) gender varied from resume to resume, researchers could isolate the effect of this variable in promotion rates.</p>
+<div id="does-gender-affect-promotions-at-a-bank" class="section level3">
+<h3><span class="header-section-number">9.1.1</span> Does gender affect promotions at a bank?</h3>
+<p>Say you are working at a bank in the 1970s and you are submitting your résumé to apply for a promotion. Will your gender affect your chances of getting promoted? To answer this question, we’ll focus on data from a study published in the <em>Journal of Applied Psychology</em> in 1974. This data is also used in the <a href="https://www.openintro.org/"><em>OpenIntro</em></a> series of statistics textbooks.</p>
+<p>To begin the study, 48 bank supervisors were asked to assume the role of a hypothetical director of a bank with multiple branches. Every one of the bank supervisors was given a résumé and asked whether or not the candidate on the résumé was fit to be promoted to a new position in one of their branches.</p>
+<p>However, each of these 48 résumés were identical in all respects except one: the name of the applicant at the top of the résumé. Of the supervisors, 24 were randomly given résumés with stereotypically “male” names, while 24 of the supervisors were randomly given résumés with stereotypically “female” names. Since only (binary) gender varied from résumé to résumé, researchers could isolate the effect of this variable in promotion rates.</p>
 <p>While many people today (including us, the authors) disagree with such binary views of gender, it is important to remember that this study was conducted at a time where more nuanced views of gender were not as prevalent. Despite this imperfection, we decided to still use this example as we feel it presents ideas still relevant today about how we could study discrimination in the workplace.</p>
-<p>The <code>moderndive</code> package contains the data on the 48 applicants in the <code>promotions</code> data frame. Let’s explore this data first:</p>
-<pre class="sourceCode r"><code class="sourceCode r">promotions</code></pre>
-<pre><code># A tibble: 48 x 3
-      id decision gender
-   &lt;int&gt; &lt;fct&gt;    &lt;fct&gt; 
- 1     1 promoted male  
- 2     2 promoted male  
- 3     3 promoted male  
- 4     4 promoted male  
- 5     5 promoted male  
- 6     6 promoted male  
- 7     7 promoted male  
- 8     8 promoted male  
- 9     9 promoted male  
-10    10 promoted male  
-# … with 38 more rows</code></pre>
-<p>The variable <code>id</code> acts as an identification variable for all 48 rows, the <code>decision</code> variable indicates whether the applicant was selected for promotion or not, while the <code>gender</code> variable indicates the gender of the name used on the resume. Recall that this data does not pertain to 24 actual men and 24 actual women, but rather 48 identical resumes of which 24 were assigned stereotypically “male” names and 24 were assigned stereotypical “female” names.</p>
-<p>Let’s perform an exploratory data analysis of the relationship between the two categorical variables <code>decision</code> and <code>gender</code>. Recall that we saw in Section <a href="2-viz.html#two-categ-barplot">2.8.3</a> that one way we can visualize such a relationship is using a stacked barplot.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(promotions, <span class="kw">aes</span>(<span class="dt">x =</span> gender, <span class="dt">fill =</span> decision)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Gender of name on resume&quot;</span>)</code></pre>
+<p>The <code>moderndive</code> package contains the data on the 48 applicants in the <code>promotions</code> data frame. Let’s explore this data by looking at six randomly selected rows:</p>
+<div class="sourceCode" id="cb352"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb352-1" data-line-number="1">promotions <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb352-2" data-line-number="2"><span class="st">  </span><span class="kw">sample_n</span>(<span class="dt">size =</span> <span class="dv">6</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb352-3" data-line-number="3"><span class="st">  </span><span class="kw">arrange</span>(id)</a></code></pre></div>
+<pre><code># A tibble: 6 x 3
+     id decision gender
+  &lt;int&gt; &lt;fct&gt;    &lt;fct&gt; 
+1    11 promoted male  
+2    26 promoted female
+3    28 promoted female
+4    36 not      male  
+5    37 not      male  
+6    46 not      female</code></pre>
+<p>The variable <code>id</code> acts as an identification variable for all 48 rows, the <code>decision</code> variable indicates whether the applicant was selected for promotion or not, while the <code>gender</code> variable indicates the gender of the name used on the résumé. Recall that this data does not pertain to 24 actual men and 24 actual women, but rather 48 identical résumés of which 24 were assigned stereotypically “male” names and 24 were assigned stereotypically “female” names.</p>
+<p>Let’s perform an exploratory data analysis of the relationship between the two categorical variables <code>decision</code> and <code>gender</code>. Recall that we saw in Subsection <a href="2-viz.html#two-categ-barplot">2.8.3</a> that one way we can visualize such a relationship is by using a stacked barplot.</p>
+<div class="sourceCode" id="cb354"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb354-1" data-line-number="1"><span class="kw">ggplot</span>(promotions, <span class="kw">aes</span>(<span class="dt">x =</span> gender, <span class="dt">fill =</span> decision)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb354-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb354-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Gender of name on résumé&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:promotions-barplot"></span>
-<img src="moderndive_files/figure-html/promotions-barplot-1.png" alt="Barplot of relationship between gender and promotion decision." width="\textwidth" />
+<img src="ModernDive_files/figure-html/promotions-barplot-1.png" alt="Barplot relating gender to promotion decision." width="\textwidth" />
 <p class="caption">
-FIGURE 9.1: Barplot of relationship between gender and promotion decision.
+FIGURE 9.1: Barplot relating gender to promotion decision.
 </p>
 </div>
-<p>Observe in Figure <a href="9-hypothesis-testing.html#fig:promotions-barplot">9.1</a> that it appears that resumes with female names were much less likely to be accepted for promotion. Let’s quantify these promotion rates by computing the proportion of resumes accepted for promotion for each group using the <code>dplyr</code> package for data wrangling:</p>
-<pre class="sourceCode r"><code class="sourceCode r">promotions <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(gender, decision) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">n</span>())</code></pre>
+<p>Observe in Figure <a href="9-hypothesis-testing.html#fig:promotions-barplot">9.1</a> that it appears that résumés with female names were much less likely to be accepted for promotion. Let’s quantify these promotion rates by computing the proportion of résumés accepted for promotion for each group using the <code>dplyr</code> package for data wrangling. Note the use of the <code>tally()</code> function here which is a shortcut for <code>summarize(n = n())</code> to get counts.</p>
+<div class="sourceCode" id="cb355"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb355-1" data-line-number="1">promotions <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb355-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(gender, decision) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb355-3" data-line-number="3"><span class="st">  </span><span class="kw">tally</span>()</a></code></pre></div>
 <pre><code># A tibble: 4 x 3
 # Groups:   gender [2]
   gender decision     n
@@ -639,22 +650,22 @@ <h3><span class="header-section-number">9.1.1</span> Does gender affect promotio
 2 male   promoted    21
 3 female not         10
 4 female promoted    14</code></pre>
-<p>So of the 24 resumes with male names, 21 were selected for promotion, for a proportion of 21/24 = 0.875 = 87.5%. On the other hand, of the 24 resumes with female names, 14 were selected for promotion, for a proportion of 14/24 = 0.583 = 58.3%. Comparing these two rates of promotion, it appears that resumes with male names were selected for promotion at a rate 0.875 - 0.583 = 0.292 = 29.2% higher than resumes with female names. This is suggestive of an advantage for resumes with a male name on it.</p>
-<p>The question is however, does this provide <em>conclusive</em> evidence that there is gender discrimination in promotions at banks? Could a difference in promotion rates of 29.2% still occur by chance, even in a hypothetical world where no gender-based discrimination existed? In other words, what is the role of <em>sampling variation</em>? To answer this question, we’ll again rely on a computer to run <em>simulations</em>.</p>
+<p>So of the 24 résumés with male names, 21 were selected for promotion, for a proportion of 21/24 = 0.875 = 87.5%. On the other hand, of the 24 résumés with female names, 14 were selected for promotion, for a proportion of 14/24 = 0.583 = 58.3%. Comparing these two rates of promotion, it appears that résumés with male names were selected for promotion at a rate 0.875 - 0.583 = 0.292 = 29.2% higher than résumés with female names. This is suggestive of an advantage for résumés with a male name on it.</p>
+<p>The question is, however, does this provide <em>conclusive</em> evidence that there is gender discrimination in promotions at banks? Could a difference in promotion rates of 29.2% still occur by chance, even in a hypothetical world where no gender-based discrimination existed? In other words, what is the role of <em>sampling variation</em> in this hypothesized world? To answer this question, we’ll again rely on a computer to run <em>simulations</em>.</p>
 </div>
 <div id="shuffling-once" class="section level3">
 <h3><span class="header-section-number">9.1.2</span> Shuffling once</h3>
 <p>First, try to imagine a hypothetical universe where no gender discrimination in promotions existed. In such a hypothetical universe, the gender of an applicant would have no bearing on their chances of promotion. Bringing things back to our <code>promotions</code> data frame, the <code>gender</code> variable would thus be an irrelevant label. If these <code>gender</code> labels were irrelevant, then we could randomly reassign them by “shuffling” them to no consequence!</p>
-<p>To illustrate this idea, let’s narrow our focus to six arbitrarily chosen resumes of the 48 in Table <a href="9-hypothesis-testing.html#tab:compare-six">9.1</a>. The <code>decision</code> column shows that three resumes resulted in promotion while three didn’t. The <code>gender</code> column shows what the original gender of the resume name was.</p>
-<p>However, in our hypothesized universe of no gender discrimination, gender is irrelevant and thus it is of no consequence to randomly “shuffle” the values of <code>gender</code>. The <code>shuffled_gender</code> column shows one such possible random shuffling. Observe how the number of male and female names remains the same at three each, but they are now listed in a different order.</p>
+<p>To illustrate this idea, let’s narrow our focus to 6 arbitrarily chosen résumés of the 48 in Table <a href="9-hypothesis-testing.html#tab:compare-six">9.1</a>. The <code>decision</code> column shows that 3 résumés resulted in promotion while 3 didn’t. The <code>gender</code> column shows what the original gender of the résumé name was.</p>
+<p>However, in our hypothesized universe of no gender discrimination, gender is irrelevant and thus it is of no consequence to randomly “shuffle” the values of <code>gender</code>. The <code>shuffled_gender</code> column shows one such possible random shuffling. Observe in the fourth column how the number of male and female names remains the same at 3 each, but they are now listed in a different order.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:compare-six">TABLE 9.1: </span>One example of shuffling gender variable.
+<span id="tab:compare-six">TABLE 9.1: </span>One example of shuffling gender variable
 </caption>
 <thead>
 <tr>
 <th style="text-align:right;">
-resume number
+résumé number
 </th>
 <th style="text-align:left;">
 decision
@@ -754,56 +765,53 @@ <h3><span class="header-section-number">9.1.2</span> Shuffling once</h3>
 </tr>
 </tbody>
 </table>
-<p>Again, such random shuffling of the gender label only makes sense in our hypothesized universe of no gender discrimination. How could we extend this shuffling of the gender variable to all 48 resumes by hand? One way would be by using standard deck of 52 playing cards, which we display in Figure <a href="9-hypothesis-testing.html#fig:deck-of-cards">9.2</a>.</p>
+<p>Again, such random shuffling of the gender label only makes sense in our hypothesized universe of no gender discrimination. How could we extend this shuffling of the gender variable to all 48 résumés by hand? One way would be by using standard deck of 52 playing cards, which we display in Figure <a href="9-hypothesis-testing.html#fig:deck-of-cards">9.2</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:deck-of-cards"></span>
 <img src="images/shutterstock/shutterstock_670789453.jpg" alt="Standard deck of 52 playing cards." width="100%" />
 <p class="caption">
 FIGURE 9.2: Standard deck of 52 playing cards.
 </p>
 </div>
-<p>Since half the cards are red and the other half are black, by removing 2 red cards and 2 black cards, we would end up with 24 red cards and 24 black cards. After shuffling these 48 cards as seen in Figure <a href="9-hypothesis-testing.html#fig:shuffling">9.3</a>, we can flip the cards over one-by-one, assigning “male” for each red card and “female” for each black card.</p>
+<p>Since half the cards are red (diamonds and hearts) and the other half are black (spades and clubs), by removing two red cards and two black cards, we would end up with 24 red cards and 24 black cards. After shuffling these 48 cards as seen in Figure <a href="9-hypothesis-testing.html#fig:shuffling">9.3</a>, we can flip the cards over one-by-one, assigning “male” for each red card and “female” for each black card.</p>
 <div class="figure" style="text-align: center"><span id="fig:shuffling"></span>
-<img src="images/shutterstock/shutterstock_128283971.jpg" alt="Shuffling a deck of cards." width="66%" />
+<img src="images/shutterstock/shutterstock_128283971.jpg" alt="Shuffling a deck of cards." width="100%" height="100%" />
 <p class="caption">
 FIGURE 9.3: Shuffling a deck of cards.
 </p>
 </div>
 <!--
-Going back to our index cards, pick up each of the 24 cards corresponding to males and females that you placed on top of the manager cards. The next step is to put the two stacks of index cards together, creating a new set of 48 cards.  If we assume that the two population means are equal, we are saying that there is no association between promotion and gender (male vs female). If there really is no association between these two variables than for each of the 48 managers, it wouldn't matter whether they saw the name of a male or female candidate on the resume they were given. They'd each be equally likely of granting a promotion for each of the two binary genders. So how do we do this with the cards?
+Going back to our index cards, pick up each of the 24 cards corresponding to males and females that you placed on top of the supervisor cards. The next step is to put the two stacks of index cards together, creating a new set of 48 cards.  If we assume that the two population means are equal, we are saying that there is no association between promotion and gender (male vs female). If there really is no association between these two variables then for each of the 48 managers, it wouldn't matter whether they saw the name of a male or female candidate on the résumé they were given. They'd each be equally likely of granting a promotion for each of the two binary genders. So how do we do this with the cards?
 
 Now that we have our 48 cards corresponding to gender in a single pile, shuffle them. Feel free to do this a couple times. Now take each of the cards off the top of the pile and assign them to the 48 different supervisors. Keep the supervisor cards in the same place they were before. We are, thus, randomly assigning the different values of the **explanatory** variable to each of the entries of the **response** variable. To reiterate, we hold the response variable of `promotion` fixed by not shuffling those cards but we shuffle the values of `gender` as the explanatory variable. Let's check out what the first few rows of this permutation of the gender cards onto the supervisors might look like as data.
 -->
-<p>We’ve saved one such shuffling in the <code>promotions_shuffled</code> data frame of the <code>moderndive</code> package. If you view both the original <code>promotions</code> and the shuffled <code>promotions_shuffled</code> data frames and compare them, you’ll see that while the <code>decision</code> variables are identical, the <code>gender</code> variables are different.</p>
-<pre class="sourceCode r"><code class="sourceCode r">promotions_shuffled</code></pre>
-<pre><code># A tibble: 48 x 3
-      id decision gender
-   &lt;int&gt; &lt;fct&gt;    &lt;fct&gt; 
- 1     1 promoted female
- 2     2 promoted female
- 3     3 promoted male  
- 4     4 promoted female
- 5     5 promoted male  
- 6     6 promoted male  
- 7     7 promoted male  
- 8     8 promoted female
- 9     9 promoted male  
-10    10 promoted female
-# … with 38 more rows</code></pre>
-<p>Let’s repeat the same exploratory data analysis we did for the original <code>promotions</code> data on our shuffled <code>promotions_shuffled</code> data frame. Let’s create a barplot visualizing the relationship between <code>decision</code> and the new shuffled <code>gender</code> variable and compare this to the original unshuffled version in Figure <a href="9-hypothesis-testing.html#fig:promotions-barplot-permuted">9.4</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(promotions_shuffled, <span class="kw">aes</span>(<span class="dt">x =</span> gender, <span class="dt">fill =</span> decision)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Gender of resume name&quot;</span>)</code></pre>
+<p>We’ve saved one such shuffling in the <code>promotions_shuffled</code> data frame of the <code>moderndive</code> package. If you compare the original <code>promotions</code> and the shuffled <code>promotions_shuffled</code> data frames, you’ll see that while the <code>decision</code> variable is identical, the <code>gender</code> variable has changed.</p>
+<!--
+Albert: not sure what this is for?
+Let's look at the six rows that we selected at random before when viewing the `promotions` data frame using the `slice()` function in the `dplyr` package.
+
+Chester: Just as a way to show that the particular entries that we should before have now changed via a shuffling. It was to make sure readers could actually see the shuffle happen since it's a little harder to see looking at the entire data frames. Fine with not including it though.
+
+
+```r
+promotions_shuffled %>% slice(c(11, 26, 28, 36, 37, 46))
+```
+-->
+<p>Let’s repeat the same exploratory data analysis we did for the original <code>promotions</code> data on our <code>promotions_shuffled</code> data frame. Let’s create a barplot visualizing the relationship between <code>decision</code> and the new shuffled <code>gender</code> variable and compare this to the original unshuffled version in Figure <a href="9-hypothesis-testing.html#fig:promotions-barplot-permuted">9.4</a>.</p>
+<div class="sourceCode" id="cb357"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb357-1" data-line-number="1"><span class="kw">ggplot</span>(promotions_shuffled, </a>
+<a class="sourceLine" id="cb357-2" data-line-number="2">       <span class="kw">aes</span>(<span class="dt">x =</span> gender, <span class="dt">fill =</span> decision)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb357-3" data-line-number="3"><span class="st">  </span><span class="kw">geom_bar</span>() <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb357-4" data-line-number="4"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Gender of résumé name&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:promotions-barplot-permuted"></span>
-<img src="moderndive_files/figure-html/promotions-barplot-permuted-1.png" alt="Barplots of relationship of promotion with gender (left) and shuffled gender (right)." width="\textwidth" />
+<img src="ModernDive_files/figure-html/promotions-barplot-permuted-1.png" alt="Barplots of relationship of promotion with gender (left) and shuffled gender (right)." width="\textwidth" />
 <p class="caption">
 FIGURE 9.4: Barplots of relationship of promotion with gender (left) and shuffled gender (right).
 </p>
 </div>
 <p>It appears the difference in “male names” versus “female names” promotion rates is now different. Compared to the original data in the left barplot, the new “shuffled” data in the right barplot has promotion rates that are much more similar.</p>
-<p>Let’s also compute the proportion of resumes accepted for promotion for each group:</p>
-<pre class="sourceCode r"><code class="sourceCode r">promotions_shuffled <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(gender, decision) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">n</span>())</code></pre>
+<p>Let’s also compute the proportion of résumés accepted for promotion for each group:</p>
+<div class="sourceCode" id="cb358"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb358-1" data-line-number="1">promotions_shuffled <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb358-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(gender, decision) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb358-3" data-line-number="3"><span class="st">  </span><span class="kw">tally</span>() <span class="co"># Same as summarize(n = n())</span></a></code></pre></div>
 <pre><code># A tibble: 4 x 3
 # Groups:   gender [2]
   gender decision     n
@@ -812,21 +820,22 @@ <h3><span class="header-section-number">9.1.2</span> Shuffling once</h3>
 2 male   promoted    18
 3 female not          7
 4 female promoted    17</code></pre>
-<p>So in this hypothetical universe of no discrimination, 18/24 = 0.75 = 75% of “male” resumes were selected for promotion. On the other hand, 17/24 = 0.708 = 70.8% of “female” resumes were selected for promotion. Comparing these two values, it appears that resumes with male names were selected for promotion at a rate that was 0.75 - 0.708 = 0.042 = 4.2% different that resumes with female names.</p>
-<p>Observe how this difference in rates is different than the difference in rates of 0.292 = 29.2% we originally observed. This is once again due to <em>sampling variation</em>. How can we better understand the effect of this sampling variation? By repeating this shuffling several times!</p>
+<p>So in this hypothetical universe of no discrimination, <span class="math inline">\(18/24 = 0.75 = 75\%\)</span> of “male” résumés were selected for promotion. On the other hand, <span class="math inline">\(17/24 = 0.708 = 70.8\%\)</span> of “female” résumés were selected for promotion.</p>
+<p>Let’s next compare these two values. It appears that résumés with stereotypically male names were selected for promotion at a rate that was <span class="math inline">\(0.75 - 0.708 = 0.042 = 4.2\%\)</span> different than résumés with stereotypically female names.</p>
+<p>Observe how this difference in rates is not the same as the difference in rates of 0.292 = 29.2% we originally observed. This is once again due to <em>sampling variation</em>. How can we better understand the effect of this sampling variation? By repeating this shuffling several times!</p>
 </div>
 <div id="shuffling-16-times" class="section level3">
 <h3><span class="header-section-number">9.1.3</span> Shuffling 16 times</h3>
-<p>We recruited 16 groups of our friends to repeat this shuffling exercise. They recorded these values in a <a href="https://docs.google.com/spreadsheets/d/1Q-ENy3o5IrpJshJ7gn3hJ5A0TOWV2AZrKNHMsshQtiE/">shared spreadsheet</a>; we display a snapshot of the first 10 rows and 5 columns in Figure <a href="9-hypothesis-testing.html#fig:tactile-shuffling">9.5</a></p>
+<p>We recruited 16 groups of our friends to repeat this shuffling exercise. They recorded these values in a <a href="https://docs.google.com/spreadsheets/d/1Q-ENy3o5IrpJshJ7gn3hJ5A0TOWV2AZrKNHMsshQtiE/">shared spreadsheet</a>; we display a snapshot of the first 10 rows and 5 columns in Figure <a href="9-hypothesis-testing.html#fig:tactile-shuffling">9.5</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:tactile-shuffling"></span>
-<img src="images/sampling/promotions/shared_spreadsheet.png" alt="Snapshot of shared spreadsheet of shuffling results." width="100%" />
+<img src="images/sampling/promotions/shared_spreadsheet.png" alt="Snapshot of shared spreadsheet of shuffling results (m for male, f for female)." width="100%" />
 <p class="caption">
-FIGURE 9.5: Snapshot of shared spreadsheet of shuffling results.
+FIGURE 9.5: Snapshot of shared spreadsheet of shuffling results (m for male, f for female).
 </p>
 </div>
-<p>For each of these 16 columns of “shuffles”, we computed the difference in promotion rates, and in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-1">9.6</a> we display their distribution in a histogram. We also mark the observed difference in promotion rate that happened in real-life of 0.292 = 29.2% with a red line.</p>
+<p>For each of these 16 columns of <em>shuffles</em>, we computed the difference in promotion rates, and in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-1">9.6</a> we display their distribution in a histogram. We also mark the observed difference in promotion rate that occurred in real life of 0.292 = 29.2% with a dark line.</p>
 <div class="figure" style="text-align: center"><span id="fig:null-distribution-1"></span>
-<img src="moderndive_files/figure-html/null-distribution-1-1.png" alt="Distribution of shuffled differences in promotions." width="\textwidth" />
+<img src="ModernDive_files/figure-html/null-distribution-1-1.png" alt="Distribution of shuffled differences in promotions." width="\textwidth" />
 <p class="caption">
 FIGURE 9.6: Distribution of shuffled differences in promotions.
 </p>
@@ -834,7 +843,7 @@ <h3><span class="header-section-number">9.1.3</span> Shuffling 16 times</h3>
 <p>Before we discuss the distribution of the histogram, we emphasize the key thing to remember: this histogram represents differences in promotion rates that one would observe in our <em>hypothesized universe</em> of no gender discrimination.</p>
 <p>Observe first that the histogram is roughly centered at 0. Saying that the difference in promotion rates is 0 is equivalent to saying that both genders had the same promotion rate. In other words, the center of these 16 values is consistent with what we would expect in our hypothesized universe of no gender discrimination.</p>
 <p>However, while the values are centered at 0, there is variation about 0. This is because even in a hypothesized universe of no gender discrimination, you will still likely observe small differences in promotion rates because of chance <em>sampling variation</em>. Looking at the histogram in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-1">9.6</a>, such differences could even be as extreme as -0.292 or 0.208.</p>
-<p>Turning our attention to what we observed in real-life: the difference of 0.292 = 29.2% is marked with a red line. Ask yourself: in a hypothesized world of no gender discrimination, how likely would it be that we observe this difference? While opinions may differ, in our opinion not often! Now ask yourself: what does these results say about our hypothesized universe of no gender discrimination?</p>
+<p>Turning our attention to what we observed in real life: the difference of 0.292 = 29.2% is marked with a vertical dark line. Ask yourself: in a hypothesized world of no gender discrimination, how likely would it be that we observe this difference? While opinions here may differ, in our opinion not often! Now ask yourself: what do these results say about our hypothesized universe of no gender discrimination?</p>
 <!-- 
 Now each of our 33 friends does the following:
 
@@ -862,16 +871,16 @@ <h3><span class="header-section-number">9.1.3</span> Shuffling 16 times</h3>
 
 
 
-We see that of the 33 samples we selected only one is close to as extreme as what we observed. Thus, we might guess that we are starting to see some data suggesting that gender discrimination might be at play. Many the statistics calculated appear close to 0 with the vast remainder appearing around values of a difference of -0.1 and 0.1. So what further evidence would we need to make this suggestion a little clearer? More simulations! As we've done before in Chapters \@ref(sampling) and \@ref(confidence-intervals), we'll use the computer to simulate these permutations and calculations many times. Let's do just that with the `infer` package in the next section.
+We see that of the 33 samples we selected only one is close to as extreme as what we observed. Thus, we might guess that we are starting to see some data suggesting that gender discrimination might be at play. Many of the statistics calculated appear close to 0 with the vast remainder appearing around values of a difference of -0.1 and 0.1. So what further evidence would we need to make this suggestion a little clearer? More simulations! As we've done before in Chapters \@ref(sampling) and \@ref(confidence-intervals), we'll use the computer to simulate these permutations and calculations many times. Let's do just that with the `infer` package in the next section.
 -->
 </div>
 <div id="what-did-we-just-do-2" class="section level3">
 <h3><span class="header-section-number">9.1.4</span> What did we just do?</h3>
-<p>What we just demonstrated in this activity is the statistical procedure known as <em>hypothesis testing</em> using a <em>permutation test</em>. The term “permutation”  is the mathematical term for “shuffling”: take a series of values and reorder them randomly, as you did with the playing cards.</p>
+<p>What we just demonstrated in this activity is the statistical procedure known as <em>hypothesis testing</em> using a <em>permutation test</em>. The term “permutation”  is the mathematical term for “shuffling”: taking a series of values and reordering them randomly, as you did with the playing cards.</p>
 <p>In fact, permutations are another form of <em>resampling</em>, like the bootstrap method you performed in Chapter <a href="8-confidence-intervals.html#confidence-intervals">8</a>. While the bootstrap method involves resampling <em>with</em> replacement, permutation methods involve resampling <em>without</em> replacement.</p>
 <p>Think of our exercise involving the slips of paper representing pennies and the hat in Section <a href="8-confidence-intervals.html#resampling-tactile">8.1</a>: after sampling a penny, you put it back in the hat. Now think of our deck of cards. After drawing a card, you laid it out in front of you, recorded the color, and then you <em>did not</em> put it back in the deck.</p>
-<p>In our previous example, we tested the validity of the hypothesized universe of no gender discrimination. The evidence contained in our observed sample of 48 resumes was somewhat inconsistent with our hypothesized universe. Thus, we would be inclined to <em>reject</em> this hypothesized universe and declare that the evidence suggests there is gender discrimination.</p>
-<p>Recall our case study on whether yawning is contagious from Section <a href="8-confidence-intervals.html#case-study-two-prop-ci">8.6</a>. The previous example involves inference about an unknown difference of population proportions as well. This time it will be <span class="math inline">\(p_{m} - p_{f}\)</span>, where <span class="math inline">\(p_{m}\)</span> is the population proportion of resumes with male names being recommended for promotion and <span class="math inline">\(p_{f}\)</span> is the equivalent for resumes with female names. Recall that this is one of the scenarios for inference we’ve seen so far in Table <a href="9-hypothesis-testing.html#tab:table-diff-prop">9.2</a>.</p>
+<p>In our previous example, we tested the validity of the hypothesized universe of no gender discrimination. The evidence contained in our observed sample of 48 résumés was somewhat inconsistent with our hypothesized universe. Thus, we would be inclined to <em>reject</em> this hypothesized universe and declare that the evidence suggests there is gender discrimination.</p>
+<p>Recall our case study on whether yawning is contagious from Section <a href="8-confidence-intervals.html#case-study-two-prop-ci">8.6</a>. The previous example involves inference about an unknown difference of population proportions as well. This time, it will be <span class="math inline">\(p_{m} - p_{f}\)</span>, where <span class="math inline">\(p_{m}\)</span> is the population proportion of résumés with male names being recommended for promotion and <span class="math inline">\(p_{f}\)</span> is the equivalent for résumés with female names. Recall that this is one of the scenarios for inference we’ve seen so far in Table <a href="9-hypothesis-testing.html#tab:table-diff-prop">9.2</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:table-diff-prop">TABLE 9.2: </span>Scenarios of sampling for inference
@@ -891,7 +900,7 @@ <h3><span class="header-section-number">9.1.4</span> What did we just do?</h3>
 Point estimate
 </th>
 <th style="text-align:left;">
-Notation.
+Symbol(s)
 </th>
 </tr>
 </thead>
@@ -949,23 +958,23 @@ <h3><span class="header-section-number">9.1.4</span> What did we just do?</h3>
 </tr>
 </tbody>
 </table>
-<p>So based on our sample of <span class="math inline">\(n_m\)</span> = 24 “male” applicants and <span class="math inline">\(n_w\)</span> = 24 “female” applicants, the <em>point estimate</em> for <span class="math inline">\(p_{m} - p_{f}\)</span> is the <em>difference in sample proportions</em> <span class="math inline">\(\widehat{p}_{m} -\widehat{p}_{f}\)</span> = 0.875 - 0.583 = 0.292 = 29.2%. This difference in favor of “male” resumes of 0.292 is greater than 0, suggesting discrimination in favor of men.</p>
-<p>However the question we asked ourselves was “is this difference meaningfully different than 0?” In other words, is that difference indicative of true discrimination, or can we just attribute it to <em>sampling variation</em>? Hypothesis testing allows us to make such distinctions.</p>
+<p>So, based on our sample of <span class="math inline">\(n_m\)</span> = 24 “male” applicants and <span class="math inline">\(n_w\)</span> = 24 “female” applicants, the <em>point estimate</em> for <span class="math inline">\(p_{m} - p_{f}\)</span> is the <em>difference in sample proportions</em> <span class="math inline">\(\widehat{p}_{m} -\widehat{p}_{f}\)</span> = 0.875 - 0.583 = 0.292 = 29.2%. This difference in favor of “male” résumés of 0.292 is greater than 0, suggesting discrimination in favor of men.</p>
+<p>However, the question we asked ourselves was “is this difference meaningfully greater than 0?”. In other words, is that difference indicative of true discrimination, or can we just attribute it to <em>sampling variation</em>? Hypothesis testing allows us to make such distinctions.</p>
 </div>
 </div>
 <div id="understanding-ht" class="section level2">
 <h2><span class="header-section-number">9.2</span> Understanding hypothesis tests</h2>
-<p>Much like the terminology, notation, and definitions relating to sampling you saw in Section <a href="7-sampling.html#sampling-framework">7.3</a>, there is a lot of terminology, notation, and definitions related to hypothesis testing. Learning these may seem like a very daunting task at first. However with practice, practice, and practice, anyone can master them.</p>
-<p>First, a <strong>hypothesis</strong>  is a statement about the value of an unknown population parameter. In our resume activity, our population parameter of interest is the difference in population proportions <span class="math inline">\(p_{m} - p_{f}\)</span>. Hypothesis tests can involve any of the population parameters in Table <a href="7-sampling.html#tab:table-ch8">7.5</a> of the 6 inference scenarios we’ll cover in this book and more.</p>
-<p>Second, a <strong>hypothesis test</strong>  consists of a test between two competing hypotheses: 1) a <strong>null hypothesis</strong> <span class="math inline">\(H_0\)</span> (pronounced “H-naught”) versus 2) an <strong>alternative hypothesis</strong> <span class="math inline">\(H_A\)</span> (also denoted <span class="math inline">\(H_1\)</span>).</p>
-<p>Generally the null hypothesis  is a claim that there is “no effect” or “no difference of interest.” In many cases, the null hypothesis represents the status quo or a situation that nothing interesting is happening. Furthermore, generally the alternative hypothesis  is the claim the experimenter or researcher wants to establish or find evidence to support. It is viewed as a “challenger” hypothesis to the null hypothesis <span class="math inline">\(H_0\)</span>. In our resume activity, an appropriate hypothesis test would be:</p>
+<p>Much like the terminology, notation, and definitions relating to sampling you saw in Section <a href="7-sampling.html#sampling-framework">7.3</a>, there are a lot of terminology, notation, and definitions related to hypothesis testing as well. Learning these may seem like a very daunting task at first. However, with practice, practice, and more practice, anyone can master them.</p>
+<p>First, a <strong>hypothesis</strong>  is a statement about the value of an unknown population parameter. In our résumé activity, our population parameter of interest is the difference in population proportions <span class="math inline">\(p_{m} - p_{f}\)</span>. Hypothesis tests can involve any of the population parameters in Table <a href="7-sampling.html#tab:table-ch8">7.5</a> of the five inference scenarios we’ll cover in this book and also more advanced types we won’t cover here.</p>
+<p>Second, a <strong>hypothesis test</strong>  consists of a test between two competing hypotheses: (1) a <strong>null hypothesis</strong> <span class="math inline">\(H_0\)</span> (pronounced “H-naught”) versus (2) an <strong>alternative hypothesis</strong> <span class="math inline">\(H_A\)</span> (also denoted <span class="math inline">\(H_1\)</span>).</p>
+<p>Generally the null hypothesis  is a claim that there is “no effect” or “no difference of interest.” In many cases, the null hypothesis represents the status quo or a situation that nothing interesting is happening. Furthermore, generally the alternative hypothesis  is the claim the experimenter or researcher wants to establish or find evidence to support. It is viewed as a “challenger” hypothesis to the null hypothesis <span class="math inline">\(H_0\)</span>. In our résumé activity, an appropriate hypothesis test would be:</p>
 <p><span class="math display">\[
 \begin{aligned}
 H_0 &amp;: \text{men and women are promoted at the same rate}\\
 \text{vs } H_A &amp;: \text{men are promoted at a higher rate than women}
 \end{aligned}
 \]</span></p>
-<p>Note some of the choices we have made. First, we set the null hypothesis <span class="math inline">\(H_0\)</span> to be that there is no difference in promotion rate and the “challenger” alternative hypothesis <span class="math inline">\(H_A\)</span> to be that there is a difference. While it would not be wrong in principle to reverse the two, it is a convention in statistical inference that the null hypothesis is set to reflect a “null” situation where “nothing is going on.” As we discussed earlier, in this case, that there is no difference in promotion rates. Furthermore we set <span class="math inline">\(H_A\)</span> to be that men are promoted at a <em>higher</em> rate, a subjective choice reflecting a prior suspicion we have that this is the case. We call such alternative hypotheses  <em>one-sided alternatives</em>. If someone else however does not share such suspicions and only wants to investigate that there is a difference, whether higher or lower, they would set what is known as a  <em>two-sided alternative</em>.</p>
+<p>Note some of the choices we have made. First, we set the null hypothesis <span class="math inline">\(H_0\)</span> to be that there is no difference in promotion rate and the “challenger” alternative hypothesis <span class="math inline">\(H_A\)</span> to be that there is a difference. While it would not be wrong in principle to reverse the two, it is a convention in statistical inference that the null hypothesis is set to reflect a “null” situation where “nothing is going on.” As we discussed earlier, in this case, <span class="math inline">\(H_0\)</span> corresponds to there being no difference in promotion rates. Furthermore, we set <span class="math inline">\(H_A\)</span> to be that men are promoted at a <em>higher</em> rate, a subjective choice reflecting a prior suspicion we have that this is the case. We call such alternative hypotheses  <em>one-sided alternatives</em>. If someone else however does not share such suspicions and only wants to investigate that there is a difference, whether higher or lower, they would set what is known as a  <em>two-sided alternative</em>.</p>
 <p>We can re-express the formulation of our hypothesis test using the mathematical notation for our population parameter of interest, the difference in population proportions <span class="math inline">\(p_{m} - p_{f}\)</span>:</p>
 <p><span class="math display">\[
 \begin{aligned}
@@ -973,31 +982,32 @@ <h2><span class="header-section-number">9.2</span> Understanding hypothesis test
 \text{vs } H_A&amp;: p_{m} - p_{f} &gt; 0
 \end{aligned}
 \]</span></p>
-<p>Observe how the alternative hypothesis <span class="math inline">\(H_A\)</span> is one-sided <span class="math inline">\(p_{m} - p_{f} &gt; 0\)</span>. Had we opted for a two-sided alternative, we would have set <span class="math inline">\(p_{m} - p_{f} \neq 0\)</span>. To keep things simple for now, we’ll stick with the simpler one-sided alternative. We’ll present an example of a two-sided alternative in Section <a href="9-hypothesis-testing.html#ht-case-study">9.5</a>.</p>
-<p>Third, a <strong>test statistic</strong>  is a <em>point estimate/sample statistic</em> formula used for hypothesis testing. Note that a sample statistic is merely a summary statistic based on a sample of observations. Recall we saw in Section <a href="3-wrangling.html#summarize">3.3</a> that a summary statistic takes in many values and returns only one. Here, the sample would be the <span class="math inline">\(n_m\)</span> = 24 resumes with male names and the <span class="math inline">\(n_f\)</span> = 24 resumes with female names. Hence, the point estimate of interest is the difference in sample proportions <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span>.</p>
-<p>Fourth, the <strong>observed test statistic</strong>  is the value of the test statistic that we observed in real-life. In our case, we computed this value using the data saved in the <code>promotions</code> data frame. It was the observed difference of <span class="math inline">\(\widehat{p}_{m} -\widehat{p}_{f}\)</span> = 0.875 - 0.583 = 0.292 = 29.2% in favor of resumes with male names.</p>
-<p>Fifth, the <strong>null distribution</strong>  is the sampling distribution of the test statistic <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>. Ooof! That’s a long one! Let’s unpack it slowly. The key to understanding the null distribution is that the null hypothesis <span class="math inline">\(H_0\)</span> <em>assumed</em> to be true. We’re not saying that <span class="math inline">\(H_0\)</span> is true at this point, we’re only assuming it to be true for hypothesis testing purposes. In our case, this corresponds to our hypothesized universe of no gender discrimination in promotion rates. Assuming the null hypothesis <span class="math inline">\(H_0\)</span>, also stated as “Under <span class="math inline">\(H_0\)</span>,” how does the test statistic vary due to sampling variation? In our case, how will the difference in sample proportions <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span> vary due to sampling? Recall from Section <a href="7-sampling.html#sampling-definitions">7.3.2</a> that distributions that display how point estimates vary due to sampling variation are called <em>sampling distributions</em>. The only additional thing to keep in mind about null distributions is that they are sampling distributions <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>.</p>
-<p>In our case, we previously visualized a null distribution in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-1">9.6</a>, which we re-display in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-2">9.7</a> using our new notation and terminology. It is the distribution of the 16 different difference in sample proportions our friends computed <em>assuming</em> a hypothetical universe of no gender discrimination. We also mark the value of the observed test statistic of 0.292 with a vertical line.</p>
+<p>Observe how the alternative hypothesis <span class="math inline">\(H_A\)</span> is one-sided with <span class="math inline">\(p_{m} - p_{f} &gt; 0\)</span>. Had we opted for a two-sided alternative, we would have set <span class="math inline">\(p_{m} - p_{f} \neq 0\)</span>. To keep things simple for now, we’ll stick with the simpler one-sided alternative. We’ll present an example of a two-sided alternative in Section <a href="9-hypothesis-testing.html#ht-case-study">9.5</a>.</p>
+<p>Third, a <strong>test statistic</strong>  is a <em>point estimate/sample statistic</em> formula used for hypothesis testing. Note that a sample statistic is merely a summary statistic based on a sample of observations. Recall we saw in Section <a href="3-wrangling.html#summarize">3.3</a> that a summary statistic takes in many values and returns only one. Here, the samples would be the <span class="math inline">\(n_m\)</span> = 24 résumés with male names and the <span class="math inline">\(n_f\)</span> = 24 résumés with female names. Hence, the point estimate of interest is the difference in sample proportions <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span>.</p>
+<p>Fourth, the <strong>observed test statistic</strong>  is the value of the test statistic that we observed in real life. In our case, we computed this value using the data saved in the <code>promotions</code> data frame. It was the observed difference of <span class="math inline">\(\widehat{p}_{m} -\widehat{p}_{f} = 0.875 - 0.583 = 0.292 = 29.2\%\)</span> in favor of résumés with male names.</p>
+<p>Fifth, the <strong>null distribution</strong>  is the sampling distribution of the test statistic <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>. Ooof! That’s a long one! Let’s unpack it slowly. The key to understanding the null distribution is that the null hypothesis <span class="math inline">\(H_0\)</span> is <em>assumed</em> to be true. We’re not saying that <span class="math inline">\(H_0\)</span> is true at this point, we’re only assuming it to be true for hypothesis testing purposes. In our case, this corresponds to our hypothesized universe of no gender discrimination in promotion rates. Assuming the null hypothesis <span class="math inline">\(H_0\)</span>, also stated as “Under <span class="math inline">\(H_0\)</span>,” how does the test statistic vary due to sampling variation? In our case, how will the difference in sample proportions <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span> vary due to sampling under <span class="math inline">\(H_0\)</span>? Recall from Subsection <a href="7-sampling.html#sampling-definitions">7.3.2</a> that distributions displaying how point estimates vary due to sampling variation are called <em>sampling distributions</em>. The only additional thing to keep in mind about null distributions is that they are sampling distributions <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>.</p>
+<p>In our case, we previously visualized a null distribution in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-1">9.6</a>, which we re-display in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-2">9.7</a> using our new notation and terminology. It is the distribution of the 16 differences in sample proportions our friends computed <em>assuming</em> a hypothetical universe of no gender discrimination. We also mark the value of the observed test statistic of 0.292 with a vertical line.</p>
 <div class="figure" style="text-align: center"><span id="fig:null-distribution-2"></span>
-<img src="moderndive_files/figure-html/null-distribution-2-1.png" alt="Null distribution and observed test statistic." width="\textwidth" />
+<img src="ModernDive_files/figure-html/null-distribution-2-1.png" alt="Null distribution and observed test statistic." width="\textwidth" />
 <p class="caption">
 FIGURE 9.7: Null distribution and observed test statistic.
 </p>
 </div>
-<p>Sixth, the <strong>p-value</strong>  is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>. Double ooof! Let’s unpack this slowly as well. You can think of the p-value as a quantification of “surprise”: assuming <span class="math inline">\(H_0\)</span> is true, how surprised are we with what we observed? Or in our case, in our hypothesized universe of no gender discrimination, how surprised are we that we observed a difference in promotion rates of 0.292? Very surprised? Somewhat surprised?</p>
-<p>The p-value quantifies this probability, or in the case of our 16 differences in sample proportions in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-2">9.7</a>, what proportion had a more “extreme” result? Here, extreme is defined in terms of the alternative hypothesis <span class="math inline">\(H_A\)</span> that “male” applicants are promoted at a higher rate than “female” applicants. In other words, how often was the discrimination in favor of men even more pronounced than 0.875 - 0.583 = 0.292 = 29.2%?</p>
-<p>In this case, 0 times out of 16 did we obtain a difference in proportion greater than or equal to the observed difference of 0.292 = 29.2%. A very rare outcome! Given the rarity of such a pronounced in difference in promotion rates in our hypothesized universe of no gender discrimination, we’re inclined to <em>reject</em>  our hypothesized universe in favor of one stating there is discrimination in favor of the “male” applicants. In other words, we reject <span class="math inline">\(H_0\)</span> in favor of <span class="math inline">\(H_A\)</span>.</p>
+<p>Sixth, the <strong><span class="math inline">\(p\)</span>-value</strong>  is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>. Double ooof! Let’s unpack this slowly as well. You can think of the <span class="math inline">\(p\)</span>-value as a quantification of “surprise”: assuming <span class="math inline">\(H_0\)</span> is true, how surprised are we with what we observed? Or in our case, in our hypothesized universe of no gender discrimination, how surprised are we that we observed a difference in promotion rates of 0.292 from our collected samples assuming <span class="math inline">\(H_0\)</span> is true? Very surprised? Somewhat surprised?</p>
+<p>The <span class="math inline">\(p\)</span>-value quantifies this probability, or in the case of our 16 differences in sample proportions in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-2">9.7</a>, what proportion had a more “extreme” result? Here, extreme is defined in terms of the alternative hypothesis <span class="math inline">\(H_A\)</span> that “male” applicants are promoted at a higher rate than “female” applicants. In other words, how often was the discrimination in favor of men <em>even more</em> pronounced than <span class="math inline">\(0.875 - 0.583 = 0.292 = 29.2\%\)</span>?</p>
+<p>In this case, 0 times out of 16, we obtained a difference in proportion greater than or equal to the observed difference of 0.292 = 29.2%. A very rare (in fact, not occurring) outcome! Given the rarity of such a pronounced difference in promotion rates in our hypothesized universe of no gender discrimination, we’re inclined to <em>reject</em>  our hypothesized universe. Instead, we favor the hypothesis stating there is discrimination in favor of the “male” applicants. In other words, we reject <span class="math inline">\(H_0\)</span> in favor of <span class="math inline">\(H_A\)</span>.</p>
 <!--
-TODO: Figure out how to weave this in.
+TODO: Including observed test stat in p-value computation
 
-We'll see later on however, the p-value isn't quite 1/16, but rather (0 + 1)/(16 + 1) = 1/17 = 0.059 as we need to include the observed test statistic in our calculation. 
+We'll see later on however, the $p$-value isn't quite 1/16, but rather (0 + 1)/(16 + 1) = 1/17 = 0.059 as we need to include the observed test statistic in our calculation. 
 -->
-<p>Seventh and lastly, in many hypothesis testing procedures, it is commonly recommended to set the <strong>significance level</strong>  of the test beforehand. It is denoted by the Greek letter <span class="math inline">\(\alpha\)</span> (pronounced “alpha”). This value acts as a cutoff on the p-value, where if the p-value falls below <span class="math inline">\(\alpha\)</span>, we would “reject the null hypothesis <span class="math inline">\(H_0\)</span>.” Alternatively, if the p-value does not fall below <span class="math inline">\(\alpha\)</span>, we would “fail to reject <span class="math inline">\(H_0\)</span>.” Note the latter statement is not quite the same as saying we “accept <span class="math inline">\(H_0\)</span>.” This distinction is rather subtle and not immediately obvious. So we’ll revisit it later in Section <a href="9-hypothesis-testing.html#ht-interpretation">9.4</a>.</p>
-<p>While different fields tend to use different values of <span class="math inline">\(\alpha\)</span>, some commonly used values for <span class="math inline">\(\alpha\)</span> are 0.1, 0.01, and 0.05, with 0.05 being the choice people often make without putting much thought into it. We’ll talk more about <span class="math inline">\(\alpha\)</span> significance levels in Section <a href="9-hypothesis-testing.html#ht-interpretation">9.4</a>, but first let’s fully conduct the hypothesis test corresponding to our promotions activity using the <code>infer</code> package.</p>
+<p>Seventh and lastly, in many hypothesis testing procedures, it is commonly recommended to set the <strong>significance level</strong>  of the test beforehand. It is denoted by the Greek letter <span class="math inline">\(\alpha\)</span> (pronounced “alpha”). This value acts as a cutoff on the <span class="math inline">\(p\)</span>-value, where if the <span class="math inline">\(p\)</span>-value falls below <span class="math inline">\(\alpha\)</span>, we would “reject the null hypothesis <span class="math inline">\(H_0\)</span>.”</p>
+<p>Alternatively, if the <span class="math inline">\(p\)</span>-value does not fall below <span class="math inline">\(\alpha\)</span>, we would “fail to reject <span class="math inline">\(H_0\)</span>.” Note the latter statement is not quite the same as saying we “accept <span class="math inline">\(H_0\)</span>.” This distinction is rather subtle and not immediately obvious. So we’ll revisit it later in Section <a href="9-hypothesis-testing.html#ht-interpretation">9.4</a>.</p>
+<p>While different fields tend to use different values of <span class="math inline">\(\alpha\)</span>, some commonly used values for <span class="math inline">\(\alpha\)</span> are 0.1, 0.01, and 0.05; with 0.05 being the choice people often make without putting much thought into it. We’ll talk more about <span class="math inline">\(\alpha\)</span> significance levels in Section <a href="9-hypothesis-testing.html#ht-interpretation">9.4</a>, but first let’s fully conduct the hypothesis test corresponding to our promotions activity using the <code>infer</code> package.</p>
 </div>
 <div id="ht-infer" class="section level2">
 <h2><span class="header-section-number">9.3</span> Conducting hypothesis tests</h2>
-<p>In Section <a href="8-confidence-intervals.html#bootstrap-process">8.4</a>, we showed you how to construct confidence intervals. We first illustrated how to do this using raw <code>dplyr</code> data wrangling verbs and the <code>rep_sample_n()</code> function from Section <a href="7-sampling.html#shovel-1000-times">7.2.3</a> which we used as a virtual shovel. In particular, we constructed confidence intervals by resampling with replacement by setting the <code>replace = TRUE</code> argument to the <code>rep_sample_n()</code> function.</p>
+<p>In Section <a href="8-confidence-intervals.html#bootstrap-process">8.4</a>, we showed you how to construct confidence intervals. We first illustrated how to do this using <code>dplyr</code> data wrangling verbs and the <code>rep_sample_n()</code> function from Subsection <a href="7-sampling.html#shovel-1000-times">7.2.3</a> which we used as a virtual shovel. In particular, we constructed confidence intervals by resampling with replacement by setting the <code>replace = TRUE</code> argument to the <code>rep_sample_n()</code> function.</p>
 <p>We then showed you how to perform the same task using the <code>infer</code> package workflow. While both workflows resulted in the same bootstrap distribution from which we can construct confidence intervals, the <code>infer</code> package workflow emphasizes each of the steps in the overall process in Figure <a href="9-hypothesis-testing.html#fig:infer-ci">9.8</a>. It does so using function names that are intuitively named with verbs:</p>
 <ol style="list-style-type: decimal">
 <li><code>specify()</code> the variables of interest in your data frame.</li>
@@ -1006,21 +1016,21 @@ <h2><span class="header-section-number">9.3</span> Conducting hypothesis tests</
 <li><code>visualize()</code> the resulting bootstrap distribution and confidence interval.</li>
 </ol>
 <div class="figure" style="text-align: center"><span id="fig:infer-ci"></span>
-<img src="images/flowcharts/infer/visualize.png" alt="Confidence intervals with the infer package." width="80%" />
+<img src="images/flowcharts/infer/visualize.png" alt="Confidence intervals with the infer package." width="90%" height="90%" />
 <p class="caption">
 FIGURE 9.8: Confidence intervals with the infer package.
 </p>
 </div>
 <p>In this section, we’ll now show you how to seamlessly modify the previously seen <code>infer</code> code for constructing confidence intervals to conduct hypothesis tests. You’ll notice that the basic outline of the workflow is almost identical, except for an additional <code>hypothesize()</code> step between the <code>specify()</code> and <code>generate()</code> steps, as can be seen in Figure <a href="9-hypothesis-testing.html#fig:inferht">9.9</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:inferht"></span>
-<img src="images/flowcharts/infer/ht.png" alt="Hypothesis testing with the infer package." width="80%" />
+<img src="images/flowcharts/infer/ht.png" alt="Hypothesis testing with the infer package." width="90%" height="90%" />
 <p class="caption">
 FIGURE 9.9: Hypothesis testing with the infer package.
 </p>
 </div>
-<p>Furthermore, we’ll use a pre-specified significance level <span class="math inline">\(\alpha\)</span> = 0.001 for this hypothesis test. Let’s leave discussion on the choice of this <span class="math inline">\(\alpha\)</span> value until later on in Section <a href="9-hypothesis-testing.html#ht-interpretation">9.4</a>.</p>
+<p>Furthermore, we’ll use a pre-specified significance level <span class="math inline">\(\alpha\)</span> = 0.05 for this hypothesis test. Let’s leave discussion on the choice of this <span class="math inline">\(\alpha\)</span> value until later on in Section <a href="9-hypothesis-testing.html#ht-interpretation">9.4</a>.</p>
 <div id="infer-workflow-ht" class="section level3">
-<h3><span class="header-section-number">9.3.1</span> infer package workflow</h3>
+<h3><span class="header-section-number">9.3.1</span> <code>infer</code> package workflow</h3>
 <!--
 you were introduced to the framework for inference including the following verbs: `specify()`, `generate()`, and `calculate()`. This was useful when calculating bootstrap distributions in order to develop confidence intervals in both the one-sample and two-sample cases. One of the great powers of the `infer` package is in extending confidence intervals to hypothesis testing by including one more verb: `hypothesize()`. 
 
@@ -1030,10 +1040,10 @@ <h3><span class="header-section-number">9.3.1</span> infer package workflow</h3>
 -->
 <div id="specify-variables-3" class="section level4 unnumbered">
 <h4>1. <code>specify</code> variables</h4>
-<p>Recall that we use the <code>specify()</code>  verb to specify the response variable and, if needed, any explanatory variables for our study. In this case, since we are interested in any potential effects of gender on promotion decisions, we set <code>decision</code> as the response variable and <code>gender</code> as the explanatory variable. We do so using a <code>formula = response ~ explanatory</code> argument where <code>response</code> is the name of the response variable in the data frame and <code>explanatory</code> is the name of the explanatory variable. So in our case it is <code>decision ~ gender</code>.</p>
-<p>Furthermore, since we are interested in the proportion of resumes <code>&quot;promoted&quot;</code>, and not the proportion of resumes <code>not</code> promoted, we set the argument <code>success = &quot;promoted&quot;</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">promotions <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>)</code></pre>
+<p>Recall that we use the <code>specify()</code>  verb to specify the response variable and, if needed, any explanatory variables for our study. In this case, since we are interested in any potential effects of gender on promotion decisions, we set <code>decision</code> as the response variable and <code>gender</code> as the explanatory variable. We do so using <code>formula = response ~ explanatory</code> where <code>response</code> is the name of the response variable in the data frame and <code>explanatory</code> is the name of the explanatory variable. So in our case it is <code>decision ~ gender</code>.</p>
+<p>Furthermore, since we are interested in the proportion of résumés <code>&quot;promoted&quot;</code>, and not the proportion of résumés <code>not</code> promoted, we set the argument <code>success</code> to <code>&quot;promoted&quot;</code>.</p>
+<div class="sourceCode" id="cb360"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb360-1" data-line-number="1">promotions <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb360-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) </a></code></pre></div>
 <pre><code>Response: decision (factor)
 Explanatory: gender (factor)
 # A tibble: 48 x 2
@@ -1058,19 +1068,22 @@ <h4>2. <code>hypothesize</code> the null</h4>
 <p><span class="math display">\[
 \begin{aligned}
 H_0 &amp;: p_{m} - p_{f} = 0\\
-\text{vs } H_A&amp;: p_{m} - p_{f} &gt; 0
+\text{vs. } H_A&amp;: p_{m} - p_{f} &gt; 0
 \end{aligned}
 \]</span></p>
 <p>In other words, the null hypothesis <span class="math inline">\(H_0\)</span> corresponding to our “hypothesized universe” stated that there was no difference in gender-based discrimination rates. We set this null hypothesis <span class="math inline">\(H_0\)</span> in our <code>infer</code> workflow using the <code>null</code> argument of the <code>hypothesize()</code> function to either:</p>
 <ul>
 <li><code>&quot;point&quot;</code> for hypotheses involving a single sample or</li>
-<li><code>&quot;independence&quot;</code> for hypotheses involving two samples</li>
+<li><code>&quot;independence&quot;</code> for hypotheses involving two samples.</li>
 </ul>
-<p>In our case, since we have two samples (the resumes with “male” and “female” names), we set <code>null = &quot;independence&quot;</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">promotions <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>)</code></pre>
-<pre><code># A tibble: 48 x 2
+<p>In our case, since we have two samples (the résumés with “male” and “female” names), we set <code>null = &quot;independence&quot;</code>.</p>
+<div class="sourceCode" id="cb362"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb362-1" data-line-number="1">promotions <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb362-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb362-3" data-line-number="3"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>)</a></code></pre></div>
+<pre><code>Response: decision (factor)
+Explanatory: gender (factor)
+Null Hypothesis: independence
+# A tibble: 48 x 2
    decision gender
    &lt;fct&gt;    &lt;fct&gt; 
  1 promoted male  
@@ -1085,74 +1098,60 @@ <h4>2. <code>hypothesize</code> the null</h4>
 10 promoted male  
 # … with 38 more rows</code></pre>
 <p>Again, the data has not changed yet. This will occur at the upcoming <code>generate()</code> step; we’re merely setting meta-data for now.</p>
-<p>Where do the terms <code>&quot;point&quot;</code> and <code>&quot;independence&quot;</code> come from? These are two technical statistics terms. The term “point” relates from the fact that for a single group of observations, you will test the value of a single point. Going back to the pennies example from Chapter <a href="8-confidence-intervals.html#confidence-intervals">8</a>, say we wanted to test if the mean year of all US pennies was equal to 1993 or not. We would be testing the value of a “point” <span class="math inline">\(\mu\)</span>, the mean year of <em>all</em> US pennies, as follows</p>
+<p>Where do the terms <code>&quot;point&quot;</code> and <code>&quot;independence&quot;</code> come from? These are two technical statistical terms. The term “point” relates from the fact that for a single group of observations, you will test the value of a single point. Going back to the pennies example from Chapter <a href="8-confidence-intervals.html#confidence-intervals">8</a>, say we wanted to test if the mean year of all US pennies was equal to 1993 or not. We would be testing the value of a “point” <span class="math inline">\(\mu\)</span>, the mean year of <em>all</em> US pennies, as follows</p>
 <p><span class="math display">\[
 \begin{aligned}
 H_0 &amp;: \mu = 1993\\
 \text{vs } H_A&amp;: \mu \neq 1993
 \end{aligned}
 \]</span></p>
-<p>The term “independence” relates to the fact that for two groups of observations, you are testing whether or not the response variable is <em>independent</em> of the explanatory variable that assigns the groups. In our case, we are testing whether the <code>decision</code> response variable is “independent” of the explanatory variable <code>gender</code> that assigns each resume to either of the two groups.</p>
+<p>The term “independence” relates to the fact that for two groups of observations, you are testing whether or not the response variable is <em>independent</em> of the explanatory variable that assigns the groups. In our case, we are testing whether the <code>decision</code> response variable is “independent” of the explanatory variable <code>gender</code> that assigns each résumé to either of the two groups.</p>
 </div>
 <div id="generate-replicates-3" class="section level4 unnumbered">
 <h4>3. <code>generate</code> replicates</h4>
 <p>After we <code>hypothesize()</code> the null hypothesis, we <code>generate()</code> replicates of “shuffled” datasets assuming the null hypothesis is true. We do this by repeating the shuffling exercise you performed in Section <a href="9-hypothesis-testing.html#ht-activity">9.1</a> several times. Instead of merely doing it 16 times as our groups of friends did, let’s use the computer to repeat this 1000 times by setting <code>reps = 1000</code> in the <code>generate()</code>  function. However, unlike for confidence intervals where we generated replicates using <code>type = &quot;bootstrap&quot;</code> resampling with replacement, we’ll now perform shuffles/permutations by setting <code>type = &quot;permute&quot;</code>. Recall that shuffles/permutations are a kind of resampling, but unlike the bootstrap method, they involve resampling <em>without</em> replacement.</p>
-<pre class="sourceCode r"><code class="sourceCode r">promotions <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>)</code></pre>
-<pre><code>Response: decision (factor)
-Explanatory: gender (factor)
-Null Hypothesis: independence
-# A tibble: 48,000 x 3
-# Groups:   replicate [1,000]
-   decision gender replicate
-   &lt;fct&gt;    &lt;fct&gt;      &lt;int&gt;
- 1 promoted male           1
- 2 not      male           1
- 3 promoted male           1
- 4 promoted female         1
- 5 promoted female         1
- 6 promoted female         1
- 7 promoted female         1
- 8 promoted female         1
- 9 promoted female         1
-10 not      female         1
-# … with 47,990 more rows</code></pre>
-<p>Observe that the resulting data frame has 48,000 rows. This is because we performed shuffles/permutations of the 48 values of <code>gender</code> 1000 times and 48,000 = 1000 <span class="math inline">\(\times\)</span> 48. The variable <code>replicate</code> indicates which resample each row belongs to. So it has the value <code>1</code> 48 times, the value <code>2</code> 48 times, all the way through to the value <code>1000</code> 48 times.</p>
+<div class="sourceCode" id="cb364"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb364-1" data-line-number="1">promotions_generate &lt;-<span class="st"> </span>promotions <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb364-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb364-3" data-line-number="3"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb364-4" data-line-number="4"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>)</a>
+<a class="sourceLine" id="cb364-5" data-line-number="5"><span class="kw">nrow</span>(promotions_generate)</a></code></pre></div>
+<pre><code>[1] 48000</code></pre>
+<!-- infer:::permute_once() shuffles the y variable instead of the x. This ends up being the same thing but isn't as intuitive
+to explain given how we set up the cards tactile example before. I think it's best to avoid displaying `promotions_generate`
+here since it has `decision` shuffled. -->
+<p>Observe that the resulting data frame has 48,000 rows. This is because we performed shuffles/permutations for each of the 48 rows 1000 times and <span class="math inline">\(48,000 = 1000 \cdot 48\)</span>. If you explore the <code>promotions_generate</code> data frame with <code>View()</code>, you’ll notice that the variable <code>replicate</code> indicates which resample each row belongs to. So it has the value <code>1</code> 48 times, the value <code>2</code> 48 times, all the way through to the value <code>1000</code> 48 times.</p>
 </div>
 <div id="calculate-summary-statistics-3" class="section level4 unnumbered">
 <h4>4. <code>calculate</code> summary statistics</h4>
-<p>Now that we have generated 1000 replicates of “shuffles” assuming the null hypothesis is true, let’s <code>calculate()</code>  the appropriate summary statistic for each of our 1000 shuffles. Recall from Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a> that point estimates/summary statistics related to hypothesis testing have a specific name: <em>test statistics</em>. Since the unknown population parameter of interest is the difference in population proportions <span class="math inline">\(p_{m} - p_{f}\)</span>, the test statistic of interest here is the difference in sample proportions <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span>.</p>
+<p>Now that we have generated 1000 replicates of “shuffles” assuming the null hypothesis is true, let’s <code>calculate()</code>  the appropriate summary statistic for each of our 1000 shuffles. From Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a>, point estimates related to hypothesis testing have a specific name: <em>test statistics</em>. Since the unknown population parameter of interest is the difference in population proportions <span class="math inline">\(p_{m} - p_{f}\)</span>, the test statistic here is the difference in sample proportions <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span>.</p>
 <p>For each of our 1000 shuffles, we can calculate this test statistic by setting <code>stat = &quot;diff in props&quot;</code>. Furthermore, since we are interested in <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span> we set <code>order = c(&quot;male&quot;, &quot;female&quot;)</code>. As we stated earlier, the order of the subtraction does not matter, so long as you stay consistent throughout your analysis and tailor your interpretations accordingly.</p>
 <p>Let’s save the result in a data frame called <code>null_distribution</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distribution &lt;-<span class="st"> </span>promotions <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;male&quot;</span>, <span class="st">&quot;female&quot;</span>))
-null_distribution</code></pre>
+<div class="sourceCode" id="cb366"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb366-1" data-line-number="1">null_distribution &lt;-<span class="st"> </span>promotions <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb366-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb366-3" data-line-number="3"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb366-4" data-line-number="4"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb366-5" data-line-number="5"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;male&quot;</span>, <span class="st">&quot;female&quot;</span>))</a>
+<a class="sourceLine" id="cb366-6" data-line-number="6">null_distribution</a></code></pre></div>
 <pre><code># A tibble: 1,000 x 2
    replicate       stat
        &lt;int&gt;      &lt;dbl&gt;
- 1         1 -0.208333 
- 2         2  0.291667 
- 3         3  0.125    
- 4         4 -0.208333 
- 5         5 -0.125    
- 6         6  0.0416667
- 7         7 -0.0416667
- 8         8  0.291667 
- 9         9  0.0416667
-10        10  0.125    
+ 1         1 -0.0416667
+ 2         2 -0.125    
+ 3         3 -0.125    
+ 4         4 -0.0416667
+ 5         5 -0.0416667
+ 6         6 -0.125    
+ 7         7 -0.125    
+ 8         8 -0.125    
+ 9         9 -0.0416667
+10        10 -0.0416667
 # … with 990 more rows</code></pre>
-<p>Observe that we have 1000 values of <code>stat</code>, each representing one instance of <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span> in a hypothesized world of no gender discrimination. Observe as well we chose the name of this data frame carefully: <code>null_distribution</code>. Recall once again from Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a> that sampling distributions when the null hypothesis <span class="math inline">\(H_0\)</span> is assumed to be true have a special name: the <em>null distribution</em>.</p>
-<p>But wait! What happened in real-life? What was the <em>observed</em> difference in promotion rates? In other words, what was the <em>observed test statistic</em> <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span>? Recall from Section <a href="9-hypothesis-testing.html#ht-activity">9.1</a> that we computed this observed difference by hand to be 0.875 - 0.583 = 0.292 = 29.2%.</p>
-<p>We can also compute this value using the previous <code>infer</code> code but with the <code>hypothesize()</code> and <code>generate()</code> steps removed. Let’s save this in <code>obs_diff_prop</code></p>
-<pre class="sourceCode r"><code class="sourceCode r">obs_diff_prop &lt;-<span class="st"> </span>promotions <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;male&quot;</span>, <span class="st">&quot;female&quot;</span>))
-obs_diff_prop</code></pre>
+<p>Observe that we have 1000 values of <code>stat</code>, each representing one instance of <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span> in a hypothesized world of no gender discrimination. Observe as well that we chose the name of this data frame carefully: <code>null_distribution</code>. Recall once again from Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a> that sampling distributions when the null hypothesis <span class="math inline">\(H_0\)</span> is assumed to be true have a special name: the <em>null distribution</em>.</p>
+<p>What was the <em>observed</em> difference in promotion rates? In other words, what was the <em>observed test statistic</em> <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span>? Recall from Section <a href="9-hypothesis-testing.html#ht-activity">9.1</a> that we computed this observed difference by hand to be 0.875 - 0.583 = 0.292 = 29.2%. We can also compute this value using the previous <code>infer</code> code but with the <code>hypothesize()</code> and <code>generate()</code> steps removed. Let’s save this in <code>obs_diff_prop</code>:</p>
+<div class="sourceCode" id="cb368"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb368-1" data-line-number="1">obs_diff_prop &lt;-<span class="st"> </span>promotions <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb368-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb368-3" data-line-number="3"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;male&quot;</span>, <span class="st">&quot;female&quot;</span>))</a>
+<a class="sourceLine" id="cb368-4" data-line-number="4">obs_diff_prop</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
       stat
      &lt;dbl&gt;
@@ -1161,93 +1160,90 @@ <h4>4. <code>calculate</code> summary statistics</h4>
 <div id="visualize-the-p-value" class="section level4 unnumbered">
 <h4>5. <code>visualize</code> the p-value</h4>
 <p>The final step is to measure how surprised we are by a promotion difference of 29.2% in a hypothesized universe of no gender discrimination. If the observed difference of 0.292 is highly unlikely, then we would be inclined to reject the validity of our hypothesized universe.</p>
-<p>We start by visualizing the <em>null distribution</em> of our 1000 values of <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span> using <code>visualize()</code>  in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-infer">9.10</a>. Recall that these are values of the difference in promotion rates assuming <span class="math inline">\(H_0\)</span> is true, in other words in our hypothesized universe of no gender discrimination.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(null_distribution, <span class="dt">binwidth =</span> <span class="fl">0.1</span>)</code></pre>
+<p>We start by visualizing the <em>null distribution</em> of our 1000 values of <span class="math inline">\(\widehat{p}_{m} - \widehat{p}_{f}\)</span> using <code>visualize()</code>  in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-infer">9.10</a>. Recall that these are values of the difference in promotion rates assuming <span class="math inline">\(H_0\)</span> is true. This corresponds to being in our hypothesized universe of no gender discrimination.</p>
+<div class="sourceCode" id="cb370"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb370-1" data-line-number="1"><span class="kw">visualize</span>(null_distribution, <span class="dt">bins =</span> <span class="dv">10</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:null-distribution-infer"></span>
-<img src="moderndive_files/figure-html/null-distribution-infer-1.png" alt="Null distribution" width="\textwidth" />
+<img src="ModernDive_files/figure-html/null-distribution-infer-1.png" alt="Null distribution." width="\textwidth" />
 <p class="caption">
-FIGURE 9.10: Null distribution
+FIGURE 9.10: Null distribution.
 </p>
 </div>
-<p>Let’s now add what happened in real-life to Figure <a href="9-hypothesis-testing.html#fig:null-distribution-infer">9.10</a>, the observed difference in promotion rates of 0.875 - 0.583 = 0.292 = 29.2%. However, instead of merely adding a vertical line using <code>geom_vline()</code>, let’s use the  <code>shade_p_value()</code> function with <code>obs_stat</code> set to the observed test statistic value we saved in <code>obs_diff_prop</code>.</p>
-<p>Furthermore, we’ll set the <code>direction = &quot;right&quot;</code> reflecting our alternative hypothesis <span class="math inline">\(H_A: p_{m} - p_{f} &gt; 0\)</span>. Recall our alternative hypothesis <span class="math inline">\(H_A\)</span> is that <span class="math inline">\(p_{m} - p_{f} &gt; 0\)</span>, stating that there is a difference in promotion rates in favor of resumes with male names. “More extreme” here corresponds to differences that are “bigger” or “more positive” or “more to the right.” Hence we set the <code>direction</code> argument of <code>shade_p_value()</code> to be <code>&quot;right&quot;</code>.</p>
-<p>On the other hand, had our alternative hypothesis <span class="math inline">\(H_A\)</span> been the other possible one-sided alternative <span class="math inline">\(p_{m} - p_{f} &lt; 0\)</span>, suggesting discrimination in favor of resumes with female names, we would’ve set <code>direction = &quot;left&quot;</code>. Had our alternative hypothesis <span class="math inline">\(H_A\)</span> been two-sided <span class="math inline">\(p_{m} - p_{f} \neq 0\)</span>, suggesting discrimination in either direction, we would’ve set <code>direction = &quot;both&quot;</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(null_distribution, <span class="dt">bins =</span> <span class="dv">10</span>) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">shade_p_value</span>(<span class="dt">obs_stat =</span> obs_diff_prop, <span class="dt">direction =</span> <span class="st">&quot;right&quot;</span>)</code></pre>
+<p>Let’s now add what happened in real life to Figure <a href="9-hypothesis-testing.html#fig:null-distribution-infer">9.10</a>, the observed difference in promotion rates of 0.875 - 0.583 = 0.292 = 29.2%. However, instead of merely adding a vertical line using <code>geom_vline()</code>, let’s use the  <code>shade_p_value()</code> function with <code>obs_stat</code> set to the observed test statistic value we saved in <code>obs_diff_prop</code>.</p>
+<p>Furthermore, we’ll set the <code>direction = &quot;right&quot;</code> reflecting our alternative hypothesis <span class="math inline">\(H_A: p_{m} - p_{f} &gt; 0\)</span>. Recall our alternative hypothesis <span class="math inline">\(H_A\)</span> is that <span class="math inline">\(p_{m} - p_{f} &gt; 0\)</span>, stating that there is a difference in promotion rates in favor of résumés with male names. “More extreme” here corresponds to differences that are “bigger” or “more positive” or “more to the right.” Hence we set the <code>direction</code> argument of <code>shade_p_value()</code> to be <code>&quot;right&quot;</code>.</p>
+<p>On the other hand, had our alternative hypothesis <span class="math inline">\(H_A\)</span> been the other possible one-sided alternative <span class="math inline">\(p_{m} - p_{f} &lt; 0\)</span>, suggesting discrimination in favor of résumés with female names, we would’ve set <code>direction = &quot;left&quot;</code>. Had our alternative hypothesis <span class="math inline">\(H_A\)</span> been two-sided <span class="math inline">\(p_{m} - p_{f} \neq 0\)</span>, suggesting discrimination in either direction, we would’ve set <code>direction = &quot;both&quot;</code>.</p>
+<div class="sourceCode" id="cb371"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb371-1" data-line-number="1"><span class="kw">visualize</span>(null_distribution, <span class="dt">bins =</span> <span class="dv">10</span>) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb371-2" data-line-number="2"><span class="st">  </span><span class="kw">shade_p_value</span>(<span class="dt">obs_stat =</span> obs_diff_prop, <span class="dt">direction =</span> <span class="st">&quot;right&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:null-distribution-infer-2"></span>
-<img src="moderndive_files/figure-html/null-distribution-infer-2-1.png" alt="Shaded histogram to show p-value." width="\textwidth" />
+<img src="ModernDive_files/figure-html/null-distribution-infer-2-1.png" alt="Shaded histogram to show $p$-value." width="\textwidth" />
 <p class="caption">
-FIGURE 9.11: Shaded histogram to show p-value.
+FIGURE 9.11: Shaded histogram to show <span class="math inline">\(p\)</span>-value.
 </p>
 </div>
-<p>In the resulting Figure <a href="9-hypothesis-testing.html#fig:null-distribution-infer-2">9.11</a>, the solid red line marks 0.292 = 29.2%. However, what does the shaded-region correspond to? This is the <em>p-value</em>. Recall the definition of the p-value from Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a>:</p>
+<p>In the resulting Figure <a href="9-hypothesis-testing.html#fig:null-distribution-infer-2">9.11</a>, the solid dark line marks 0.292 = 29.2%. However, what does the shaded-region correspond to? This is the <em><span class="math inline">\(p\)</span>-value</em>. Recall the definition of the <span class="math inline">\(p\)</span>-value from Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a>:</p>
 <blockquote>
-<p>A p-value is the probability of obtaining a test statistic just as or more extreme than the observed test statistic <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>.</p>
+<p>A <span class="math inline">\(p\)</span>-value is the probability of obtaining a test statistic just as or more extreme than the observed test statistic <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>.</p>
 </blockquote>
-<p>So judging by the shaded region in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-infer-2">9.11</a>, it seems we would somewhat rarely observe differences in promotion rates of 0.292 = 29.2% or more in a hypothesized universe of no gender discrimination. In other words, the p-value is somewhat small. Hence, we would be inclined to reject this hypothesized universe, or using statistical language we would “reject <span class="math inline">\(H_0\)</span>.”</p>
-<p>What fraction of the null distribution is shaded? In other words, what is the exact value of the p-value? We can compute it using the <code>get_p_value()</code>  function with the same arguments as the previous <code>visualize()</code> code:</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distribution <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_p_value</span>(<span class="dt">obs_stat =</span> obs_diff_prop, <span class="dt">direction =</span> <span class="st">&quot;right&quot;</span>)</code></pre>
+<p>So judging by the shaded region in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-infer-2">9.11</a>, it seems we would somewhat rarely observe differences in promotion rates of 0.292 = 29.2% or more in a hypothesized universe of no gender discrimination. In other words, the <span class="math inline">\(p\)</span>-value is somewhat small. Hence, we would be inclined to reject this hypothesized universe, or using statistical language we would “reject <span class="math inline">\(H_0\)</span>.”</p>
+<p>What fraction of the null distribution is shaded? In other words, what is the exact value of the <span class="math inline">\(p\)</span>-value? We can compute it using the <code>get_p_value()</code>  function with the same arguments as the previous <code>shade_p_value()</code> code:</p>
+<div class="sourceCode" id="cb372"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb372-1" data-line-number="1">null_distribution <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb372-2" data-line-number="2"><span class="st">  </span><span class="kw">get_p_value</span>(<span class="dt">obs_stat =</span> obs_diff_prop, <span class="dt">direction =</span> <span class="st">&quot;right&quot;</span>)</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   p_value
     &lt;dbl&gt;
 1   0.027</code></pre>
-<p>Keeping the definition of a p-value in mind, the probability of observing a difference in promotion rates as large as 0.292 = 29.2% due to sampling variation alone is 0.027 = 2.7%.</p>
-<p>Since this p-value is greater than our pre-specified significance level <span class="math inline">\(\alpha\)</span> = 0.001, we fail to reject the null hypothesis <span class="math inline">\(H_0: p_{m} - p_{f} = 0\)</span>. In other words, this p-value wasn’t sufficiently small to reject our hypothesized universe of no gender discrimination.</p>
-<p>Observe that whether we reject the null hypothesis <span class="math inline">\(H_0\)</span> or not depends in large part on our choice of significance level <span class="math inline">\(\alpha\)</span>. We’ll discuss this more in Section <a href="9-hypothesis-testing.html#choosing-alpha">9.4.3</a>.</p>
+<p>Keeping the definition of a <span class="math inline">\(p\)</span>-value in mind, the probability of observing a difference in promotion rates as large as 0.292 = 29.2% due to sampling variation alone in the null distribution is 0.027 = 2.7%. Since this <span class="math inline">\(p\)</span>-value is smaller than our pre-specified significance level <span class="math inline">\(\alpha\)</span> = 0.05, we reject the null hypothesis <span class="math inline">\(H_0: p_{m} - p_{f} = 0\)</span>. In other words, this <span class="math inline">\(p\)</span>-value is sufficiently small to reject our hypothesized universe of no gender discrimination. We instead have enough evidence to change our mind in favor of gender discrimination being a likely culprit here. Observe that whether we reject the null hypothesis <span class="math inline">\(H_0\)</span> or not depends in large part on our choice of significance level <span class="math inline">\(\alpha\)</span>. We’ll discuss this more in Subsection <a href="9-hypothesis-testing.html#choosing-alpha">9.4.3</a>.</p>
 </div>
 </div>
 <div id="comparing-infer-workflows" class="section level3">
 <h3><span class="header-section-number">9.3.2</span> Comparison with confidence intervals</h3>
-<p>One of the great things about the <code>infer</code> package is that we can jump seamlessly between conducting hypothesis tests and constructing confidence intervals with minimal changes! Recall the code from the previous section that creates the null distribution, which in turn is needed to compute the p-value:</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distribution &lt;-<span class="st"> </span>promotions <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;male&quot;</span>, <span class="st">&quot;female&quot;</span>))</code></pre>
+<p>One of the great things about the <code>infer</code> package is that we can jump seamlessly between conducting hypothesis tests and constructing confidence intervals with minimal changes! Recall the code from the previous section that creates the null distribution, which in turn is needed to compute the <span class="math inline">\(p\)</span>-value:</p>
+<div class="sourceCode" id="cb374"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb374-1" data-line-number="1">null_distribution &lt;-<span class="st"> </span>promotions <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb374-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb374-3" data-line-number="3"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb374-4" data-line-number="4"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb374-5" data-line-number="5"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;male&quot;</span>, <span class="st">&quot;female&quot;</span>))</a></code></pre></div>
 <p>To create the corresponding bootstrap distribution needed to construct a 95% confidence interval for <span class="math inline">\(p_{m} - p_{f}\)</span>, we only need to make two changes.  First, we remove the <code>hypothesize()</code> step since we are no longer assuming a null hypothesis <span class="math inline">\(H_0\)</span> is true. We can do this by deleting or commenting out the <code>hypothesize()</code> line of code. Second, we switch the <code>type</code> of resampling in the <code>generate()</code> step to be <code>&quot;bootstrap&quot;</code> instead of <code>&quot;permute&quot;</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">bootstrap_distribution &lt;-<span class="st"> </span>promotions <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="co"># Change 1 - Remove hypothesize():</span>
-<span class="st">  </span><span class="co"># hypothesize(null = &quot;independence&quot;) %&gt;% </span>
-<span class="st">  </span><span class="co"># Change 2 - Switch type from &quot;permute&quot; to &quot;bootstrap&quot;:</span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;male&quot;</span>, <span class="st">&quot;female&quot;</span>))</code></pre>
+<div class="sourceCode" id="cb375"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb375-1" data-line-number="1">bootstrap_distribution &lt;-<span class="st"> </span>promotions <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb375-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> decision <span class="op">~</span><span class="st"> </span>gender, <span class="dt">success =</span> <span class="st">&quot;promoted&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb375-3" data-line-number="3"><span class="st">  </span><span class="co"># Change 1 - Remove hypothesize():</span></a>
+<a class="sourceLine" id="cb375-4" data-line-number="4"><span class="st">  </span><span class="co"># hypothesize(null = &quot;independence&quot;) %&gt;% </span></a>
+<a class="sourceLine" id="cb375-5" data-line-number="5"><span class="st">  </span><span class="co"># Change 2 - Switch type from &quot;permute&quot; to &quot;bootstrap&quot;:</span></a>
+<a class="sourceLine" id="cb375-6" data-line-number="6"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;bootstrap&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb375-7" data-line-number="7"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;male&quot;</span>, <span class="st">&quot;female&quot;</span>))</a></code></pre></div>
 <p>Using this <code>bootstrap_distribution</code>, let’s first compute the percentile-based confidence intervals, as we did in Section <a href="8-confidence-intervals.html#bootstrap-process">8.4</a>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">percentile_ci &lt;-<span class="st"> </span>bootstrap_distribution <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>)
-percentile_ci</code></pre>
+<div class="sourceCode" id="cb376"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb376-1" data-line-number="1">percentile_ci &lt;-<span class="st"> </span>bootstrap_distribution <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb376-2" data-line-number="2"><span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>)</a>
+<a class="sourceLine" id="cb376-3" data-line-number="3">percentile_ci</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
      `2.5%`  `97.5%`
       &lt;dbl&gt;    &lt;dbl&gt;
-1 0.0414187 0.522222</code></pre>
-<p>Using our shorthand interpretation for 95% confidence intervals from Section <a href="8-confidence-intervals.html#shorthand">8.5.2</a>, we are 95% “confident” that the true difference in population proportions <span class="math inline">\(p_{m} - p_{f}\)</span> is between (0.041, 0.522). Let’s visualize <code>bootstrap_distribution</code> and this percentile-based 95% confidence interval for <span class="math inline">\(p_{m} - p_{f}\)</span> in Figure <a href="9-hypothesis-testing.html#fig:bootstrap-distribution-two-prop-percentile">9.12</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(bootstrap_distribution) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> percentile_ci)</code></pre>
+1 0.0444444 0.538542</code></pre>
+<p>Using our shorthand interpretation for 95% confidence intervals from Subsection <a href="8-confidence-intervals.html#shorthand">8.5.2</a>, we are 95% “confident” that the true difference in population proportions <span class="math inline">\(p_{m} - p_{f}\)</span> is between (0.044, 0.539). Let’s visualize <code>bootstrap_distribution</code> and this percentile-based 95% confidence interval for <span class="math inline">\(p_{m} - p_{f}\)</span> in Figure <a href="9-hypothesis-testing.html#fig:bootstrap-distribution-two-prop-percentile">9.12</a>.</p>
+<div class="sourceCode" id="cb378"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb378-1" data-line-number="1"><span class="kw">visualize</span>(bootstrap_distribution) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb378-2" data-line-number="2"><span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> percentile_ci)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:bootstrap-distribution-two-prop-percentile"></span>
-<img src="moderndive_files/figure-html/bootstrap-distribution-two-prop-percentile-1.png" alt="Percentile-based 95 percent confidence interval." width="\textwidth" />
+<img src="ModernDive_files/figure-html/bootstrap-distribution-two-prop-percentile-1.png" alt="Percentile-based 95\% confidence interval." width="\textwidth" />
 <p class="caption">
-FIGURE 9.12: Percentile-based 95 percent confidence interval.
+FIGURE 9.12: Percentile-based 95% confidence interval.
 </p>
 </div>
 <p>Notice a key value that is not included in the 95% confidence interval for <span class="math inline">\(p_{m} - p_{f}\)</span>: the value 0. In other words, a difference of 0 is not included in our net, suggesting that <span class="math inline">\(p_{m}\)</span> and <span class="math inline">\(p_{f}\)</span> are truly different! Furthermore, observe how the entirety of the 95% confidence interval for <span class="math inline">\(p_{m} - p_{f}\)</span> lies above 0, suggesting that this difference is in favor of men.</p>
-<p>Since the bootstrap distribution appears to be roughly normally shaped, we can also use the standard error method as we did in Section <a href="8-confidence-intervals.html#bootstrap-process">8.4</a>.</p>
-<p>In this case, we must specify the <code>point_estimate</code> argument as the observed difference in promotion rates 0.292 = 29.2% saved in <code>obs_diff_prop</code>. This value acts as the center of the confidence interval.</p>
-<pre class="sourceCode r"><code class="sourceCode r">se_ci &lt;-<span class="st"> </span>bootstrap_distribution <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;se&quot;</span>, 
-                          <span class="dt">point_estimate =</span> obs_diff_prop)
-se_ci</code></pre>
+<p>Since the bootstrap distribution appears to be roughly normally shaped, we can also use the standard error method as we did in Section <a href="8-confidence-intervals.html#bootstrap-process">8.4</a>. In this case, we must specify the <code>point_estimate</code> argument as the observed difference in promotion rates 0.292 = 29.2% saved in <code>obs_diff_prop</code>. This value acts as the center of the confidence interval.</p>
+<div class="sourceCode" id="cb379"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb379-1" data-line-number="1">se_ci &lt;-<span class="st"> </span>bootstrap_distribution <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb379-2" data-line-number="2"><span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;se&quot;</span>, </a>
+<a class="sourceLine" id="cb379-3" data-line-number="3">                          <span class="dt">point_estimate =</span> obs_diff_prop)</a>
+<a class="sourceLine" id="cb379-4" data-line-number="4">se_ci</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
       lower    upper
       &lt;dbl&gt;    &lt;dbl&gt;
-1 0.0490607 0.534273</code></pre>
+1 0.0514129 0.531920</code></pre>
 <p>Let’s visualize <code>bootstrap_distribution</code> again, but now the standard error based 95% confidence interval for <span class="math inline">\(p_{m} - p_{f}\)</span> in Figure <a href="9-hypothesis-testing.html#fig:bootstrap-distribution-two-prop-se">9.13</a>. Again, notice how the value 0 is not included in our confidence interval, again suggesting that <span class="math inline">\(p_{m}\)</span> and <span class="math inline">\(p_{f}\)</span> are truly different!</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(bootstrap_distribution) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> se_ci)</code></pre>
+<div class="sourceCode" id="cb381"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb381-1" data-line-number="1"><span class="kw">visualize</span>(bootstrap_distribution) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb381-2" data-line-number="2"><span class="st">  </span><span class="kw">shade_confidence_interval</span>(<span class="dt">endpoints =</span> se_ci)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:bootstrap-distribution-two-prop-se"></span>
-<img src="moderndive_files/figure-html/bootstrap-distribution-two-prop-se-1.png" alt="Standard error-based 95 percent confidence interval." width="\textwidth" />
+<img src="ModernDive_files/figure-html/bootstrap-distribution-two-prop-se-1.png" alt="Standard error-based 95\% confidence interval." width="\textwidth" />
 <p class="caption">
-FIGURE 9.13: Standard error-based 95 percent confidence interval.
+FIGURE 9.13: Standard error-based 95% confidence interval.
 </p>
 </div>
 <div class="learncheck">
@@ -1255,46 +1251,53 @@ <h3><span class="header-section-number">9.3.2</span> Comparison with confidence
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC9.1)</strong> Conduct the same analysis comparing male and female promotion rates using the median rating instead of the mean rating? What was different and what was the same?</p>
-<p><strong>(LC9.2)</strong> Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between the promotion rate of males and females using this study.</p>
-<p><strong>(LC9.3)</strong> Why are we relatively confident that the distributions of the sample proportions will be good approximations of the population distributions of promotion proportions for the two genders?</p>
-<p><strong>(LC9.4)</strong> Using the definition of “<span class="math inline">\(p\)</span>-value”, write in words what the <span class="math inline">\(p\)</span>-value represents for the hypothesis test comparing the promotion rates for males and females.</p>
-<p><strong>(LC9.5)</strong> What is the value of the <span class="math inline">\(p\)</span>-value for the hypothesis test comparing the mean rating of romance to action movies? How can it be interpreted in the context of the problem?</p>
+<p><strong>(LC9.1)</strong> Conduct the same hypothesis test and confidence interval analysis comparing male and female promotion rates using the median rating instead of the mean rating. What was different and what was the same?</p>
+<p><strong>(LC9.2)</strong> Why are we relatively confident that the distributions of the sample proportions will be good approximations of the population distributions of promotion proportions for the two genders?</p>
+<p><strong>(LC9.3)</strong> Using the definition of <em>p-value</em>, write in words what the <span class="math inline">\(p\)</span>-value represents for the hypothesis test comparing the promotion rates for males and females.</p>
 <div class="learncheck">
 
 </div>
 </div>
 <div id="only-one-test" class="section level3">
 <h3><span class="header-section-number">9.3.3</span> “There is only one test”</h3>
-<p>Let’s recap the steps necessary to conduct a hypothesis test using the terminology, notation, and definitions related to sampling you saw in Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a> and the <code>infer</code> workflow from Section <a href="9-hypothesis-testing.html#infer-workflow-ht">9.3.1</a>:</p>
+<p>Let’s recap the steps necessary to conduct a hypothesis test using the terminology, notation, and definitions related to sampling you saw in Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a> and the <code>infer</code> workflow from Subsection <a href="9-hypothesis-testing.html#infer-workflow-ht">9.3.1</a>:</p>
 <ol style="list-style-type: decimal">
 <li><code>specify()</code> the variables of interest in your data frame.</li>
 <li><code>hypothesize()</code> the null hypothesis <span class="math inline">\(H_0\)</span>. In other words, set a “model for the universe” assuming <span class="math inline">\(H_0\)</span> is true.</li>
-<li><code>generate()</code> shuffles assuming <span class="math inline">\(H_0\)</span> is true. In other words, <em>simulate</em> data assuming <span class="math inline">\(H_0\)</span> in true.</li>
+<li><code>generate()</code> shuffles assuming <span class="math inline">\(H_0\)</span> is true. In other words, <em>simulate</em> data assuming <span class="math inline">\(H_0\)</span> is true.</li>
 <li><code>calculate()</code> the <em>test statistic</em> of interest, both for the observed data and your <em>simulated</em> data.</li>
-<li><code>visualize()</code> the resulting <em>null distribution</em> and compute the <em>p-value</em> by comparing the null distribution to the observed test statistic.</li>
+<li><code>visualize()</code> the resulting <em>null distribution</em> and compute the <em><span class="math inline">\(p\)</span>-value</em> by comparing the null distribution to the observed test statistic.</li>
 </ol>
 <p>While this is a lot to digest, especially the first time you encounter hypothesis testing, the nice thing is that once you understand this general framework, then you can understand <em>any</em> hypothesis test. In a famous blog post, computer scientist Allen Downey called this the <a href="http://allendowney.blogspot.com/2016/06/there-is-still-only-one-test.html">“There is only one test”</a> framework, for which he created the flowchart displayed in Figure <a href="9-hypothesis-testing.html#fig:htdowney">9.14</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:htdowney"></span>
-<img src="images/copyright/there_is_only_one_test.png" alt="Allan Downey's hypothesis testing framework." width="90%" />
+<img src="images/copyright/there_is_only_one_test.png" alt="Allen Downey's hypothesis testing framework." width="110%" />
 <p class="caption">
-FIGURE 9.14: Allan Downey’s hypothesis testing framework.
+FIGURE 9.14: Allen Downey’s hypothesis testing framework.
 </p>
 </div>
-<p>Notice its similarity with the “hypothesis testing via <code>infer</code>” diagram you saw in Figure <a href="9-hypothesis-testing.html#fig:inferht">9.9</a>. That’s because the <code>infer</code> package was explicitly designed to match the “There is only one test” framework. So if you can understand the framework, you can easily generalize these ideas for all hypothesis testing scenarios. Whether for population proportions <span class="math inline">\(p\)</span>, population means <span class="math inline">\(\mu\)</span>, differences in population proportions <span class="math inline">\(p_1 - p_2\)</span>, differences in population means <span class="math inline">\(\mu_1 - \mu_2\)</span>, and as you’ll see in Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a> on inference for regression, population regression intercepts <span class="math inline">\(\beta_0\)</span> and population regression slopes <span class="math inline">\(\beta_1\)</span> as well.</p>
+<p>Notice its similarity with the “hypothesis testing with <code>infer</code>” diagram you saw in Figure <a href="9-hypothesis-testing.html#fig:inferht">9.9</a>. That’s because the <code>infer</code> package was explicitly designed to match the “There is only one test” framework. So if you can understand the framework, you can easily generalize these ideas for all hypothesis testing scenarios. Whether for population proportions <span class="math inline">\(p\)</span>, population means <span class="math inline">\(\mu\)</span>, differences in population proportions <span class="math inline">\(p_1 - p_2\)</span>, differences in population means <span class="math inline">\(\mu_1 - \mu_2\)</span>, and as you’ll see in Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a> on inference for regression, population regression slopes <span class="math inline">\(\beta_1\)</span> as well. In fact, it applies more generally even than just these examples to more complicated hypothesis tests and test statistics as well.</p>
+<div class="learncheck">
+<p>
+<strong><em>Learning check</em></strong>
+</p>
+</div>
+<p><strong>(LC9.4)</strong> Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between the promotion rate of males and females using this study.</p>
+<div class="learncheck">
+
+</div>
 </div>
 </div>
 <div id="ht-interpretation" class="section level2">
 <h2><span class="header-section-number">9.4</span> Interpreting hypothesis tests</h2>
-<p>Interpreting the results of hypothesis tests are one of the more challenging aspects of this method for statistical inference. In this section, we’ll focus on ways to help with deciphering the process and address some common misconceptions.</p>
+<p>Interpreting the results of hypothesis tests is one of the more challenging aspects of this method for statistical inference. In this section, we’ll focus on ways to help with deciphering the process and address some common misconceptions.</p>
 <div id="trial" class="section level3">
 <h3><span class="header-section-number">9.4.1</span> Two possible outcomes</h3>
 <p>In Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a>, we mentioned that given a pre-specified significance level <span class="math inline">\(\alpha\)</span> there are two possible outcomes of a hypothesis test:</p>
 <ul>
-<li>If the p-value is less than <span class="math inline">\(\alpha\)</span>, then we <em>reject</em> the null hypothesis <span class="math inline">\(H_0\)</span> in favor of <span class="math inline">\(H_A\)</span>.</li>
-<li>If the p-value is greater than or equal to <span class="math inline">\(\alpha\)</span>, we <em>fail to reject</em> the null hypothesis <span class="math inline">\(H_0\)</span>.</li>
+<li>If the <span class="math inline">\(p\)</span>-value is less than <span class="math inline">\(\alpha\)</span>, then we <em>reject</em> the null hypothesis <span class="math inline">\(H_0\)</span> in favor of <span class="math inline">\(H_A\)</span>.</li>
+<li>If the <span class="math inline">\(p\)</span>-value is greater than or equal to <span class="math inline">\(\alpha\)</span>, we <em>fail to reject</em> the null hypothesis <span class="math inline">\(H_0\)</span>.</li>
 </ul>
-<p>Unfortunately, the latter result is often misinterpreted as “accepting the null hypothesis <span class="math inline">\(H_0\)</span>.” While at first glance it may seem that the statements “failing to reject <span class="math inline">\(H_0\)</span>” and “accepting <span class="math inline">\(H_0\)</span>” are equivalent, there actually is a subtle difference. Saying that we “accept the null hypothesis <span class="math inline">\(H_0\)</span>” is equivalent to stating “we think the null hypothesis <span class="math inline">\(H_0\)</span> is true.” However, saying that we “fail to reject the null hypothesis <span class="math inline">\(H_0\)</span>” is saying something else: “While <span class="math inline">\(H_0\)</span> might still be false, we don’t have enough evidence to say so.” In other words, there is an absence of enough proof. However, the absence of proof is not proof of absence.</p>
+<p>Unfortunately, the latter result is often misinterpreted as “accepting the null hypothesis <span class="math inline">\(H_0\)</span>.” While at first glance it may seem that the statements “failing to reject <span class="math inline">\(H_0\)</span>” and “accepting <span class="math inline">\(H_0\)</span>” are equivalent, there actually is a subtle difference. Saying that we “accept the null hypothesis <span class="math inline">\(H_0\)</span>” is equivalent to stating that “we think the null hypothesis <span class="math inline">\(H_0\)</span> is true.” However, saying that we “fail to reject the null hypothesis <span class="math inline">\(H_0\)</span>” is saying something else: “While <span class="math inline">\(H_0\)</span> might still be false, we don’t have enough evidence to say so.” In other words, there is an absence of enough proof. However, the absence of proof is not proof of absence.</p>
 <p>To further shed light on this distinction,  let’s use the United States criminal justice system as an analogy. A criminal trial in the United States is a similar situation to hypothesis tests whereby a choice between two contradictory claims must be made about a defendant who is on trial:</p>
 <ol style="list-style-type: decimal">
 <li>The defendant is truly either “innocent” or “guilty.”</li>
@@ -1302,23 +1305,22 @@ <h3><span class="header-section-number">9.4.1</span> Two possible outcomes</h3>
 <li>The defendant is found guilty only if there is <em>strong evidence</em> that the defendant is guilty. The phrase “beyond a reasonable doubt” is often used as a guideline for determining a cutoff for when enough evidence exists to find the defendant guilty.</li>
 <li>The defendant is found to be either “not guilty” or “guilty” in the ultimate verdict.</li>
 </ol>
-<p>In other words, “not guilty” verdicts are not suggesting the defendant is “innocent”, but instead that “while the defendant may still actually be guilty, there wasn’t enough evidence to prove this fact.” Now let’s make the connection with hypothesis tests:</p>
+<p>In other words, <em>not guilty</em> verdicts are not suggesting the defendant is <em>innocent</em>, but instead that “while the defendant may still actually be guilty, there wasn’t enough evidence to prove this fact.” Now let’s make the connection with hypothesis tests:</p>
 <ol style="list-style-type: decimal">
 <li>Either the null hypothesis <span class="math inline">\(H_0\)</span> or the alternative hypothesis <span class="math inline">\(H_A\)</span> is true.</li>
-<li>Hypothesis tests are always conducted assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true.</li>
-<li>We reject the null hypothesis <span class="math inline">\(H_0\)</span> in favor of <span class="math inline">\(H_A\)</span> only if the evidence found in the sample suggests that <span class="math inline">\(H_A\)</span> is true. The significance level <span class="math inline">\(\alpha\)</span> is used as a guideline to set the threshold on how strong evidence we require.</li>
+<li>Hypothesis tests are conducted assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true.</li>
+<li>We reject the null hypothesis <span class="math inline">\(H_0\)</span> in favor of <span class="math inline">\(H_A\)</span> only if the evidence found in the sample suggests that <span class="math inline">\(H_A\)</span> is true. The significance level <span class="math inline">\(\alpha\)</span> is used as a guideline to set the threshold on just how strong of evidence we require.</li>
 <li>We ultimately decide to either “fail to reject <span class="math inline">\(H_0\)</span>” or “reject <span class="math inline">\(H_0\)</span>.”</li>
 </ol>
-<p>So while gut instinct may suggest “failing to reject <span class="math inline">\(H_0\)</span>” and “accepting <span class="math inline">\(H_0\)</span>” are equivalent statements, they are not. “Accepting <span class="math inline">\(H_0\)</span>” is equivalent to finding a defendant innocent. However, courts do not defendants “innocent,” but rather they find them “not guilty.” Putting things differently, defense attorneys do not need to prove that their clients are innocent, rather they only need to prove that clients are “not guilty beyond a reasonable doubt”.</p>
-<p>So going back to our resumes activity in Section <a href="9-hypothesis-testing.html#ht-infer">9.3</a>, recall that our hypothesis test was <span class="math inline">\(H_0: p_{m} - p_{f} = 0\)</span> versus <span class="math inline">\(H_A: p_{m} - p_{f} &gt; 0\)</span> and that we used a pre-specified significance level of <span class="math inline">\(\alpha\)</span> = 0.001. We found a p-value of 0.027. Since the p-value was greater than <span class="math inline">\(\alpha\)</span> = 0.001, we failed to reject <span class="math inline">\(H_0\)</span>. In other words, we didn’t find any evidence in this particular sample to say that <span class="math inline">\(H_0\)</span> is false at the <span class="math inline">\(\alpha\)</span> = 0.001 significance level. We also state this conclusion using non-statistical language: we didn’t find enough evidence in this data to suggest that there was no gender discrimination.</p>
+<p>So while gut instinct may suggest “failing to reject <span class="math inline">\(H_0\)</span>” and “accepting <span class="math inline">\(H_0\)</span>” are equivalent statements, they are not. “Accepting <span class="math inline">\(H_0\)</span>” is equivalent to finding a defendant innocent. However, courts do not find defendants “innocent,” but rather they find them “not guilty.” Putting things differently, defense attorneys do not need to prove that their clients are innocent, rather they only need to prove that clients are not “guilty beyond a reasonable doubt”.</p>
+<p>So going back to our résumés activity in Section <a href="9-hypothesis-testing.html#ht-infer">9.3</a>, recall that our hypothesis test was <span class="math inline">\(H_0: p_{m} - p_{f} = 0\)</span> versus <span class="math inline">\(H_A: p_{m} - p_{f} &gt; 0\)</span> and that we used a pre-specified significance level of <span class="math inline">\(\alpha\)</span> = 0.05. We found a <span class="math inline">\(p\)</span>-value of 0.027. Since the <span class="math inline">\(p\)</span>-value was smaller than <span class="math inline">\(\alpha\)</span> = 0.05, we rejected <span class="math inline">\(H_0\)</span>. In other words, we found needed levels of evidence in this particular sample to say that <span class="math inline">\(H_0\)</span> is false at the <span class="math inline">\(\alpha\)</span> = 0.05 significance level. We also state this conclusion using non-statistical language: we found enough evidence in this data to suggest that there was gender discrimination at play.</p>
 </div>
 <div id="types-of-errors" class="section level3">
 <h3><span class="header-section-number">9.4.2</span> Types of errors</h3>
 <p>Unfortunately, there is some chance a jury or a judge can make an incorrect decision in a criminal trial by reaching the wrong verdict. For example, finding a truly innocent defendant “guilty”. Or on the other hand, finding a truly guilty defendant “not guilty.” This can often stem from the fact that prosecutors don’t have access to all the relevant evidence, but instead are limited to whatever evidence the police can find.</p>
 <p>The same holds for hypothesis tests. We can make incorrect decisions about a population parameter because we only have a sample of data from the population and thus sampling variation can lead us to incorrect conclusions.</p>
-<p>There are two possible erroneous conclusions in a criminal trial: either 1) a truly innocent person is found guilty or 2) a truly guilty person is found not guilty. Similarly, there are two possible errors in a hypothesis test: either 1) rejecting <span class="math inline">\(H_0\)</span> when in fact <span class="math inline">\(H_0\)</span> is true, called a <strong>Type I error</strong>  or 2) failing to reject <span class="math inline">\(H_0\)</span> when in fact <span class="math inline">\(H_0\)</span> is false, called a  <strong>Type II error</strong>. Another term used for “Type I error” is “false positive” while another term for “Type II error” include “false negative.”</p>
-<p>This risk of error is the price researchers pay for basing inference on a sample instead of performing a census on the entire population. But as we’ve seen in our numerous examples and activities so far, censuses are often very expensive and other times impossible, and thus researchers have no choice but to use a sample.</p>
-<p>Thus in any hypothesis test based on a sample, we have no choice but to tolerate the chance that a Type I error will be made and some chance that a Type II error will occur.</p>
+<p>There are two possible erroneous conclusions in a criminal trial: either (1) a truly innocent person is found guilty or (2) a truly guilty person is found not guilty. Similarly, there are two possible errors in a hypothesis test: either (1) rejecting <span class="math inline">\(H_0\)</span> when in fact <span class="math inline">\(H_0\)</span> is true, called a <strong>Type I error</strong>  or (2) failing to reject <span class="math inline">\(H_0\)</span> when in fact <span class="math inline">\(H_0\)</span> is false, called a  <strong>Type II error</strong>. Another term used for “Type I error” is “false positive,” while another term for “Type II error” is “false negative.”</p>
+<p>This risk of error is the price researchers pay for basing inference on a sample instead of performing a census on the entire population. But as we’ve seen in our numerous examples and activities so far, censuses are often very expensive and other times impossible, and thus researchers have no choice but to use a sample. Thus in any hypothesis test based on a sample, we have no choice but to tolerate some chance that a Type I error will be made and some chance that a Type II error will occur.</p>
 <p>To help understand the concepts of Type I error and Type II errors, we apply these terms to our criminal justice analogy in Figure <a href="9-hypothesis-testing.html#fig:trial-errors-table">9.15</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:trial-errors-table"></span>
 <img src="images/gt_error_table.png" alt="Type I and Type II errors in criminal trials." width="\textwidth" />
@@ -1326,7 +1328,7 @@ <h3><span class="header-section-number">9.4.2</span> Types of errors</h3>
 FIGURE 9.15: Type I and Type II errors in criminal trials.
 </p>
 </div>
-<p>Thus a Type I error corresponds to incorrectly putting a truly innocent person in jail whereas a Type II error corresponds to letting a truly guilty person go free. Let’s show the corresponding table for hypothesis tests</p>
+<p>Thus a Type I error corresponds to incorrectly putting a truly innocent person in jail, whereas a Type II error corresponds to letting a truly guilty person go free. Let’s show the corresponding table in Figure <a href="9-hypothesis-testing.html#fig:trial-errors-table-ht">9.16</a> for hypothesis tests.</p>
 <div class="figure" style="text-align: center"><span id="fig:trial-errors-table-ht"></span>
 <img src="images/gt_error_table_ht.png" alt="Type I and Type II errors in hypothesis tests." width="\textwidth" />
 <p class="caption">
@@ -1338,24 +1340,24 @@ <h3><span class="header-section-number">9.4.2</span> Types of errors</h3>
 <h3><span class="header-section-number">9.4.3</span> How do we choose alpha?</h3>
 <p>If we are using a sample to make inferences about a population, we run the risk of making errors. For confidence intervals, a corresponding “error” would be constructing a confidence interval that does not contain the true value of the population parameter. For hypothesis tests, this would be making either a Type I or Type II error. Obviously, we want to minimize the probability of either error; we want a small probability of making an incorrect conclusion:</p>
 <ul>
-<li>The probability of a Type I Error occurring is denoted by <span class="math inline">\(\alpha\)</span>. The value of <span class="math inline">\(\alpha\)</span> is called the <em>significance level</em> of the hypothesis test, which we defined in Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a></li>
+<li>The probability of a Type I Error occurring is denoted by <span class="math inline">\(\alpha\)</span>. The value of <span class="math inline">\(\alpha\)</span> is called the <em>significance level</em> of the hypothesis test, which we defined in Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a>.</li>
 <li>The probability of a Type II Error is denoted by <span class="math inline">\(\beta\)</span>. The value of <span class="math inline">\(1-\beta\)</span> is known as the <em>power</em> of the hypothesis test.</li>
 </ul>
 <p>In other words, <span class="math inline">\(\alpha\)</span> corresponds to the probability of incorrectly rejecting <span class="math inline">\(H_0\)</span> when in fact <span class="math inline">\(H_0\)</span> is true. On the other hand, <span class="math inline">\(\beta\)</span> corresponds to the probability of incorrectly failing to reject <span class="math inline">\(H_0\)</span> when in fact <span class="math inline">\(H_0\)</span> is false.</p>
 <p>Ideally, we want <span class="math inline">\(\alpha = 0\)</span> and <span class="math inline">\(\beta = 0\)</span>, meaning that the chance of making either error is 0. However, this can never be the case in any situation where we are sampling for inference. There will always be the possibility of making either error when we use sample data. Furthermore, these two error probabilities are inversely related. As the probability of a Type I error goes down, the probability of a Type II error goes up.</p>
 <p>What is typically done in practice is to fix the probability of a Type I error by pre-specifying a significance level <span class="math inline">\(\alpha\)</span> and then try to minimize <span class="math inline">\(\beta\)</span>. In other words, we will tolerate a certain fraction of incorrect rejections of the null hypothesis <span class="math inline">\(H_0\)</span>, and then try to minimize the fraction of incorrect non-rejections of <span class="math inline">\(H_0\)</span>.</p>
 <p>So for example if we used <span class="math inline">\(\alpha\)</span> = 0.01, we would be using a hypothesis testing procedure that in the long run would incorrectly reject the null hypothesis <span class="math inline">\(H_0\)</span> one percent of the time. This is analogous to setting the confidence level of a confidence interval.</p>
-<p>So what value should you use for <span class="math inline">\(\alpha\)</span>?  Different fields have different conventions, but some commonly used values include 0.10, 0.05, 0.01, and 0.001. However, it is important to keep in mind that if you use a relatively small value of <span class="math inline">\(\alpha\)</span> then all things being equal, p-values will have a harder time being less than <span class="math inline">\(\alpha\)</span>. Thus we would reject the null hypothesis less often. In other words, we would reject the null hypothesis <span class="math inline">\(H_0\)</span> only if we have <em>very strong</em> evidence to do so. This is known as a “conservative” test.</p>
-<p>On the other hand, if we used a relatively large value of <span class="math inline">\(\alpha\)</span> then all things being equal, p-values will have an easier time being less than <span class="math inline">\(\alpha\)</span>. Thus we would reject the null hypothesis more often. In other words, we would reject the null hypothesis <span class="math inline">\(H_0\)</span> even if we only have <em>mild</em> evidence to do so. This is known as a “liberal” test.</p>
+<p>So what value should you use for <span class="math inline">\(\alpha\)</span>?  Different fields have different conventions, but some commonly used values include 0.10, 0.05, 0.01, and 0.001. However, it is important to keep in mind that if you use a relatively small value of <span class="math inline">\(\alpha\)</span>, then all things being equal, <span class="math inline">\(p\)</span>-values will have a harder time being less than <span class="math inline">\(\alpha\)</span>. Thus we would reject the null hypothesis less often. In other words, we would reject the null hypothesis <span class="math inline">\(H_0\)</span> only if we have <em>very strong</em> evidence to do so. This is known as a “conservative” test.</p>
+<p>On the other hand, if we used a relatively large value of <span class="math inline">\(\alpha\)</span>, then all things being equal, <span class="math inline">\(p\)</span>-values will have an easier time being less than <span class="math inline">\(\alpha\)</span>. Thus we would reject the null hypothesis more often. In other words, we would reject the null hypothesis <span class="math inline">\(H_0\)</span> even if we only have <em>mild</em> evidence to do so. This is known as a “liberal” test.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC9.6)</strong> What is wrong about saying “The defendant is innocent.” based on the US system of criminal trials?</p>
-<p><strong>(LC9.7)</strong> What is the purpose of hypothesis testing?</p>
-<p><strong>(LC9.8)</strong> What are some flaws with hypothesis testing? How could we alleviate them?</p>
-<p><strong>(LC9.9)</strong> Consider two <span class="math inline">\(\alpha\)</span> significance levels of 0.1 and 0.01. Of the two, which would lead to a more <em>liberal</em> hypothesis testing procedure? In other words, one that will, all things being equal, lead to more rejections of the null hypothesis <span class="math inline">\(H_0\)</span>?</p>
+<p><strong>(LC9.5)</strong> What is wrong about saying, “The defendant is innocent.” based on the US system of criminal trials?</p>
+<p><strong>(LC9.6)</strong> What is the purpose of hypothesis testing?</p>
+<p><strong>(LC9.7)</strong> What are some flaws with hypothesis testing? How could we alleviate them?</p>
+<p><strong>(LC9.8)</strong> Consider two <span class="math inline">\(\alpha\)</span> significance levels of 0.1 and 0.01. Of the two, which would lead to a more <em>liberal</em> hypothesis testing procedure? In other words, one that will, all things being equal, lead to more rejections of the null hypothesis <span class="math inline">\(H_0\)</span>.</p>
 <div class="learncheck">
 
 </div>
@@ -1363,14 +1365,14 @@ <h3><span class="header-section-number">9.4.3</span> How do we choose alpha?</h3
 </div>
 <div id="ht-case-study" class="section level2">
 <h2><span class="header-section-number">9.5</span> Case study: Are action or romance movies rated higher?</h2>
-<p>Let’s apply our knowledge of hypothesis testing to answer the question: “Are action or romance movies rated higher on IMDb?” <a href="https://www.imdb.com/">IMDb</a> is a database on the internet providing information on movie and television show casts, plot summaries, trivia, and ratings. We’ll investigate if, on average, action or romance movies get higher ratings on IMDb.</p>
+<p>Let’s apply our knowledge of hypothesis testing to answer the question: “Are action or romance movies rated higher on IMDb?”. <a href="https://www.imdb.com/">IMDb</a> is a database on the internet providing information on movie and television show casts, plot summaries, trivia, and ratings. We’ll investigate if, on average, action or romance movies get higher ratings on IMDb.</p>
 <div id="imdb-data" class="section level3">
 <h3><span class="header-section-number">9.5.1</span> IMDb ratings data</h3>
 <!--
 **Important note:** Remember that we hardly ever have access to the population values as we do here.  This example was used to show how well hypothesis testing procedures using methods like permutation can do at testing hypotheses about population parameters. In nearly all circumstances, we'll be needing to use only a sample of the population to try to infer conclusions about the unknown population parameter values.  This example does show a nice relationship between statistics (where data is usually small and more focused on experimental settings) and data science (where data is frequently large and collected without experimental conditions). 
 -->
-<p>The <code>movies</code> dataset in the <code>ggplot2movies</code> package contains information on 58,788 movies that have been rated by users of IMDB.com.</p>
-<pre class="sourceCode r"><code class="sourceCode r">movies</code></pre>
+<p>The <code>movies</code> dataset in the <code>ggplot2movies</code> package contains information on 58,788 movies that have been rated by users of IMDb.com.</p>
+<div class="sourceCode" id="cb382"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb382-1" data-line-number="1">movies</a></code></pre></div>
 <pre><code># A tibble: 58,788 x 24
    title  year length budget rating votes    r1    r2    r3    r4    r5    r6
    &lt;chr&gt; &lt;int&gt;  &lt;int&gt;  &lt;int&gt;  &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
@@ -1387,8 +1389,8 @@ <h3><span class="header-section-number">9.5.1</span> IMDb ratings data</h3>
 # … with 58,778 more rows, and 12 more variables: r7 &lt;dbl&gt;, r8 &lt;dbl&gt;, r9 &lt;dbl&gt;,
 #   r10 &lt;dbl&gt;, mpaa &lt;chr&gt;, Action &lt;int&gt;, Animation &lt;int&gt;, Comedy &lt;int&gt;,
 #   Drama &lt;int&gt;, Documentary &lt;int&gt;, Romance &lt;int&gt;, Short &lt;int&gt;</code></pre>
-<p>We’ll focus on a random sample of 68 movies that are classified as either “action” or “romance” movies but not both. We disregard movies that are classified as both so that we can assign all 68 movies into either category. Furthermore, since the original <code>movies</code> dataset was a little messy, we provide a pre-wrangled version of our data in the <code>movies_sample</code> data frame included in the <code>moderndive</code> package. If you’re curious, you can look at the necessary data wrangling code to do this on <a href="https://github.com/moderndive/moderndive/blob/master/data-raw/process_data_sets.R#L14">GitHub</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">movies_sample</code></pre>
+<p>We’ll focus on a random sample of 68 movies that are classified as either “action” or “romance” movies but not both. We disregard movies that are classified as both so that we can assign all 68 movies into either category. Furthermore, since the original <code>movies</code> dataset was a little messy, we provide a pre-wrangled version of our data in the <code>movies_sample</code> data frame included in the <code>moderndive</code> package. If you’re curious, you can look at the necessary data wrangling code to do this on <a href="https://github.com/moderndive/moderndive/blob/master/data-raw/process_data_sets.R">GitHub</a>.</p>
+<div class="sourceCode" id="cb384"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb384-1" data-line-number="1">movies_sample</a></code></pre></div>
 <pre><code># A tibble: 68 x 4
    title                     year rating genre  
    &lt;chr&gt;                    &lt;int&gt;  &lt;dbl&gt; &lt;chr&gt;  
@@ -1404,33 +1406,33 @@ <h3><span class="header-section-number">9.5.1</span> IMDb ratings data</h3>
 10 Electric Horseman, The    1979    5.8 Romance
 # … with 58 more rows</code></pre>
 <p>The variables include the <code>title</code> and <code>year</code> the movie was filmed. Furthermore, we have a numerical variable <code>rating</code>, which is the IMDb rating out of 10 stars, and a binary categorical variable <code>genre</code> indicating if the movie was an <code>Action</code> or <code>Romance</code> movie. We are interested in whether <code>Action</code> or <code>Romance</code> movies got a higher <code>rating</code> on average.</p>
-<p>Let’s perform an exploratory data analysis of this data. Recall from Section <a href="2-viz.html#geomboxplot">2.7.1</a> that a boxplot is a visualization we can use to show the relationship between a numerical and a categorical variable. Another option you saw in Section <a href="2-viz.html#facets">2.6</a> would be to use a faceted histogram. However in the interest of brevity, let’s only present the boxplot in Figure <a href="9-hypothesis-testing.html#fig:action-romance-boxplot">9.17</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> movies_sample, <span class="kw">aes</span>(<span class="dt">x =</span> genre, <span class="dt">y =</span> rating)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_boxplot</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">y =</span> <span class="st">&quot;IMDb rating&quot;</span>)</code></pre>
+<p>Let’s perform an exploratory data analysis of this data. Recall from Subsection <a href="2-viz.html#geomboxplot">2.7.1</a> that a boxplot is a visualization we can use to show the relationship between a numerical and a categorical variable. Another option you saw in Section <a href="2-viz.html#facets">2.6</a> would be to use a faceted histogram. However, in the interest of brevity, let’s only present the boxplot in Figure <a href="9-hypothesis-testing.html#fig:action-romance-boxplot">9.17</a>.</p>
+<div class="sourceCode" id="cb386"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb386-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> movies_sample, <span class="kw">aes</span>(<span class="dt">x =</span> genre, <span class="dt">y =</span> rating)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb386-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_boxplot</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb386-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">y =</span> <span class="st">&quot;IMDb rating&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:action-romance-boxplot"></span>
-<img src="moderndive_files/figure-html/action-romance-boxplot-1.png" alt="Boxplot of IMDb rating vs genre." width="\textwidth" />
+<img src="ModernDive_files/figure-html/action-romance-boxplot-1.png" alt="Boxplot of IMDb rating vs. genre." width="\textwidth" />
 <p class="caption">
-FIGURE 9.17: Boxplot of IMDb rating vs genre.
+FIGURE 9.17: Boxplot of IMDb rating vs. genre.
 </p>
 </div>
-<p>Eyeballing Figure <a href="9-hypothesis-testing.html#fig:action-romance-boxplot">9.17</a>, it appears that romance movies have a higher median rating. Do we have reason to believe however, that there is a <em>significant</em> difference between the mean <code>rating</code> for action movies compared to romance movies? It’s hard to say just based on the plot. The boxplot does show that the median sample rating is higher for romance movies. However, there is a large amount of overlap between the boxes.</p>
-<p>Let’s calculate some summary statistic split by the binary categorical variable <code>genre</code>: the number of movies, the mean rating, and the standard deviation split. We’ll do this using <code>dplyr</code> data wrangling verbs. Notice in particular how we count the number of each type of movie using the <code>n()</code> summary function.</p>
-<pre class="sourceCode r"><code class="sourceCode r">movies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(genre) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">n</span>(), <span class="dt">mean_rating =</span> <span class="kw">mean</span>(rating), <span class="dt">std_dev =</span> <span class="kw">sd</span>(rating))</code></pre>
+<p>Eyeballing Figure <a href="9-hypothesis-testing.html#fig:action-romance-boxplot">9.17</a>, romance movies have a higher median rating. Do we have reason to believe, however, that there is a <em>significant</em> difference between the mean <code>rating</code> for action movies compared to romance movies? It’s hard to say just based on this plot. The boxplot does show that the median sample rating is higher for romance movies.</p>
+<p>However, there is a large amount of overlap between the boxes. Recall that the median isn’t necessarily the same as the mean either, depending on whether the distribution is skewed.</p>
+<p>Let’s calculate some summary statistics split by the binary categorical variable <code>genre</code>: the number of movies, the mean rating, and the standard deviation split by <code>genre</code>. We’ll do this using <code>dplyr</code> data wrangling verbs. Notice in particular how we count the number of each type of movie using the <code>n()</code> summary function.</p>
+<div class="sourceCode" id="cb387"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb387-1" data-line-number="1">movies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb387-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(genre) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb387-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">n</span>(), <span class="dt">mean_rating =</span> <span class="kw">mean</span>(rating), <span class="dt">std_dev =</span> <span class="kw">sd</span>(rating))</a></code></pre></div>
 <pre><code># A tibble: 2 x 4
   genre       n mean_rating std_dev
   &lt;chr&gt;   &lt;int&gt;       &lt;dbl&gt;   &lt;dbl&gt;
 1 Action     32     5.275   1.36121
 2 Romance    36     6.32222 1.60963</code></pre>
-<p>Observe that we have 36 movies with an average rating of 6.32 stars and 32 movies with an average rating of 5.28 stars. The difference in these average ratings is thus 6.32 - 5.28 = 1.05. So there appears to be an edge of 1.05 stars in favor of romance movies. The question is however, are these results indicative of a true difference for <em>all</em> romance and action movies? Or could we attribute this difference to chance <em>sampling variation</em>?</p>
+<p>Observe that we have 36 movies with an average rating of 6.322 stars and 32 movies with an average rating of 5.275 stars. The difference in these average ratings is thus 6.322 - 5.275 = 1.047. So there appears to be an edge of 1.047 stars in favor of romance movies. The question is, however, are these results indicative of a true difference for <em>all</em> romance and action movies? Or could we attribute this difference to chance <em>sampling variation</em>?</p>
 </div>
 <div id="sampling-scenario-1" class="section level3">
 <h3><span class="header-section-number">9.5.2</span> Sampling scenario</h3>
-<p>Let’s now revisit this study in terms of terminology and notation related to sampling we studied in Section <a href="7-sampling.html#terminology-and-notation">7.3.1</a>.</p>
-<p>The <em>study population</em> is all movies in the IMDb database that are either action or romance (but not both). The <em>sample</em> from this population is the 68 movies included in the <code>movies_sample</code> dataset. Since this sample was randomly taken from the population <code>movies</code>, it is representative of all romance and action movies on IMDb. Thus, any analysis and results based on <code>movies_sample</code> can generalize to the entire population.</p>
-<p>What are the relevant <em>population parameter</em> and <em>point estimates</em>? We introduce the fourth sampling scenario in Table <a href="9-hypothesis-testing.html#tab:summarytable-ch10">9.3</a>.</p>
+<p>Let’s now revisit this study in terms of terminology and notation related to sampling we studied in Subsection <a href="7-sampling.html#terminology-and-notation">7.3.1</a>. The <em>study population</em> is all movies in the IMDb database that are either action or romance (but not both). The <em>sample</em> from this population is the 68 movies included in the <code>movies_sample</code> dataset.</p>
+<p>Since this sample was randomly taken from the population <code>movies</code>, it is representative of all romance and action movies on IMDb. Thus, any analysis and results based on <code>movies_sample</code> can generalize to the entire population. What are the relevant <em>population parameter</em> and <em>point estimates</em>? We introduce the fourth sampling scenario in Table <a href="9-hypothesis-testing.html#tab:summarytable-ch10">9.3</a>.</p>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
 <span id="tab:summarytable-ch10">TABLE 9.3: </span>Scenarios of sampling for inference
@@ -1450,7 +1452,7 @@ <h3><span class="header-section-number">9.5.2</span> Sampling scenario</h3>
 Point estimate
 </th>
 <th style="text-align:left;">
-Notation.
+Symbol(s)
 </th>
 </tr>
 </thead>
@@ -1525,9 +1527,9 @@ <h3><span class="header-section-number">9.5.2</span> Sampling scenario</h3>
 </tr>
 </tbody>
 </table>
-<p>So whereas the sampling bowl exercise in Section <a href="7-sampling.html#sampling-activity">7.1</a> concerned <em>proportions</em>, the pennies exercise in Section <a href="8-confidence-intervals.html#resampling-tactile">8.1</a> concerned <em>means</em>, the case study on whether yawning is contagious in Section <a href="8-confidence-intervals.html#case-study-two-prop-ci">8.6</a> and the promotions activity in Section <a href="9-hypothesis-testing.html#ht-activity">9.1</a> concerned <em>differences in proportions</em>, we are now concerned with <em>differences in means</em>.</p>
-<p>In other words, the population parameter of interest is the difference in population mean ratings <span class="math inline">\(\mu_a - \mu_r\)</span>, where <span class="math inline">\(\mu_a\)</span> is the mean rating of all action movies on IMDb and similarly <span class="math inline">\(\mu_r\)</span> is the mean rating of all romance movies. Additionally the point estimate/sample statistic of interest is the difference in sample means <span class="math inline">\(\overline{x}_a - \overline{x}_r\)</span>, where <span class="math inline">\(\overline{x}_a\)</span> is the mean rating of the <span class="math inline">\(n_a\)</span> = 32 movies in our sample and <span class="math inline">\(\overline{x}_r\)</span> is the mean rating of the <span class="math inline">\(n_r\)</span> = 36 in our sample. Based on our earlier exploratory data analysis, our estimate <span class="math inline">\(\overline{x}_a - \overline{x}_r\)</span> is 5.28 - 6.32 = -1.05.</p>
-<p>So there appears to be a slight difference of -1.05 in favor of romance movies. The question is however, could this difference of -1.05 be merely due to chance and sampling variation? Or are these results indicative of a true difference in mean ratings for <em>all</em> romance and action movies on IMDb? To answer this question, we’ll use hypothesis testing.</p>
+<p>So, whereas the sampling bowl exercise in Section <a href="7-sampling.html#sampling-activity">7.1</a> concerned <em>proportions</em>, the pennies exercise in Section <a href="8-confidence-intervals.html#resampling-tactile">8.1</a> concerned <em>means</em>, the case study on whether yawning is contagious in Section <a href="8-confidence-intervals.html#case-study-two-prop-ci">8.6</a> and the promotions activity in Section <a href="9-hypothesis-testing.html#ht-activity">9.1</a> concerned <em>differences in proportions</em>, we are now concerned with <em>differences in means</em>.</p>
+<p>In other words, the population parameter of interest is the difference in population mean ratings <span class="math inline">\(\mu_a - \mu_r\)</span>, where <span class="math inline">\(\mu_a\)</span> is the mean rating of all action movies on IMDb and similarly <span class="math inline">\(\mu_r\)</span> is the mean rating of all romance movies. Additionally the point estimate/sample statistic of interest is the difference in sample means <span class="math inline">\(\overline{x}_a - \overline{x}_r\)</span>, where <span class="math inline">\(\overline{x}_a\)</span> is the mean rating of the <span class="math inline">\(n_a\)</span> = 32 movies in our sample and <span class="math inline">\(\overline{x}_r\)</span> is the mean rating of the <span class="math inline">\(n_r\)</span> = 36 in our sample. Based on our earlier exploratory data analysis, our estimate <span class="math inline">\(\overline{x}_a - \overline{x}_r\)</span> is <span class="math inline">\(5.275 - 6.322 = -1.047\)</span>.</p>
+<p>So there appears to be a slight difference of -1.047 in favor of romance movies. The question is, however, could this difference of -1.047 be merely due to chance and sampling variation? Or are these results indicative of a true difference in mean ratings for <em>all</em> romance and action movies on IMDb? To answer this question, we’ll use hypothesis testing.</p>
 </div>
 <div id="conducting-the-hypothesis-test" class="section level3">
 <h3><span class="header-section-number">9.5.3</span> Conducting the hypothesis test</h3>
@@ -1539,12 +1541,12 @@ <h3><span class="header-section-number">9.5.3</span> Conducting the hypothesis t
 \end{aligned}
 \]</span></p>
 <p>In other words, the null hypothesis <span class="math inline">\(H_0\)</span> suggests that both romance and action movies have the same mean rating. This is the “hypothesized universe” we’ll <em>assume</em> is true. On the other hand, the alternative hypothesis <span class="math inline">\(H_A\)</span> suggests that there is a difference. Unlike the one-sided alternative we used in the promotions exercise <span class="math inline">\(H_a: p_m - p_f &gt; 0\)</span>, we are now considering a two-sided alternative of <span class="math inline">\(H_A: \mu_a - \mu_r \neq 0\)</span>.</p>
-<p>Furthermore, we’ll pre-specify a relatively high significance level of <span class="math inline">\(\alpha\)</span> = 0.2. By setting this value high, all things being equal, there is a higher chance that the p-value will be less than <span class="math inline">\(\alpha\)</span>. Thus there is a higher chance that we’ll reject the null hypothesis <span class="math inline">\(H_0\)</span> in favor of the alternative hypothesis <span class="math inline">\(H_A\)</span>. In other words, we’ll reject the hypothesis that there is no difference in mean ratings for all action and romance movies, even if we only have mild evidence.</p>
+<p>Furthermore, we’ll pre-specify a low significance level of <span class="math inline">\(\alpha\)</span> = 0.001. By setting this value low, all things being equal, there is a lower chance that the <span class="math inline">\(p\)</span>-value will be less than <span class="math inline">\(\alpha\)</span>. Thus, there is a lower chance that we’ll reject the null hypothesis <span class="math inline">\(H_0\)</span> in favor of the alternative hypothesis <span class="math inline">\(H_A\)</span>. In other words, we’ll reject the hypothesis that there is no difference in mean ratings for all action and romance movies, only if we have quite strong evidence. This is known as a “conservative” hypothesis testing procedure.</p>
 <div id="specify-variables-4" class="section level4 unnumbered">
 <h4>1. <code>specify</code> variables</h4>
-<p>Let’s now perform all the steps of the <code>infer</code> workflow. We first <code>specify()</code> the variables of interest in the <code>movies_sample</code> data frame using the formula <code>rating ~ genre</code>. This tells <code>infer</code> that the numerical variable <code>rating</code> is the outcome variable while the binary categorical variable <code>genre</code> is the explanatory variable. Note than unlike when we were previously interested in proportions, since we are now interested in the mean of a numerical variable, we do not need to set the <code>success</code> argument.</p>
-<pre class="sourceCode r"><code class="sourceCode r">movies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre)</code></pre>
+<p>Let’s now perform all the steps of the <code>infer</code> workflow. We first <code>specify()</code> the variables of interest in the <code>movies_sample</code> data frame using the formula <code>rating ~ genre</code>. This tells <code>infer</code> that the numerical variable <code>rating</code> is the outcome variable, while the binary variable <code>genre</code> is the explanatory variable. Note that unlike previously when we were interested in proportions, since we are now interested in the mean of a numerical variable, we do not need to set the <code>success</code> argument.</p>
+<div class="sourceCode" id="cb389"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb389-1" data-line-number="1">movies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb389-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre)</a></code></pre></div>
 <pre><code>Response: rating (numeric)
 Explanatory: genre (factor)
 # A tibble: 68 x 2
@@ -1565,11 +1567,14 @@ <h4>1. <code>specify</code> variables</h4>
 </div>
 <div id="hypothesize-the-null-1" class="section level4 unnumbered">
 <h4>2. <code>hypothesize</code> the null</h4>
-<p>We set the null hypothesis <span class="math inline">\(H_0: \mu_a - \mu_r = 0\)</span> by using the <code>hypothesize()</code> function. Since we have two samples, action and romance movies, we set <code>null = &quot;independence&quot;</code> as we described in Section <a href="9-hypothesis-testing.html#ht-infer">9.3</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">movies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>)</code></pre>
-<pre><code># A tibble: 68 x 2
+<p>We set the null hypothesis <span class="math inline">\(H_0: \mu_a - \mu_r = 0\)</span> by using the <code>hypothesize()</code> function. Since we have two samples, action and romance movies, we set <code>null</code> to be <code>&quot;independence&quot;</code> as we described in Section <a href="9-hypothesis-testing.html#ht-infer">9.3</a>.</p>
+<div class="sourceCode" id="cb391"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb391-1" data-line-number="1">movies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb391-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb391-3" data-line-number="3"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>)</a></code></pre></div>
+<pre><code>Response: rating (numeric)
+Explanatory: genre (factor)
+Null Hypothesis: independence
+# A tibble: 68 x 2
    rating genre  
     &lt;dbl&gt; &lt;fct&gt;  
  1    3.1 Action 
@@ -1586,62 +1591,43 @@ <h4>2. <code>hypothesize</code> the null</h4>
 </div>
 <div id="generate-replicates-4" class="section level4 unnumbered">
 <h4>3. <code>generate</code> replicates</h4>
-<p>After we have set the null hypothesis, we generate “shuffled” replicates assuming the null hypothesis is true by repeating the shuffling/permutation exercise you performed in Section <a href="9-hypothesis-testing.html#ht-activity">9.1</a>. We’ll repeat this resampling without replacement of <code>type = &quot;permute&quot;</code> a total of <code>reps = 1000</code> times .</p>
-<pre class="sourceCode r"><code class="sourceCode r">movies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>)</code></pre>
-<pre><code>Response: rating (numeric)
-Explanatory: genre (factor)
-Null Hypothesis: independence
-# A tibble: 68,000 x 3
-# Groups:   replicate [1,000]
-   rating genre   replicate
-    &lt;dbl&gt; &lt;fct&gt;       &lt;int&gt;
- 1  4.4   Action          1
- 2  5.2   Romance         1
- 3  7.3   Romance         1
- 4  4.9   Romance         1
- 5  4.100 Action          1
- 6  7.4   Romance         1
- 7  5     Romance         1
- 8  5.100 Action          1
- 9  4.4   Romance         1
-10  8     Romance         1
-# … with 67,990 more rows</code></pre>
-<p>Observe that the resulting data frame has 68,000 rows. This is because we performed resampling of 68 movies with replacement 1000 times and 68,000 = 68 <span class="math inline">\(\times\)</span> 1000. The variable <code>replicate</code> indicates which resample each row belongs to. So it has the value <code>1</code> 68 times, the value <code>2</code> 68 times, all the way through to the value <code>1000</code> 68 times.</p>
+<p>After we have set the null hypothesis, we generate “shuffled” replicates assuming the null hypothesis is true by repeating the shuffling/permutation exercise you performed in Section <a href="9-hypothesis-testing.html#ht-activity">9.1</a>.</p>
+<p>We’ll repeat this resampling without replacement of <code>type = &quot;permute&quot;</code> a total of <code>reps = 1000</code> times. Feel free to run the code below to check out what the <code>generate()</code> step produces.</p>
+<div class="sourceCode" id="cb393"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb393-1" data-line-number="1">movies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb393-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb393-3" data-line-number="3"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb393-4" data-line-number="4"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb393-5" data-line-number="5"><span class="st">  </span><span class="kw">View</span>()</a></code></pre></div>
 </div>
 <div id="calculate-summary-statistics-4" class="section level4 unnumbered">
 <h4>4. <code>calculate</code> summary statistics</h4>
-<p>Now that we have 1000 replicated “shuffles” assuming the null hypothesis <span class="math inline">\(H_0\)</span> that both <code>Action</code> and <code>Romance</code> movies on average have the same ratings on IMDb, let’s <code>calculate()</code> the appropriate summary statistic for these 1000 replicated shuffles. Recall from Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a> that point estimates/summary statistics relating to hypothesis testing have a specific name: <em>test statistics</em>. Since the unknown population parameter of interest is the difference in population means <span class="math inline">\(\mu_{a} - \mu_{r}\)</span>, the test statistic of interest here is the difference in sample means <span class="math inline">\(\overline{x}_{a} - \overline{x}_{r}\)</span>.</p>
+<p>Now that we have 1000 replicated “shuffles” assuming the null hypothesis <span class="math inline">\(H_0\)</span> that both <code>Action</code> and <code>Romance</code> movies on average have the same ratings on IMDb, let’s <code>calculate()</code> the appropriate summary statistic for these 1000 replicated shuffles. From Section <a href="9-hypothesis-testing.html#understanding-ht">9.2</a>, summary statistics relating to hypothesis testing have a specific name: <em>test statistics</em>. Since the unknown population parameter of interest is the difference in population means <span class="math inline">\(\mu_{a} - \mu_{r}\)</span>, the test statistic of interest here is the difference in sample means <span class="math inline">\(\overline{x}_{a} - \overline{x}_{r}\)</span>.</p>
 <p>For each of our 1000 shuffles, we can calculate this test statistic by setting <code>stat = &quot;diff in means&quot;</code>. Furthermore, since we are interested in <span class="math inline">\(\overline{x}_{a} - \overline{x}_{r}\)</span>, we set <code>order = c(&quot;Action&quot;, &quot;Romance&quot;)</code>. Let’s save the results in a data frame called <code>null_distribution_movies</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distribution_movies &lt;-<span class="st"> </span>movies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in means&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Action&quot;</span>, <span class="st">&quot;Romance&quot;</span>))
-null_distribution_movies</code></pre>
+<div class="sourceCode" id="cb394"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb394-1" data-line-number="1">null_distribution_movies &lt;-<span class="st"> </span>movies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb394-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb394-3" data-line-number="3"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb394-4" data-line-number="4"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb394-5" data-line-number="5"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in means&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Action&quot;</span>, <span class="st">&quot;Romance&quot;</span>))</a>
+<a class="sourceLine" id="cb394-6" data-line-number="6">null_distribution_movies</a></code></pre></div>
 <pre><code># A tibble: 1,000 x 2
    replicate      stat
        &lt;int&gt;     &lt;dbl&gt;
- 1         1 -0.923264
- 2         2  0.363542
- 3         3  0.404861
- 4         4  0.463889
- 5         5 -0.610417
- 6         6 -0.279861
- 7         7 -0.262153
- 8         8 -0.291667
- 9         9 -0.114583
-10        10  0.398958
+ 1         1  0.511111
+ 2         2  0.345833
+ 3         3 -0.327083
+ 4         4 -0.209028
+ 5         5 -0.433333
+ 6         6 -0.102778
+ 7         7  0.387153
+ 8         8  0.16875 
+ 9         9  0.257292
+10        10  0.334028
 # … with 990 more rows</code></pre>
-<p>Observe that we have 1000 values of <code>stat</code>, each representing one instance of <span class="math inline">\(\overline{x}_{a} - \overline{x}_{r}\)</span>. The 1000 values form the <em>null distribution</em>, which is the technical term for the sampling distribution of the difference in sample means <span class="math inline">\(\overline{x}_{a} - \overline{x}_{r}\)</span> assuming <span class="math inline">\(H_0\)</span> is true.</p>
-<p>But wait! What happened in real-life? What was the observed difference in promotion rates? In other words, what was the <em>observed test statistic</em> <span class="math inline">\(\overline{x}_{a} - \overline{x}_{r}\)</span>? Recall that our earlier data wrangling from earlier, this observed difference in means was 5.28 - 6.32 = -1.05.</p>
-<p>We can also achieve this using the code that constructed the null distribution <code>null_distribution_movies</code> but with the <code>hypothesize()</code> and <code>generate()</code> steps removed. Let’s save this in <code>obs_diff_means</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">obs_diff_means &lt;-<span class="st"> </span>movies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in means&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Action&quot;</span>, <span class="st">&quot;Romance&quot;</span>))
-obs_diff_means</code></pre>
+<p>Observe that we have 1000 values of <code>stat</code>, each representing one instance of <span class="math inline">\(\overline{x}_{a} - \overline{x}_{r}\)</span>. The 1000 values form the <em>null distribution</em>, which is the technical term for the sampling distribution of the difference in sample means <span class="math inline">\(\overline{x}_{a} - \overline{x}_{r}\)</span> assuming <span class="math inline">\(H_0\)</span> is true. What happened in real life? What was the observed difference in promotion rates? What was the <em>observed test statistic</em> <span class="math inline">\(\overline{x}_{a} - \overline{x}_{r}\)</span>? Recall from our earlier data wrangling, this observed difference in means was <span class="math inline">\(5.275 - 6.322 = -1.047\)</span>. We can also achieve this using the code that constructed the null distribution <code>null_distribution_movies</code> but with the <code>hypothesize()</code> and <code>generate()</code> steps removed. Let’s save this in <code>obs_diff_means</code>:</p>
+<div class="sourceCode" id="cb396"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb396-1" data-line-number="1">obs_diff_means &lt;-<span class="st"> </span>movies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb396-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb396-3" data-line-number="3"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in means&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Action&quot;</span>, <span class="st">&quot;Romance&quot;</span>))</a>
+<a class="sourceLine" id="cb396-4" data-line-number="4">obs_diff_means</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
       stat
      &lt;dbl&gt;
@@ -1649,38 +1635,42 @@ <h4>4. <code>calculate</code> summary statistics</h4>
 </div>
 <div id="visualize-the-p-value-1" class="section level4 unnumbered">
 <h4>5. <code>visualize</code> the p-value</h4>
-<p>Lastly, in order to compute the p-value, we have to assess how “extreme” the observed difference in means of -1.05 is. We do this by comparing -1.05 to our null distribution, which was constructed in a hypothesized universe of no true difference in movie ratings.</p>
-<p>Let’s visualize both the null distribution and the p-value in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-movies-2">9.18</a>. However, unlike our example in Section <a href="9-hypothesis-testing.html#infer-workflow-ht">9.3.1</a> involving promotions, since we have a two-sided alternative hypothesis <span class="math inline">\(H_A: \mu_a - \mu_r \neq 0\)</span>, we have to allow for both possibilities for “more extreme”, so we set <code>direction = &quot;both&quot;</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(null_distribution_movies, <span class="dt">bins =</span> <span class="dv">10</span>) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">shade_p_value</span>(<span class="dt">obs_stat =</span> obs_diff_means, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</code></pre>
+<p>Lastly, in order to compute the <span class="math inline">\(p\)</span>-value, we have to assess how “extreme” the observed difference in means of -1.047 is. We do this by comparing -1.047 to our null distribution, which was constructed in a hypothesized universe of no true difference in movie ratings. Let’s visualize both the null distribution and the <span class="math inline">\(p\)</span>-value in Figure <a href="9-hypothesis-testing.html#fig:null-distribution-movies-2">9.18</a>. Unlike our example in Subsection <a href="9-hypothesis-testing.html#infer-workflow-ht">9.3.1</a> involving promotions, since we have a two-sided <span class="math inline">\(H_A: \mu_a - \mu_r \neq 0\)</span>, we have to allow for both possibilities for <em>more extreme</em>, so we set <code>direction = &quot;both&quot;</code>.</p>
+<div class="sourceCode" id="cb398"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb398-1" data-line-number="1"><span class="kw">visualize</span>(null_distribution_movies, <span class="dt">bins =</span> <span class="dv">10</span>) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb398-2" data-line-number="2"><span class="st">  </span><span class="kw">shade_p_value</span>(<span class="dt">obs_stat =</span> obs_diff_means, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:null-distribution-movies-2"></span>
-<img src="moderndive_files/figure-html/null-distribution-movies-2-1.png" alt="Null distribution, observed test statistic, and p-value." width="\textwidth" />
+<img src="ModernDive_files/figure-html/null-distribution-movies-2-1.png" alt="Null distribution, observed test statistic, and $p$-value." width="\textwidth" />
 <p class="caption">
-FIGURE 9.18: Null distribution, observed test statistic, and p-value.
+FIGURE 9.18: Null distribution, observed test statistic, and <span class="math inline">\(p\)</span>-value.
 </p>
 </div>
-<p>Let’s go over the elements of this plot. First, the histogram is the <em>null distribution</em>. Second, the solid line is the <em>observed test statistic</em>, or the difference in sample means we observed in real-life of 5.28 - 6.32 = -1.05. Third, the two shaded areas of the histogram form the <em>p-value</em>, or the probability of obtaining a test statistic just as or more extreme than the observed test statistic <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>.</p>
-<p>What proportion of the null distribution is shaded? In other words, what is the numerical value of the p-value? We use the <code>get_p_value()</code> function to compute this value:</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distribution_movies <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_p_value</span>(<span class="dt">obs_stat =</span> obs_diff_means, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</code></pre>
+<p>Let’s go over the elements of this plot. First, the histogram is the <em>null distribution</em>. Second, the solid line is the <em>observed test statistic</em>, or the difference in sample means we observed in real life of <span class="math inline">\(5.275 - 6.322 = -1.047\)</span>. Third, the two shaded areas of the histogram form the <em><span class="math inline">\(p\)</span>-value</em>, or the probability of obtaining a test statistic just as or more extreme than the observed test statistic <em>assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true</em>.</p>
+<p>What proportion of the null distribution is shaded? In other words, what is the numerical value of the <span class="math inline">\(p\)</span>-value? We use the <code>get_p_value()</code> function to compute this value:</p>
+<div class="sourceCode" id="cb399"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb399-1" data-line-number="1">null_distribution_movies <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb399-2" data-line-number="2"><span class="st">  </span><span class="kw">get_p_value</span>(<span class="dt">obs_stat =</span> obs_diff_means, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   p_value
     &lt;dbl&gt;
-1   0.016</code></pre>
-<p>This p-value of 0.016 is somewhat small. In other words, there is a somewhat small chance that we’d observe a difference of 5.28 - 6.32 = -1.05 in a hypothesized universe where there was truly no difference in ratings.</p>
-<p>This p-value is in fact much smaller than our pre-specified <span class="math inline">\(\alpha\)</span> significance level of 0.2. Thus, we are very inclined to reject the null hypothesis <span class="math inline">\(H_0: \mu_a - \mu_r = 0\)</span>, in favor of the alternative hypothesis <span class="math inline">\(H_A: \mu_a - \mu_r \neq 0\)</span>. In non-statistical language, the conclusion is: the evidence in this sample of data suggests that we should reject the hypothesis that there is no difference in mean IMDb ratings between romance and action movies in favor of the hypothesis that there is a difference.</p>
+1   0.004</code></pre>
+<p>This <span class="math inline">\(p\)</span>-value of 0.004 is very small. In other words, there is a very small chance that we’d observe a difference of 5.275 - 6.322 = -1.047 in a hypothesized universe where there was truly no difference in ratings.</p>
+<p>But this <span class="math inline">\(p\)</span>-value is larger than our (even smaller) pre-specified <span class="math inline">\(\alpha\)</span> significance level of 0.001. Thus, we are inclined to fail to reject the null hypothesis <span class="math inline">\(H_0: \mu_a - \mu_r = 0\)</span>. In non-statistical language, the conclusion is: we do not have the evidence needed in this sample of data to suggest that we should reject the hypothesis that there is no difference in mean IMDb ratings between romance and action movies. We, thus, cannot say that a difference exists in romance and action movie ratings, on average, for all IMDb movies.</p>
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
 </p>
 </div>
-<p><strong>(LC9.10)</strong> Conduct the same analysis comparing action movies versus romantic movies using the median rating instead of the mean rating. What was different and what was the same?</p>
-<p><strong>(LC9.11)</strong> What conclusions can you make from viewing the faceted histogram looking at <code>rating</code> versus <code>genre</code> that you couldn’t see when looking at the boxplot?</p>
-<p><strong>(LC9.12)</strong> Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between mean movie ratings for action and romance movies.</p>
-<p><strong>(LC9.13)</strong> Why are we relatively confident that the distributions of the sample ratings will be good approximations of the population distributions of ratings for the two genres?</p>
-<p><strong>(LC9.14)</strong> Using the definition of <span class="math inline">\(p\)</span>-value, write in words what the <span class="math inline">\(p\)</span>-value represents for the hypothesis test comparing the mean rating of romance to action movies.</p>
-<p><strong>(LC9.15)</strong> What is the value of the <span class="math inline">\(p\)</span>-value for the hypothesis test comparing the mean rating of romance to action movies?</p>
-<p><strong>(LC9.16)</strong> Do the results of the hypothesis test match up with the original plots we made looking at the population of movies? Why or why not?</p>
+<p><strong>(LC9.9)</strong> Conduct the same analysis comparing action movies versus romantic movies using the median rating instead of the mean rating. What was different and what was the same?</p>
+<p><strong>(LC9.10)</strong> What conclusions can you make from viewing the faceted histogram looking at <code>rating</code> versus <code>genre</code> that you couldn’t see when looking at the boxplot?</p>
+<p><strong>(LC9.11)</strong> Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between mean movie ratings for action and romance movies.</p>
+<p><strong>(LC9.12)</strong> Why are we relatively confident that the distributions of the sample ratings will be good approximations of the population distributions of ratings for the two genres?</p>
+<p><strong>(LC9.13)</strong> Using the definition of <span class="math inline">\(p\)</span>-value, write in words what the <span class="math inline">\(p\)</span>-value represents for the hypothesis test comparing the mean rating of romance to action movies.</p>
+<p><strong>(LC9.14)</strong> What is the value of the <span class="math inline">\(p\)</span>-value for the hypothesis test comparing the mean rating of romance to action movies?</p>
+<p><strong>(LC9.15)</strong> Test your data wrangling knowledge and EDA skills:</p>
+<ul>
+<li>Use <code>dplyr</code> and <code>tidyr</code> to create the necessary data frame focused on only action and romance movies (but not both) from the <code>movies</code> data frame in the <code>ggplot2movies</code> package.</li>
+<li>Make a boxplot and a faceted histogram of this population data comparing ratings of action and romance movies from IMDb.</li>
+<li>Discuss how these plots compare to the similar plots produced for the <code>movies_sample</code> data.</li>
+</ul>
 <div class="learncheck">
 
 </div>
@@ -1691,49 +1681,44 @@ <h4>5. <code>visualize</code> the p-value</h4>
 <h2><span class="header-section-number">9.6</span> Conclusion</h2>
 <div id="theory-hypo" class="section level3">
 <h3><span class="header-section-number">9.6.1</span> Theory-based hypothesis tests</h3>
-<p>Much as we did in Section <a href="8-confidence-intervals.html#theory-ci">8.7.2</a> when we showed you a theory-based method for constructing confidence intervals that involved mathematical formulas, we now present an example of a traditional theory-based method to conduct hypothesis tests. This method relies on probability models, probability distributions, and a few assumptions to construct the null distribution. This is in contrast to the approach we’ve been using throughout this book where we relied on computer simulations to construct the null distribution.</p>
-<p>These traditional theory-based methods have been used for decades mostly because researchers didn’t have access to computers that could run thousands of calculations quickly and efficiently. Now that computing power is much cheaper and more accessible, simulation-based methods are much more feasible. However researchers in many fields continue to use theory-based methods. Hence we make it a point to include an example here.</p>
+<p>Much as we did in Subsection <a href="8-confidence-intervals.html#theory-ci">8.7.2</a> when we showed you a theory-based method for constructing confidence intervals that involved mathematical formulas, we now present an example of a traditional theory-based method to conduct hypothesis tests. This method relies on probability models, probability distributions, and a few assumptions to construct the null distribution. This is in contrast to the approach we’ve been using throughout this book where we relied on computer simulations to construct the null distribution.</p>
+<p>These traditional theory-based methods have been used for decades mostly because researchers didn’t have access to computers that could run thousands of calculations quickly and efficiently. Now that computing power is much cheaper and more accessible, simulation-based methods are much more feasible. However, researchers in many fields continue to use theory-based methods. Hence, we make it a point to include an example here.</p>
 <p>As we’ll show in this section, any theory-based method is ultimately an approximation to the simulation-based method. The theory-based method we’ll focus on is known as the <em>two-sample <span class="math inline">\(t\)</span>-test</em> for testing differences in sample means. However, the test statistic we’ll use won’t be the difference in sample means <span class="math inline">\(\overline{x}_1 - \overline{x}_2\)</span>, but rather the related <em>two-sample <span class="math inline">\(t\)</span>-statistic</em>. The data we’ll use will once again be the <code>movies_sample</code> data of action and romance movies from Section <a href="9-hypothesis-testing.html#ht-case-study">9.5</a>.</p>
 <div id="two-sample-t-statistic" class="section level4 unnumbered">
 <h4>Two-sample t-statistic</h4>
-<p>A common task in statistics is the process of “standardizing a variable.” By standardizing different variables, we make them more comparable. For example, say you are interested in studying the distribution of temperature recordings from Portland, Oregon, USA with temperature recordings in Montreal, Quebec, Canada. Given that US temperatures are generally recorded in degrees Fahrenheit and Canadian temperatures are generally recorded in degrees Celsius, how can we make them comparable?</p>
-<p>One approach would be to convert degrees Fahrenheit into Celsius, or vice versa. Another approach would be to convert them both to a common “standardized” scale, like degrees Kelvin. One common method for standardizing a variable from probability theory is to compute the  <span class="math inline">\(z\)</span>-score:</p>
+<p>A common task in statistics is the process of “standardizing a variable.” By standardizing different variables, we make them more comparable. For example, say you are interested in studying the distribution of temperature recordings from Portland, Oregon, USA and comparing it to that of the temperature recordings in Montreal, Quebec, Canada. Given that US temperatures are generally recorded in degrees Fahrenheit and Canadian temperatures are generally recorded in degrees Celsius, how can we make them comparable? One approach would be to convert degrees Fahrenheit into Celsius, or vice versa. Another approach would be to convert them both to a common “standardized” scale, like degrees Kelvin.</p>
+<p>One common method for standardizing a variable from probability and statistics theory is to compute the  <span class="math inline">\(z\)</span>-score:</p>
 <p><span class="math display">\[z = \frac{x - \mu}{\sigma}\]</span></p>
-<p>where <span class="math inline">\(x\)</span> represents one value of a variable, <span class="math inline">\(\mu\)</span> represents the mean of that variable, and <span class="math inline">\(\sigma\)</span> represents that standard deviation of the variable.</p>
-<p>You first subtract the mean <span class="math inline">\(\mu\)</span> from each value of <span class="math inline">\(x\)</span> and then divide <span class="math inline">\(x - \mu\)</span> by the standard deviation <span class="math inline">\(\sigma\)</span>. These operations will have the effect of “re-centering” your variable around 0 and “re-scaling” your variable <span class="math inline">\(x\)</span> so that they have what are known as “standard units.”</p>
-<p>Thus for every value that your variable can take, it has a corresponding <span class="math inline">\(z\)</span>-score that gives how many standard units away that value is from the mean <span class="math inline">\(\mu\)</span>. <span class="math inline">\(z\)</span>-scores are normally distributed with mean 0 and standard deviation 1. Such a curve is called a “<span class="math inline">\(z\)</span>-distribution” as well a “standard normal” curve and they have the common, bell-shaped pattern from Figure <a href="9-hypothesis-testing.html#fig:zcurve">9.19</a>. We discuss this further in Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a>.</p>
+<p>where <span class="math inline">\(x\)</span> represents one value of a variable, <span class="math inline">\(\mu\)</span> represents the mean of that variable, and <span class="math inline">\(\sigma\)</span> represents the standard deviation of that variable. You first subtract the mean <span class="math inline">\(\mu\)</span> from each value of <span class="math inline">\(x\)</span> and then divide <span class="math inline">\(x - \mu\)</span> by the standard deviation <span class="math inline">\(\sigma\)</span>. These operations will have the effect of <em>re-centering</em> your variable around 0 and <em>re-scaling</em> your variable <span class="math inline">\(x\)</span> so that they have what are known as “standard units.” Thus for every value that your variable can take, it has a corresponding <span class="math inline">\(z\)</span>-score that gives how many standard units away that value is from the mean <span class="math inline">\(\mu\)</span>. <span class="math inline">\(z\)</span>-scores are normally distributed with mean 0 and standard deviation 1. This curve is called a “<span class="math inline">\(z\)</span>-distribution” or “standard normal” curve and has the common, bell-shaped pattern from Figure <a href="9-hypothesis-testing.html#fig:zcurve">9.19</a> discussed in Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:zcurve"></span>
-<img src="moderndive_files/figure-html/zcurve-1.png" alt="Standard normal z curve." width="80%" />
+<img src="ModernDive_files/figure-html/zcurve-1.png" alt="Standard normal z curve." width="100%" />
 <p class="caption">
 FIGURE 9.19: Standard normal z curve.
 </p>
 </div>
-<p>Bringing these back to the difference of sample mean ratings <span class="math inline">\(\overline{x}_a - \overline{x}_r\)</span> of action versus romance movies, how would we standardize this variable? By once again subtracting its mean and dividing by its standard deviation. Recall two facts from Section <a href="7-sampling.html#moral-of-the-story">7.3.3</a>. First, if the sampling was done in a representative fashion, then the sampling distribution of <span class="math inline">\(\overline{x}_a - \overline{x}_r\)</span> will be centered at the true population parameter <span class="math inline">\(\mu_a - \mu_r\)</span>. Second, the standard deviation of point estimates like <span class="math inline">\(\overline{x}_a - \overline{x}_r\)</span> have a special name: the standard error</p>
+<p>Bringing these back to the difference of sample mean ratings <span class="math inline">\(\overline{x}_a - \overline{x}_r\)</span> of action versus romance movies, how would we standardize this variable? By once again subtracting its mean and dividing by its standard deviation. Recall two facts from Subsection <a href="7-sampling.html#moral-of-the-story">7.3.3</a>. First, if the sampling was done in a representative fashion, then the sampling distribution of <span class="math inline">\(\overline{x}_a - \overline{x}_r\)</span> will be centered at the true population parameter <span class="math inline">\(\mu_a - \mu_r\)</span>. Second, the standard deviation of point estimates like <span class="math inline">\(\overline{x}_a - \overline{x}_r\)</span> has a special name: the standard error.</p>
 <p>Applying these ideas, we present the <em>two-sample <span class="math inline">\(t\)</span>-statistic</em>:</p>
 <p><span class="math display">\[t = \dfrac{ (\bar{x}_a - \bar{x}_r) - (\mu_a - \mu_r)}{ \text{SE}_{\bar{x}_a - \bar{x}_r} } = \dfrac{ (\bar{x}_a - \bar{x}_r) - (\mu_a - \mu_r)}{ \sqrt{\dfrac{{s_a}^2}{n_a} + \dfrac{{s_r}^2}{n_r}}  }\]</span></p>
-<p>Oofda! There is a lot to try to unpack here! Let’s go slowly. In the numerator <span class="math inline">\(\bar{x}_a-\bar{x}_r\)</span> is the difference in sample means while <span class="math inline">\(\mu_a - \mu_r\)</span> is the difference in population means.</p>
-<p>In the denominator <span class="math inline">\(s_a\)</span> and <span class="math inline">\(s_r\)</span> are the <em>sample standard deviations</em> of the action and romance movies in our sample <code>movies_sample</code>. Lastly, <span class="math inline">\(n_a\)</span> and <span class="math inline">\(n_r\)</span> are the sample sizes of the action and romance movies. Putting this together gives us the standard error <span class="math inline">\(\text{SE}_{\bar{x}_a - \bar{x}_r}\)</span>.</p>
-<p>Observe that the formula for <span class="math inline">\(\text{SE}_{\bar{x}_a - \bar{x}_r}\)</span> has the sample sizes <span class="math inline">\(n_a\)</span> and <span class="math inline">\(n_r\)</span> in them. So as the sample sizes increase, the standard error goes down. We’ve seen this concept numerous times now, in particular in our simulations using the three virtual shovels with <span class="math inline">\(n\)</span> = 25, 50, and 100 slots in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions-3">7.15</a> and in Section <a href="8-confidence-intervals.html#ci-width">8.5.3</a> where we studied the effect of using larger sample sizes on the widths of confidence intervals.</p>
-<p>So how can we use the two-sample <span class="math inline">\(t\)</span>-statistic as a test statistic in our hypothesis test? First, assuming the null hypothesis <span class="math inline">\(H_0: \mu_a - \mu_r = 0\)</span> is true, the right-hand side of the numerator, <span class="math inline">\(\mu_a - \mu_r\)</span>, becomes 0. Second, similarly to how the Central Limit Theorem from Section <a href="7-sampling.html#sampling-conclusion-central-limit-theorem">7.5.2</a> states that sample means follow a normal distribution, it can be mathematically proven that the two-sample <span class="math inline">\(t\)</span>-statistic follows a <em><span class="math inline">\(t\)</span> distribution with degrees of freedom</em> “roughly equal” to <span class="math inline">\(df = n_a + n_r - 2\)</span>.</p>
-<p>We display three examples of <span class="math inline">\(t\)</span>-distributions in Figure <a href="9-hypothesis-testing.html#fig:t-distributions">9.20</a> along with the standard normal <span class="math inline">\(z\)</span> curve.</p>
-<!--
-TODO: Add legend to ggplot version of images/t-distributions.png
--->
+<p>Oofda! There is a lot to try to unpack here! Let’s go slowly. In the numerator, <span class="math inline">\(\bar{x}_a-\bar{x}_r\)</span> is the difference in sample means, while <span class="math inline">\(\mu_a - \mu_r\)</span> is the difference in population means. In the denominator, <span class="math inline">\(s_a\)</span> and <span class="math inline">\(s_r\)</span> are the <em>sample standard deviations</em> of the action and romance movies in our sample <code>movies_sample</code>. Lastly, <span class="math inline">\(n_a\)</span> and <span class="math inline">\(n_r\)</span> are the sample sizes of the action and romance movies. Putting this together under the square root gives us the standard error <span class="math inline">\(\text{SE}_{\bar{x}_a - \bar{x}_r}\)</span>.</p>
+<p>Observe that the formula for <span class="math inline">\(\text{SE}_{\bar{x}_a - \bar{x}_r}\)</span> has the sample sizes <span class="math inline">\(n_a\)</span> and <span class="math inline">\(n_r\)</span> in them. So as the sample sizes increase, the standard error goes down. We’ve seen this concept numerous times now, in particular in our simulations using the three virtual shovels with <span class="math inline">\(n\)</span> = 25, 50, and 100 slots in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions-3">7.15</a> and in Subsection <a href="8-confidence-intervals.html#ci-width">8.5.3</a> where we studied the effect of using larger sample sizes on the widths of confidence intervals.</p>
+<p>So how can we use the two-sample <span class="math inline">\(t\)</span>-statistic as a test statistic in our hypothesis test? First, assuming the null hypothesis <span class="math inline">\(H_0: \mu_a - \mu_r = 0\)</span> is true, the right-hand side of the numerator (to the right of the <span class="math inline">\(-\)</span> sign), <span class="math inline">\(\mu_a - \mu_r\)</span>, becomes 0.</p>
+<p>Second, similarly to how the Central Limit Theorem from Subsection <a href="7-sampling.html#sampling-conclusion-central-limit-theorem">7.5.2</a> states that sample means follow a normal distribution, it can be mathematically proven that the two-sample <span class="math inline">\(t\)</span>-statistic follows a <em><span class="math inline">\(t\)</span> distribution with degrees of freedom</em> “roughly equal” to <span class="math inline">\(df = n_a + n_r - 2\)</span>. To better understand this concept of <em>degrees of freedom</em>, we next display three examples of <span class="math inline">\(t\)</span>-distributions in Figure <a href="9-hypothesis-testing.html#fig:t-distributions">9.20</a> along with the standard normal <span class="math inline">\(z\)</span> curve.</p>
 <div class="figure" style="text-align: center"><span id="fig:t-distributions"></span>
-<img src="moderndive_files/figure-html/t-distributions-1.png" alt="Examples of t-distributions and the z curve." width="\textwidth" />
+<img src="ModernDive_files/figure-html/t-distributions-1.png" alt="Examples of t-distributions and the z curve." width="100%" />
 <p class="caption">
 FIGURE 9.20: Examples of t-distributions and the z curve.
 </p>
 </div>
-<p>Begin by looking at the center of the plot at 0 on the horizontal axis. As you move up from the value of 0, follow along with the labels and note that the bottom curve corresponds to 1 degree of freedom, the curve above it is for 3 degrees of freedom, the curve above that is for 10 degrees of freedom, and lastly the dashed curve is the standard normal <span class="math inline">\(z\)</span> curve.</p>
+<p>Begin by looking at the center of the plot at 0 on the horizontal axis. As you move up from the value of 0, follow along with the labels and note that the bottom curve corresponds to 1 degree of freedom, the curve above it is for 3 degrees of freedom, the curve above that is for 10 degrees of freedom, and lastly the dotted curve is the standard normal <span class="math inline">\(z\)</span> curve.</p>
 <p>Observe that all four curves have a bell shape, are centered at 0, and that as the degrees of freedom increase, the <span class="math inline">\(t\)</span>-distribution more and more resembles the standard normal <span class="math inline">\(z\)</span> curve. The “degrees of freedom”  measures how different the <span class="math inline">\(t\)</span> distribution will be from a normal distribution. <span class="math inline">\(t\)</span>-distributions tend to have more values in the tails of their distributions than the standard normal <span class="math inline">\(z\)</span> curve.</p>
-<p>This “roughly equal” statement indicates that the equation <span class="math inline">\(df = n_a + n_r - 2\)</span> is a “good enough” approximation to the true degrees of freedom. The true <a href="https://en.wikipedia.org/wiki/Student%27s_t-test#Equal_or_unequal_sample_sizes,_unequal_variances">formula</a> is a bit more complicated than this simple expression, but we’ve found the formula to be beyond the reach of those new to statistical inference and it does little to build the intuition of the <span class="math inline">\(t\)</span>-test. The message to retain however is that small sample sizes lead to small degrees of freedom and thus small sample sizes lead to <span class="math inline">\(t\)</span>-distributions that are different than the <span class="math inline">\(z\)</span> curve. On the other hand, large sample sizes lead to large degrees of freedom and thus lead to <span class="math inline">\(t\)</span> distributions that closely align with the standard normal <span class="math inline">\(z\)</span>-curve.</p>
+<p>This “roughly equal” statement indicates that the equation <span class="math inline">\(df = n_a + n_r - 2\)</span> is a “good enough” approximation to the true degrees of freedom. The true <a href="https://en.wikipedia.org/wiki/Student%27s_t-test#Equal_or_unequal_sample_sizes,_unequal_variances">formula</a> is a bit more complicated than this simple expression, but we’ve found the formula to be beyond the reach of those new to statistical inference and it does little to build the intuition of the <span class="math inline">\(t\)</span>-test.</p>
+<p>The message to retain, however, is that small sample sizes lead to small degrees of freedom and thus small sample sizes lead to <span class="math inline">\(t\)</span>-distributions that are different than the <span class="math inline">\(z\)</span> curve. On the other hand, large sample sizes correspond to large degrees of freedom and thus produce <span class="math inline">\(t\)</span> distributions that closely align with the standard normal <span class="math inline">\(z\)</span>-curve.</p>
 <p>So, assuming the null hypothesis <span class="math inline">\(H_0\)</span> is true, our formula for the test statistic simplifies a bit:</p>
 <p><span class="math display">\[t = \dfrac{ (\bar{x}_a - \bar{x}_r) - 0}{ \sqrt{\dfrac{{s_a}^2}{n_a} + \dfrac{{s_r}^2}{n_r}}  } = \dfrac{ \bar{x}_a - \bar{x}_r}{ \sqrt{\dfrac{{s_a}^2}{n_a} + \dfrac{{s_r}^2}{n_r}}  }\]</span></p>
 <p>Let’s compute the values necessary for this two-sample <span class="math inline">\(t\)</span>-statistic. Recall the summary statistics we computed during our exploratory data analysis in Section <a href="9-hypothesis-testing.html#imdb-data">9.5.1</a>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">movies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(genre) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">n</span>(), <span class="dt">mean_rating =</span> <span class="kw">mean</span>(rating), <span class="dt">std_dev =</span> <span class="kw">sd</span>(rating))</code></pre>
+<div class="sourceCode" id="cb401"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb401-1" data-line-number="1">movies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb401-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(genre) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb401-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">n</span>(), <span class="dt">mean_rating =</span> <span class="kw">mean</span>(rating), <span class="dt">std_dev =</span> <span class="kw">sd</span>(rating))</a></code></pre></div>
 <pre><code># A tibble: 2 x 4
   genre       n mean_rating std_dev
   &lt;chr&gt;   &lt;int&gt;       &lt;dbl&gt;   &lt;dbl&gt;
@@ -1745,69 +1730,71 @@ <h4>Two-sample t-statistic</h4>
 \dfrac{5.28 - 6.32}{ \sqrt{\dfrac{{1.36}^2}{32} + \dfrac{{1.61}^2}{36}}  } = 
 -2.906
 \]</span></p>
-<p>Great! How can we compute the p-value using this theory-based test statistic? We need to compare it to a null distribution, which we construct next.</p>
+<p>Great! How can we compute the <span class="math inline">\(p\)</span>-value using this theory-based test statistic? We need to compare it to a null distribution, which we construct next.</p>
 </div>
 <div id="null-distribution" class="section level4 unnumbered">
 <h4>Null distribution</h4>
-<p>Let’s revisit the null distribution for the test statistic <span class="math inline">\(\bar{x}_a - \bar{x}_r\)</span> we constructed in Section <a href="9-hypothesis-testing.html#ht-case-study">9.5</a>. Let’s visualize this in the left-hand plot of Figure <a href="9-hypothesis-testing.html#fig:comparing-diff-means-t-stat">9.21</a></p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Construct null distribution of xbar_a - xbar_m:</span>
-null_distribution_movies &lt;-<span class="st"> </span>movies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in means&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Action&quot;</span>, <span class="st">&quot;Romance&quot;</span>))
-<span class="kw">visualize</span>(null_distribution_movies, <span class="dt">bins =</span> <span class="dv">10</span>)</code></pre>
-<p>The <code>infer</code> package also includes some built-in theory-based test statistics as well. So instead of calculating the test statistic of interest as the <code>&quot;diff in means&quot;</code> <span class="math inline">\(\bar{x}_a - \bar{x}_r\)</span>, we can calculate this defined two-sample <span class="math inline">\(t\)</span>-statistic by setting <code>stat = &quot;t&quot;</code>. Let’s visualize this in the right-hand plot of Figure <a href="9-hypothesis-testing.html#fig:comparing-diff-means-t-stat">9.21</a></p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Construct null distribution of t:</span>
-null_distribution_movies_t &lt;-<span class="st"> </span>movies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="co"># Notice we switched stat from &quot;diff in means&quot; to &quot;t&quot;</span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;t&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Action&quot;</span>, <span class="st">&quot;Romance&quot;</span>))
-<span class="kw">visualize</span>(null_distribution_movies_t, <span class="dt">bins =</span> <span class="dv">10</span>)</code></pre>
+<p>Let’s revisit the null distribution for the test statistic <span class="math inline">\(\bar{x}_a - \bar{x}_r\)</span> we constructed in Section <a href="9-hypothesis-testing.html#ht-case-study">9.5</a>. Let’s visualize this in the left-hand plot of Figure <a href="9-hypothesis-testing.html#fig:comparing-diff-means-t-stat">9.21</a>.</p>
+<div class="sourceCode" id="cb403"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb403-1" data-line-number="1"><span class="co"># Construct null distribution of xbar_a - xbar_m:</span></a>
+<a class="sourceLine" id="cb403-2" data-line-number="2">null_distribution_movies &lt;-<span class="st"> </span>movies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb403-3" data-line-number="3"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb403-4" data-line-number="4"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb403-5" data-line-number="5"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb403-6" data-line-number="6"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in means&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Action&quot;</span>, <span class="st">&quot;Romance&quot;</span>))</a>
+<a class="sourceLine" id="cb403-7" data-line-number="7"><span class="kw">visualize</span>(null_distribution_movies, <span class="dt">bins =</span> <span class="dv">10</span>)</a></code></pre></div>
+<p>The <code>infer</code> package also includes some built-in theory-based test statistics as well. So instead of calculating the test statistic of interest as the <code>&quot;diff in means&quot;</code> <span class="math inline">\(\bar{x}_a - \bar{x}_r\)</span>, we can calculate this defined two-sample <span class="math inline">\(t\)</span>-statistic by setting <code>stat = &quot;t&quot;</code>. Let’s visualize this in the right-hand plot of Figure <a href="9-hypothesis-testing.html#fig:comparing-diff-means-t-stat">9.21</a>.</p>
+<div class="sourceCode" id="cb404"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb404-1" data-line-number="1"><span class="co"># Construct null distribution of t:</span></a>
+<a class="sourceLine" id="cb404-2" data-line-number="2">null_distribution_movies_t &lt;-<span class="st"> </span>movies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb404-3" data-line-number="3"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb404-4" data-line-number="4"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb404-5" data-line-number="5"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>, <span class="dt">type =</span> <span class="st">&quot;permute&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb404-6" data-line-number="6"><span class="st">  </span><span class="co"># Notice we switched stat from &quot;diff in means&quot; to &quot;t&quot;</span></a>
+<a class="sourceLine" id="cb404-7" data-line-number="7"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;t&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Action&quot;</span>, <span class="st">&quot;Romance&quot;</span>))</a>
+<a class="sourceLine" id="cb404-8" data-line-number="8"><span class="kw">visualize</span>(null_distribution_movies_t, <span class="dt">bins =</span> <span class="dv">10</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:comparing-diff-means-t-stat"></span>
-<img src="moderndive_files/figure-html/comparing-diff-means-t-stat-1.png" alt="Comparing the null distributions of two test statistics." width="100%" />
+<img src="ModernDive_files/figure-html/comparing-diff-means-t-stat-1.png" alt="Comparing the null distributions of two test statistics." width="\textwidth" />
 <p class="caption">
 FIGURE 9.21: Comparing the null distributions of two test statistics.
 </p>
 </div>
-<p>Observe that while the shape of the null distributions of both the difference in means <span class="math inline">\(\bar{x}_a - \bar{x}_r\)</span> and the two-sample <span class="math inline">\(t\)</span>-statistic are similar, the scales on the x-axis are different. The two-sample <span class="math inline">\(t\)</span>-statistic are spread out over a larger range.</p>
-<p>However, a traditional theory-based <span class="math inline">\(t\)</span>-test doesn’t look at the simulated histogram in <code>null_distribution_movies_t</code>, but instead it looks at the <span class="math inline">\(t\)</span>-distribution curve with degrees of freedom equal to roughly 65.85. This calculation is based on the complicated formula referenced previously, which we approximated with <span class="math inline">\(df = n_a + n_r - 2\)</span> = 32 + 36 - 2 = 66. Let’s overlay this <span class="math inline">\(t\)</span>-distribution curve over the top of our simulated two-sample <span class="math inline">\(t\)</span>-statistics using the <code>method = &quot;both&quot;</code> argument in <code>visualize()</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(null_distribution_movies_t, <span class="dt">bins =</span> <span class="dv">10</span>, <span class="dt">method =</span> <span class="st">&quot;both&quot;</span>)</code></pre>
+<p>Observe that while the shape of the null distributions of both the difference in means <span class="math inline">\(\bar{x}_a - \bar{x}_r\)</span> and the two-sample <span class="math inline">\(t\)</span>-statistics are similar, the scales on the x-axis are different. The two-sample <span class="math inline">\(t\)</span>-statistic values are spread out over a larger range.</p>
+<p>However, a traditional theory-based <span class="math inline">\(t\)</span>-test doesn’t look at the simulated histogram in <code>null_distribution_movies_t</code>, but instead it looks at the <span class="math inline">\(t\)</span>-distribution curve with degrees of freedom equal to roughly 65.85. This calculation is based on the complicated formula referenced previously, which we approximated with <span class="math inline">\(df = n_a + n_r - 2 = 32 + 36 - 2 = 66\)</span>. Let’s overlay this <span class="math inline">\(t\)</span>-distribution curve over the top of our simulated two-sample <span class="math inline">\(t\)</span>-statistics using the <code>method = &quot;both&quot;</code> argument in <code>visualize()</code>.</p>
+<div class="sourceCode" id="cb405"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb405-1" data-line-number="1"><span class="kw">visualize</span>(null_distribution_movies_t, <span class="dt">bins =</span> <span class="dv">10</span>, <span class="dt">method =</span> <span class="st">&quot;both&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:t-stat-3"></span>
-<img src="moderndive_files/figure-html/t-stat-3-1.png" alt="Null distribution using t-statistic and t-distribution." width="100%" />
+<img src="ModernDive_files/figure-html/t-stat-3-1.png" alt="Null distribution using t-statistic and t-distribution." width="\textwidth" />
 <p class="caption">
 FIGURE 9.22: Null distribution using t-statistic and t-distribution.
 </p>
 </div>
-<p>Observe that the curve does a good job of approximating the histogram here. To calculate the <span class="math inline">\(p\)</span>-value in this case, we need to figure out how much of the total area under the <span class="math inline">\(t\)</span>-distribution curve is equal or “more extreme” our observed two-sample <span class="math inline">\(t\)</span>-statistic. Since our alternative hypothesis <span class="math inline">\(H_A: \mu_a - \mu_r \neq 0\)</span> is a two-sided alternative, we need to add up the areas in both tails.</p>
-<p>We first compute the observed two-sample <span class="math inline">\(t\)</span>-statistic using <code>infer</code> verbs:</p>
-<pre class="sourceCode r"><code class="sourceCode r">obs_two_sample_t &lt;-<span class="st"> </span>movies_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;t&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Action&quot;</span>, <span class="st">&quot;Romance&quot;</span>))
-obs_two_sample_t</code></pre>
+<p>Observe that the curve does a good job of approximating the histogram here. To calculate the <span class="math inline">\(p\)</span>-value in this case, we need to figure out how much of the total area under the <span class="math inline">\(t\)</span>-distribution curve is at or “more extreme” than our observed two-sample <span class="math inline">\(t\)</span>-statistic. Since <span class="math inline">\(H_A: \mu_a - \mu_r \neq 0\)</span> is a two-sided alternative, we need to add up the areas in both tails.</p>
+<p>We first compute the observed two-sample <span class="math inline">\(t\)</span>-statistic using <code>infer</code> verbs. This shortcut calculation further assumes that the null hypothesis is true: that the population of action and romance movies have an equal average rating.</p>
+<div class="sourceCode" id="cb406"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb406-1" data-line-number="1">obs_two_sample_t &lt;-<span class="st"> </span>movies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb406-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">formula =</span> rating <span class="op">~</span><span class="st"> </span>genre) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb406-3" data-line-number="3"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;t&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Action&quot;</span>, <span class="st">&quot;Romance&quot;</span>))</a>
+<a class="sourceLine" id="cb406-4" data-line-number="4">obs_two_sample_t</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
       stat
      &lt;dbl&gt;
 1 -2.90589</code></pre>
-<p>So we are interested in finding the percentage of values that are at or above <code>obs_two_sample_t</code> = -2.906 or at or below <code>-obs_two_sample_t</code> = 2.906. We do this using the <code>shade_p_value()</code> function with the <code>direction</code> argument set to <code>&quot;both&quot;</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">visualize</span>(null_distribution_movies_t, <span class="dt">method =</span> <span class="st">&quot;both&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">shade_p_value</span>(<span class="dt">obs_stat =</span> obs_two_sample_t, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</code></pre>
+<p>We want to find the percentage of values that are at or above <code>obs_two_sample_t</code> <span class="math inline">\(= -2.906\)</span> or at or below <code>-obs_two_sample_t</code> <span class="math inline">\(= 2.906\)</span>. We use the <code>shade_p_value()</code> function with the <code>direction</code> argument set to <code>&quot;both&quot;</code> to do this:</p>
+<div class="sourceCode" id="cb408"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb408-1" data-line-number="1"><span class="kw">visualize</span>(null_distribution_movies_t, <span class="dt">method =</span> <span class="st">&quot;both&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb408-2" data-line-number="2"><span class="st">  </span><span class="kw">shade_p_value</span>(<span class="dt">obs_stat =</span> obs_two_sample_t, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</a></code></pre></div>
+<pre><code>Warning: Check to make sure the conditions have been met for the theoretical
+method. {infer} currently does not check these for you.</code></pre>
 <div class="figure" style="text-align: center"><span id="fig:t-stat-4"></span>
-<img src="moderndive_files/figure-html/t-stat-4-1.png" alt="Null distribution using t-statistic and t-distribution with p-value shaded." width="100%" />
+<img src="ModernDive_files/figure-html/t-stat-4-1.png" alt="Null distribution using t-statistic and t-distribution with $p$-value shaded." width="\textwidth" />
 <p class="caption">
-FIGURE 9.23: Null distribution using t-statistic and t-distribution with p-value shaded.
+FIGURE 9.23: Null distribution using t-statistic and t-distribution with <span class="math inline">\(p\)</span>-value shaded.
 </p>
 </div>
-<p>(We’ll discuss this warning message shortly.) What is the p-value? We apply <code>get_p_value()</code> to our null distribution saved in <code>null_distribution_movies_t</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distribution_movies_t <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_p_value</span>(<span class="dt">obs_stat =</span> obs_two_sample_t, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</code></pre>
+<p>(We’ll discuss this warning message shortly.) What is the <span class="math inline">\(p\)</span>-value? We apply <code>get_p_value()</code> to our null distribution saved in <code>null_distribution_movies_t</code>:</p>
+<div class="sourceCode" id="cb410"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb410-1" data-line-number="1">null_distribution_movies_t <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb410-2" data-line-number="2"><span class="st">  </span><span class="kw">get_p_value</span>(<span class="dt">obs_stat =</span> obs_two_sample_t, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   p_value
     &lt;dbl&gt;
-1   0.004</code></pre>
-<p>We have a very small p-value, and thus it is very unlikely that these results are due to <em>sampling variation</em>. Thus, we are inclined to reject <span class="math inline">\(H_0\)</span>.</p>
+1   0.002</code></pre>
+<p>We have a very small <span class="math inline">\(p\)</span>-value, and thus it is very unlikely that these results are due to <em>sampling variation</em>. Thus, we are inclined to reject <span class="math inline">\(H_0\)</span>.</p>
 <p>Let’s come back to that earlier warning message: <code>Check to make sure the conditions have been met for the theoretical method. {infer} currently does not check these for you.</code> To be able to use the <span class="math inline">\(t\)</span>-test and other such theoretical methods, there are always a few conditions to check. The <code>infer</code> package does not automatically check these conditions, hence the warning message we received. These conditions are necessary so that the underlying mathematical theory holds. In order for the results of our two-sample <span class="math inline">\(t\)</span>-test to be valid, three conditions must be met:</p>
 <ol style="list-style-type: decimal">
 <li>Nearly normal populations or large sample sizes. A general rule of thumb that works in many (but not all) situations is that the sample size <span class="math inline">\(n\)</span> should be greater than 30.</li>
@@ -1820,91 +1807,78 @@ <h4>Null distribution</h4>
 <li>This is met since we sampled the action and romance movies at random and in an unbiased fashion from the database of all IMDb movies.</li>
 <li>Unfortunately, we don’t know how IMDb computes the ratings. For example, if the same person rated multiple movies, then those observations would be related and hence not independent.</li>
 </ol>
-<p>Assuming all three conditions are met, we can be reasonably certain that the theory-based <span class="math inline">\(t\)</span>-test results are valid. If any of the conditions were not met, we couldn’t put as much faith into any conclusions.</p>
-<!--
-On the other hand, the only simulation-based assumption that needs to be met in the simulation based method are that the sample is selected at random.  
-
-They are our preferred method as they have fewer assumptions, are conceptually easier to understand, and since computing power has recently become easily accessible, they can be run quickly. That being said since much of the world's research still relies on traditional theory-based methods and thus it is important to understand them. 
--->
+<p>Assuming all three conditions are roughly met, we can be reasonably certain that the theory-based <span class="math inline">\(t\)</span>-test results are valid. If any of the conditions were clearly not met, we couldn’t put as much trust into any conclusions reached. On the other hand, in most scenarios, the only assumption that needs to be met in the simulation-based method is that the sample is selected at random. Thus, in our experience, we prefer simulation-based methods as they have fewer assumptions, are conceptually easier to understand, and since computing power has recently become easily accessible, they can be run quickly. That being said since much of the world’s research still relies on traditional theory-based methods, we also believe it is important to understand them.</p>
+<p>You may be wondering why we chose <code>reps = 1000</code> for these simulation-based methods. We’ve noticed that after around 1000 replicates for the null distribution and the bootstrap distribution for most problems you can start to get a general sense for how the statistic behaves. You can change this value to something like 10,000 though for <code>reps</code> if you would like even finer detail but this will take more time to compute. Feel free to iterate on this as you like to get an even better idea about the shape of the null and bootstrap distributions as you wish.</p>
 </div>
 </div>
 <div id="when-inference-is-not-needed" class="section level3">
 <h3><span class="header-section-number">9.6.2</span> When inference is not needed</h3>
-<p>We’ve now walked through several different examples of how to use the <code>infer</code> package to perform statistical inference: constructing confidence intervals and conducting hypothesis tests. For each of these examples, we made it a point to always perform an exploratory data analysis (EDA) first. Specifically by looking at the raw data values, by using data visualization via <code>ggplot2</code>, and by data wrangling via <code>dplyr</code> beforehand. We <em>highly</em> encourage you to always do the same. As a beginner to statistics, EDA helps you develop intuition as to what statistical methods like confidence intervals and hypothesis tests can tell us. Even as a seasoned practitioner of statistics, EDA helps guide your statistical investigations. In particular, is statistical inference even needed?</p>
+<p>We’ve now walked through several different examples of how to use the <code>infer</code> package to perform statistical inference: constructing confidence intervals and conducting hypothesis tests. For each of these examples, we made it a point to always perform an exploratory data analysis (EDA) first; specifically, by looking at the raw data values, by using data visualization with <code>ggplot2</code>, and by data wrangling with <code>dplyr</code> beforehand. We <em>highly</em> encourage you to always do the same. As a beginner to statistics, EDA helps you develop intuition as to what statistical methods like confidence intervals and hypothesis tests can tell us. Even as a seasoned practitioner of statistics, EDA helps guide your statistical investigations. In particular, is statistical inference even needed?</p>
 <p>Let’s consider an example. Say we’re interested in the following question: Of <em>all</em> flights leaving a New York City airport, are Hawaiian Airlines flights in the air for longer than Alaska Airlines flights? Furthermore, let’s assume that 2013 flights are a representative sample of all such flights. Then we can use the <code>flights</code> data frame in the <code>nycflights13</code>  package we introduced in Section <a href="1-getting-started.html#nycflights13">1.4</a> to answer our question. Let’s filter this data frame to only include Hawaiian and Alaska Airlines using their <code>carrier</code> codes <code>HA</code> and <code>AS</code>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights_sample &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">filter</span>(carrier <span class="op">%in%</span><span class="st"> </span><span class="kw">c</span>(<span class="st">&quot;HA&quot;</span>, <span class="st">&quot;AS&quot;</span>))</code></pre>
-<p>There are two possible statistical inference methods we could use to answer such questions. First, we could construct a 95% confidence interval for the difference in population means <span class="math inline">\(\mu_{HA} - \mu_{AS}\)</span>, where <span class="math inline">\(\mu_{HA}\)</span> is the mean air time of all Hawaiian Airlines flights and <span class="math inline">\(\mu_{AS}\)</span> is the mean air time of all Alaska Airlines flights. We could then check if the entirety of the interval is greater than 0, suggesting that <span class="math inline">\(\mu_{HA} - \mu_{AS} &gt; 0\)</span>, or in other words suggesting that <span class="math inline">\(\mu_{HA} &gt; \mu_{AS}\)</span>.</p>
-<p>Second, we could perform a hypothesis test of the null hypothesis <span class="math inline">\(H_0: \mu_{HA} - \mu_{AS} = 0\)</span> versus the alternative hypothesis <span class="math inline">\(H_A: \mu_{HA} - \mu_{AS} &gt; 0\)</span>.</p>
-<p>However, let’s first construct an exploratory visualization as we suggested earlier. Since <code>air_time</code> is numerical and <code>carrier</code> is categorical, a boxplot can display the relationship between these two variables, which we display in Figure <a href="9-hypothesis-testing.html#fig:ha-as-flights-boxplot">9.24</a></p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights_sample, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier, <span class="dt">y =</span> air_time)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_boxplot</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Carrier&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Air Time&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb412"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb412-1" data-line-number="1">flights_sample &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb412-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(carrier <span class="op">%in%</span><span class="st"> </span><span class="kw">c</span>(<span class="st">&quot;HA&quot;</span>, <span class="st">&quot;AS&quot;</span>))</a></code></pre></div>
+<p>There are two possible statistical inference methods we could use to answer such questions. First, we could construct a 95% confidence interval for the difference in population means <span class="math inline">\(\mu_{HA} - \mu_{AS}\)</span>, where <span class="math inline">\(\mu_{HA}\)</span> is the mean air time of all Hawaiian Airlines flights and <span class="math inline">\(\mu_{AS}\)</span> is the mean air time of all Alaska Airlines flights. We could then check if the entirety of the interval is greater than 0, suggesting that <span class="math inline">\(\mu_{HA} - \mu_{AS} &gt; 0\)</span>, or, in other words suggesting that <span class="math inline">\(\mu_{HA} &gt; \mu_{AS}\)</span>. Second, we could perform a hypothesis test of the null hypothesis <span class="math inline">\(H_0: \mu_{HA} - \mu_{AS} = 0\)</span> versus the alternative hypothesis <span class="math inline">\(H_A: \mu_{HA} - \mu_{AS} &gt; 0\)</span>.</p>
+<p>However, let’s first construct an exploratory visualization as we suggested earlier. Since <code>air_time</code> is numerical and <code>carrier</code> is categorical, a boxplot can display the relationship between these two variables, which we display in Figure <a href="9-hypothesis-testing.html#fig:ha-as-flights-boxplot">9.24</a>.</p>
+<div class="sourceCode" id="cb413"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb413-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights_sample, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier, <span class="dt">y =</span> air_time)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb413-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_boxplot</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb413-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Carrier&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Air Time&quot;</span>)</a></code></pre></div>
 <div class="figure" style="text-align: center"><span id="fig:ha-as-flights-boxplot"></span>
-<img src="moderndive_files/figure-html/ha-as-flights-boxplot-1.png" alt="Air time for Hawaiian and Alaska Airlines flights departing NYC in 2013." width="\textwidth" />
+<img src="ModernDive_files/figure-html/ha-as-flights-boxplot-1.png" alt="Air time for Hawaiian and Alaska Airlines flights departing NYC in 2013." width="\textwidth" />
 <p class="caption">
 FIGURE 9.24: Air time for Hawaiian and Alaska Airlines flights departing NYC in 2013.
 </p>
 </div>
-<p>This is what we like to call “you don’t need no PhD in statistics” moments. You don’t need to be an expert in statistics to know that Alaska Airlines and Hawaiian Airlines have <em>significantly</em> different air times. The two boxes don’t even overlap! Constructing a confidence interval or conducting a hypothesis test would frankly not provide much more insight than Figure <a href="9-hypothesis-testing.html#fig:ha-as-flights-boxplot">9.24</a>.</p>
-<p>Let’s investigate why we observe such a clear cut difference between these two airlines using data wrangling. Let’s first group by the rows of <code>flights_sample</code> not only by <code>carrier</code> but also by destination <code>dest</code>. Subsequently we’ll compute two summary statistics: the number of observations using <code>n()</code> and the mean airtime:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights_sample <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(carrier, dest) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">n</span>(), <span class="dt">mean_time =</span> <span class="kw">mean</span>(air_time, <span class="dt">na.rm =</span><span class="ot">TRUE</span>))</code></pre>
+<p>This is what we like to call “no PhD in Statistics needed” moments. You don’t have to be an expert in statistics to know that Alaska Airlines and Hawaiian Airlines have <em>significantly</em> different air times. The two boxplots don’t even overlap! Constructing a confidence interval or conducting a hypothesis test would frankly not provide much more insight than Figure <a href="9-hypothesis-testing.html#fig:ha-as-flights-boxplot">9.24</a>.</p>
+<p>Let’s investigate why we observe such a clear cut difference between these two airlines using data wrangling. Let’s first group by the rows of <code>flights_sample</code> not only by <code>carrier</code> but also by destination <code>dest</code>. Subsequently, we’ll compute two summary statistics: the number of observations using <code>n()</code> and the mean airtime:</p>
+<div class="sourceCode" id="cb414"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb414-1" data-line-number="1">flights_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb414-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(carrier, dest) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb414-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">n</span>(), <span class="dt">mean_time =</span> <span class="kw">mean</span>(air_time, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</a></code></pre></div>
 <pre><code># A tibble: 2 x 4
 # Groups:   carrier [2]
   carrier dest      n mean_time
   &lt;chr&gt;   &lt;chr&gt; &lt;int&gt;     &lt;dbl&gt;
 1 AS      SEA     714   325.618
 2 HA      HNL     342   623.088</code></pre>
-<p>It turns out that from New York City, Alaska only flies to <code>SEA</code> (Seattle) from New York City (NYC) while Hawaiian only flies to <code>HNL</code> (Honolulu) from NYC. Given the clear difference in distance from New York City to Seattle versus New York City to Honolulu, it is not surprising that we observe such different air times in flights.</p>
+<p>It turns out that from New York City in 2013, Alaska only flew to <code>SEA</code> (Seattle) from New York City (NYC) while Hawaiian only flew to <code>HNL</code> (Honolulu) from NYC. Given the clear difference in distance from New York City to Seattle versus New York City to Honolulu, it is not surprising that we observe such different (<em>statistically significantly different</em>, in fact) air times in flights.</p>
 <p>This is a clear example of not needing to do anything more than a simple exploratory data analysis using data visualization and descriptive statistics to get an appropriate conclusion. This is why we highly recommend you perform an EDA of any sample data before running statistical inference methods like confidence intervals and hypothesis tests.</p>
-<div class="learncheck">
-<p>
-<strong><em>Learning check</em></strong>
-</p>
-</div>
-<p><strong>(LC9.17)</strong> Could we make the same type of immediate conclusion that SFO had a statistically greater <code>air_time</code> if, say, its corresponding standard deviation was 200 minutes? What about 100 minutes? Explain.</p>
-<div class="learncheck">
-
-</div>
 </div>
 <div id="problems-with-p-values" class="section level3">
 <h3><span class="header-section-number">9.6.3</span> Problems with p-values</h3>
-<p>On top of the many common misunderstandings about hypothesis testing and p-values we listed in Section <a href="9-hypothesis-testing.html#ht-interpretation">9.4</a>, another unfortunate consequence of the expanded use of p-values and hypothesis testing is a phenomenon known as “p-hacking.”  p-hacking is the act of “cherry-picking” only results that are “statistically significant” while dismissing those that aren’t, even if at the expense of the scientific ideas. There are lots of articles written recently about misunderstandings and the problems with p-values. We encourage you to check some of them out:</p>
+<p>On top of the many common misunderstandings about hypothesis testing and <span class="math inline">\(p\)</span>-values we listed in Section <a href="9-hypothesis-testing.html#ht-interpretation">9.4</a>, another unfortunate consequence of the expanded use of <span class="math inline">\(p\)</span>-values and hypothesis testing is a phenomenon known as “p-hacking.”  p-hacking is the act of “cherry-picking” only results that are “statistically significant” while dismissing those that aren’t, even if at the expense of the scientific ideas. There are lots of articles written recently about misunderstandings and the problems with <span class="math inline">\(p\)</span>-values. We encourage you to check some of them out:</p>
 <ol style="list-style-type: decimal">
 <li><a href="https://en.wikipedia.org/wiki/Misunderstandings_of_p-values">Misunderstandings of <span class="math inline">\(p\)</span>-values</a></li>
-<li><a href="https://www.vox.com/science-and-health/2017/7/31/16021654/p-values-statistical-significance-redefine-0005">What a nerdy debate about p-values shows about science - and how to fix it</a></li>
+<li><a href="https://www.vox.com/science-and-health/2017/7/31/16021654/p-values-statistical-significance-redefine-0005">What a nerdy debate about <span class="math inline">\(p\)</span>-values shows about science - and how to fix it</a></li>
 <li><a href="https://www.nature.com/news/statisticians-issue-warning-over-misuse-of-p-values-1.19503">Statisticians issue warning over misuse of <span class="math inline">\(P\)</span> values</a></li>
 <li><a href="https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/">You Can’t Trust What You Read About Nutrition</a></li>
 <li><a href="http://www.fharrell.com/post/pval-litany/">A Litany of Problems with p-values</a></li>
 </ol>
-<p>Such issues were getting so bad that the American Statistical Association (ASA) put out a statement in 2016 titled <a href="https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf">“The ASA Statement on p-Values: Context, Process, and Purpose”</a> with six principles underlying the proper use and interpretation of p-values. The ASA released this guidance on p-values to improve the conduct and interpretation of quantitative science and inform the growing emphasis on reproducibility of science research.</p>
-<p>We as authors much prefer the use of confidence intervals for statistical inference, since in our opinion they are much less prone to large misinterpretation. However, many fields still exclusively use <span class="math inline">\(p\)</span>-values for statistical inference, thus we still included them in our text. We encourage you to learn more about “p-hacking” as well and its implication for science.</p>
+<p>Such issues were getting so problematic that the American Statistical Association (ASA) put out a statement in 2016 titled, <a href="https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf">“The ASA Statement on Statistical Significance and <span class="math inline">\(P\)</span>-Values,”</a> with six principles underlying the proper use and interpretation of <span class="math inline">\(p\)</span>-values. The ASA released this guidance on <span class="math inline">\(p\)</span>-values to improve the conduct and interpretation of quantitative science and to inform the growing emphasis on reproducibility of science research.</p>
+<p>We as authors much prefer the use of confidence intervals for statistical inference, since in our opinion they are much less prone to large misinterpretation. However, many fields still exclusively use <span class="math inline">\(p\)</span>-values for statistical inference and this is one reason for including them in this text. We encourage you to learn more about “p-hacking” as well and its implication for science.</p>
 </div>
 <div id="additional-resources-7" class="section level3">
 <h3><span class="header-section-number">9.6.4</span> Additional resources</h3>
 <p>An R script file of all R code used in this chapter is available <a href="scripts/09-hypothesis-testing.R">here</a>.</p>
-<p>If you want more examples of the <code>infer</code> workflow to conducting hypothesis tests, we suggest you check out the <code>infer</code> package homepage, in particular, a series of example analyses available at <a href="https://infer.netlify.com/articles/" class="uri">https://infer.netlify.com/articles/</a>.</p>
+<p>If you want more examples of the <code>infer</code> workflow for conducting hypothesis tests, we suggest you check out the <code>infer</code> package homepage, in particular, a series of example analyses available at <a href="https://infer.netlify.com/articles/" class="uri">https://infer.netlify.com/articles/</a>.</p>
 </div>
 <div id="whats-to-come-8" class="section level3">
 <h3><span class="header-section-number">9.6.5</span> What’s to come</h3>
-<p>We conclude by showing the <code>infer</code> pipeline diagram for hypothesis testing.</p>
+<p>We conclude with the <code>infer</code> pipeline for hypothesis testing in Figure <a href="9-hypothesis-testing.html#fig:infer-workflow-ht">9.25</a>.</p>
 <div class="figure" style="text-align: center"><span id="fig:infer-workflow-ht"></span>
-<img src="images/flowcharts/infer/ht_diagram.png" alt="infer package workflow for hypothesis testing." width="100%" />
+<img src="images/flowcharts/infer/ht_diagram_trimmed.png" alt="infer package workflow for hypothesis testing." width="100%" height="100%" />
 <p class="caption">
 FIGURE 9.25: infer package workflow for hypothesis testing.
 </p>
 </div>
 <p>Now that we’ve armed ourselves with an understanding of confidence intervals from Chapter <a href="8-confidence-intervals.html#confidence-intervals">8</a> and hypothesis tests from this chapter, we’ll now study inference for regression in the upcoming Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a>.</p>
-<p>We’ll revisit the regression models we studied in Chapters <a href="5-regression.html#regression">5</a> on basic regression and <a href="6-multiple-regression.html#multiple-regression">6</a>. For example, recall Table <a href="5-regression.html#tab:regtable">5.2</a>, where we displayed the regression table corresponding to our regression model for an instructor’s teaching score as a function of their “beauty” score.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit regression model:</span>
-score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals)
-<span class="co"># Get regression table:</span>
-<span class="kw">get_regression_table</span>(score_model)</code></pre>
+<p>We’ll revisit the regression models we studied in Chapter <a href="5-regression.html#regression">5</a> on basic regression and Chapter <a href="6-multiple-regression.html#multiple-regression">6</a> on multiple regression. For example, recall Table <a href="5-regression.html#tab:regtable">5.2</a> (shown again here in Table <a href="9-hypothesis-testing.html#tab:regression-table-inference">9.4</a>), corresponding to our regression model for an instructor’s teaching score as a function of their “beauty” score.</p>
+<div class="sourceCode" id="cb416"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb416-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb416-2" data-line-number="2">score_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>bty_avg, <span class="dt">data =</span> evals)</a>
+<a class="sourceLine" id="cb416-3" data-line-number="3"></a>
+<a class="sourceLine" id="cb416-4" data-line-number="4"><span class="co"># Get regression table:</span></a>
+<a class="sourceLine" id="cb416-5" data-line-number="5"><span class="kw">get_regression_table</span>(score_model)</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:regression-table-inference">TABLE 9.4: </span>Linear regression table.
+<span id="tab:regression-table-inference">TABLE 9.4: </span>Linear regression table
 </caption>
 <thead>
 <tr>
@@ -1980,7 +1954,7 @@ <h3><span class="header-section-number">9.6.5</span> What’s to come</h3>
 </tr>
 </tbody>
 </table>
-<p>We previously saw in Section <a href="5-regression.html#model1table">5.1.2</a> that the values in the <code>estimate</code> column are the fitted intercept <span class="math inline">\(b_0\)</span> and fitted slope for beauty score <span class="math inline">\(b_1\)</span>. In Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a>, we’ll unpack the remaining columns: <code>std_error</code> which is the standard error, <code>statistic</code> which is the observed <em>standardized</em> test statistic to compute the <code>p_value</code>, and the 95% confidence intervals as given by <code>lower_ci</code> and <code>upper_ci</code>.</p>
+<p>We previously saw in Subsection <a href="5-regression.html#model1table">5.1.2</a> that the values in the <code>estimate</code> column are the fitted intercept <span class="math inline">\(b_0\)</span> and fitted slope for beauty score <span class="math inline">\(b_1\)</span>. In Chapter <a href="10-inference-for-regression.html#inference-for-regression">10</a>, we’ll unpack the remaining columns: <code>std_error</code> which is the standard error, <code>statistic</code> which is the observed <em>standardized</em> test statistic to compute the <code>p_value</code>, and the 95% confidence intervals as given by <code>lower_ci</code> and <code>upper_ci</code>.</p>
 
 </div>
 </div>
@@ -1996,11 +1970,13 @@ <h3><span class="header-section-number">9.6.5</span> What’s to come</h3>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -2008,12 +1984,11 @@ <h3><span class="header-section-number">9.6.5</span> What’s to come</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -2028,6 +2003,10 @@ <h3><span class="header-section-number">9.6.5</span> What’s to come</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -2044,8 +2023,9 @@ <h3><span class="header-section-number">9.6.5</span> What’s to come</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/A-appendixA.html b/docs/A-appendixA.html
index 28b5ee31d..a6c884189 100644
--- a/docs/A-appendixA.html
+++ b/docs/A-appendixA.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>A Statistical Background | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="A Statistical Background | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="A Statistical Background | Statistical Inference via Data Science" />
@@ -21,18 +21,18 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="11-thinking-with-data.html">
-<link rel="next" href="B-appendixB.html">
+<link rel="prev" href="11-thinking-with-data.html"/>
+<link rel="next" href="B-appendixB.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -584,46 +597,46 @@ <h3><span class="header-section-number">A.1.2</span> Median</h3>
 </div>
 <div id="standard-deviation" class="section level3">
 <h3><span class="header-section-number">A.1.3</span> Standard deviation</h3>
-<p>We will next discuss the <em>standard deviation</em> of a variable. The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far we expect a given data value will be from its mean:</p>
-<p><span class="math display">\[Standard \, deviation = \sqrt{\frac{(x_1 - Mean)^2 + (x_2 - Mean)^2 + \cdots + (x_n - Mean)^2}{n - 1}}\]</span></p>
+<p>We will next discuss the <em>standard deviation</em> (<span class="math inline">\(sd\)</span>) of a variable. The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far we expect a given data value will be from its mean:</p>
+<p><span class="math display">\[sd = \sqrt{\frac{(x_1 - Mean)^2 + (x_2 - Mean)^2 + \cdots + (x_n - Mean)^2}{n - 1}}\]</span></p>
 </div>
 <div id="five-number-summary" class="section level3">
 <h3><span class="header-section-number">A.1.4</span> Five-number summary</h3>
-<p>The <em>five-number summary</em> consists of five summary statistics: the minimum, the first quantile AKA 25<sup>th</sup> percentile, the second quantile AKA median AKA 50<sup>th</sup> percentile, the third quantile AKA 75<sup>th</sup>, and the maximum. The five-number summary of a variable is used when constructing boxplots, as seen in Section <a href="2-viz.html#boxplots">2.7</a>.</p>
+<p>The <em>five-number summary</em> consists of five summary statistics: the minimum, the first quantile AKA 25th percentile, the second quantile AKA median or 50th percentile, the third quantile AKA 75th, and the maximum. The five-number summary of a variable is used when constructing boxplots, as seen in Section <a href="2-viz.html#boxplots">2.7</a>.</p>
 <p>The quantiles are calculated as</p>
 <ul>
 <li>first quantile (<span class="math inline">\(Q_1\)</span>): the median of the first half of the sorted data</li>
 <li>third quantile (<span class="math inline">\(Q_3\)</span>): the median of the second half of the sorted data</li>
 </ul>
-<p>The <em>interquartile range (IQR)</em> is defined as <span class="math inline">\(Q_3 - Q_1\)</span> and is a measure of how spread out the middle 50% of values are. The IQR corresponds to the length of a box in a boxplot.</p>
-<p>The median and the interquartile range are not influenced by the presence of outliers in the ways that the mean and standard deviation are. It is, thus, recommended for skewed datasets. We say in this case that the median and interquartile range are more <em>robust to outliers</em>.</p>
+<p>The <em>interquartile range (IQR)</em> is defined as <span class="math inline">\(Q_3 - Q_1\)</span> and is a measure of how spread out the middle 50% of values are. The IQR corresponds to the length of the box in a boxplot.</p>
+<p>The median and the IQR are not influenced by the presence of outliers in the ways that the mean and standard deviation are. They are, thus, recommended for skewed datasets. We say in this case that the median and IQR are more <em>robust to outliers</em>.</p>
 </div>
 <div id="distribution" class="section level3">
 <h3><span class="header-section-number">A.1.5</span> Distribution</h3>
-<p>The <em>distribution</em> of a variable shows how frequently different values of a variable occur. Looking at visualization of a distribution can show where the values are centered, show how the values vary, and give some information about where a typical value might fall. It can also alert you to the presence of outliers.</p>
-<p>Recall from Chapter <a href="2-viz.html#viz">2</a> that we can visualize the distribution of a numerical variable using a histogram and that we can visualize the distribution of a categorical variable using a barplot.</p>
+<p>The <em>distribution</em> of a variable shows how frequently different values of a variable occur. Looking at the visualization of a distribution can show where the values are centered, show how the values vary, and give some information about where a typical value might fall. It can also alert you to the presence of outliers.</p>
+<p>Recall from Chapter <a href="2-viz.html#viz">2</a> that we can visualize the distribution of a numerical variable using binning in a histogram and that we can visualize the distribution of a categorical variable using a barplot.</p>
 </div>
 <div id="outliers" class="section level3">
 <h3><span class="header-section-number">A.1.6</span> Outliers</h3>
-<p><em>Outliers</em> correspond to values in the dataset that fall far outside the range of “ordinary” values. In context of a boxplot, by default they correspond to values below <span class="math inline">\(Q_1 - (1.5 * IQR)\)</span> or above <span class="math inline">\(Q_3 + (1.5 * IQR)\)</span>.</p>
+<p><em>Outliers</em> correspond to values in the dataset that fall far outside the range of “ordinary” values. In the context of a boxplot, by default they correspond to values below <span class="math inline">\(Q_1 - (1.5 \cdot IQR)\)</span> or above <span class="math inline">\(Q_3 + (1.5 \cdot IQR)\)</span>.</p>
 </div>
 </div>
 <div id="appendix-normal-curve" class="section level2">
 <h2><span class="header-section-number">A.2</span> Normal distribution</h2>
-<p>Let’s discuss one particular kind of distribution: <em>normal distributions</em> . Such bell-shaped distributions are defined by two values: 1) the <em>mean</em> <span class="math inline">\(\mu\)</span> (“mu”) which locates the center of the distribution and 2) the <em>standard deviation</em> <span class="math inline">\(\sigma\)</span> (“sigma”) which determines the variation of the distribution. In Figure <a href="A-appendixA.html#fig:normal-curves">A.1</a>, we plot three normal distributions where:</p>
+<p>Let’s next discuss one particular kind of distribution:  <em>normal distributions</em>. Such bell-shaped distributions are defined by two values: (1) the <em>mean</em> <span class="math inline">\(\mu\)</span> (“mu”) which locates the center of the distribution and (2) the <em>standard deviation</em> <span class="math inline">\(\sigma\)</span> (“sigma”) which determines the variation of the distribution. In Figure <a href="A-appendixA.html#fig:normal-curves">A.1</a>, we plot three normal distributions where:</p>
 <ol style="list-style-type: decimal">
-<li>The solid normal curve has mean <span class="math inline">\(\mu\)</span> = 5 and standard deviation <span class="math inline">\(\sigma\)</span> = 2.</li>
-<li>The dashed normal curve has mean <span class="math inline">\(\mu\)</span> = 5 and standard deviation <span class="math inline">\(\sigma\)</span> = 5.</li>
-<li>The dotted normal curve has mean <span class="math inline">\(\mu\)</span> = 15 and standard deviation <span class="math inline">\(\sigma\)</span> = 2.</li>
+<li>The solid normal curve has mean <span class="math inline">\(\mu = 5\)</span> &amp; standard deviation <span class="math inline">\(\sigma = 2\)</span>.</li>
+<li>The dotted normal curve has mean <span class="math inline">\(\mu = 5\)</span> &amp; standard deviation <span class="math inline">\(\sigma = 5\)</span>.</li>
+<li>The dashed normal curve has mean <span class="math inline">\(\mu = 15\)</span> &amp; standard deviation <span class="math inline">\(\sigma = 2\)</span>.</li>
 </ol>
 <div class="figure" style="text-align: center"><span id="fig:normal-curves"></span>
-<img src="moderndive_files/figure-html/normal-curves-1.png" alt="Three normal distributions." width="80%" />
+<img src="ModernDive_files/figure-html/normal-curves-1.png" alt="Three normal distributions." width="90%" />
 <p class="caption">
 FIGURE A.1: Three normal distributions.
 </p>
 </div>
-<p>Notice how the solid and dashed line normal curves have the same center due to their common mean <span class="math inline">\(\mu\)</span> = 5. However the dashed line normal curve is wider due to its larger standard deviation of <span class="math inline">\(\sigma\)</span> = 5. On the other hand, the solid and dotted line normal curves have the same variation due to their common standard deviation <span class="math inline">\(\sigma\)</span> = 2. However, they are centered at different locations.</p>
-<p>When the mean <span class="math inline">\(\mu\)</span> = 0 and the standard deviation <span class="math inline">\(\sigma\)</span> = 1, the normal distribution has a special name: the <em>standard normal distribution</em> or the <em><span class="math inline">\(z\)</span>-curve</em>.</p>
+<p>Notice how the solid and dotted line normal curves have the same center due to their common mean <span class="math inline">\(\mu\)</span> = 5. However, the dotted line normal curve is wider due to its larger standard deviation of <span class="math inline">\(\sigma\)</span> = 5. On the other hand, the solid and dashed line normal curves have the same variation due to their common standard deviation <span class="math inline">\(\sigma\)</span> = 2. However, they are centered at different locations.</p>
+<p>When the mean <span class="math inline">\(\mu\)</span> = 0 and the standard deviation <span class="math inline">\(\sigma\)</span> = 1, the normal distribution has a special name. It’s called the <em>standard normal distribution</em> or the <em><span class="math inline">\(z\)</span>-curve</em>.</p>
 <p>Furthermore, if a variable follows a normal curve, there are <em>three rules of thumb</em> we can use:</p>
 <ol style="list-style-type: decimal">
 <li>68% of values will lie within <span class="math inline">\(\pm\)</span> 1 standard deviation of the mean.</li>
@@ -637,11 +650,12 @@ <h2><span class="header-section-number">A.2</span> Normal distribution</h2>
 <li>The middle six segments represent the interval -3 to 3. The shaded area above this interval represents 2.35% + 13.5% + 34% + 34% + 13.5% + 2.35% = 99.7% of the area under the curve. In other words, 99.7% of values.</li>
 </ol>
 <div class="figure" style="text-align: center"><span id="fig:normal-rule-of-thumb"></span>
-<img src="moderndive_files/figure-html/normal-rule-of-thumb-1.png" alt="Rules of thumb about areas under normal curves" width="80%" />
+<img src="ModernDive_files/figure-html/normal-rule-of-thumb-1.png" alt="Rules of thumb about areas under normal curves." width="80%" />
 <p class="caption">
-FIGURE A.2: Rules of thumb about areas under normal curves
+FIGURE A.2: Rules of thumb about areas under normal curves.
 </p>
 </div>
+
 <div class="learncheck">
 <p>
 <strong><em>Learning check</em></strong>
@@ -650,23 +664,24 @@ <h2><span class="header-section-number">A.2</span> Normal distribution</h2>
 <!--
 Consider LC using this later on: <https://gallery.shinyapps.io/dist_calc/>
 -->
-<p>Say you have a normal distribution with mean <span class="math inline">\(\mu\)</span> = 6 and standard deviation <span class="math inline">\(\sigma\)</span> = 3.</p>
-<p><strong>(LC11.3)</strong> What proportion of the area under the normal curve is less than 3? Greater than 12? Between 0 and 12?</p>
-<p><strong>(LC11.4)</strong> What is the 2.5th percentile of the area under the normal curve? The 95th percentile? The 100th percentile?</p>
+<p>Say you have a normal distribution with mean <span class="math inline">\(\mu = 6\)</span> and standard deviation <span class="math inline">\(\sigma = 3\)</span>.</p>
+<p><strong>(LCA.1)</strong> What proportion of the area under the normal curve is less than 3? Greater than 12? Between 0 and 12?</p>
+<p><strong>(LCA.2)</strong> What is the 2.5th percentile of the area under the normal curve? The 95th percentile? The 100th percentile?</p>
 <div class="learncheck">
 
 </div>
+
 </div>
 <div id="appendix-log10-transformations" class="section level2">
 <h2><span class="header-section-number">A.3</span> log10 transformations</h2>
-<p>At its simplest, log10 transformations return base 10 <em>logarithms</em>. For example, since <span class="math inline">\(1000 = 10^3\)</span>, running <code>log10(1000)</code> returns <code>3</code> in R. To undo a log10-transformation, we raise 10 to this value. For example, to undo the previous log10-transformation and return the original value of 1000, we raise 10 to this value to the power of 3 by running <code>10^(3) = 1000</code> in R. </p>
-<p>Log-transformations allow us to focus on changes in <em>orders of magnitude</em>. In other words, they allow us to focus on <em>multiplicative changes</em> instead of <em>additive ones</em>. Let’s illustrate this idea in Table <a href="A-appendixA.html#tab:logten">A.1</a> with examples of prices of consumer goods in US dollars.</p>
+<p>At its simplest, log10 transformations return base 10 <em>logarithms</em>. For example, since <span class="math inline">\(1000 = 10^3\)</span>, running <code>log10(1000)</code> returns <code>3</code> in R. To undo a log10 transformation, we raise 10 to this value. For example, to undo the previous log10 transformation and return the original value of 1000, we raise 10 to the power of 3 by running <code>10^(3) = 1000</code> in R. </p>
+<p>Log transformations allow us to focus on changes in <em>orders of magnitude</em>. In other words, they allow us to focus on <em>multiplicative changes</em> instead of <em>additive ones</em>. Let’s illustrate this idea in Table <a href="A-appendixA.html#tab:logten">A.1</a> with examples of prices of consumer goods in 2019 US dollars.</p>
 <!--
 We can also frame such changes as being relative percentage increases/decreases instead of absolute increases/decreases. 
 -->
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <caption style="font-size: initial !important;">
-<span id="tab:logten">TABLE A.1: </span>log10-transformed prices, orders of magnitude, and examples
+<span id="tab:logten">TABLE A.1: </span>log10 transformed prices, orders of magnitude, and examples
 </caption>
 <thead>
 <tr>
@@ -738,7 +753,7 @@ <h2><span class="header-section-number">A.3</span> log10 transformations</h2>
 Thousands
 </td>
 <td style="text-align:left;">
-High definition TV’s
+High definition TVs
 </td>
 </tr>
 <tr>
@@ -766,7 +781,7 @@ <h2><span class="header-section-number">A.3</span> log10 transformations</h2>
 Hundreds of thousands
 </td>
 <td style="text-align:left;">
-Luxury cars &amp; houses
+Luxury cars and houses
 </td>
 </tr>
 <tr>
@@ -785,12 +800,12 @@ <h2><span class="header-section-number">A.3</span> log10 transformations</h2>
 </tr>
 </tbody>
 </table>
-<p>Let’s make some remarks about log10-transformations based on Table <a href="A-appendixA.html#tab:logten">A.1</a>:</p>
+<p>Let’s make some remarks about log10 transformations based on Table <a href="A-appendixA.html#tab:logten">A.1</a>:</p>
 <ol style="list-style-type: decimal">
-<li>When purchasing a cup of coffee, we tend to think of prices ranging in single dollars. Ex: $2 or $3. However when purchasing a mobile phone, we don’t tend to think of their prices in units of single dollars such as $313 or $727. Instead, we tend to think of their prices in units of hundreds of dollars. Ex: $300 or $700. Thus cups of coffee and mobile phones are of different <em>orders of magnitude</em> of price.</li>
-<li>Let’s say we want to know the log10-transformed value of $76. This would be hard to compute exactly without a calculator. However, since $76 is between $10 and $100 and since log10(10) = 1 and log10(100) = 2, we know log10(76) will be between 1 and 2. In fact, log10(76) is 1.880814.</li>
-<li>log10-transformations are <em>monotonic</em>, meaning they preserve orders. So if Price A is lower than Price B, then log10(Price A) will also be lower than log10(Price B).</li>
-<li>Most importantly, increments of one in log10-scale correspond to <em>relative multiplicative changes</em> in the original scale and not <em>absolute additive changes</em>. For example, increasing a log10(Price) from 3 to 4 corresponds to a multiplicative increase by a factor of x10: $100 to $1000.</li>
+<li>When purchasing a cup of coffee, we tend to think of prices ranging in single dollars, such as $2 or $3. However, when purchasing a mobile phone, we don’t tend to think of their prices in units of single dollars such as $313 or $727. Instead, we tend to think of their prices in units of hundreds of dollars like $300 or $700. Thus, cups of coffee and mobile phones are of different <em>orders of magnitude</em> in price.</li>
+<li>Let’s say we want to know the log10 transformed value of $76. This would be hard to compute exactly without a calculator. However, since $76 is between $10 and $100 and since log10(10) = 1 and log10(100) = 2, we know log10(76) will be between 1 and 2. In fact, log10(76) is 1.880814.</li>
+<li>log10 transformations are <em>monotonic</em>, meaning they preserve orders. So if Price A is lower than Price B, then log10(Price A) will also be lower than log10(Price B).</li>
+<li>Most importantly, increments of one in log10-scale correspond to <em>relative multiplicative changes</em> in the original scale and not <em>absolute additive changes</em>. For example, increasing a log10(Price) from 3 to 4 corresponds to a multiplicative increase by a factor of 10: $100 to $1000.</li>
 </ol>
 
 </div>
@@ -806,11 +821,13 @@ <h2><span class="header-section-number">A.3</span> log10 transformations</h2>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -818,12 +835,11 @@ <h2><span class="header-section-number">A.3</span> log10 transformations</h2>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -838,6 +854,10 @@ <h2><span class="header-section-number">A.3</span> log10 transformations</h2>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -854,8 +874,9 @@ <h2><span class="header-section-number">A.3</span> log10 transformations</h2>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/B-appendixB.html b/docs/B-appendixB.html
index ac3380c85..88237a3ab 100644
--- a/docs/B-appendixB.html
+++ b/docs/B-appendixB.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>B Inference Examples | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="B Inference Examples | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="B Inference Examples | Statistical Inference via Data Science" />
@@ -21,18 +21,18 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="A-appendixA.html">
-<link rel="next" href="C-appendixC.html">
+<link rel="prev" href="A-appendixA.html"/>
+<link rel="next" href="C-appendixC.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -575,19 +588,16 @@ <h1><span class="header-section-number">B</span> Inference Examples</h1>
 <p>
 <strong>Note: This appendix is still under construction. If you would like to contribute, please check us out on GitHub at <a href="https://github.com/moderndive/moderndive_book" class="uri">https://github.com/moderndive/moderndive_book</a>.</strong>
 </p>
-<p>
-<strong>Please check out our sneak peak of <code>infer</code> below in the meanwhile. For more details on <code>infer</code> visit <a href="https://infer.netlify.com/" class="uri">https://infer.netlify.com/</a></strong>.
-</p>
 </div>
 <div id="needed-packages-10" class="section level2 unnumbered">
 <h2>Needed packages</h2>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(dplyr)
-<span class="kw">library</span>(ggplot2)
-<span class="kw">library</span>(infer)
-<span class="kw">library</span>(knitr)
-<span class="kw">library</span>(kableExtra)
-<span class="kw">library</span>(readr)
-<span class="kw">library</span>(janitor)</code></pre>
+<div class="sourceCode" id="cb466"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb466-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb466-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
+<a class="sourceLine" id="cb466-3" data-line-number="3"><span class="kw">library</span>(infer)</a>
+<a class="sourceLine" id="cb466-4" data-line-number="4"><span class="kw">library</span>(knitr)</a>
+<a class="sourceLine" id="cb466-5" data-line-number="5"><span class="kw">library</span>(kableExtra)</a>
+<a class="sourceLine" id="cb466-6" data-line-number="6"><span class="kw">library</span>(readr)</a>
+<a class="sourceLine" id="cb466-7" data-line-number="7"><span class="kw">library</span>(janitor)</a></code></pre></div>
 </div>
 <div id="inference-mind-map" class="section level2">
 <h2><span class="header-section-number">B.1</span> Inference mind map</h2>
@@ -632,19 +642,19 @@ <h4>Set <span class="math inline">\(\alpha\)</span></h4>
 </div>
 <div id="exploring-the-sample-data" class="section level3">
 <h3><span class="header-section-number">B.2.3</span> Exploring the sample data</h3>
-<pre class="sourceCode r"><code class="sourceCode r">age_at_marriage &lt;-<span class="st"> </span><span class="kw">read_csv</span>(<span class="st">&quot;https://moderndive.com/data/ageAtMar.csv&quot;</span>)</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">age_summ &lt;-<span class="st"> </span>age_at_marriage <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sample_size =</span> <span class="kw">n</span>(),
-    <span class="dt">mean =</span> <span class="kw">mean</span>(age),
-    <span class="dt">sd =</span> <span class="kw">sd</span>(age),
-    <span class="dt">minimum =</span> <span class="kw">min</span>(age),
-    <span class="dt">lower_quartile =</span> <span class="kw">quantile</span>(age, <span class="fl">0.25</span>),
-    <span class="dt">median =</span> <span class="kw">median</span>(age),
-    <span class="dt">upper_quartile =</span> <span class="kw">quantile</span>(age, <span class="fl">0.75</span>),
-    <span class="dt">max =</span> <span class="kw">max</span>(age))
-<span class="kw">kable</span>(age_summ) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">kable_styling</span>(<span class="dt">font_size =</span> <span class="kw">ifelse</span>(knitr<span class="op">:::</span><span class="kw">is_latex_output</span>(), <span class="dv">10</span>, <span class="dv">16</span>), 
-                <span class="dt">latex_options =</span> <span class="kw">c</span>(<span class="st">&quot;hold_position&quot;</span>))</code></pre>
+<div class="sourceCode" id="cb467"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb467-1" data-line-number="1">age_at_marriage &lt;-<span class="st"> </span><span class="kw">read_csv</span>(<span class="st">&quot;https://moderndive.com/data/ageAtMar.csv&quot;</span>)</a></code></pre></div>
+<div class="sourceCode" id="cb468"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb468-1" data-line-number="1">age_summ &lt;-<span class="st"> </span>age_at_marriage <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb468-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sample_size =</span> <span class="kw">n</span>(),</a>
+<a class="sourceLine" id="cb468-3" data-line-number="3">    <span class="dt">mean =</span> <span class="kw">mean</span>(age),</a>
+<a class="sourceLine" id="cb468-4" data-line-number="4">    <span class="dt">sd =</span> <span class="kw">sd</span>(age),</a>
+<a class="sourceLine" id="cb468-5" data-line-number="5">    <span class="dt">minimum =</span> <span class="kw">min</span>(age),</a>
+<a class="sourceLine" id="cb468-6" data-line-number="6">    <span class="dt">lower_quartile =</span> <span class="kw">quantile</span>(age, <span class="fl">0.25</span>),</a>
+<a class="sourceLine" id="cb468-7" data-line-number="7">    <span class="dt">median =</span> <span class="kw">median</span>(age),</a>
+<a class="sourceLine" id="cb468-8" data-line-number="8">    <span class="dt">upper_quartile =</span> <span class="kw">quantile</span>(age, <span class="fl">0.75</span>),</a>
+<a class="sourceLine" id="cb468-9" data-line-number="9">    <span class="dt">max =</span> <span class="kw">max</span>(age))</a>
+<a class="sourceLine" id="cb468-10" data-line-number="10"><span class="kw">kable</span>(age_summ) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb468-11" data-line-number="11"><span class="st">  </span><span class="kw">kable_styling</span>(<span class="dt">font_size =</span> <span class="kw">ifelse</span>(knitr<span class="op">:::</span><span class="kw">is_latex_output</span>(), <span class="dv">10</span>, <span class="dv">16</span>), </a>
+<a class="sourceLine" id="cb468-12" data-line-number="12">                <span class="dt">latex_options =</span> <span class="kw">c</span>(<span class="st">&quot;hold_position&quot;</span>))</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <thead>
 <tr>
@@ -704,14 +714,14 @@ <h3><span class="header-section-number">B.2.3</span> Exploring the sample data</
 </tbody>
 </table>
 <p>The histogram below also shows the distribution of <code>age</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> age_at_marriage, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> age)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">3</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/hist1b-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb469"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb469-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> age_at_marriage, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> age)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb469-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="dv">3</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/hist1b-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>The observed statistic of interest here is the sample mean:</p>
-<pre class="sourceCode r"><code class="sourceCode r">x_bar &lt;-<span class="st"> </span>age_at_marriage <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> age) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)
-x_bar</code></pre>
+<div class="sourceCode" id="cb470"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb470-1" data-line-number="1">x_bar &lt;-<span class="st"> </span>age_at_marriage <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb470-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> age) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb470-3" data-line-number="3"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</a>
+<a class="sourceLine" id="cb470-4" data-line-number="4">x_bar</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
      stat
     &lt;dbl&gt;
@@ -733,23 +743,23 @@ <h4>Bootstrapping for hypothesis test</h4>
 <li>combine all of these bootstrap statistics calculated in Step 2 into a <code>boot_distn</code> object, and</li>
 <li>shift the center of this distribution over to the null value of 23. (This is needed since it will be centered at 23.44 via the process of bootstrapping.)</li>
 </ol>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">set.seed</span>(<span class="dv">2018</span>)
-null_distn_one_mean &lt;-<span class="st"> </span>age_at_marriage <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> age) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;point&quot;</span>, <span class="dt">mu =</span> <span class="dv">23</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">null_distn_one_mean <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">visualize</span>()</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-482-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb472"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb472-1" data-line-number="1"><span class="kw">set.seed</span>(<span class="dv">2018</span>)</a>
+<a class="sourceLine" id="cb472-2" data-line-number="2">null_distn_one_mean &lt;-<span class="st"> </span>age_at_marriage <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb472-3" data-line-number="3"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> age) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb472-4" data-line-number="4"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;point&quot;</span>, <span class="dt">mu =</span> <span class="dv">23</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb472-5" data-line-number="5"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb472-6" data-line-number="6"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</a></code></pre></div>
+<div class="sourceCode" id="cb473"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb473-1" data-line-number="1">null_distn_one_mean <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">visualize</span>()</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-498-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>We can next use this distribution to observe our <span class="math inline">\(p\)</span>-value. Recall this is a right-tailed test so we will be looking for values that are greater than or equal to 23.44 for our <span class="math inline">\(p\)</span>-value.</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distn_one_mean <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">visualize</span>(<span class="dt">obs_stat =</span> x_bar, <span class="dt">direction =</span> <span class="st">&quot;greater&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-483-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb474"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb474-1" data-line-number="1">null_distn_one_mean <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb474-2" data-line-number="2"><span class="st">  </span><span class="kw">visualize</span>(<span class="dt">obs_stat =</span> x_bar, <span class="dt">direction =</span> <span class="st">&quot;greater&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-499-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <div id="calculate-p-value" class="section level5 unnumbered">
 <h5>Calculate <span class="math inline">\(p\)</span>-value</h5>
-<pre class="sourceCode r"><code class="sourceCode r">pvalue &lt;-<span class="st"> </span>null_distn_one_mean <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">get_pvalue</span>(<span class="dt">obs_stat =</span> x_bar, <span class="dt">direction =</span> <span class="st">&quot;greater&quot;</span>)
-pvalue</code></pre>
+<div class="sourceCode" id="cb475"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb475-1" data-line-number="1">pvalue &lt;-<span class="st"> </span>null_distn_one_mean <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb475-2" data-line-number="2"><span class="st">  </span><span class="kw">get_pvalue</span>(<span class="dt">obs_stat =</span> x_bar, <span class="dt">direction =</span> <span class="st">&quot;greater&quot;</span>)</a>
+<a class="sourceLine" id="cb475-3" data-line-number="3">pvalue</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   p_value
     &lt;dbl&gt;
@@ -760,22 +770,22 @@ <h5>Calculate <span class="math inline">\(p\)</span>-value</h5>
 <div id="bootstrapping-for-confidence-interval" class="section level4 unnumbered">
 <h4>Bootstrapping for confidence interval</h4>
 <p>We can also create a confidence interval for the unknown population parameter <span class="math inline">\(\mu\)</span> using our sample data using <em>bootstrapping</em>. Note that we don’t need to shift this distribution since we want the center of our confidence interval to be our point estimate <span class="math inline">\(\bar{x}_{obs} = 23.44\)</span>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">boot_distn_one_mean &lt;-<span class="st"> </span>age_at_marriage <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> age) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">ci &lt;-<span class="st"> </span>boot_distn_one_mean <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_ci</span>()
-ci</code></pre>
+<div class="sourceCode" id="cb477"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb477-1" data-line-number="1">boot_distn_one_mean &lt;-<span class="st"> </span>age_at_marriage <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb477-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> age) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb477-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb477-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</a></code></pre></div>
+<div class="sourceCode" id="cb478"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb478-1" data-line-number="1">ci &lt;-<span class="st"> </span>boot_distn_one_mean <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb478-2" data-line-number="2"><span class="st">  </span><span class="kw">get_ci</span>()</a>
+<a class="sourceLine" id="cb478-3" data-line-number="3">ci</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
    `2.5%` `97.5%`
     &lt;dbl&gt;   &lt;dbl&gt;
-1 23.3159 23.5651</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">boot_distn_one_mean <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">visualize</span>(<span class="dt">endpoints =</span> ci, <span class="dt">direction =</span> <span class="st">&quot;between&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-487-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+1 23.3148 23.5669</code></pre>
+<div class="sourceCode" id="cb480"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb480-1" data-line-number="1">boot_distn_one_mean <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb480-2" data-line-number="2"><span class="st">  </span><span class="kw">visualize</span>(<span class="dt">endpoints =</span> ci, <span class="dt">direction =</span> <span class="st">&quot;between&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-503-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>We see that 23 is not contained in this confidence interval as a plausible value of <span class="math inline">\(\mu\)</span> (the unknown population mean) and the entire interval is larger than 23. This matches with our hypothesis test results of rejecting the null hypothesis in favor of the alternative (<span class="math inline">\(\mu &gt; 23\)</span>).</p>
-<p><strong>Interpretation</strong>: We are 95% confident the true mean age of first marriage for all US women from 2006 to 2010 is between 23.316 and 23.565.</p>
+<p><strong>Interpretation</strong>: We are 95% confident the true mean age of first marriage for all US women from 2006 to 2010 is between 23.315 and 23.567.</p>
 </div>
 </div>
 <div id="traditional-methods" class="section level3">
@@ -790,9 +800,9 @@ <h4>Check conditions</h4>
 <p>The histogram for the sample above does show some skew.</p></li>
 </ol>
 <p>The Q-Q plot below also shows some skew.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> age_at_marriage, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">sample =</span> age)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">stat_qq</span>()</code></pre>
-<p><img src="moderndive_files/figure-html/qqplotmean-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb481"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb481-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> age_at_marriage, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">sample =</span> age)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb481-2" data-line-number="2"><span class="st">  </span><span class="kw">stat_qq</span>()</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/qqplotmean-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>The sample size here is quite large though (<span class="math inline">\(n = 5534\)</span>) so both conditions are met.</p>
 </div>
 <div id="test-statistic" class="section level4 unnumbered">
@@ -803,11 +813,11 @@ <h4>Test statistic</h4>
 <div id="observed-test-statistic" class="section level5 unnumbered">
 <h5>Observed test statistic</h5>
 <p>While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the <code>t_test()</code> function to perform this analysis for us.</p>
-<pre class="sourceCode r"><code class="sourceCode r">t_test_results &lt;-<span class="st"> </span>age_at_marriage <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span>infer<span class="op">::</span><span class="kw">t_test</span>(<span class="dt">formula =</span> age <span class="op">~</span><span class="st"> </span><span class="ot">NULL</span>,
-       <span class="dt">alternative =</span> <span class="st">&quot;greater&quot;</span>,
-       <span class="dt">mu =</span> <span class="dv">23</span>)
-t_test_results</code></pre>
+<div class="sourceCode" id="cb482"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb482-1" data-line-number="1">t_test_results &lt;-<span class="st"> </span>age_at_marriage <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb482-2" data-line-number="2"><span class="st">  </span>infer<span class="op">::</span><span class="kw">t_test</span>(<span class="dt">formula =</span> age <span class="op">~</span><span class="st"> </span><span class="ot">NULL</span>,</a>
+<a class="sourceLine" id="cb482-3" data-line-number="3">       <span class="dt">alternative =</span> <span class="st">&quot;greater&quot;</span>,</a>
+<a class="sourceLine" id="cb482-4" data-line-number="4">       <span class="dt">mu =</span> <span class="dv">23</span>)</a>
+<a class="sourceLine" id="cb482-5" data-line-number="5">t_test_results</a></code></pre></div>
 <pre><code># A tibble: 1 x 6
   statistic  t_df     p_value alternative lower_ci upper_ci
       &lt;dbl&gt; &lt;dbl&gt;       &lt;dbl&gt; &lt;chr&gt;          &lt;dbl&gt;    &lt;dbl&gt;
@@ -825,9 +835,9 @@ <h4>State conclusion</h4>
 </div>
 <div id="confidence-interval-1" class="section level4 unnumbered">
 <h4>Confidence interval</h4>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">t.test</span>(<span class="dt">x =</span> age_at_marriage<span class="op">$</span>age, 
-       <span class="dt">alternative =</span> <span class="st">&quot;two.sided&quot;</span>,
-       <span class="dt">mu =</span> <span class="dv">23</span>)<span class="op">$</span>conf</code></pre>
+<div class="sourceCode" id="cb484"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb484-1" data-line-number="1"><span class="kw">t.test</span>(<span class="dt">x =</span> age_at_marriage<span class="op">$</span>age, </a>
+<a class="sourceLine" id="cb484-2" data-line-number="2">       <span class="dt">alternative =</span> <span class="st">&quot;two.sided&quot;</span>,</a>
+<a class="sourceLine" id="cb484-3" data-line-number="3">       <span class="dt">mu =</span> <span class="dv">23</span>)<span class="op">$</span>conf</a></code></pre></div>
 <pre><code>[1] 23.3 23.6
 attr(,&quot;conf.level&quot;)
 [1] 0.95</code></pre>
@@ -867,18 +877,18 @@ <h4>Set <span class="math inline">\(\alpha\)</span></h4>
 </div>
 <div id="exploring-the-sample-data-1" class="section level3">
 <h3><span class="header-section-number">B.3.3</span> Exploring the sample data</h3>
-<pre class="sourceCode r"><code class="sourceCode r">elec &lt;-<span class="st"> </span><span class="kw">c</span>(<span class="kw">rep</span>(<span class="st">&quot;satisfied&quot;</span>, <span class="dv">73</span>), <span class="kw">rep</span>(<span class="st">&quot;unsatisfied&quot;</span>, <span class="dv">27</span>)) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">as_data_frame</span>() <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">rename</span>(<span class="dt">satisfy =</span> value)</code></pre>
+<div class="sourceCode" id="cb486"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb486-1" data-line-number="1">elec &lt;-<span class="st"> </span><span class="kw">c</span>(<span class="kw">rep</span>(<span class="st">&quot;satisfied&quot;</span>, <span class="dv">73</span>), <span class="kw">rep</span>(<span class="st">&quot;unsatisfied&quot;</span>, <span class="dv">27</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb486-2" data-line-number="2"><span class="st">  </span><span class="kw">as_data_frame</span>() <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb486-3" data-line-number="3"><span class="st">  </span><span class="kw">rename</span>(<span class="dt">satisfy =</span> value)</a></code></pre></div>
 <p>The bar graph below also shows the distribution of <code>satisfy</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> elec, <span class="kw">aes</span>(<span class="dt">x =</span> satisfy)) <span class="op">+</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">geom_bar</span>()</code></pre>
-<p><img src="moderndive_files/figure-html/bar-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb487"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb487-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> elec, <span class="kw">aes</span>(<span class="dt">x =</span> satisfy)) <span class="op">+</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb487-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>()</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/bar-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>The observed statistic is computed as</p>
-<pre class="sourceCode r"><code class="sourceCode r">p_hat &lt;-<span class="st"> </span>elec <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> satisfy, <span class="dt">success =</span> <span class="st">&quot;satisfied&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;prop&quot;</span>)
-p_hat</code></pre>
+<div class="sourceCode" id="cb488"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb488-1" data-line-number="1">p_hat &lt;-<span class="st"> </span>elec <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb488-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> satisfy, <span class="dt">success =</span> <span class="st">&quot;satisfied&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb488-3" data-line-number="3"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;prop&quot;</span>)</a>
+<a class="sourceLine" id="cb488-4" data-line-number="4">p_hat</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
    stat
   &lt;dbl&gt;
@@ -893,23 +903,23 @@ <h3><span class="header-section-number">B.3.4</span> Non-traditional methods</h3
 <div id="simulation-for-hypothesis-test" class="section level4 unnumbered">
 <h4>Simulation for hypothesis test</h4>
 <p>In order to look to see if 0.73 is statistically different from 0.8, we need to account for the sample size. We also need to determine a process that replicates how the original sample of size 100 was selected. We can use the idea of an unfair coin to <em>simulate</em> this process. We will simulate flipping an unfair coin (with probability of success 0.8 matching the null hypothesis) 100 times. Then we will keep track of how many heads come up in those 100 flips. Our simulated statistic matches with how we calculated the original statistic <span class="math inline">\(\hat{p}\)</span>: the number of heads (satisfied) out of our total sample of 100. We then repeat this process many times (say 10,000) to create the null distribution looking at the simulated proportions of successes:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">set.seed</span>(<span class="dv">2018</span>)
-null_distn_one_prop &lt;-<span class="st"> </span>elec <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> satisfy, <span class="dt">success =</span> <span class="st">&quot;satisfied&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;point&quot;</span>, <span class="dt">p =</span> <span class="fl">0.8</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;prop&quot;</span>)</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">null_distn_one_prop <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">visualize</span>()</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-490-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb490"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb490-1" data-line-number="1"><span class="kw">set.seed</span>(<span class="dv">2018</span>)</a>
+<a class="sourceLine" id="cb490-2" data-line-number="2">null_distn_one_prop &lt;-<span class="st"> </span>elec <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb490-3" data-line-number="3"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> satisfy, <span class="dt">success =</span> <span class="st">&quot;satisfied&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb490-4" data-line-number="4"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;point&quot;</span>, <span class="dt">p =</span> <span class="fl">0.8</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb490-5" data-line-number="5"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb490-6" data-line-number="6"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;prop&quot;</span>)</a></code></pre></div>
+<div class="sourceCode" id="cb491"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb491-1" data-line-number="1">null_distn_one_prop <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">visualize</span>()</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-506-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>We can next use this distribution to observe our <span class="math inline">\(p\)</span>-value. Recall this is a two-tailed test so we will be looking for values that are 0.8 - 0.73 = 0.07 away from 0.8 in BOTH directions for our <span class="math inline">\(p\)</span>-value:</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distn_one_prop <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">visualize</span>(<span class="dt">obs_stat =</span> p_hat, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-491-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb492"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb492-1" data-line-number="1">null_distn_one_prop <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb492-2" data-line-number="2"><span class="st">  </span><span class="kw">visualize</span>(<span class="dt">obs_stat =</span> p_hat, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-507-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <div id="calculate-p-value-1" class="section level5 unnumbered">
 <h5>Calculate <span class="math inline">\(p\)</span>-value</h5>
-<pre class="sourceCode r"><code class="sourceCode r">pvalue &lt;-<span class="st"> </span>null_distn_one_prop <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_pvalue</span>(<span class="dt">obs_stat =</span> p_hat, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)
-pvalue</code></pre>
+<div class="sourceCode" id="cb493"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb493-1" data-line-number="1">pvalue &lt;-<span class="st"> </span>null_distn_one_prop <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb493-2" data-line-number="2"><span class="st">  </span><span class="kw">get_pvalue</span>(<span class="dt">obs_stat =</span> p_hat, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</a>
+<a class="sourceLine" id="cb493-3" data-line-number="3">pvalue</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   p_value
     &lt;dbl&gt;
@@ -924,24 +934,24 @@ <h4>Bootstrapping for confidence interval</h4>
 <li>sampling with replacement from our original sample of 100 survey respondents and repeating this process 10,000 times,</li>
 <li>calculating the proportion of successes for each of the 10,000 bootstrap samples created in Step 1.,</li>
 <li>combining all of these bootstrap statistics calculated in Step 2 into a <code>boot_distn</code> object,</li>
-<li>identifying the 2.5<sup>th</sup> and 97.5<sup>th</sup> percentiles of this distribution (corresponding to the 5% significance level chosen) to find a 95% confidence interval for <span class="math inline">\(\pi\)</span>, and</li>
+<li>identifying the 2.5th and 97.5th percentiles of this distribution (corresponding to the 5% significance level chosen) to find a 95% confidence interval for <span class="math inline">\(\pi\)</span>, and</li>
 <li>interpret this confidence interval in the context of the problem.</li>
 </ol>
-<pre class="sourceCode r"><code class="sourceCode r">boot_distn_one_prop &lt;-<span class="st"> </span>elec <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> satisfy, <span class="dt">success =</span> <span class="st">&quot;satisfied&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;prop&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb495"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb495-1" data-line-number="1">boot_distn_one_prop &lt;-<span class="st"> </span>elec <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb495-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> satisfy, <span class="dt">success =</span> <span class="st">&quot;satisfied&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb495-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb495-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;prop&quot;</span>)</a></code></pre></div>
 <p>Just as we use the <code>mean</code> function for calculating the mean over a numerical variable, we can also use it to compute the proportion of successes for a categorical variable where we specify what we are calling a “success” after the <code>==</code>. (Think about the formula for calculating a mean and how R handles logical statements such as <code>satisfy == &quot;satisfied&quot;</code> for why this must be true.)</p>
-<pre class="sourceCode r"><code class="sourceCode r">ci &lt;-<span class="st"> </span>boot_distn_one_prop <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_ci</span>()
-ci</code></pre>
+<div class="sourceCode" id="cb496"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb496-1" data-line-number="1">ci &lt;-<span class="st"> </span>boot_distn_one_prop <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb496-2" data-line-number="2"><span class="st">  </span><span class="kw">get_ci</span>()</a>
+<a class="sourceLine" id="cb496-3" data-line-number="3">ci</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
   `2.5%` `97.5%`
    &lt;dbl&gt;   &lt;dbl&gt;
 1   0.64    0.81</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">boot_distn_one_prop <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">visualize</span>(<span class="dt">endpoints =</span> ci, <span class="dt">direction =</span> <span class="st">&quot;between&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-495-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb498"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb498-1" data-line-number="1">boot_distn_one_prop <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb498-2" data-line-number="2"><span class="st">  </span><span class="kw">visualize</span>(<span class="dt">endpoints =</span> ci, <span class="dt">direction =</span> <span class="st">&quot;between&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-511-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>We see that 0.80 is contained in this confidence interval as a plausible value of <span class="math inline">\(\pi\)</span> (the unknown population proportion). This matches with our hypothesis test results of failing to reject the null hypothesis.</p>
 <p><strong>Interpretation</strong>: We are 95% confident the true proportion of customers who are satisfied with the service they receive is between 0.64 and 0.81.</p>
 </div>
@@ -965,31 +975,31 @@ <h4>Test statistic</h4>
 <div id="observed-test-statistic-1" class="section level5 unnumbered">
 <h5>Observed test statistic</h5>
 <p>While one could compute this observed test statistic by “hand” by plugging the observed values into the formula, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. The calculation has been done in R below for completeness though:</p>
-<pre class="sourceCode r"><code class="sourceCode r">p_hat &lt;-<span class="st"> </span><span class="fl">0.73</span>
-p0 &lt;-<span class="st"> </span><span class="fl">0.8</span>
-n &lt;-<span class="st"> </span><span class="dv">100</span>
-(z_obs &lt;-<span class="st"> </span>(p_hat <span class="op">-</span><span class="st"> </span>p0) <span class="op">/</span><span class="st"> </span><span class="kw">sqrt</span>( (p0 <span class="op">*</span><span class="st"> </span>(<span class="dv">1</span> <span class="op">-</span><span class="st"> </span>p0)) <span class="op">/</span><span class="st"> </span>n))</code></pre>
+<div class="sourceCode" id="cb499"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb499-1" data-line-number="1">p_hat &lt;-<span class="st"> </span><span class="fl">0.73</span></a>
+<a class="sourceLine" id="cb499-2" data-line-number="2">p0 &lt;-<span class="st"> </span><span class="fl">0.8</span></a>
+<a class="sourceLine" id="cb499-3" data-line-number="3">n &lt;-<span class="st"> </span><span class="dv">100</span></a>
+<a class="sourceLine" id="cb499-4" data-line-number="4">(z_obs &lt;-<span class="st"> </span>(p_hat <span class="op">-</span><span class="st"> </span>p0) <span class="op">/</span><span class="st"> </span><span class="kw">sqrt</span>( (p0 <span class="op">*</span><span class="st"> </span>(<span class="dv">1</span> <span class="op">-</span><span class="st"> </span>p0)) <span class="op">/</span><span class="st"> </span>n))</a></code></pre></div>
 <pre><code>[1] -1.75</code></pre>
 <p>We see here that the <span class="math inline">\(z_{obs}\)</span> value is around -1.75. Our observed sample proportion of 0.73 is 1.75 standard errors below the hypothesized parameter value of 0.8.</p>
 </div>
 </div>
 <div id="visualize-and-compute-p-value" class="section level4 unnumbered">
 <h4>Visualize and compute <span class="math inline">\(p\)</span>-value</h4>
-<pre class="sourceCode r"><code class="sourceCode r">elec <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> satisfy, <span class="dt">success =</span> <span class="st">&quot;satisfied&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;point&quot;</span>, <span class="dt">p =</span> <span class="fl">0.8</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;z&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">visualize</span>(<span class="dt">method =</span> <span class="st">&quot;theoretical&quot;</span>, <span class="dt">obs_stat =</span> z_obs, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/pvaloneprop-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="dv">2</span> <span class="op">*</span><span class="st"> </span><span class="kw">pnorm</span>(z_obs)</code></pre>
+<div class="sourceCode" id="cb501"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb501-1" data-line-number="1">elec <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb501-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> satisfy, <span class="dt">success =</span> <span class="st">&quot;satisfied&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb501-3" data-line-number="3"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;point&quot;</span>, <span class="dt">p =</span> <span class="fl">0.8</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb501-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;z&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb501-5" data-line-number="5"><span class="st">  </span><span class="kw">visualize</span>(<span class="dt">method =</span> <span class="st">&quot;theoretical&quot;</span>, <span class="dt">obs_stat =</span> z_obs, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/pvaloneprop-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb502"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb502-1" data-line-number="1"><span class="dv">2</span> <span class="op">*</span><span class="st"> </span><span class="kw">pnorm</span>(z_obs)</a></code></pre></div>
 <pre><code>[1] 0.0801</code></pre>
 <p>The <span class="math inline">\(p\)</span>-value—the probability of observing an <span class="math inline">\(z_{obs}\)</span> value of -1.75 or more extreme (in both directions) in our null distribution—is around 8%.</p>
 <p>Note that we could also do this test directly using the <code>prop.test</code> function.</p>
-<pre class="sourceCode r"><code class="sourceCode r">stats<span class="op">::</span><span class="kw">prop.test</span>(<span class="dt">x =</span> <span class="kw">table</span>(elec<span class="op">$</span>satisfy),
-       <span class="dt">n =</span> <span class="kw">length</span>(elec<span class="op">$</span>satisfy),
-       <span class="dt">alternative =</span> <span class="st">&quot;two.sided&quot;</span>,
-       <span class="dt">p =</span> <span class="fl">0.8</span>,
-       <span class="dt">correct =</span> <span class="ot">FALSE</span>)</code></pre>
+<div class="sourceCode" id="cb504"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb504-1" data-line-number="1">stats<span class="op">::</span><span class="kw">prop.test</span>(<span class="dt">x =</span> <span class="kw">table</span>(elec<span class="op">$</span>satisfy),</a>
+<a class="sourceLine" id="cb504-2" data-line-number="2">       <span class="dt">n =</span> <span class="kw">length</span>(elec<span class="op">$</span>satisfy),</a>
+<a class="sourceLine" id="cb504-3" data-line-number="3">       <span class="dt">alternative =</span> <span class="st">&quot;two.sided&quot;</span>,</a>
+<a class="sourceLine" id="cb504-4" data-line-number="4">       <span class="dt">p =</span> <span class="fl">0.8</span>,</a>
+<a class="sourceLine" id="cb504-5" data-line-number="5">       <span class="dt">correct =</span> <span class="ot">FALSE</span>)</a></code></pre></div>
 <pre><code>
     1-sample proportions test without continuity correction
 
@@ -1055,19 +1065,19 @@ <h4>Set <span class="math inline">\(\alpha\)</span></h4>
 </div>
 <div id="exploring-the-sample-data-2" class="section level3">
 <h3><span class="header-section-number">B.4.3</span> Exploring the sample data</h3>
-<pre class="sourceCode r"><code class="sourceCode r">offshore &lt;-<span class="st"> </span><span class="kw">read_csv</span>(<span class="st">&quot;https://moderndive.com/data/offshore.csv&quot;</span>)</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">offshore <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">tabyl</span>(college_grad, response)</code></pre>
+<div class="sourceCode" id="cb506"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb506-1" data-line-number="1">offshore &lt;-<span class="st"> </span><span class="kw">read_csv</span>(<span class="st">&quot;https://moderndive.com/data/offshore.csv&quot;</span>)</a></code></pre></div>
+<div class="sourceCode" id="cb507"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb507-1" data-line-number="1">offshore <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">tabyl</span>(college_grad, response)</a></code></pre></div>
 <pre><code> college_grad no opinion opinion
            no        131     258
           yes        104     334</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">off_summ &lt;-<span class="st"> </span>offshore <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(college_grad) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">prop_no_opinion =</span> <span class="kw">mean</span>(response <span class="op">==</span><span class="st"> &quot;no opinion&quot;</span>),
-    <span class="dt">sample_size =</span> <span class="kw">n</span>())</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(offshore, <span class="kw">aes</span>(<span class="dt">x =</span> college_grad, <span class="dt">fill =</span> response)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>(<span class="dt">position =</span> <span class="st">&quot;fill&quot;</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">coord_flip</span>()</code></pre>
-<p><img src="moderndive_files/figure-html/stacked_bar-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb509"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb509-1" data-line-number="1">off_summ &lt;-<span class="st"> </span>offshore <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb509-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(college_grad) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb509-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">prop_no_opinion =</span> <span class="kw">mean</span>(response <span class="op">==</span><span class="st"> &quot;no opinion&quot;</span>),</a>
+<a class="sourceLine" id="cb509-4" data-line-number="4">    <span class="dt">sample_size =</span> <span class="kw">n</span>())</a></code></pre></div>
+<div class="sourceCode" id="cb510"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb510-1" data-line-number="1"><span class="kw">ggplot</span>(offshore, <span class="kw">aes</span>(<span class="dt">x =</span> college_grad, <span class="dt">fill =</span> response)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb510-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>(<span class="dt">position =</span> <span class="st">&quot;fill&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb510-3" data-line-number="3"><span class="st">  </span><span class="kw">coord_flip</span>()</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/stacked_bar-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <div id="guess-about-statistical-significance-2" class="section level4 unnumbered">
 <h4>Guess about statistical significance</h4>
 <p>We are looking to see if a difference exists in the size of the bars corresponding to <code>no opinion</code> for the plot. Based solely on the plot, we have little reason to believe that a difference exists since the bars seem to be about the same size, BUT…it’s important to use statistics to see if that difference is actually statistically significant!</p>
@@ -1078,10 +1088,10 @@ <h3><span class="header-section-number">B.4.4</span> Non-traditional methods</h3
 <div id="collecting-summary-info" class="section level4 unnumbered">
 <h4>Collecting summary info</h4>
 <p>The observed statistic is</p>
-<pre class="sourceCode r"><code class="sourceCode r">d_hat &lt;-<span class="st"> </span>offshore <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(response <span class="op">~</span><span class="st"> </span>college_grad, <span class="dt">success =</span> <span class="st">&quot;no opinion&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;yes&quot;</span>, <span class="st">&quot;no&quot;</span>))
-d_hat</code></pre>
+<div class="sourceCode" id="cb511"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb511-1" data-line-number="1">d_hat &lt;-<span class="st"> </span>offshore <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb511-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(response <span class="op">~</span><span class="st"> </span>college_grad, <span class="dt">success =</span> <span class="st">&quot;no opinion&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb511-3" data-line-number="3"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;yes&quot;</span>, <span class="st">&quot;no&quot;</span>))</a>
+<a class="sourceLine" id="cb511-4" data-line-number="4">d_hat</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
         stat
        &lt;dbl&gt;
@@ -1091,47 +1101,47 @@ <h4>Collecting summary info</h4>
 <h4>Randomization for hypothesis test</h4>
 <p>In order to look to see if the observed sample proportion of no opinion for college graduates of 0.337 is statistically different than that for graduates of 0.237, we need to account for the sample sizes. Note that this is the same as looking to see if <span class="math inline">\(\hat{p}_{grad} - \hat{p}_{nograd}\)</span> is statistically different than 0. We also need to determine a process that replicates how the original group sizes of 389 and 438 were selected.</p>
 <p>We can use the idea of <em>randomization testing</em> (also known as <em>permutation testing</em>) to simulate the population from which the sample came (with two groups of different sizes) and then generate samples using <em>shuffling</em> from that simulated population to account for sampling variability.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">set.seed</span>(<span class="dv">2018</span>)
-null_distn_two_props &lt;-<span class="st"> </span>offshore <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(response <span class="op">~</span><span class="st"> </span>college_grad, <span class="dt">success =</span> <span class="st">&quot;no opinion&quot;</span>) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;yes&quot;</span>, <span class="st">&quot;no&quot;</span>))</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">null_distn_two_props <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">visualize</span>()</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-500-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb513"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb513-1" data-line-number="1"><span class="kw">set.seed</span>(<span class="dv">2018</span>)</a>
+<a class="sourceLine" id="cb513-2" data-line-number="2">null_distn_two_props &lt;-<span class="st"> </span>offshore <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb513-3" data-line-number="3"><span class="st">  </span><span class="kw">specify</span>(response <span class="op">~</span><span class="st"> </span>college_grad, <span class="dt">success =</span> <span class="st">&quot;no opinion&quot;</span>) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb513-4" data-line-number="4"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb513-5" data-line-number="5"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb513-6" data-line-number="6"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;yes&quot;</span>, <span class="st">&quot;no&quot;</span>))</a></code></pre></div>
+<div class="sourceCode" id="cb514"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb514-1" data-line-number="1">null_distn_two_props <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">visualize</span>()</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-516-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>We can next use this distribution to observe our <span class="math inline">\(p\)</span>-value. Recall this is a two-tailed test so we will be looking for values that are greater than or equal to -0.099 or less than or equal to 0.099 for our <span class="math inline">\(p\)</span>-value.</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distn_two_props <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">visualize</span>(<span class="dt">obs_stat =</span> d_hat, <span class="dt">direction =</span> <span class="st">&quot;two_sided&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-501-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb515"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb515-1" data-line-number="1">null_distn_two_props <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb515-2" data-line-number="2"><span class="st">  </span><span class="kw">visualize</span>(<span class="dt">obs_stat =</span> d_hat, <span class="dt">direction =</span> <span class="st">&quot;two_sided&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-517-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <div id="calculate-p-value-2" class="section level5 unnumbered">
 <h5>Calculate <span class="math inline">\(p\)</span>-value</h5>
-<pre class="sourceCode r"><code class="sourceCode r">pvalue &lt;-<span class="st"> </span>null_distn_two_props <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_pvalue</span>(<span class="dt">obs_stat =</span> d_hat, <span class="dt">direction =</span> <span class="st">&quot;two_sided&quot;</span>)
-pvalue</code></pre>
+<div class="sourceCode" id="cb516"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb516-1" data-line-number="1">pvalue &lt;-<span class="st"> </span>null_distn_two_props <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb516-2" data-line-number="2"><span class="st">  </span><span class="kw">get_pvalue</span>(<span class="dt">obs_stat =</span> d_hat, <span class="dt">direction =</span> <span class="st">&quot;two_sided&quot;</span>)</a>
+<a class="sourceLine" id="cb516-3" data-line-number="3">pvalue</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
-  p_value
-    &lt;dbl&gt;
-1   0.003</code></pre>
-<p>So our <span class="math inline">\(p\)</span>-value is 0.003 and we reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are far into the tails of the null distribution.</p>
+     p_value
+       &lt;dbl&gt;
+1 0.00240000</code></pre>
+<p>So our <span class="math inline">\(p\)</span>-value is 0.002 and we reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are far into the tails of the null distribution.</p>
 </div>
 </div>
 <div id="bootstrapping-for-confidence-interval-2" class="section level4 unnumbered">
 <h4>Bootstrapping for confidence interval</h4>
 <p>We can also create a confidence interval for the unknown population parameter <span class="math inline">\(\pi_{college} - \pi_{no\_college}\)</span> using our sample data with <em>bootstrapping</em>.</p>
-<pre class="sourceCode r"><code class="sourceCode r">boot_distn_two_props &lt;-<span class="st"> </span>offshore <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(response <span class="op">~</span><span class="st"> </span>college_grad, <span class="dt">success =</span> <span class="st">&quot;no opinion&quot;</span>) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;yes&quot;</span>, <span class="st">&quot;no&quot;</span>))</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">ci &lt;-<span class="st"> </span>boot_distn_two_props <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_ci</span>()
-ci</code></pre>
+<div class="sourceCode" id="cb518"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb518-1" data-line-number="1">boot_distn_two_props &lt;-<span class="st"> </span>offshore <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb518-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(response <span class="op">~</span><span class="st"> </span>college_grad, <span class="dt">success =</span> <span class="st">&quot;no opinion&quot;</span>) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb518-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb518-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in props&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;yes&quot;</span>, <span class="st">&quot;no&quot;</span>))</a></code></pre></div>
+<div class="sourceCode" id="cb519"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb519-1" data-line-number="1">ci &lt;-<span class="st"> </span>boot_distn_two_props <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb519-2" data-line-number="2"><span class="st">  </span><span class="kw">get_ci</span>()</a>
+<a class="sourceLine" id="cb519-3" data-line-number="3">ci</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
      `2.5%`    `97.5%`
       &lt;dbl&gt;      &lt;dbl&gt;
-1 -0.161207 -0.0378500</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">boot_distn_two_props <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">visualize</span>(<span class="dt">endpoints =</span> ci, <span class="dt">direction =</span> <span class="st">&quot;between&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-505-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+1 -0.160030 -0.0379112</code></pre>
+<div class="sourceCode" id="cb521"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb521-1" data-line-number="1">boot_distn_two_props <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb521-2" data-line-number="2"><span class="st">  </span><span class="kw">visualize</span>(<span class="dt">endpoints =</span> ci, <span class="dt">direction =</span> <span class="st">&quot;between&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-521-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>We see that 0 is not contained in this confidence interval as a plausible value of <span class="math inline">\(\pi_{college} - \pi_{no\_college}\)</span> (the unknown population parameter). This matches with our hypothesis test results of rejecting the null hypothesis. Since zero is not a plausible value of the population parameter, we have evidence that the proportion of college graduates in California with no opinion on drilling is different than that of non-college graduates.</p>
 <p><strong>Interpretation</strong>: We are 95% confident the true proportion of non-college graduates with no opinion on offshore drilling in California is between 0.16 dollars smaller to 0.04 dollars smaller than for college graduates.</p>
 </div>
@@ -1160,17 +1170,17 @@ <h3><span class="header-section-number">B.4.7</span> Test statistic</h3>
 <div id="observed-test-statistic-2" class="section level4 unnumbered">
 <h4>Observed test statistic</h4>
 <p>While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the <code>prop.test</code> function to perform this analysis for us.</p>
-<pre class="sourceCode r"><code class="sourceCode r">z_hat &lt;-<span class="st"> </span>offshore <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(response <span class="op">~</span><span class="st"> </span>college_grad, <span class="dt">success =</span> <span class="st">&quot;no opinion&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;z&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;yes&quot;</span>, <span class="st">&quot;no&quot;</span>))
-z_hat</code></pre>
+<div class="sourceCode" id="cb522"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb522-1" data-line-number="1">z_hat &lt;-<span class="st"> </span>offshore <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb522-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(response <span class="op">~</span><span class="st"> </span>college_grad, <span class="dt">success =</span> <span class="st">&quot;no opinion&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb522-3" data-line-number="3"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;z&quot;</span>, <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;yes&quot;</span>, <span class="st">&quot;no&quot;</span>))</a>
+<a class="sourceLine" id="cb522-4" data-line-number="4">z_hat</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
       stat
      &lt;dbl&gt;
 1 -3.16081</code></pre>
 <p>The observed difference in sample proportions is 3.16 standard deviations smaller than 0.</p>
 <p>The <span class="math inline">\(p\)</span>-value—the probability of observing a <span class="math inline">\(Z\)</span> value of -3.16 or more extreme in our null distribution—is 0.0016. This can also be calculated in R directly:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="dv">2</span> <span class="op">*</span><span class="st"> </span><span class="kw">pnorm</span>(<span class="op">-</span><span class="fl">3.16</span>, <span class="dt">lower.tail =</span> <span class="ot">TRUE</span>)</code></pre>
+<div class="sourceCode" id="cb524"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb524-1" data-line-number="1"><span class="dv">2</span> <span class="op">*</span><span class="st"> </span><span class="kw">pnorm</span>(<span class="op">-</span><span class="fl">3.16</span>, <span class="dt">lower.tail =</span> <span class="ot">TRUE</span>)</a></code></pre></div>
 <pre><code>[1] 0.00158</code></pre>
 </div>
 </div>
@@ -1223,22 +1233,22 @@ <h4>Set <span class="math inline">\(\alpha\)</span></h4>
 </div>
 <div id="exploring-the-sample-data-3" class="section level3">
 <h3><span class="header-section-number">B.5.3</span> Exploring the sample data</h3>
-<pre class="sourceCode r"><code class="sourceCode r">cle_sac &lt;-<span class="st"> </span><span class="kw">read.delim</span>(<span class="st">&quot;https://moderndive.com/data/cleSac.txt&quot;</span>) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">rename</span>(<span class="dt">metro_area =</span> Metropolitan_area_Detailed,
-         <span class="dt">income =</span> Total_personal_income) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">na.omit</span>()</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">inc_summ &lt;-<span class="st"> </span>cle_sac <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">group_by</span>(metro_area) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sample_size =</span> <span class="kw">n</span>(),
-    <span class="dt">mean =</span> <span class="kw">mean</span>(income),
-    <span class="dt">sd =</span> <span class="kw">sd</span>(income),
-    <span class="dt">minimum =</span> <span class="kw">min</span>(income),
-    <span class="dt">lower_quartile =</span> <span class="kw">quantile</span>(income, <span class="fl">0.25</span>),
-    <span class="dt">median =</span> <span class="kw">median</span>(income),
-    <span class="dt">upper_quartile =</span> <span class="kw">quantile</span>(income, <span class="fl">0.75</span>),
-    <span class="dt">max =</span> <span class="kw">max</span>(income))
-<span class="kw">kable</span>(inc_summ) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">kable_styling</span>(<span class="dt">font_size =</span> <span class="kw">ifelse</span>(knitr<span class="op">:::</span><span class="kw">is_latex_output</span>(), <span class="dv">10</span>, <span class="dv">16</span>), 
-                <span class="dt">latex_options =</span> <span class="kw">c</span>(<span class="st">&quot;hold_position&quot;</span>))</code></pre>
+<div class="sourceCode" id="cb526"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb526-1" data-line-number="1">cle_sac &lt;-<span class="st"> </span><span class="kw">read.delim</span>(<span class="st">&quot;https://moderndive.com/data/cleSac.txt&quot;</span>) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb526-2" data-line-number="2"><span class="st">  </span><span class="kw">rename</span>(<span class="dt">metro_area =</span> Metropolitan_area_Detailed,</a>
+<a class="sourceLine" id="cb526-3" data-line-number="3">         <span class="dt">income =</span> Total_personal_income) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb526-4" data-line-number="4"><span class="st">  </span><span class="kw">na.omit</span>()</a></code></pre></div>
+<div class="sourceCode" id="cb527"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb527-1" data-line-number="1">inc_summ &lt;-<span class="st"> </span>cle_sac <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">group_by</span>(metro_area) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb527-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">sample_size =</span> <span class="kw">n</span>(),</a>
+<a class="sourceLine" id="cb527-3" data-line-number="3">    <span class="dt">mean =</span> <span class="kw">mean</span>(income),</a>
+<a class="sourceLine" id="cb527-4" data-line-number="4">    <span class="dt">sd =</span> <span class="kw">sd</span>(income),</a>
+<a class="sourceLine" id="cb527-5" data-line-number="5">    <span class="dt">minimum =</span> <span class="kw">min</span>(income),</a>
+<a class="sourceLine" id="cb527-6" data-line-number="6">    <span class="dt">lower_quartile =</span> <span class="kw">quantile</span>(income, <span class="fl">0.25</span>),</a>
+<a class="sourceLine" id="cb527-7" data-line-number="7">    <span class="dt">median =</span> <span class="kw">median</span>(income),</a>
+<a class="sourceLine" id="cb527-8" data-line-number="8">    <span class="dt">upper_quartile =</span> <span class="kw">quantile</span>(income, <span class="fl">0.75</span>),</a>
+<a class="sourceLine" id="cb527-9" data-line-number="9">    <span class="dt">max =</span> <span class="kw">max</span>(income))</a>
+<a class="sourceLine" id="cb527-10" data-line-number="10"><span class="kw">kable</span>(inc_summ) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb527-11" data-line-number="11"><span class="st">  </span><span class="kw">kable_styling</span>(<span class="dt">font_size =</span> <span class="kw">ifelse</span>(knitr<span class="op">:::</span><span class="kw">is_latex_output</span>(), <span class="dv">10</span>, <span class="dv">16</span>), </a>
+<a class="sourceLine" id="cb527-12" data-line-number="12">                <span class="dt">latex_options =</span> <span class="kw">c</span>(<span class="st">&quot;hold_position&quot;</span>))</a></code></pre></div>
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <thead>
 <tr>
@@ -1333,10 +1343,10 @@ <h3><span class="header-section-number">B.5.3</span> Exploring the sample data</
 </tbody>
 </table>
 <p>The boxplot below also shows the mean for each group highlighted by the red dots.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(cle_sac, <span class="kw">aes</span>(<span class="dt">x =</span> metro_area, <span class="dt">y =</span> income)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_boxplot</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">stat_summary</span>(<span class="dt">fun.y =</span> <span class="st">&quot;mean&quot;</span>, <span class="dt">geom =</span> <span class="st">&quot;point&quot;</span>, <span class="dt">color =</span> <span class="st">&quot;red&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/boxplot-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb528"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb528-1" data-line-number="1"><span class="kw">ggplot</span>(cle_sac, <span class="kw">aes</span>(<span class="dt">x =</span> metro_area, <span class="dt">y =</span> income)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb528-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_boxplot</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb528-3" data-line-number="3"><span class="st">  </span><span class="kw">stat_summary</span>(<span class="dt">fun.y =</span> <span class="st">&quot;mean&quot;</span>, <span class="dt">geom =</span> <span class="st">&quot;point&quot;</span>, <span class="dt">color =</span> <span class="st">&quot;red&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/boxplot-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <div id="guess-about-statistical-significance-3" class="section level4 unnumbered">
 <h4>Guess about statistical significance</h4>
 <p>We are looking to see if a difference exists in the mean income of the two levels of the explanatory variable. Based solely on the boxplot, we have reason to believe that no difference exists. The distributions of income seem similar and the means fall in roughly the same place.</p>
@@ -1347,11 +1357,11 @@ <h3><span class="header-section-number">B.5.4</span> Non-traditional methods</h3
 <div id="collecting-summary-info-1" class="section level4 unnumbered">
 <h4>Collecting summary info</h4>
 <p>We now compute the observed statistic:</p>
-<pre class="sourceCode r"><code class="sourceCode r">d_hat &lt;-<span class="st"> </span>cle_sac <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(income <span class="op">~</span><span class="st"> </span>metro_area) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in means&quot;</span>, 
-            <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Sacramento_ CA&quot;</span>, <span class="st">&quot;Cleveland_ OH&quot;</span>))
-d_hat</code></pre>
+<div class="sourceCode" id="cb529"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb529-1" data-line-number="1">d_hat &lt;-<span class="st"> </span>cle_sac <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb529-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(income <span class="op">~</span><span class="st"> </span>metro_area) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb529-3" data-line-number="3"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in means&quot;</span>, </a>
+<a class="sourceLine" id="cb529-4" data-line-number="4">            <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Sacramento_ CA&quot;</span>, <span class="st">&quot;Cleveland_ OH&quot;</span>))</a>
+<a class="sourceLine" id="cb529-5" data-line-number="5">d_hat</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
      stat
     &lt;dbl&gt;
@@ -1361,29 +1371,29 @@ <h4>Collecting summary info</h4>
 <h4>Randomization for hypothesis test</h4>
 <p>In order to look to see if the observed sample mean for Sacramento of 27467.066 is statistically different than that for Cleveland of 32427.543, we need to account for the sample sizes. Note that this is the same as looking to see if <span class="math inline">\(\bar{x}_{sac} - \bar{x}_{cle}\)</span> is statistically different than 0. We also need to determine a process that replicates how the original group sizes of 212 and 175 were selected.</p>
 <p>We can use the idea of <em>randomization testing</em> (also known as <em>permutation testing</em>) to simulate the population from which the sample came (with two groups of different sizes) and then generate samples using <em>shuffling</em> from that simulated population to account for sampling variability.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">set.seed</span>(<span class="dv">2018</span>)
-null_distn_two_means &lt;-<span class="st"> </span>cle_sac <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(income <span class="op">~</span><span class="st"> </span>metro_area) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in means&quot;</span>,
-            <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Sacramento_ CA&quot;</span>, <span class="st">&quot;Cleveland_ OH&quot;</span>))</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">null_distn_two_means <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">visualize</span>()</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-509-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb531"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb531-1" data-line-number="1"><span class="kw">set.seed</span>(<span class="dv">2018</span>)</a>
+<a class="sourceLine" id="cb531-2" data-line-number="2">null_distn_two_means &lt;-<span class="st"> </span>cle_sac <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb531-3" data-line-number="3"><span class="st">  </span><span class="kw">specify</span>(income <span class="op">~</span><span class="st"> </span>metro_area) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb531-4" data-line-number="4"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;independence&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb531-5" data-line-number="5"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb531-6" data-line-number="6"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in means&quot;</span>,</a>
+<a class="sourceLine" id="cb531-7" data-line-number="7">            <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Sacramento_ CA&quot;</span>, <span class="st">&quot;Cleveland_ OH&quot;</span>))</a></code></pre></div>
+<div class="sourceCode" id="cb532"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb532-1" data-line-number="1">null_distn_two_means <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">visualize</span>()</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-525-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>We can next use this distribution to observe our <span class="math inline">\(p\)</span>-value. Recall this is a two-tailed test so we will be looking for values that are greater than or equal to 4960.477 or less than or equal to -4960.477 for our <span class="math inline">\(p\)</span>-value.</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distn_two_means <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">visualize</span>(<span class="dt">obs_stat =</span> d_hat, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-510-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb533"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb533-1" data-line-number="1">null_distn_two_means <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb533-2" data-line-number="2"><span class="st">  </span><span class="kw">visualize</span>(<span class="dt">obs_stat =</span> d_hat, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-526-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <div id="calculate-p-value-3" class="section level5 unnumbered">
 <h5>Calculate <span class="math inline">\(p\)</span>-value</h5>
-<pre class="sourceCode r"><code class="sourceCode r">pvalue &lt;-<span class="st"> </span>null_distn_two_means <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_pvalue</span>(<span class="dt">obs_stat =</span> d_hat, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)
-pvalue</code></pre>
+<div class="sourceCode" id="cb534"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb534-1" data-line-number="1">pvalue &lt;-<span class="st"> </span>null_distn_two_means <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb534-2" data-line-number="2"><span class="st">  </span><span class="kw">get_pvalue</span>(<span class="dt">obs_stat =</span> d_hat, <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>)</a>
+<a class="sourceLine" id="cb534-3" data-line-number="3">pvalue</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   p_value
     &lt;dbl&gt;
-1  0.1298</code></pre>
-<p>So our <span class="math inline">\(p\)</span>-value is 0.13 and we fail to reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are not very far into the tail of the null distribution.</p>
+1  0.1262</code></pre>
+<p>So our <span class="math inline">\(p\)</span>-value is 0.126 and we fail to reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are not very far into the tail of the null distribution.</p>
 </div>
 </div>
 <div id="bootstrapping-for-confidence-interval-3" class="section level4 unnumbered">
@@ -1391,23 +1401,23 @@ <h4>Bootstrapping for confidence interval</h4>
 <p>We can also create a confidence interval for the unknown population parameter <span class="math inline">\(\mu_{sac} - \mu_{cle}\)</span> using our sample data with <em>bootstrapping</em>. Here we will bootstrap each of the groups with replacement instead of shuffling. This is done using the <code>groups</code>
 argument in the <code>resample</code> function to fix the size of each group to
 be the same as the original group sizes of 175 for Sacramento and 212 for Cleveland.</p>
-<pre class="sourceCode r"><code class="sourceCode r">boot_distn_two_means &lt;-<span class="st"> </span>cle_sac <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(income <span class="op">~</span><span class="st"> </span>metro_area) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in means&quot;</span>,
-            <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Sacramento_ CA&quot;</span>, <span class="st">&quot;Cleveland_ OH&quot;</span>))</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">ci &lt;-<span class="st"> </span>boot_distn_two_means <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_ci</span>()
-ci</code></pre>
+<div class="sourceCode" id="cb536"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb536-1" data-line-number="1">boot_distn_two_means &lt;-<span class="st"> </span>cle_sac <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb536-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(income <span class="op">~</span><span class="st"> </span>metro_area) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb536-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb536-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;diff in means&quot;</span>,</a>
+<a class="sourceLine" id="cb536-5" data-line-number="5">            <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Sacramento_ CA&quot;</span>, <span class="st">&quot;Cleveland_ OH&quot;</span>))</a></code></pre></div>
+<div class="sourceCode" id="cb537"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb537-1" data-line-number="1">ci &lt;-<span class="st"> </span>boot_distn_two_means <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb537-2" data-line-number="2"><span class="st">  </span><span class="kw">get_ci</span>()</a>
+<a class="sourceLine" id="cb537-3" data-line-number="3">ci</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
     `2.5%` `97.5%`
      &lt;dbl&gt;   &lt;dbl&gt;
-1 -1445.53 11307.8</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">boot_distn_two_means <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">visualize</span>(<span class="dt">endpoints =</span> ci, <span class="dt">direction =</span> <span class="st">&quot;between&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-514-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+1 -1359.50 11499.7</code></pre>
+<div class="sourceCode" id="cb539"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb539-1" data-line-number="1">boot_distn_two_means <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb539-2" data-line-number="2"><span class="st">  </span><span class="kw">visualize</span>(<span class="dt">endpoints =</span> ci, <span class="dt">direction =</span> <span class="st">&quot;between&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-530-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>We see that 0 is contained in this confidence interval as a plausible value of <span class="math inline">\(\mu_{sac} - \mu_{cle}\)</span> (the unknown population parameter). This matches with our hypothesis test results of failing to reject the null hypothesis. Since zero is a plausible value of the population parameter, we do not have evidence that Sacramento incomes are different than Cleveland incomes.</p>
-<p><strong>Interpretation</strong>: We are 95% confident the true mean yearly income for those living in Sacramento is between 1445.53 dollars smaller to 11307.82 dollars higher than for Cleveland.</p>
+<p><strong>Interpretation</strong>: We are 95% confident the true mean yearly income for those living in Sacramento is between 1359.5 dollars smaller to 11499.69 dollars higher than for Cleveland.</p>
 <p><strong>Note</strong>: You could also use the null distribution based on randomization with a shift to have its center at <span class="math inline">\(\bar{x}_{sac} - \bar{x}_{cle} = \$4960.48\)</span> instead of at 0 and calculate its percentiles. The confidence interval produced via this method should be comparable to the one done using bootstrapping above.</p>
 </div>
 </div>
@@ -1421,10 +1431,10 @@ <h5>Check conditions</h5>
 <p>This <code>metro_area</code> variable is met since the cases are randomly selected from each city.</p></li>
 <li><p><em>Approximately normal</em>: The distribution of the response for each group should be normal or the sample sizes should be at least 30.</p></li>
 </ol>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(cle_sac, <span class="kw">aes</span>(<span class="dt">x =</span> income)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>, <span class="dt">binwidth =</span> <span class="dv">20000</span>) <span class="op">+</span>
-<span class="st">  </span><span class="kw">facet_wrap</span>(<span class="op">~</span><span class="st"> </span>metro_area)</code></pre>
-<p><img src="moderndive_files/figure-html/hist-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb540"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb540-1" data-line-number="1"><span class="kw">ggplot</span>(cle_sac, <span class="kw">aes</span>(<span class="dt">x =</span> income)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb540-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">color =</span> <span class="st">&quot;white&quot;</span>, <span class="dt">binwidth =</span> <span class="dv">20000</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb540-3" data-line-number="3"><span class="st">  </span><span class="kw">facet_wrap</span>(<span class="op">~</span><span class="st"> </span>metro_area)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/hist-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>We have some reason to doubt the normality assumption here since both the histograms show deviation from a normal model fitting the data well for each group. The sample sizes for each group are greater than 100 though so the assumptions should still apply.</p>
 <ol start="3" style="list-style-type: decimal">
 <li><p><em>Independent samples</em>: The samples should be collected without any natural pairing.</p>
@@ -1439,10 +1449,10 @@ <h3><span class="header-section-number">B.5.6</span> Test statistic</h3>
 <div id="observed-test-statistic-3" class="section level4 unnumbered">
 <h4>Observed test statistic</h4>
 <p>Note that we could also do (ALMOST) this test directly using the <code>t.test</code> function. The <code>x</code> and <code>y</code> arguments are expected to both be numeric vectors here so we’ll need to appropriately filter our datasets.</p>
-<pre class="sourceCode r"><code class="sourceCode r">cle_sac <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(income <span class="op">~</span><span class="st"> </span>metro_area) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;t&quot;</span>,
-            <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Cleveland_ OH&quot;</span>, <span class="st">&quot;Sacramento_ CA&quot;</span>))</code></pre>
+<div class="sourceCode" id="cb541"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb541-1" data-line-number="1">cle_sac <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb541-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(income <span class="op">~</span><span class="st"> </span>metro_area) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb541-3" data-line-number="3"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;t&quot;</span>,</a>
+<a class="sourceLine" id="cb541-4" data-line-number="4">            <span class="dt">order =</span> <span class="kw">c</span>(<span class="st">&quot;Cleveland_ OH&quot;</span>, <span class="st">&quot;Sacramento_ CA&quot;</span>))</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
       stat
      &lt;dbl&gt;
@@ -1461,10 +1471,10 @@ <h4>Observed test statistic</h4>
 <div id="compute-p-value-1" class="section level3">
 <h3><span class="header-section-number">B.5.7</span> Compute <span class="math inline">\(p\)</span>-value</h3>
 <p>The <span class="math inline">\(p\)</span>-value—the probability of observing an <span class="math inline">\(t_{174}\)</span> value of -1.501 or more extreme (in both directions) in our null distribution—is 0.13. This can also be calculated in R directly:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="dv">2</span> <span class="op">*</span><span class="st"> </span><span class="kw">pt</span>(<span class="op">-</span><span class="fl">1.501</span>, <span class="dt">df =</span> <span class="kw">min</span>(<span class="dv">212</span> <span class="op">-</span><span class="st"> </span><span class="dv">1</span>, <span class="dv">175</span> <span class="op">-</span><span class="st"> </span><span class="dv">1</span>), <span class="dt">lower.tail =</span> <span class="ot">TRUE</span>)</code></pre>
+<div class="sourceCode" id="cb543"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb543-1" data-line-number="1"><span class="dv">2</span> <span class="op">*</span><span class="st"> </span><span class="kw">pt</span>(<span class="op">-</span><span class="fl">1.501</span>, <span class="dt">df =</span> <span class="kw">min</span>(<span class="dv">212</span> <span class="op">-</span><span class="st"> </span><span class="dv">1</span>, <span class="dv">175</span> <span class="op">-</span><span class="st"> </span><span class="dv">1</span>), <span class="dt">lower.tail =</span> <span class="ot">TRUE</span>)</a></code></pre></div>
 <pre><code>[1] 0.135</code></pre>
 <p>We can also approximate by using the standard normal curve:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="dv">2</span> <span class="op">*</span><span class="st"> </span><span class="kw">pnorm</span>(<span class="op">-</span><span class="fl">1.501</span>)</code></pre>
+<div class="sourceCode" id="cb545"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb545-1" data-line-number="1"><span class="dv">2</span> <span class="op">*</span><span class="st"> </span><span class="kw">pnorm</span>(<span class="op">-</span><span class="fl">1.501</span>)</a></code></pre></div>
 <pre><code>[1] 0.133</code></pre>
 <p>Note that the 95 percent confidence interval given above matches well with the one calculated using bootstrapping.</p>
 </div>
@@ -1506,25 +1516,25 @@ <h4>Set <span class="math inline">\(\alpha\)</span></h4>
 </div>
 <div id="exploring-the-sample-data-4" class="section level3">
 <h3><span class="header-section-number">B.6.2</span> Exploring the sample data</h3>
-<pre class="sourceCode r"><code class="sourceCode r">zinc_tidy &lt;-<span class="st"> </span><span class="kw">read_csv</span>(<span class="st">&quot;https://moderndive.com/data/zinc_tidy.csv&quot;</span>)</code></pre>
+<div class="sourceCode" id="cb547"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb547-1" data-line-number="1">zinc_tidy &lt;-<span class="st"> </span><span class="kw">read_csv</span>(<span class="st">&quot;https://moderndive.com/data/zinc_tidy.csv&quot;</span>)</a></code></pre></div>
 <p>We want to look at the differences in <code>surface - bottom</code> for each location:</p>
-<pre class="sourceCode r"><code class="sourceCode r">zinc_diff &lt;-<span class="st"> </span>zinc_tidy <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(loc_id) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">pair_diff =</span> <span class="kw">diff</span>(concentration)) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">ungroup</span>()</code></pre>
+<div class="sourceCode" id="cb548"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb548-1" data-line-number="1">zinc_diff &lt;-<span class="st"> </span>zinc_tidy <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb548-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(loc_id) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb548-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">pair_diff =</span> <span class="kw">diff</span>(concentration)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb548-4" data-line-number="4"><span class="st">  </span><span class="kw">ungroup</span>()</a></code></pre></div>
 <p>Next we calculate the mean difference as our observed statistic:</p>
-<pre class="sourceCode r"><code class="sourceCode r">d_hat &lt;-<span class="st"> </span>zinc_diff <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> pair_diff) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)
-d_hat</code></pre>
+<div class="sourceCode" id="cb549"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb549-1" data-line-number="1">d_hat &lt;-<span class="st"> </span>zinc_diff <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb549-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> pair_diff) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb549-3" data-line-number="3"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</a>
+<a class="sourceLine" id="cb549-4" data-line-number="4">d_hat</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
      stat
     &lt;dbl&gt;
 1 -0.0804</code></pre>
 <p>The histogram below also shows the distribution of <code>pair_diff</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(zinc_diff, <span class="kw">aes</span>(<span class="dt">x =</span> pair_diff)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.04</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/hist1a-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb551"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb551-1" data-line-number="1"><span class="kw">ggplot</span>(zinc_diff, <span class="kw">aes</span>(<span class="dt">x =</span> pair_diff)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb551-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_histogram</span>(<span class="dt">binwidth =</span> <span class="fl">0.04</span>, <span class="dt">color =</span> <span class="st">&quot;white&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/hist1a-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <div id="guess-about-statistical-significance-4" class="section level4 unnumbered">
 <h4>Guess about statistical significance</h4>
 <p>We are looking to see if the sample paired mean difference of -0.08 is statistically less than 0. They seem to be quite close, but we have a small number of pairs here. Let’s guess that we will fail to reject the null hypothesis.</p>
@@ -1537,23 +1547,23 @@ <h4>Bootstrapping for hypothesis test</h4>
 <p>In order to look to see if the observed sample mean difference <span class="math inline">\(\bar{x}_{diff} = 4960.477\)</span> is statistically less than 0, we need to account for the number of pairs. We also need to determine a process that replicates how the paired data was selected in a way similar to how we calculated our original difference in sample means.</p>
 <p>Treating the differences as our data of interest, we next use the process of <strong>bootstrapping</strong> to build other simulated samples and then calculate the mean of the bootstrap samples. We hypothesize that the mean difference is zero.</p>
 <p>This process is similar to comparing the One Mean example seen above, but using the differences between the two groups as a single sample with a hypothesized mean difference of 0.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">set.seed</span>(<span class="dv">2018</span>)
-null_distn_paired_means &lt;-<span class="st"> </span>zinc_diff <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> pair_diff) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;point&quot;</span>, <span class="dt">mu =</span> <span class="dv">0</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">null_distn_paired_means <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">visualize</span>()</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-518-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb552"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb552-1" data-line-number="1"><span class="kw">set.seed</span>(<span class="dv">2018</span>)</a>
+<a class="sourceLine" id="cb552-2" data-line-number="2">null_distn_paired_means &lt;-<span class="st"> </span>zinc_diff <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb552-3" data-line-number="3"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> pair_diff) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb552-4" data-line-number="4"><span class="st">  </span><span class="kw">hypothesize</span>(<span class="dt">null =</span> <span class="st">&quot;point&quot;</span>, <span class="dt">mu =</span> <span class="dv">0</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb552-5" data-line-number="5"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb552-6" data-line-number="6"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</a></code></pre></div>
+<div class="sourceCode" id="cb553"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb553-1" data-line-number="1">null_distn_paired_means <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">visualize</span>()</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-534-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>We can next use this distribution to observe our <span class="math inline">\(p\)</span>-value. Recall this is a left-tailed test so we will be looking for values that are less than or equal to 4960.477 for our <span class="math inline">\(p\)</span>-value.</p>
-<pre class="sourceCode r"><code class="sourceCode r">null_distn_paired_means <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">visualize</span>(<span class="dt">obs_stat =</span> d_hat, <span class="dt">direction =</span> <span class="st">&quot;less&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-519-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb554"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb554-1" data-line-number="1">null_distn_paired_means <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb554-2" data-line-number="2"><span class="st">  </span><span class="kw">visualize</span>(<span class="dt">obs_stat =</span> d_hat, <span class="dt">direction =</span> <span class="st">&quot;less&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-535-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <div id="calculate-p-value-4" class="section level5 unnumbered">
 <h5>Calculate <span class="math inline">\(p\)</span>-value</h5>
-<pre class="sourceCode r"><code class="sourceCode r">pvalue &lt;-<span class="st"> </span>null_distn_paired_means <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_pvalue</span>(<span class="dt">obs_stat =</span> d_hat, <span class="dt">direction =</span> <span class="st">&quot;less&quot;</span>)
-pvalue</code></pre>
+<div class="sourceCode" id="cb555"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb555-1" data-line-number="1">pvalue &lt;-<span class="st"> </span>null_distn_paired_means <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb555-2" data-line-number="2"><span class="st">  </span><span class="kw">get_pvalue</span>(<span class="dt">obs_stat =</span> d_hat, <span class="dt">direction =</span> <span class="st">&quot;less&quot;</span>)</a>
+<a class="sourceLine" id="cb555-3" data-line-number="3">pvalue</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   p_value
     &lt;dbl&gt;
@@ -1565,20 +1575,20 @@ <h5>Calculate <span class="math inline">\(p\)</span>-value</h5>
 <h4>Bootstrapping for confidence interval</h4>
 <p>We can also create a confidence interval for the unknown population parameter <span class="math inline">\(\mu_{diff}\)</span> using our sample data (the calculated differences) with <em>bootstrapping</em>. This is similar to the bootstrapping done in a one sample mean case, except now our data is differences instead of raw numerical data.
 Note that this code is identical to the pipeline shown in the hypothesis test above except the <code>hypothesize()</code> function is not called.</p>
-<pre class="sourceCode r"><code class="sourceCode r">boot_distn_paired_means &lt;-<span class="st"> </span>zinc_diff <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> pair_diff) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">ci &lt;-<span class="st"> </span>boot_distn_paired_means <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">get_ci</span>()
-ci</code></pre>
+<div class="sourceCode" id="cb557"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb557-1" data-line-number="1">boot_distn_paired_means &lt;-<span class="st"> </span>zinc_diff <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb557-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> pair_diff) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb557-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">10000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb557-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;mean&quot;</span>)</a></code></pre></div>
+<div class="sourceCode" id="cb558"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb558-1" data-line-number="1">ci &lt;-<span class="st"> </span>boot_distn_paired_means <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb558-2" data-line-number="2"><span class="st">  </span><span class="kw">get_ci</span>()</a>
+<a class="sourceLine" id="cb558-3" data-line-number="3">ci</a></code></pre></div>
 <pre><code># A tibble: 1 x 2
-     `2.5%` `97.5%`
-      &lt;dbl&gt;   &lt;dbl&gt;
-1 -0.112200 -0.0503</code></pre>
-<pre class="sourceCode r"><code class="sourceCode r">boot_distn_paired_means <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">visualize</span>(<span class="dt">endpoints =</span> ci, <span class="dt">direction =</span> <span class="st">&quot;between&quot;</span>)</code></pre>
-<p><img src="moderndive_files/figure-html/unnamed-chunk-523-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+     `2.5%`    `97.5%`
+      &lt;dbl&gt;      &lt;dbl&gt;
+1 -0.111600 -0.0501975</code></pre>
+<div class="sourceCode" id="cb560"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb560-1" data-line-number="1">boot_distn_paired_means <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb560-2" data-line-number="2"><span class="st">  </span><span class="kw">visualize</span>(<span class="dt">endpoints =</span> ci, <span class="dt">direction =</span> <span class="st">&quot;between&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-539-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p>We see that 0 is not contained in this confidence interval as a plausible value of <span class="math inline">\(\mu_{diff}\)</span> (the unknown population parameter). This matches with our hypothesis test results of rejecting the null hypothesis. Since zero is not a plausible value of the population parameter and since the entire confidence interval falls below zero, we have evidence that surface zinc concentration levels are lower, on average, than bottom level zinc concentrations.</p>
 <p><strong>Interpretation</strong>: We are 95% confident the true mean zinc concentration on the surface is between 0.11 units smaller to 0.05 units smaller than on the bottom.</p>
 </div>
@@ -1603,11 +1613,11 @@ <h4>Test statistic</h4>
 <div id="observed-test-statistic-4" class="section level5 unnumbered">
 <h5>Observed test statistic</h5>
 <p>While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the <code>t_test</code> function on the differences to perform this analysis for us.</p>
-<pre class="sourceCode r"><code class="sourceCode r">t_test_results &lt;-<span class="st"> </span>zinc_diff <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span>infer<span class="op">::</span><span class="kw">t_test</span>(<span class="dt">formula =</span> pair_diff <span class="op">~</span><span class="st"> </span><span class="ot">NULL</span>, 
-         <span class="dt">alternative =</span> <span class="st">&quot;less&quot;</span>,
-         <span class="dt">mu =</span> <span class="dv">0</span>)
-t_test_results</code></pre>
+<div class="sourceCode" id="cb561"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb561-1" data-line-number="1">t_test_results &lt;-<span class="st"> </span>zinc_diff <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb561-2" data-line-number="2"><span class="st">  </span>infer<span class="op">::</span><span class="kw">t_test</span>(<span class="dt">formula =</span> pair_diff <span class="op">~</span><span class="st"> </span><span class="ot">NULL</span>, </a>
+<a class="sourceLine" id="cb561-3" data-line-number="3">         <span class="dt">alternative =</span> <span class="st">&quot;less&quot;</span>,</a>
+<a class="sourceLine" id="cb561-4" data-line-number="4">         <span class="dt">mu =</span> <span class="dv">0</span>)</a>
+<a class="sourceLine" id="cb561-5" data-line-number="5">t_test_results</a></code></pre></div>
 <pre><code># A tibble: 1 x 6
   statistic  t_df     p_value alternative lower_ci   upper_ci
       &lt;dbl&gt; &lt;dbl&gt;       &lt;dbl&gt; &lt;chr&gt;          &lt;dbl&gt;      &lt;dbl&gt;
@@ -1618,7 +1628,7 @@ <h5>Observed test statistic</h5>
 <div id="compute-p-value-2" class="section level4 unnumbered">
 <h4>Compute <span class="math inline">\(p\)</span>-value</h4>
 <p>The <span class="math inline">\(p\)</span>-value—the probability of observing a <span class="math inline">\(t_{obs}\)</span> value of -4.864 or less in our null distribution of a <span class="math inline">\(t\)</span> with 9 degrees of freedom—is 0. This can also be calculated in R directly:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">pt</span>(<span class="op">-</span><span class="fl">4.8638</span>, <span class="dt">df =</span> <span class="kw">nrow</span>(zinc_diff) <span class="op">-</span><span class="st"> </span><span class="dv">1</span>, <span class="dt">lower.tail =</span> <span class="ot">TRUE</span>)</code></pre>
+<div class="sourceCode" id="cb563"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb563-1" data-line-number="1"><span class="kw">pt</span>(<span class="op">-</span><span class="fl">4.8638</span>, <span class="dt">df =</span> <span class="kw">nrow</span>(zinc_diff) <span class="op">-</span><span class="st"> </span><span class="dv">1</span>, <span class="dt">lower.tail =</span> <span class="ot">TRUE</span>)</a></code></pre></div>
 <pre><code>[1] 0.000446</code></pre>
 </div>
 <div id="state-conclusion-4" class="section level4 unnumbered">
@@ -1636,7 +1646,7 @@ <h3><span class="header-section-number">B.6.5</span> Comparing results</h3>
 <h3>References</h3>
 <div id="refs" class="references">
 <div id="ref-isrs2014">
-<p>Diez, David M, Christopher D Barr, and Mine Çetinkaya-Rundel. 2014. <em>Introductory Statistics with Randomization and Simulation</em>. First Edition. <a href="https://www.openintro.org/stat/textbook.php?stat_book=isrs">https://www.openintro.org/stat/textbook.php?stat_book=isrs</a>.</p>
+<p>Diez, David M, Christopher D Barr, and Mine Çetinkaya-Rundel. 2014. <em>Introductory Statistics with Randomization and Simulation</em>. First. Scotts Valley, CA: CreateSpace Independent Publishing Platform. <a href="https://www.openintro.org/stat/textbook.php?stat_book=isrs">https://www.openintro.org/stat/textbook.php?stat_book=isrs</a>.</p>
 </div>
 </div>
             </section>
@@ -1650,11 +1660,13 @@ <h3>References</h3>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -1662,12 +1674,11 @@ <h3>References</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -1682,6 +1693,10 @@ <h3>References</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -1698,8 +1713,9 @@ <h3>References</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/C-appendixC.html b/docs/C-appendixC.html
index b5451bd34..d997e3977 100644
--- a/docs/C-appendixC.html
+++ b/docs/C-appendixC.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>C Reach for the Stars | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="C Reach for the Stars | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="C Reach for the Stars | Statistical Inference via Data Science" />
@@ -21,18 +21,18 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="B-appendixB.html">
-<link rel="next" href="D-appendixD.html">
+<link rel="prev" href="B-appendixB.html"/>
+<link rel="next" href="D-appendixD.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -572,34 +585,34 @@ <h1>
 <h1><span class="header-section-number">C</span> Reach for the Stars</h1>
 <div id="needed-packages-11" class="section level2 unnumbered">
 <h2>Needed packages</h2>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(dplyr)
-<span class="kw">library</span>(ggplot2)
-<span class="kw">library</span>(knitr)
-<span class="kw">library</span>(dygraphs)
-<span class="kw">library</span>(nycflights13)</code></pre>
+<div class="sourceCode" id="cb565"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb565-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb565-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
+<a class="sourceLine" id="cb565-3" data-line-number="3"><span class="kw">library</span>(knitr)</a>
+<a class="sourceLine" id="cb565-4" data-line-number="4"><span class="kw">library</span>(dygraphs)</a>
+<a class="sourceLine" id="cb565-5" data-line-number="5"><span class="kw">library</span>(nycflights13)</a></code></pre></div>
 </div>
 <div id="sorted-barplots" class="section level2">
 <h2><span class="header-section-number">C.1</span> Sorted barplots</h2>
 <p>Building upon the example in Section <a href="2-viz.html#geombar">2.8</a>:</p>
-<pre class="sourceCode r"><code class="sourceCode r">flights_table &lt;-<span class="st"> </span><span class="kw">table</span>(flights<span class="op">$</span>carrier)
-flights_table</code></pre>
+<div class="sourceCode" id="cb566"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb566-1" data-line-number="1">flights_table &lt;-<span class="st"> </span><span class="kw">table</span>(flights<span class="op">$</span>carrier)</a>
+<a class="sourceLine" id="cb566-2" data-line-number="2">flights_table</a></code></pre></div>
 <pre><code>
    9E    AA    AS    B6    DL    EV    F9    FL    HA    MQ    OO    UA    US 
 18460 32729   714 54635 48110 54173   685  3260   342 26397    32 58665 20536 
    VX    WN    YV 
  5162 12275   601 </code></pre>
 <p>We can sort this table from highest to lowest counts by using the <code>sort</code> function:</p>
-<pre class="sourceCode r"><code class="sourceCode r">sorted_flights &lt;-<span class="st"> </span><span class="kw">sort</span>(flights_table, <span class="dt">decreasing =</span> <span class="ot">TRUE</span>)
-<span class="kw">names</span>(sorted_flights)</code></pre>
+<div class="sourceCode" id="cb568"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb568-1" data-line-number="1">sorted_flights &lt;-<span class="st"> </span><span class="kw">sort</span>(flights_table, <span class="dt">decreasing =</span> <span class="ot">TRUE</span>)</a>
+<a class="sourceLine" id="cb568-2" data-line-number="2"><span class="kw">names</span>(sorted_flights)</a></code></pre></div>
 <pre><code> [1] &quot;UA&quot; &quot;B6&quot; &quot;EV&quot; &quot;DL&quot; &quot;AA&quot; &quot;MQ&quot; &quot;US&quot; &quot;9E&quot; &quot;WN&quot; &quot;VX&quot; &quot;FL&quot; &quot;AS&quot; &quot;F9&quot; &quot;YV&quot; &quot;HA&quot;
 [16] &quot;OO&quot;</code></pre>
 <p>It is often preferred for barplots to be ordered corresponding to the heights of the bars. This allows the reader to more easily compare the ordering of different airlines in terms of departed flights <span class="citation">(Robbins <a href="#ref-robbins2013">2013</a>)</span>. We can also much more easily answer questions like “How many airlines have more departing flights than Southwest Airlines?”.</p>
 <p>We can use the sorted table giving the number of flights defined as <code>sorted_flights</code> to <strong>reorder</strong> the <code>carrier</code>.</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier)) <span class="op">+</span>
-<span class="st">  </span><span class="kw">geom_bar</span>() <span class="op">+</span>
-<span class="st">  </span><span class="kw">scale_x_discrete</span>(<span class="dt">limits =</span> <span class="kw">names</span>(sorted_flights))</code></pre>
-<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-529"></span>
-<img src="moderndive_files/figure-html/unnamed-chunk-529-1.png" alt="Number of flights departing NYC in 2013 by airline - Descending numbers." width="\textwidth" />
+<div class="sourceCode" id="cb570"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb570-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> carrier)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb570-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_bar</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb570-3" data-line-number="3"><span class="st">  </span><span class="kw">scale_x_discrete</span>(<span class="dt">limits =</span> <span class="kw">names</span>(sorted_flights))</a></code></pre></div>
+<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-545"></span>
+<img src="ModernDive_files/figure-html/unnamed-chunk-545-1.png" alt="Number of flights departing NYC in 2013 by airline - Descending numbers." width="\textwidth" />
 <p class="caption">
 FIGURE C.1: Number of flights departing NYC in 2013 by airline - Descending numbers.
 </p>
@@ -611,16 +624,16 @@ <h2><span class="header-section-number">C.2</span> Interactive graphics</h2>
 <div id="interactive-linegraphs" class="section level3">
 <h3><span class="header-section-number">C.2.1</span> Interactive linegraphs</h3>
 <p>Another useful tool for viewing linegraphs such as this is the <code>dygraph</code> function in the <code>dygraphs</code> package in combination with the <code>dyRangeSelector</code> function. This allows us to zoom in on a selected range and get an interactive plot for us to work with:</p>
-<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(dygraphs)
-flights_day &lt;-<span class="st"> </span><span class="kw">mutate</span>(flights, <span class="dt">date =</span> <span class="kw">as.Date</span>(time_hour))
-flights_summarized &lt;-<span class="st"> </span>flights_day <span class="op">%&gt;%</span><span class="st"> </span>
-<span class="st">  </span><span class="kw">group_by</span>(date) <span class="op">%&gt;%</span>
-<span class="st">  </span><span class="kw">summarize</span>(<span class="dt">median_arr_delay =</span> <span class="kw">median</span>(arr_delay, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))
-<span class="kw">rownames</span>(flights_summarized) &lt;-<span class="st"> </span>flights_summarized<span class="op">$</span>date
-flights_summarized &lt;-<span class="st"> </span><span class="kw">select</span>(flights_summarized, <span class="op">-</span>date)
-<span class="kw">dyRangeSelector</span>(<span class="kw">dygraph</span>(flights_summarized))</code></pre>
+<div class="sourceCode" id="cb571"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb571-1" data-line-number="1"><span class="kw">library</span>(dygraphs)</a>
+<a class="sourceLine" id="cb571-2" data-line-number="2">flights_day &lt;-<span class="st"> </span><span class="kw">mutate</span>(flights, <span class="dt">date =</span> <span class="kw">as.Date</span>(time_hour))</a>
+<a class="sourceLine" id="cb571-3" data-line-number="3">flights_summarized &lt;-<span class="st"> </span>flights_day <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb571-4" data-line-number="4"><span class="st">  </span><span class="kw">group_by</span>(date) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb571-5" data-line-number="5"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">median_arr_delay =</span> <span class="kw">median</span>(arr_delay, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</a>
+<a class="sourceLine" id="cb571-6" data-line-number="6"><span class="kw">rownames</span>(flights_summarized) &lt;-<span class="st"> </span>flights_summarized<span class="op">$</span>date</a>
+<a class="sourceLine" id="cb571-7" data-line-number="7">flights_summarized &lt;-<span class="st"> </span><span class="kw">select</span>(flights_summarized, <span class="op">-</span>date)</a>
+<a class="sourceLine" id="cb571-8" data-line-number="8"><span class="kw">dyRangeSelector</span>(<span class="kw">dygraph</span>(flights_summarized))</a></code></pre></div>
 <div id="htmlwidget-856b09578dcc552385a2" style="width:100%;height:384px;" class="dygraphs html-widget"></div>
-<script type="application/json" data-for="htmlwidget-856b09578dcc552385a2">{"x":{"attrs":{"labels":["day","median_arr_delay"],"legend":"auto","retainDateWindow":false,"axes":{"x":{"pixelsPerLabel":60}},"showRangeSelector":true,"rangeSelectorHeight":40,"rangeSelectorPlotFillColor":" #A7B1C4","rangeSelectorPlotStrokeColor":"#808FAB","interactionModel":"Dygraph.Interaction.defaultModel"},"scale":"daily","annotations":[],"shadings":[],"events":[],"format":"date","data":[["2013-01-01T08:00:00.000Z","2013-01-02T08:00:00.000Z","2013-01-03T08:00:00.000Z","2013-01-04T08:00:00.000Z","2013-01-05T08:00:00.000Z","2013-01-06T08:00:00.000Z","2013-01-07T08:00:00.000Z","2013-01-08T08:00:00.000Z","2013-01-09T08:00:00.000Z","2013-01-10T08:00:00.000Z","2013-01-11T08:00:00.000Z","2013-01-12T08:00:00.000Z","2013-01-13T08:00:00.000Z","2013-01-14T08:00:00.000Z","2013-01-15T08:00:00.000Z","2013-01-16T08:00:00.000Z","2013-01-17T08:00:00.000Z","2013-01-18T08:00:00.000Z","2013-01-19T08:00:00.000Z","2013-01-20T08:00:00.000Z","2013-01-21T08:00:00.000Z","2013-01-22T08:00:00.000Z","2013-01-23T08:00:00.000Z","2013-01-24T08:00:00.000Z","2013-01-25T08:00:00.000Z","2013-01-26T08:00:00.000Z","2013-01-27T08:00:00.000Z","2013-01-28T08:00:00.000Z","2013-01-29T08:00:00.000Z","2013-01-30T08:00:00.000Z","2013-01-31T08:00:00.000Z","2013-02-01T08:00:00.000Z","2013-02-02T08:00:00.000Z","2013-02-03T08:00:00.000Z","2013-02-04T08:00:00.000Z","2013-02-05T08:00:00.000Z","2013-02-06T08:00:00.000Z","2013-02-07T08:00:00.000Z","2013-02-08T08:00:00.000Z","2013-02-09T08:00:00.000Z","2013-02-10T08:00:00.000Z","2013-02-11T08:00:00.000Z","2013-02-12T08:00:00.000Z","2013-02-13T08:00:00.000Z","2013-02-14T08:00:00.000Z","2013-02-15T08:00:00.000Z","2013-02-16T08:00:00.000Z","2013-02-17T08:00:00.000Z","2013-02-18T08:00:00.000Z","2013-02-19T08:00:00.000Z","2013-02-20T08:00:00.000Z","2013-02-21T08:00:00.000Z","2013-02-22T08:00:00.000Z","2013-02-23T08:00:00.000Z","2013-02-24T08:00:00.000Z","2013-02-25T08:00:00.000Z","2013-02-26T08:00:00.000Z","2013-02-27T08:00:00.000Z","2013-02-28T08:00:00.000Z","2013-03-01T08:00:00.000Z","2013-03-02T08:00:00.000Z","2013-03-03T08:00:00.000Z","2013-03-04T08:00:00.000Z","2013-03-05T08:00:00.000Z","2013-03-06T08:00:00.000Z","2013-03-07T08:00:00.000Z","2013-03-08T08:00:00.000Z","2013-03-09T08:00:00.000Z","2013-03-10T08:00:00.000Z","2013-03-11T07:00:00.000Z","2013-03-12T07:00:00.000Z","2013-03-13T07:00:00.000Z","2013-03-14T07:00:00.000Z","2013-03-15T07:00:00.000Z","2013-03-16T07:00:00.000Z","2013-03-17T07:00:00.000Z","2013-03-18T07:00:00.000Z","2013-03-19T07:00:00.000Z","2013-03-20T07:00:00.000Z","2013-03-21T07:00:00.000Z","2013-03-22T07:00:00.000Z","2013-03-23T07:00:00.000Z","2013-03-24T07:00:00.000Z","2013-03-25T07:00:00.000Z","2013-03-26T07:00:00.000Z","2013-03-27T07:00:00.000Z","2013-03-28T07:00:00.000Z","2013-03-29T07:00:00.000Z","2013-03-30T07:00:00.000Z","2013-03-31T07:00:00.000Z","2013-04-01T07:00:00.000Z","2013-04-02T07:00:00.000Z","2013-04-03T07:00:00.000Z","2013-04-04T07:00:00.000Z","2013-04-05T07:00:00.000Z","2013-04-06T07:00:00.000Z","2013-04-07T07:00:00.000Z","2013-04-08T07:00:00.000Z","2013-04-09T07:00:00.000Z","2013-04-10T07:00:00.000Z","2013-04-11T07:00:00.000Z","2013-04-12T07:00:00.000Z","2013-04-13T07:00:00.000Z","2013-04-14T07:00:00.000Z","2013-04-15T07:00:00.000Z","2013-04-16T07:00:00.000Z","2013-04-17T07:00:00.000Z","2013-04-18T07:00:00.000Z","2013-04-19T07:00:00.000Z","2013-04-20T07:00:00.000Z","2013-04-21T07:00:00.000Z","2013-04-22T07:00:00.000Z","2013-04-23T07:00:00.000Z","2013-04-24T07:00:00.000Z","2013-04-25T07:00:00.000Z","2013-04-26T07:00:00.000Z","2013-04-27T07:00:00.000Z","2013-04-28T07:00:00.000Z","2013-04-29T07:00:00.000Z","2013-04-30T07:00:00.000Z","2013-05-01T07:00:00.000Z","2013-05-02T07:00:00.000Z","2013-05-03T07:00:00.000Z","2013-05-04T07:00:00.000Z","2013-05-05T07:00:00.000Z","2013-05-06T07:00:00.000Z","2013-05-07T07:00:00.000Z","2013-05-08T07:00:00.000Z","2013-05-09T07:00:00.000Z","2013-05-10T07:00:00.000Z","2013-05-11T07:00:00.000Z","2013-05-12T07:00:00.000Z","2013-05-13T07:00:00.000Z","2013-05-14T07:00:00.000Z","2013-05-15T07:00:00.000Z","2013-05-16T07:00:00.000Z","2013-05-17T07:00:00.000Z","2013-05-18T07:00:00.000Z","2013-05-19T07:00:00.000Z","2013-05-20T07:00:00.000Z","2013-05-21T07:00:00.000Z","2013-05-22T07:00:00.000Z","2013-05-23T07:00:00.000Z","2013-05-24T07:00:00.000Z","2013-05-25T07:00:00.000Z","2013-05-26T07:00:00.000Z","2013-05-27T07:00:00.000Z","2013-05-28T07:00:00.000Z","2013-05-29T07:00:00.000Z","2013-05-30T07:00:00.000Z","2013-05-31T07:00:00.000Z","2013-06-01T07:00:00.000Z","2013-06-02T07:00:00.000Z","2013-06-03T07:00:00.000Z","2013-06-04T07:00:00.000Z","2013-06-05T07:00:00.000Z","2013-06-06T07:00:00.000Z","2013-06-07T07:00:00.000Z","2013-06-08T07:00:00.000Z","2013-06-09T07:00:00.000Z","2013-06-10T07:00:00.000Z","2013-06-11T07:00:00.000Z","2013-06-12T07:00:00.000Z","2013-06-13T07:00:00.000Z","2013-06-14T07:00:00.000Z","2013-06-15T07:00:00.000Z","2013-06-16T07:00:00.000Z","2013-06-17T07:00:00.000Z","2013-06-18T07:00:00.000Z","2013-06-19T07:00:00.000Z","2013-06-20T07:00:00.000Z","2013-06-21T07:00:00.000Z","2013-06-22T07:00:00.000Z","2013-06-23T07:00:00.000Z","2013-06-24T07:00:00.000Z","2013-06-25T07:00:00.000Z","2013-06-26T07:00:00.000Z","2013-06-27T07:00:00.000Z","2013-06-28T07:00:00.000Z","2013-06-29T07:00:00.000Z","2013-06-30T07:00:00.000Z","2013-07-01T07:00:00.000Z","2013-07-02T07:00:00.000Z","2013-07-03T07:00:00.000Z","2013-07-04T07:00:00.000Z","2013-07-05T07:00:00.000Z","2013-07-06T07:00:00.000Z","2013-07-07T07:00:00.000Z","2013-07-08T07:00:00.000Z","2013-07-09T07:00:00.000Z","2013-07-10T07:00:00.000Z","2013-07-11T07:00:00.000Z","2013-07-12T07:00:00.000Z","2013-07-13T07:00:00.000Z","2013-07-14T07:00:00.000Z","2013-07-15T07:00:00.000Z","2013-07-16T07:00:00.000Z","2013-07-17T07:00:00.000Z","2013-07-18T07:00:00.000Z","2013-07-19T07:00:00.000Z","2013-07-20T07:00:00.000Z","2013-07-21T07:00:00.000Z","2013-07-22T07:00:00.000Z","2013-07-23T07:00:00.000Z","2013-07-24T07:00:00.000Z","2013-07-25T07:00:00.000Z","2013-07-26T07:00:00.000Z","2013-07-27T07:00:00.000Z","2013-07-28T07:00:00.000Z","2013-07-29T07:00:00.000Z","2013-07-30T07:00:00.000Z","2013-07-31T07:00:00.000Z","2013-08-01T07:00:00.000Z","2013-08-02T07:00:00.000Z","2013-08-03T07:00:00.000Z","2013-08-04T07:00:00.000Z","2013-08-05T07:00:00.000Z","2013-08-06T07:00:00.000Z","2013-08-07T07:00:00.000Z","2013-08-08T07:00:00.000Z","2013-08-09T07:00:00.000Z","2013-08-10T07:00:00.000Z","2013-08-11T07:00:00.000Z","2013-08-12T07:00:00.000Z","2013-08-13T07:00:00.000Z","2013-08-14T07:00:00.000Z","2013-08-15T07:00:00.000Z","2013-08-16T07:00:00.000Z","2013-08-17T07:00:00.000Z","2013-08-18T07:00:00.000Z","2013-08-19T07:00:00.000Z","2013-08-20T07:00:00.000Z","2013-08-21T07:00:00.000Z","2013-08-22T07:00:00.000Z","2013-08-23T07:00:00.000Z","2013-08-24T07:00:00.000Z","2013-08-25T07:00:00.000Z","2013-08-26T07:00:00.000Z","2013-08-27T07:00:00.000Z","2013-08-28T07:00:00.000Z","2013-08-29T07:00:00.000Z","2013-08-30T07:00:00.000Z","2013-08-31T07:00:00.000Z","2013-09-01T07:00:00.000Z","2013-09-02T07:00:00.000Z","2013-09-03T07:00:00.000Z","2013-09-04T07:00:00.000Z","2013-09-05T07:00:00.000Z","2013-09-06T07:00:00.000Z","2013-09-07T07:00:00.000Z","2013-09-08T07:00:00.000Z","2013-09-09T07:00:00.000Z","2013-09-10T07:00:00.000Z","2013-09-11T07:00:00.000Z","2013-09-12T07:00:00.000Z","2013-09-13T07:00:00.000Z","2013-09-14T07:00:00.000Z","2013-09-15T07:00:00.000Z","2013-09-16T07:00:00.000Z","2013-09-17T07:00:00.000Z","2013-09-18T07:00:00.000Z","2013-09-19T07:00:00.000Z","2013-09-20T07:00:00.000Z","2013-09-21T07:00:00.000Z","2013-09-22T07:00:00.000Z","2013-09-23T07:00:00.000Z","2013-09-24T07:00:00.000Z","2013-09-25T07:00:00.000Z","2013-09-26T07:00:00.000Z","2013-09-27T07:00:00.000Z","2013-09-28T07:00:00.000Z","2013-09-29T07:00:00.000Z","2013-09-30T07:00:00.000Z","2013-10-01T07:00:00.000Z","2013-10-02T07:00:00.000Z","2013-10-03T07:00:00.000Z","2013-10-04T07:00:00.000Z","2013-10-05T07:00:00.000Z","2013-10-06T07:00:00.000Z","2013-10-07T07:00:00.000Z","2013-10-08T07:00:00.000Z","2013-10-09T07:00:00.000Z","2013-10-10T07:00:00.000Z","2013-10-11T07:00:00.000Z","2013-10-12T07:00:00.000Z","2013-10-13T07:00:00.000Z","2013-10-14T07:00:00.000Z","2013-10-15T07:00:00.000Z","2013-10-16T07:00:00.000Z","2013-10-17T07:00:00.000Z","2013-10-18T07:00:00.000Z","2013-10-19T07:00:00.000Z","2013-10-20T07:00:00.000Z","2013-10-21T07:00:00.000Z","2013-10-22T07:00:00.000Z","2013-10-23T07:00:00.000Z","2013-10-24T07:00:00.000Z","2013-10-25T07:00:00.000Z","2013-10-26T07:00:00.000Z","2013-10-27T07:00:00.000Z","2013-10-28T07:00:00.000Z","2013-10-29T07:00:00.000Z","2013-10-30T07:00:00.000Z","2013-10-31T07:00:00.000Z","2013-11-01T07:00:00.000Z","2013-11-02T07:00:00.000Z","2013-11-03T07:00:00.000Z","2013-11-04T08:00:00.000Z","2013-11-05T08:00:00.000Z","2013-11-06T08:00:00.000Z","2013-11-07T08:00:00.000Z","2013-11-08T08:00:00.000Z","2013-11-09T08:00:00.000Z","2013-11-10T08:00:00.000Z","2013-11-11T08:00:00.000Z","2013-11-12T08:00:00.000Z","2013-11-13T08:00:00.000Z","2013-11-14T08:00:00.000Z","2013-11-15T08:00:00.000Z","2013-11-16T08:00:00.000Z","2013-11-17T08:00:00.000Z","2013-11-18T08:00:00.000Z","2013-11-19T08:00:00.000Z","2013-11-20T08:00:00.000Z","2013-11-21T08:00:00.000Z","2013-11-22T08:00:00.000Z","2013-11-23T08:00:00.000Z","2013-11-24T08:00:00.000Z","2013-11-25T08:00:00.000Z","2013-11-26T08:00:00.000Z","2013-11-27T08:00:00.000Z","2013-11-28T08:00:00.000Z","2013-11-29T08:00:00.000Z","2013-11-30T08:00:00.000Z","2013-12-01T08:00:00.000Z","2013-12-02T08:00:00.000Z","2013-12-03T08:00:00.000Z","2013-12-04T08:00:00.000Z","2013-12-05T08:00:00.000Z","2013-12-06T08:00:00.000Z","2013-12-07T08:00:00.000Z","2013-12-08T08:00:00.000Z","2013-12-09T08:00:00.000Z","2013-12-10T08:00:00.000Z","2013-12-11T08:00:00.000Z","2013-12-12T08:00:00.000Z","2013-12-13T08:00:00.000Z","2013-12-14T08:00:00.000Z","2013-12-15T08:00:00.000Z","2013-12-16T08:00:00.000Z","2013-12-17T08:00:00.000Z","2013-12-18T08:00:00.000Z","2013-12-19T08:00:00.000Z","2013-12-20T08:00:00.000Z","2013-12-21T08:00:00.000Z","2013-12-22T08:00:00.000Z","2013-12-23T08:00:00.000Z","2013-12-24T08:00:00.000Z","2013-12-25T08:00:00.000Z","2013-12-26T08:00:00.000Z","2013-12-27T08:00:00.000Z","2013-12-28T08:00:00.000Z","2013-12-29T08:00:00.000Z","2013-12-30T08:00:00.000Z","2013-12-31T08:00:00.000Z","2014-01-01T08:00:00.000Z"],[3,4,1,-7,-7,-2,-8,-8,-6,-11,-11,-14,-9,3,-3,16,1,-3,-12,-7,-3,2,-1,-1,3,-1,-9.5,-3,-12,-1,12,0,-9,-6,-3,1,-6,-5,10,-3,-5,7,-3,-6,-2,-3,-4,-12,-9.5,-3,-5,0,3,3,-8,-5,-5,11,-10,-8,-9,-13,-8,-10,-7.5,0,58,-9,-12,-7,3,-7,-7,-7,2.5,0,9,15,-3,-5,-6,3,-1,-1,-11,-11,-13,-14,-17,-10,0,-5,-4,-1,-2,-11,-9,-10,-9,-4,6,19,0,-3,-5,-8,-4,10,14,1,-3,19,13,4,23,11,-14,-11,-10,-13,-14,-13,-7,-15,-15,-12,-16,10,4,-3,2,-10,-12,-15,-12,-5,-6,-15,-3,-3,-6,5,30.5,10,-7,-14,-15,-9,-6,-10,-11,-16,-5,10,-5,-10,-6,5,-8,-9,3,-7,-5,30,4,-11,-9,3,13,4,-11,-11,-12,-8,14.5,15,5,8,14,0,11,44.5,1,2,-13.5,-15,-15,0,9,7,13,4,4,2,-16,-14,-6,-5.5,-6,-3,-1,-1,12,25,5,2,3,-7,7,3,-7,-8,11,2,-2,-4,-5,-8,-2,20,27,-1,-2,2,16,6,-2,-8,-12,-9,-9,-13,-14,10,-6,-16.5,-20,-18,-16,1,-6,-18,-15,-16,-2,-8,-18,-18,-19,-22,-16,-15,-15,-10,16,4,-17,-15,-7,-13,-16,-11,-13,-11,-8,-10,-12,-9,-9,-11,-19,-14,-15,-21,-16,-5,-10,-13,-10,5,-8,-11,-2,2,-13,-9,-7,-10,-7,-4,-2,-4,-5,-5,-4,-1,-4,-3,-5,-13,-7,-7,-9,-5,0,-4,-11,-1,-8,-8,-1,-3,-10,-9,-9,4,-12,-13,-8,-11,-2.5,-4,-6,-9,-5,-1,-1,-6,-8,-3,8,-5,-14,-17,-11.5,-7,-3,-7.5,12,9,1,10,29,35,7,0,-4,16,5.5,1,27,8,0,2,3,5,15,-1,-9,-2,-5,-8,-1,2,1,3]]},"evals":["attrs.interactionModel"],"jsHooks":[]}</script>
+<script type="application/json" data-for="htmlwidget-856b09578dcc552385a2">{"x":{"attrs":{"labels":["day","median_arr_delay"],"legend":"auto","retainDateWindow":false,"axes":{"x":{"pixelsPerLabel":60}},"showRangeSelector":true,"rangeSelectorHeight":40,"rangeSelectorPlotFillColor":" #A7B1C4","rangeSelectorPlotStrokeColor":"#808FAB","interactionModel":"Dygraph.Interaction.defaultModel"},"scale":"daily","annotations":[],"shadings":[],"events":[],"format":"date","data":[["2013-01-01T05:00:00.000Z","2013-01-02T05:00:00.000Z","2013-01-03T05:00:00.000Z","2013-01-04T05:00:00.000Z","2013-01-05T05:00:00.000Z","2013-01-06T05:00:00.000Z","2013-01-07T05:00:00.000Z","2013-01-08T05:00:00.000Z","2013-01-09T05:00:00.000Z","2013-01-10T05:00:00.000Z","2013-01-11T05:00:00.000Z","2013-01-12T05:00:00.000Z","2013-01-13T05:00:00.000Z","2013-01-14T05:00:00.000Z","2013-01-15T05:00:00.000Z","2013-01-16T05:00:00.000Z","2013-01-17T05:00:00.000Z","2013-01-18T05:00:00.000Z","2013-01-19T05:00:00.000Z","2013-01-20T05:00:00.000Z","2013-01-21T05:00:00.000Z","2013-01-22T05:00:00.000Z","2013-01-23T05:00:00.000Z","2013-01-24T05:00:00.000Z","2013-01-25T05:00:00.000Z","2013-01-26T05:00:00.000Z","2013-01-27T05:00:00.000Z","2013-01-28T05:00:00.000Z","2013-01-29T05:00:00.000Z","2013-01-30T05:00:00.000Z","2013-01-31T05:00:00.000Z","2013-02-01T05:00:00.000Z","2013-02-02T05:00:00.000Z","2013-02-03T05:00:00.000Z","2013-02-04T05:00:00.000Z","2013-02-05T05:00:00.000Z","2013-02-06T05:00:00.000Z","2013-02-07T05:00:00.000Z","2013-02-08T05:00:00.000Z","2013-02-09T05:00:00.000Z","2013-02-10T05:00:00.000Z","2013-02-11T05:00:00.000Z","2013-02-12T05:00:00.000Z","2013-02-13T05:00:00.000Z","2013-02-14T05:00:00.000Z","2013-02-15T05:00:00.000Z","2013-02-16T05:00:00.000Z","2013-02-17T05:00:00.000Z","2013-02-18T05:00:00.000Z","2013-02-19T05:00:00.000Z","2013-02-20T05:00:00.000Z","2013-02-21T05:00:00.000Z","2013-02-22T05:00:00.000Z","2013-02-23T05:00:00.000Z","2013-02-24T05:00:00.000Z","2013-02-25T05:00:00.000Z","2013-02-26T05:00:00.000Z","2013-02-27T05:00:00.000Z","2013-02-28T05:00:00.000Z","2013-03-01T05:00:00.000Z","2013-03-02T05:00:00.000Z","2013-03-03T05:00:00.000Z","2013-03-04T05:00:00.000Z","2013-03-05T05:00:00.000Z","2013-03-06T05:00:00.000Z","2013-03-07T05:00:00.000Z","2013-03-08T05:00:00.000Z","2013-03-09T05:00:00.000Z","2013-03-10T05:00:00.000Z","2013-03-11T04:00:00.000Z","2013-03-12T04:00:00.000Z","2013-03-13T04:00:00.000Z","2013-03-14T04:00:00.000Z","2013-03-15T04:00:00.000Z","2013-03-16T04:00:00.000Z","2013-03-17T04:00:00.000Z","2013-03-18T04:00:00.000Z","2013-03-19T04:00:00.000Z","2013-03-20T04:00:00.000Z","2013-03-21T04:00:00.000Z","2013-03-22T04:00:00.000Z","2013-03-23T04:00:00.000Z","2013-03-24T04:00:00.000Z","2013-03-25T04:00:00.000Z","2013-03-26T04:00:00.000Z","2013-03-27T04:00:00.000Z","2013-03-28T04:00:00.000Z","2013-03-29T04:00:00.000Z","2013-03-30T04:00:00.000Z","2013-03-31T04:00:00.000Z","2013-04-01T04:00:00.000Z","2013-04-02T04:00:00.000Z","2013-04-03T04:00:00.000Z","2013-04-04T04:00:00.000Z","2013-04-05T04:00:00.000Z","2013-04-06T04:00:00.000Z","2013-04-07T04:00:00.000Z","2013-04-08T04:00:00.000Z","2013-04-09T04:00:00.000Z","2013-04-10T04:00:00.000Z","2013-04-11T04:00:00.000Z","2013-04-12T04:00:00.000Z","2013-04-13T04:00:00.000Z","2013-04-14T04:00:00.000Z","2013-04-15T04:00:00.000Z","2013-04-16T04:00:00.000Z","2013-04-17T04:00:00.000Z","2013-04-18T04:00:00.000Z","2013-04-19T04:00:00.000Z","2013-04-20T04:00:00.000Z","2013-04-21T04:00:00.000Z","2013-04-22T04:00:00.000Z","2013-04-23T04:00:00.000Z","2013-04-24T04:00:00.000Z","2013-04-25T04:00:00.000Z","2013-04-26T04:00:00.000Z","2013-04-27T04:00:00.000Z","2013-04-28T04:00:00.000Z","2013-04-29T04:00:00.000Z","2013-04-30T04:00:00.000Z","2013-05-01T04:00:00.000Z","2013-05-02T04:00:00.000Z","2013-05-03T04:00:00.000Z","2013-05-04T04:00:00.000Z","2013-05-05T04:00:00.000Z","2013-05-06T04:00:00.000Z","2013-05-07T04:00:00.000Z","2013-05-08T04:00:00.000Z","2013-05-09T04:00:00.000Z","2013-05-10T04:00:00.000Z","2013-05-11T04:00:00.000Z","2013-05-12T04:00:00.000Z","2013-05-13T04:00:00.000Z","2013-05-14T04:00:00.000Z","2013-05-15T04:00:00.000Z","2013-05-16T04:00:00.000Z","2013-05-17T04:00:00.000Z","2013-05-18T04:00:00.000Z","2013-05-19T04:00:00.000Z","2013-05-20T04:00:00.000Z","2013-05-21T04:00:00.000Z","2013-05-22T04:00:00.000Z","2013-05-23T04:00:00.000Z","2013-05-24T04:00:00.000Z","2013-05-25T04:00:00.000Z","2013-05-26T04:00:00.000Z","2013-05-27T04:00:00.000Z","2013-05-28T04:00:00.000Z","2013-05-29T04:00:00.000Z","2013-05-30T04:00:00.000Z","2013-05-31T04:00:00.000Z","2013-06-01T04:00:00.000Z","2013-06-02T04:00:00.000Z","2013-06-03T04:00:00.000Z","2013-06-04T04:00:00.000Z","2013-06-05T04:00:00.000Z","2013-06-06T04:00:00.000Z","2013-06-07T04:00:00.000Z","2013-06-08T04:00:00.000Z","2013-06-09T04:00:00.000Z","2013-06-10T04:00:00.000Z","2013-06-11T04:00:00.000Z","2013-06-12T04:00:00.000Z","2013-06-13T04:00:00.000Z","2013-06-14T04:00:00.000Z","2013-06-15T04:00:00.000Z","2013-06-16T04:00:00.000Z","2013-06-17T04:00:00.000Z","2013-06-18T04:00:00.000Z","2013-06-19T04:00:00.000Z","2013-06-20T04:00:00.000Z","2013-06-21T04:00:00.000Z","2013-06-22T04:00:00.000Z","2013-06-23T04:00:00.000Z","2013-06-24T04:00:00.000Z","2013-06-25T04:00:00.000Z","2013-06-26T04:00:00.000Z","2013-06-27T04:00:00.000Z","2013-06-28T04:00:00.000Z","2013-06-29T04:00:00.000Z","2013-06-30T04:00:00.000Z","2013-07-01T04:00:00.000Z","2013-07-02T04:00:00.000Z","2013-07-03T04:00:00.000Z","2013-07-04T04:00:00.000Z","2013-07-05T04:00:00.000Z","2013-07-06T04:00:00.000Z","2013-07-07T04:00:00.000Z","2013-07-08T04:00:00.000Z","2013-07-09T04:00:00.000Z","2013-07-10T04:00:00.000Z","2013-07-11T04:00:00.000Z","2013-07-12T04:00:00.000Z","2013-07-13T04:00:00.000Z","2013-07-14T04:00:00.000Z","2013-07-15T04:00:00.000Z","2013-07-16T04:00:00.000Z","2013-07-17T04:00:00.000Z","2013-07-18T04:00:00.000Z","2013-07-19T04:00:00.000Z","2013-07-20T04:00:00.000Z","2013-07-21T04:00:00.000Z","2013-07-22T04:00:00.000Z","2013-07-23T04:00:00.000Z","2013-07-24T04:00:00.000Z","2013-07-25T04:00:00.000Z","2013-07-26T04:00:00.000Z","2013-07-27T04:00:00.000Z","2013-07-28T04:00:00.000Z","2013-07-29T04:00:00.000Z","2013-07-30T04:00:00.000Z","2013-07-31T04:00:00.000Z","2013-08-01T04:00:00.000Z","2013-08-02T04:00:00.000Z","2013-08-03T04:00:00.000Z","2013-08-04T04:00:00.000Z","2013-08-05T04:00:00.000Z","2013-08-06T04:00:00.000Z","2013-08-07T04:00:00.000Z","2013-08-08T04:00:00.000Z","2013-08-09T04:00:00.000Z","2013-08-10T04:00:00.000Z","2013-08-11T04:00:00.000Z","2013-08-12T04:00:00.000Z","2013-08-13T04:00:00.000Z","2013-08-14T04:00:00.000Z","2013-08-15T04:00:00.000Z","2013-08-16T04:00:00.000Z","2013-08-17T04:00:00.000Z","2013-08-18T04:00:00.000Z","2013-08-19T04:00:00.000Z","2013-08-20T04:00:00.000Z","2013-08-21T04:00:00.000Z","2013-08-22T04:00:00.000Z","2013-08-23T04:00:00.000Z","2013-08-24T04:00:00.000Z","2013-08-25T04:00:00.000Z","2013-08-26T04:00:00.000Z","2013-08-27T04:00:00.000Z","2013-08-28T04:00:00.000Z","2013-08-29T04:00:00.000Z","2013-08-30T04:00:00.000Z","2013-08-31T04:00:00.000Z","2013-09-01T04:00:00.000Z","2013-09-02T04:00:00.000Z","2013-09-03T04:00:00.000Z","2013-09-04T04:00:00.000Z","2013-09-05T04:00:00.000Z","2013-09-06T04:00:00.000Z","2013-09-07T04:00:00.000Z","2013-09-08T04:00:00.000Z","2013-09-09T04:00:00.000Z","2013-09-10T04:00:00.000Z","2013-09-11T04:00:00.000Z","2013-09-12T04:00:00.000Z","2013-09-13T04:00:00.000Z","2013-09-14T04:00:00.000Z","2013-09-15T04:00:00.000Z","2013-09-16T04:00:00.000Z","2013-09-17T04:00:00.000Z","2013-09-18T04:00:00.000Z","2013-09-19T04:00:00.000Z","2013-09-20T04:00:00.000Z","2013-09-21T04:00:00.000Z","2013-09-22T04:00:00.000Z","2013-09-23T04:00:00.000Z","2013-09-24T04:00:00.000Z","2013-09-25T04:00:00.000Z","2013-09-26T04:00:00.000Z","2013-09-27T04:00:00.000Z","2013-09-28T04:00:00.000Z","2013-09-29T04:00:00.000Z","2013-09-30T04:00:00.000Z","2013-10-01T04:00:00.000Z","2013-10-02T04:00:00.000Z","2013-10-03T04:00:00.000Z","2013-10-04T04:00:00.000Z","2013-10-05T04:00:00.000Z","2013-10-06T04:00:00.000Z","2013-10-07T04:00:00.000Z","2013-10-08T04:00:00.000Z","2013-10-09T04:00:00.000Z","2013-10-10T04:00:00.000Z","2013-10-11T04:00:00.000Z","2013-10-12T04:00:00.000Z","2013-10-13T04:00:00.000Z","2013-10-14T04:00:00.000Z","2013-10-15T04:00:00.000Z","2013-10-16T04:00:00.000Z","2013-10-17T04:00:00.000Z","2013-10-18T04:00:00.000Z","2013-10-19T04:00:00.000Z","2013-10-20T04:00:00.000Z","2013-10-21T04:00:00.000Z","2013-10-22T04:00:00.000Z","2013-10-23T04:00:00.000Z","2013-10-24T04:00:00.000Z","2013-10-25T04:00:00.000Z","2013-10-26T04:00:00.000Z","2013-10-27T04:00:00.000Z","2013-10-28T04:00:00.000Z","2013-10-29T04:00:00.000Z","2013-10-30T04:00:00.000Z","2013-10-31T04:00:00.000Z","2013-11-01T04:00:00.000Z","2013-11-02T04:00:00.000Z","2013-11-03T04:00:00.000Z","2013-11-04T05:00:00.000Z","2013-11-05T05:00:00.000Z","2013-11-06T05:00:00.000Z","2013-11-07T05:00:00.000Z","2013-11-08T05:00:00.000Z","2013-11-09T05:00:00.000Z","2013-11-10T05:00:00.000Z","2013-11-11T05:00:00.000Z","2013-11-12T05:00:00.000Z","2013-11-13T05:00:00.000Z","2013-11-14T05:00:00.000Z","2013-11-15T05:00:00.000Z","2013-11-16T05:00:00.000Z","2013-11-17T05:00:00.000Z","2013-11-18T05:00:00.000Z","2013-11-19T05:00:00.000Z","2013-11-20T05:00:00.000Z","2013-11-21T05:00:00.000Z","2013-11-22T05:00:00.000Z","2013-11-23T05:00:00.000Z","2013-11-24T05:00:00.000Z","2013-11-25T05:00:00.000Z","2013-11-26T05:00:00.000Z","2013-11-27T05:00:00.000Z","2013-11-28T05:00:00.000Z","2013-11-29T05:00:00.000Z","2013-11-30T05:00:00.000Z","2013-12-01T05:00:00.000Z","2013-12-02T05:00:00.000Z","2013-12-03T05:00:00.000Z","2013-12-04T05:00:00.000Z","2013-12-05T05:00:00.000Z","2013-12-06T05:00:00.000Z","2013-12-07T05:00:00.000Z","2013-12-08T05:00:00.000Z","2013-12-09T05:00:00.000Z","2013-12-10T05:00:00.000Z","2013-12-11T05:00:00.000Z","2013-12-12T05:00:00.000Z","2013-12-13T05:00:00.000Z","2013-12-14T05:00:00.000Z","2013-12-15T05:00:00.000Z","2013-12-16T05:00:00.000Z","2013-12-17T05:00:00.000Z","2013-12-18T05:00:00.000Z","2013-12-19T05:00:00.000Z","2013-12-20T05:00:00.000Z","2013-12-21T05:00:00.000Z","2013-12-22T05:00:00.000Z","2013-12-23T05:00:00.000Z","2013-12-24T05:00:00.000Z","2013-12-25T05:00:00.000Z","2013-12-26T05:00:00.000Z","2013-12-27T05:00:00.000Z","2013-12-28T05:00:00.000Z","2013-12-29T05:00:00.000Z","2013-12-30T05:00:00.000Z","2013-12-31T05:00:00.000Z","2014-01-01T05:00:00.000Z"],[3,4,1,-7,-7,-2,-8,-8,-6,-11,-11,-14,-9,3,-3,16,1,-3,-12,-7,-3,2,-1,-1,3,-1,-9.5,-3,-12,-1,12,0,-9,-6,-3,1,-6,-5,10,-3,-5,7,-3,-6,-2,-3,-4,-12,-9.5,-3,-5,0,3,3,-8,-5,-5,11,-10,-8,-9,-13,-8,-10,-7.5,0,58,-9,-12,-7,3,-7,-7,-7,2.5,0,9,15,-3,-5,-6,3,-1,-1,-11,-11,-13,-14,-17,-10,0,-5,-4,-1,-2,-11,-9,-10,-9,-4,6,19,0,-3,-5,-8,-4,10,14,1,-3,19,13,4,23,11,-14,-11,-10,-13,-14,-13,-7,-15,-15,-12,-16,10,4,-3,2,-10,-12,-15,-12,-5,-6,-15,-3,-3,-6,5,30.5,10,-7,-14,-15,-9,-6,-10,-11,-16,-5,10,-5,-10,-6,5,-8,-9,3,-7,-5,30,4,-11,-9,3,13,4,-11,-11,-12,-8,14.5,15,5,8,14,0,11,44.5,1,2,-13.5,-15,-15,0,9,7,13,4,4,2,-16,-14,-6,-5.5,-6,-3,-1,-1,12,25,5,2,3,-7,7,3,-7,-8,11,2,-2,-4,-5,-8,-2,20,27,-1,-2,2,16,6,-2,-8,-12,-9,-9,-13,-14,10,-6,-16.5,-20,-18,-16,1,-6,-18,-15,-16,-2,-8,-18,-18,-19,-22,-16,-15,-15,-10,16,4,-17,-15,-7,-13,-16,-11,-13,-11,-8,-10,-12,-9,-9,-11,-19,-14,-15,-21,-16,-5,-10,-13,-10,5,-8,-11,-2,2,-13,-9,-7,-10,-7,-4,-2,-4,-5,-5,-4,-1,-4,-3,-5,-13,-7,-7,-9,-5,0,-4,-11,-1,-8,-8,-1,-3,-10,-9,-9,4,-12,-13,-8,-11,-2.5,-4,-6,-9,-5,-1,-1,-6,-8,-3,8,-5,-14,-17,-11.5,-7,-3,-7.5,12,9,1,10,29,35,7,0,-4,16,5.5,1,27,8,0,2,3,5,15,-1,-9,-2,-5,-8,-1,2,1,3]]},"evals":["attrs.interactionModel"],"jsHooks":[]}</script>
 <p><br></p>
 <p>The syntax here is a little different than what we have covered so far. The <code>dygraph</code> function is expecting for the dates to be given as the <code>rownames</code> of the object. We then remove the <code>date</code> variable from the <code>flights_summarized</code> data frame since it is accounted for in the <code>rownames</code>. Lastly, we run the <code>dygraph</code> function on the new data frame that only contains the median arrival delay as a column and then provide the ability to have a selector to zoom in on the interactive plot via <code>dyRangeSelector</code>. (Note that this plot will only be interactive in the HTML version of this book.)</p>
 <!--
@@ -645,7 +658,7 @@ <h3><span class="header-section-number">C.2.1</span> Interactive linegraphs</h3>
 <h3>References</h3>
 <div id="refs" class="references">
 <div id="ref-robbins2013">
-<p>Robbins, Naomi. 2013. <em>Creating More Effective Graphs</em>. Chart House.</p>
+<p>Robbins, Naomi. 2013. <em>Creating More Effective Graphs</em>. First. New York, NY: Chart House.</p>
 </div>
 </div>
             </section>
@@ -659,11 +672,13 @@ <h3>References</h3>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -671,12 +686,11 @@ <h3>References</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -691,6 +705,10 @@ <h3>References</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -707,8 +725,9 @@ <h3>References</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/D-appendixD.html b/docs/D-appendixD.html
index 515a489b7..28aa46601 100644
--- a/docs/D-appendixD.html
+++ b/docs/D-appendixD.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>D Learning Check Solutions | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.13 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="D Learning Check Solutions | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="D Learning Check Solutions | Statistical Inference via Data Science" />
@@ -24,7 +24,7 @@
 <meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-09-29" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -161,8 +164,8 @@
 <li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
 <li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
 <li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
 </ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
 <li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
@@ -182,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -190,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -259,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -277,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -302,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -323,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -339,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -351,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -370,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -402,7 +405,7 @@
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -458,17 +461,19 @@
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
 <li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Scripts of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -547,8 +552,14 @@
 <li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
 <li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
 <li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -576,15 +587,15 @@ <h1><span class="header-section-number">D</span> Learning Check Solutions</h1>
 reordering of the book. -->
 <div id="chapter-1-solutions" class="section level2">
 <h2><span class="header-section-number">D.1</span> Chapter 1 Solutions</h2>
-<div class="sourceCode" id="cb579"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb579-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
-<a class="sourceLine" id="cb579-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
-<a class="sourceLine" id="cb579-3" data-line-number="3"><span class="kw">library</span>(nycflights13)</a></code></pre></div>
+<div class="sourceCode" id="cb572"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb572-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb572-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
+<a class="sourceLine" id="cb572-3" data-line-number="3"><span class="kw">library</span>(nycflights13)</a></code></pre></div>
 <p><strong>(LC1.1)</strong> Repeat the above installing steps, but for the <code>dplyr</code>, <code>nycflights13</code>, and <code>knitr</code> packages. This will install the earlier mentioned <code>dplyr</code> package, the <code>nycflights13</code> package containing data on all domestic flights leaving a NYC airport in 2013, and the <code>knitr</code> package for writing reports in R.</p>
 <p><strong>(LC1.2)</strong> “Load” the <code>dplyr</code>, <code>nycflights13</code>, and <code>knitr</code> packages as well by repeating the above steps.</p>
 <p><strong>Solution</strong>: If the following code runs with no errors, you’ve succeeded!</p>
-<div class="sourceCode" id="cb580"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb580-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
-<a class="sourceLine" id="cb580-2" data-line-number="2"><span class="kw">library</span>(nycflights13)</a>
-<a class="sourceLine" id="cb580-3" data-line-number="3"><span class="kw">library</span>(knitr)</a></code></pre></div>
+<div class="sourceCode" id="cb573"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb573-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb573-2" data-line-number="2"><span class="kw">library</span>(nycflights13)</a>
+<a class="sourceLine" id="cb573-3" data-line-number="3"><span class="kw">library</span>(knitr)</a></code></pre></div>
 <p><strong>(LC1.3)</strong> What does any <em>ONE</em> row in this <code>flights</code> dataset refer to?</p>
 <ul>
 <li>A. Data on an airline</li>
@@ -624,9 +635,9 @@ <h2><span class="header-section-number">D.1</span> Chapter 1 Solutions</h2>
 </div>
 <div id="chapter-2-solutions" class="section level2">
 <h2><span class="header-section-number">D.2</span> Chapter 2 Solutions</h2>
-<div class="sourceCode" id="cb581"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb581-1" data-line-number="1"><span class="kw">library</span>(nycflights13)</a>
-<a class="sourceLine" id="cb581-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
-<a class="sourceLine" id="cb581-3" data-line-number="3"><span class="kw">library</span>(dplyr)</a></code></pre></div>
+<div class="sourceCode" id="cb574"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb574-1" data-line-number="1"><span class="kw">library</span>(nycflights13)</a>
+<a class="sourceLine" id="cb574-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
+<a class="sourceLine" id="cb574-3" data-line-number="3"><span class="kw">library</span>(dplyr)</a></code></pre></div>
 <p><strong>(LC2.1)</strong> Take a look at both the <code>flights</code> and <code>alaska_flights</code> data frames by running <code>View(flights)</code> and <code>View(alaska_flights)</code> in the console. In what respect do these data frames differ? For example, think about the number of rows in each dataset.</p>
 <p><strong>Solution</strong>: <code>flights</code> contains all flight data, while <code>alaska_flights</code> contains only data from Alaskan carrier “AS”. We can see that flights has 336776 rows while <code>alaska_flights</code> has only 714</p>
 <p><strong>(LC2.2)</strong> What are some practical reasons why <code>dep_delay</code> and <code>arr_delay</code> have a positive relationship?</p>
@@ -639,6 +650,9 @@ <h2><span class="header-section-number">D.2</span> Chapter 2 Solutions</h2>
 <p><strong>Solution</strong>: Different people will answer this one differently. One answer is most flights depart and arrive less than an hour late.</p>
 <p><strong>(LC2.6)</strong> Create a new scatterplot using different variables in the <code>alaska_flights</code> data frame by modifying the example above.</p>
 <p><strong>Solution</strong>: Many possibilities for this one, see the plot below. Is there a pattern in departure delay depending on when the flight is scheduled to depart? Interestingly, there seems to be only two blocks of time where flights depart.</p>
+<div class="sourceCode" id="cb575"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb575-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> alaska_flights, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> dep_time, <span class="dt">y =</span> dep_delay)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb575-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>()</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-553-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p><strong>(LC2.7)</strong> Why is setting the <code>alpha</code> argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot?</p>
 <p><strong>Solution</strong>: It thins out the points so we address overplotting. But more importantly it hints at the (statistical) <strong>density</strong> and <strong>distribution</strong> of the points: where are the points concentrated, where do they occur.</p>
 <p><strong>(LC2.8)</strong> After viewing the Figure <a href="2-viz.html#fig:alpha">2.4</a> above, give an approximate range of arrival delays and departure delays that occur the most frequently. How has that region changed compared to when you observed the same plot without the <code>alpha = 0.2</code> set in Figure <a href="2-viz.html#fig:noalpha">2.2</a>?</p>
@@ -653,8 +667,11 @@ <h2><span class="header-section-number">D.2</span> Chapter 2 Solutions</h2>
 <p><strong>Solution</strong>: Because time is sequential: subsequent observations are closely related to each other.</p>
 <p><strong>(LC2.13)</strong> Plot a time series of a variable other than <code>temp</code> for Newark Airport in the first 15 days of January 2013.</p>
 <p><strong>Solution</strong>: Humidity is a good one to look at, since this very closely related to the cycles of a day.</p>
+<div class="sourceCode" id="cb576"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb576-1" data-line-number="1"><span class="kw">ggplot</span>(<span class="dt">data =</span> early_january_weather, <span class="dt">mapping =</span> <span class="kw">aes</span>(<span class="dt">x =</span> time_hour, <span class="dt">y =</span> humid)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb576-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_line</span>()</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-554-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
 <p><strong>(LC2.14)</strong> What does changing the number of bins from 30 to 40 tell us about the distribution of temperatures?</p>
-<p><strong>Solution</strong>: The distribution doesn’t change much. But by refining the bin width, we see that the temperature data has a high degree of accuracy. What do I mean by accuracy? Looking at the <code>temp</code> variabile by <code>View(weather)</code>, we see that the precision of each temperature recording is 2 decimal places.</p>
+<p><strong>Solution</strong>: The distribution doesn’t change much. But by refining the bin width, we see that the temperature data has a high degree of accuracy. What do I mean by accuracy? Looking at the <code>temp</code> variable by <code>View(weather)</code>, we see that the precision of each temperature recording is 2 decimal places.</p>
 <p><strong>(LC2.15)</strong> Would you classify the distribution of temperatures as symmetric or skewed?</p>
 <p><strong>Solution</strong>: It is rather symmetric, i.e. there are no <strong>long tails</strong> on only one side of the distribution</p>
 <p><strong>(LC2.16)</strong> What would you guess is the “center” value in this distribution? Why did you make that choice?</p>
@@ -677,9 +694,9 @@ <h2><span class="header-section-number">D.2</span> Chapter 2 Solutions</h2>
 <p><strong>Solution</strong>:</p>
 <ul>
 <li>Certain months have much more consistent weather (August in particular), while others have crazy variability like January and October, representing changes in the seasons.</li>
-<li>Because we see <code>temp</code> recordings split by <code>month</code>, we are considering the relationship between these two variables. For example, for summer months, temperatures tend to be higher.</li>
+<li>Because we see <code>temp</code> recordings split by <code>month</code>, we are considering the relationship between these two variables. For example, for summer months, temperatures tend to be higher.
+<strong>(LC2.19)</strong> What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?</li>
 </ul>
-<p><strong>(LC2.19)</strong> What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?</p>
 <p><strong>Solution</strong>:</p>
 <ul>
 <li>They correspond to the month of the flight. While month is technically a number between 1-12, we’re viewing it as a categorical variable here. Specifically, this is an <strong>ordinal categorical</strong> variable since there is an ordering to the categories.</li>
@@ -688,12 +705,20 @@ <h2><span class="header-section-number">D.2</span> Chapter 2 Solutions</h2>
 <p><strong>(LC2.20)</strong> For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics.</p>
 <p><strong>Solution</strong>:</p>
 <ul>
-<li>It would not work if we had a very large number of facets. For example, if we facetted by individual days rather than months, as we would have 365 facets to look at. When considering all days in 2013, it could be argued that we shouldn’t care about day-to-day fluctuation in weather so much, but rather month-to-month fluctuations, allowing us to focus on seasonal trends.</li>
+<li>It would not work if we had a very large number of facets. For example, if we faceted by individual days rather than months, as we would have 365 facets to look at. When considering all days in 2013, it could be argued that we shouldn’t care about day-to-day fluctuation in weather so much, but rather month-to-month fluctuations, allowing us to focus on seasonal trends.</li>
 </ul>
 <p><strong>(LC2.21)</strong> Does the <code>temp</code> variable in the <code>weather</code> dataset have a lot of variability? Why do you say that?</p>
 <p><strong>Solution</strong>: Again, like in LC (LC2.17), this is a relative question. I would say yes, because in New York City, you have 4 clear seasons with different weather. Whereas in Seattle WA and Portland OR, you have two seasons: summer and rain!</p>
 <p><strong>(LC2.22)</strong> What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.</p>
 <p><strong>Solution</strong>: It appears to be an outlier. Let’s revisit the use of the <code>filter</code> command to hone in on it. We want all data points where the <code>month</code> is 5 and <code>temp&lt;25</code></p>
+<div class="sourceCode" id="cb577"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb577-1" data-line-number="1">weather <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb577-2" data-line-number="2"><span class="st">  </span><span class="kw">filter</span>(month <span class="op">==</span><span class="st"> </span><span class="dv">5</span> <span class="op">&amp;</span><span class="st"> </span>temp <span class="op">&lt;</span><span class="st"> </span><span class="dv">25</span>)</a></code></pre></div>
+<pre><code># A tibble: 1 x 16
+  origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
+  &lt;chr&gt;  &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;    &lt;dbl&gt;      &lt;dbl&gt;     &lt;dbl&gt;
+1 JFK     2013     5     8    22  13.1 12.02 95.34       80    8.05546        NA
+# … with 5 more variables: precip &lt;dbl&gt;, pressure &lt;dbl&gt;, visib &lt;dbl&gt;,
+#   time_hour &lt;dttm&gt;, temp_in_C &lt;dbl&gt;</code></pre>
 <p>There appears to be only one hour and only at JFK that recorded 13.1 F (-10.5 C) in the month of May. This is probably a data entry mistake! Why wasn’t the weather at least similar at EWR (Newark) and LGA (LaGuardia)?</p>
 <p><strong>(LC2.23)</strong> Which months have the highest variability in temperature? What reasons do you think this is?</p>
 <p><strong>Solution</strong>: We are now interested in the <strong>spread</strong> of the data. One measure some of you may have seen previously is the standard deviation. But in this plot we can read off the Interquartile Range (IQR):</p>
@@ -712,10 +737,120 @@ <h2><span class="header-section-number">D.2</span> Chapter 2 Solutions</h2>
 <li>for each <code>group</code>, i.e. <code>month</code>, <code>summarize</code> it by applying the summary statistic function <code>IQR()</code>, while making sure to skip over missing data via <code>na.rm=TRUE</code> then</li>
 <li><code>arrange</code> the table in <code>desc</code>ending order of <code>IQR</code></li>
 </ol>
-<div class="sourceCode" id="cb582"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb582-1" data-line-number="1">weather <span class="op">%&gt;%</span></a>
-<a class="sourceLine" id="cb582-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(month) <span class="op">%&gt;%</span></a>
-<a class="sourceLine" id="cb582-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">IQR =</span> <span class="kw">IQR</span>(temp, <span class="dt">na.rm=</span><span class="ot">TRUE</span>)) <span class="op">%&gt;%</span></a>
-<a class="sourceLine" id="cb582-4" data-line-number="4"><span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(IQR))</a></code></pre></div>
+<div class="sourceCode" id="cb579"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb579-1" data-line-number="1">weather <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb579-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(month) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb579-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">IQR =</span> <span class="kw">IQR</span>(temp, <span class="dt">na.rm=</span><span class="ot">TRUE</span>)) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb579-4" data-line-number="4"><span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(IQR))</a></code></pre></div>
+<table>
+<thead>
+<tr>
+<th style="text-align:right;">
+month
+</th>
+<th style="text-align:right;">
+IQR
+</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td style="text-align:right;">
+11
+</td>
+<td style="text-align:right;">
+16.02
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+12
+</td>
+<td style="text-align:right;">
+14.04
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+1
+</td>
+<td style="text-align:right;">
+13.77
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+9
+</td>
+<td style="text-align:right;">
+12.06
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+4
+</td>
+<td style="text-align:right;">
+12.06
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+5
+</td>
+<td style="text-align:right;">
+11.88
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+6
+</td>
+<td style="text-align:right;">
+10.98
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+10
+</td>
+<td style="text-align:right;">
+10.98
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+2
+</td>
+<td style="text-align:right;">
+10.08
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+7
+</td>
+<td style="text-align:right;">
+9.18
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+3
+</td>
+<td style="text-align:right;">
+9.00
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+8
+</td>
+<td style="text-align:right;">
+7.02
+</td>
+</tr>
+</tbody>
+</table>
 <p><strong>(LC2.24)</strong> We looked at the distribution of the numerical variable <code>temp</code> split by the numerical variable <code>month</code> that we converted to a categorical variable using the <code>factor()</code> function. Why would a boxplot of <code>temp</code> split by the numerical variable <code>pressure</code> similarly converted to a categorical variable using the <code>factor()</code> not be informative?</p>
 <p><strong>Solution</strong>: Because there are 12 unique values of <code>month</code> yielding only 12 boxes in our boxplot. There are many more unique values of <code>pressure</code> (469 unique values in fact), because values are to the first decimal place. This would lead to 469 boxes, which is too many for people to digest.</p>
 <p><strong>(LC2.25)</strong> Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?</p>
@@ -731,13 +866,13 @@ <h2><span class="header-section-number">D.2</span> Chapter 2 Solutions</h2>
 <p><strong>(LC2.30)</strong> Why should pie charts be avoided and replaced by barplots?</p>
 <p><strong>Solution</strong>: In our <strong>opinion</strong>, comparisons using horizontal lines are easier than comparing angles and areas of circles.</p>
 <p><strong>(LC2.31)</strong> What is your opinion as to why pie charts continue to be used?</p>
-<p><strong>Solution</strong>: (Only an opinion) Pie charts are generally considered as poorer at communicating data than bar charts. People’s brains are not as good at comparing the size of angles because there is no scale, and in comparison, it is much easier to compare the heights of bars in a bar charts. However, in some circumstances, for example, when reprensenting 25% and 75% of a sample size, if we have 2 bars, in which the higher one is three times in height of the other one, it is difficult to tell the scale of their comparison without labels. But in a bar chart, it would be easy to compare if a circle is devided by 75% and 25%. (Read more at: <a href="https://www.displayr.com/why-pie-charts-are-better-than-bar-charts/" class="uri">https://www.displayr.com/why-pie-charts-are-better-than-bar-charts/</a>)</p>
+<p><strong>Solution</strong>: In our <strong>opinion</strong>, pie charts are generally considered as a poorer method for communicating data than bar charts. People’s brains are not as good at comparing the size of angles because there is no scale, and in comparison, it is much easier to compare the heights of bars in a bar charts. However, in some circumstances, for example, when representing 25% and 75% of a sample size, if we have 2 bars, in which the higher one is three times in height of the other one, it is difficult to tell the scale of their comparison without labels. But in a bar chart, it would be easy to compare if a circle is divided by 75% and 25%. (Read more at: <a href="https://www.displayr.com/why-pie-charts-are-better-than-bar-charts/" class="uri">https://www.displayr.com/why-pie-charts-are-better-than-bar-charts/</a>)</p>
 <p><strong>(LC2.32)</strong> What kinds of questions are not easily answered by looking at the above figure?</p>
 <p><strong>Solution</strong>: Because the red, green, and blue bars don’t all start at 0 (only red does), it makes comparing counts hard.</p>
 <p><strong>(LC2.33)</strong> What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?</p>
 <p><strong>Solution</strong>: The different airlines prefer different airports. For example, United is mostly a Newark carrier and JetBlue is a JFK carrier. If airlines didn’t prefer airports, each color would be roughly one third of each bar.}</p>
 <p><strong>(LC2.34)</strong> Why might the side-by-side (AKA dodged) barplot be preferable to a stacked barplot in this case?</p>
-<p><strong>Solution</strong>: We can easily compare the different aiports for a given carrier using a single comparison line i.e. things are lined up</p>
+<p><strong>Solution</strong>: We can easily compare the different airports for a given carrier using a single comparison line i.e. things are lined up</p>
 <p><strong>(LC2.35)</strong> What are the disadvantages of using a side-by-side (AKA dodged) barplot, in general?</p>
 <p><strong>Solution</strong>: It is hard to get totals for each airline.</p>
 <p><strong>(LC2.36)</strong> Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case?</p>
@@ -748,22 +883,22 @@ <h2><span class="header-section-number">D.2</span> Chapter 2 Solutions</h2>
 </div>
 <div id="chapter-3-solutions" class="section level2">
 <h2><span class="header-section-number">D.3</span> Chapter 3 Solutions</h2>
-<div class="sourceCode" id="cb583"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb583-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
-<a class="sourceLine" id="cb583-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
-<a class="sourceLine" id="cb583-3" data-line-number="3"><span class="kw">library</span>(nycflights13)</a></code></pre></div>
+<div class="sourceCode" id="cb580"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb580-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb580-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
+<a class="sourceLine" id="cb580-3" data-line-number="3"><span class="kw">library</span>(nycflights13)</a></code></pre></div>
 <p><strong>(LC3.1)</strong> What’s another way using the “not” operator <code>!</code> to filter only the rows that are not going to Burlington, VT nor Seattle, WA in the <code>flights</code> data frame? Test this out using the code above.</p>
 <p><strong>Solution</strong>:</p>
-<div class="sourceCode" id="cb584"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb584-1" data-line-number="1"><span class="co"># Original in book</span></a>
-<a class="sourceLine" id="cb584-2" data-line-number="2">not_BTV_SEA &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
-<a class="sourceLine" id="cb584-3" data-line-number="3"><span class="st">  </span><span class="kw">filter</span>(<span class="op">!</span>(dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span>))</a>
-<a class="sourceLine" id="cb584-4" data-line-number="4"></a>
-<a class="sourceLine" id="cb584-5" data-line-number="5"><span class="co"># Alternative way</span></a>
-<a class="sourceLine" id="cb584-6" data-line-number="6">not_BTV_SEA &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
-<a class="sourceLine" id="cb584-7" data-line-number="7"><span class="st">  </span><span class="kw">filter</span>(<span class="op">!</span>dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">&amp;</span><span class="st"> </span><span class="op">!</span>dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span>)</a>
-<a class="sourceLine" id="cb584-8" data-line-number="8"></a>
-<a class="sourceLine" id="cb584-9" data-line-number="9"><span class="co"># Yet another way</span></a>
-<a class="sourceLine" id="cb584-10" data-line-number="10">not_BTV_SEA &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
-<a class="sourceLine" id="cb584-11" data-line-number="11"><span class="st">  </span><span class="kw">filter</span>(dest <span class="op">!=</span><span class="st"> &quot;BTV&quot;</span> <span class="op">&amp;</span><span class="st"> </span>dest <span class="op">!=</span><span class="st"> &quot;SEA&quot;</span>)</a></code></pre></div>
+<div class="sourceCode" id="cb581"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb581-1" data-line-number="1"><span class="co"># Original in book</span></a>
+<a class="sourceLine" id="cb581-2" data-line-number="2">not_BTV_SEA &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb581-3" data-line-number="3"><span class="st">  </span><span class="kw">filter</span>(<span class="op">!</span>(dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">|</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span>))</a>
+<a class="sourceLine" id="cb581-4" data-line-number="4"></a>
+<a class="sourceLine" id="cb581-5" data-line-number="5"><span class="co"># Alternative way</span></a>
+<a class="sourceLine" id="cb581-6" data-line-number="6">not_BTV_SEA &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb581-7" data-line-number="7"><span class="st">  </span><span class="kw">filter</span>(<span class="op">!</span>dest <span class="op">==</span><span class="st"> &quot;BTV&quot;</span> <span class="op">&amp;</span><span class="st"> </span><span class="op">!</span>dest <span class="op">==</span><span class="st"> &quot;SEA&quot;</span>)</a>
+<a class="sourceLine" id="cb581-8" data-line-number="8"></a>
+<a class="sourceLine" id="cb581-9" data-line-number="9"><span class="co"># Yet another way</span></a>
+<a class="sourceLine" id="cb581-10" data-line-number="10">not_BTV_SEA &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb581-11" data-line-number="11"><span class="st">  </span><span class="kw">filter</span>(dest <span class="op">!=</span><span class="st"> &quot;BTV&quot;</span> <span class="op">&amp;</span><span class="st"> </span>dest <span class="op">!=</span><span class="st"> &quot;SEA&quot;</span>)</a></code></pre></div>
 <p><strong>(LC3.2)</strong> Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor’s approach?</p>
 <p><strong>Solution</strong>: The missing patients may have died of lung cancer! So to ignore them might seriously <strong>bias</strong> your results! It is very important to think of what the consequences on your analysis are of ignoring missing data! Ask yourself:</p>
 <ul>
@@ -772,19 +907,19 @@ <h2><span class="header-section-number">D.3</span> Chapter 3 Solutions</h2>
 </ul>
 <p><strong>(LC3.3)</strong> Modify the above <code>summarize</code> function to create <code>summary_temp</code> to also use the <code>n()</code> summary function: <code>summarize(count = n())</code>. What does the returned value correspond to?</p>
 <p><strong>Solution</strong>: It corresponds to a count of the number of observations/rows:</p>
-<div class="sourceCode" id="cb585"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb585-1" data-line-number="1">weather <span class="op">%&gt;%</span><span class="st"> </span></a>
-<a class="sourceLine" id="cb585-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">count =</span> <span class="kw">n</span>())</a></code></pre></div>
+<div class="sourceCode" id="cb582"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb582-1" data-line-number="1">weather <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb582-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">count =</span> <span class="kw">n</span>())</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
   count
   &lt;int&gt;
 1 26115</code></pre>
 <p><strong>(LC3.4)</strong> Why doesn’t the following code work? Run the code line by line instead of all at once, and then look at the data. In other words, run <code>summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE))</code> first.</p>
-<div class="sourceCode" id="cb587"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb587-1" data-line-number="1">summary_temp &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st">   </span></a>
-<a class="sourceLine" id="cb587-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
-<a class="sourceLine" id="cb587-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">std_dev =</span> <span class="kw">sd</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</a></code></pre></div>
+<div class="sourceCode" id="cb584"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb584-1" data-line-number="1">summary_temp &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st">   </span></a>
+<a class="sourceLine" id="cb584-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb584-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">std_dev =</span> <span class="kw">sd</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</a></code></pre></div>
 <p><strong>Solution</strong>: Consider the output of only running the first two lines:</p>
-<div class="sourceCode" id="cb588"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb588-1" data-line-number="1">weather <span class="op">%&gt;%</span><span class="st">   </span></a>
-<a class="sourceLine" id="cb588-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</a></code></pre></div>
+<div class="sourceCode" id="cb585"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb585-1" data-line-number="1">weather <span class="op">%&gt;%</span><span class="st">   </span></a>
+<a class="sourceLine" id="cb585-2" data-line-number="2"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</a></code></pre></div>
 <pre><code># A tibble: 1 x 1
      mean
     &lt;dbl&gt;
@@ -792,28 +927,1021 @@ <h2><span class="header-section-number">D.3</span> Chapter 3 Solutions</h2>
 <p>Because after the first <code>summarize()</code>, the variable <code>temp</code> disappears as it has been collapsed to the value <code>mean</code>. So when we try to run the second <code>summarize()</code>, it can’t find the variable <code>temp</code> to compute the standard deviation of.</p>
 <p><strong>(LC3.5)</strong> Recall from Chapter <a href="2-viz.html#viz">2</a> when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the <code>summary_monthly_temp</code> data frame tell us about temperatures in New York City throughout the year?</p>
 <p><strong>Solution</strong>:</p>
+<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
+<thead>
+<tr>
+<th style="text-align:right;">
+month
+</th>
+<th style="text-align:right;">
+mean
+</th>
+<th style="text-align:right;">
+std_dev
+</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td style="text-align:right;">
+1
+</td>
+<td style="text-align:right;">
+35.6
+</td>
+<td style="text-align:right;">
+10.22
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+2
+</td>
+<td style="text-align:right;">
+34.3
+</td>
+<td style="text-align:right;">
+6.98
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+3
+</td>
+<td style="text-align:right;">
+39.9
+</td>
+<td style="text-align:right;">
+6.25
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+4
+</td>
+<td style="text-align:right;">
+51.7
+</td>
+<td style="text-align:right;">
+8.79
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+5
+</td>
+<td style="text-align:right;">
+61.8
+</td>
+<td style="text-align:right;">
+9.68
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+6
+</td>
+<td style="text-align:right;">
+72.2
+</td>
+<td style="text-align:right;">
+7.55
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+7
+</td>
+<td style="text-align:right;">
+80.1
+</td>
+<td style="text-align:right;">
+7.12
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+8
+</td>
+<td style="text-align:right;">
+74.5
+</td>
+<td style="text-align:right;">
+5.19
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+9
+</td>
+<td style="text-align:right;">
+67.4
+</td>
+<td style="text-align:right;">
+8.47
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+10
+</td>
+<td style="text-align:right;">
+60.1
+</td>
+<td style="text-align:right;">
+8.85
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+11
+</td>
+<td style="text-align:right;">
+45.0
+</td>
+<td style="text-align:right;">
+10.44
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+12
+</td>
+<td style="text-align:right;">
+38.4
+</td>
+<td style="text-align:right;">
+9.98
+</td>
+</tr>
+</tbody>
+</table>
 <p>The standard deviation is a quantification of <strong>spread</strong> and <strong>variability</strong>. We
 see that the period in November, December, and January has the most variation in
 weather, so you can expect very different temperatures on different days.</p>
 <p><strong>(LC3.6)</strong> What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC?</p>
 <p><strong>Solution</strong>:</p>
+<div class="sourceCode" id="cb587"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb587-1" data-line-number="1">summary_temp_by_day &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb587-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(year, month, day) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb587-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(</a>
+<a class="sourceLine" id="cb587-4" data-line-number="4">          <span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb587-5" data-line-number="5">          <span class="dt">std_dev =</span> <span class="kw">sd</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)</a>
+<a class="sourceLine" id="cb587-6" data-line-number="6">          )</a>
+<a class="sourceLine" id="cb587-7" data-line-number="7">summary_temp_by_day</a></code></pre></div>
+<pre><code># A tibble: 364 x 5
+# Groups:   year, month [12]
+    year month   day    mean std_dev
+   &lt;int&gt; &lt;int&gt; &lt;int&gt;   &lt;dbl&gt;   &lt;dbl&gt;
+ 1  2013     1     1 36.9997 4.00117
+ 2  2013     1     2 28.7025 3.45205
+ 3  2013     1     3 29.9725 2.58472
+ 4  2013     1     4 34.94   2.45283
+ 5  2013     1     5 37.205  4.00500
+ 6  2013     1     6 40.0518 4.39562
+ 7  2013     1     7 40.5825 3.68319
+ 8  2013     1     8 40.1175 5.77457
+ 9  2013     1     9 43.225  5.39724
+10  2013     1    10 43.85   2.95214
+# … with 354 more rows</code></pre>
 <p>Note: <code>group_by(day)</code> is not enough, because <code>day</code> is a value between 1-31. We need to <code>group_by(year, month, day)</code></p>
-<div class="sourceCode" id="cb590"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb590-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
-<a class="sourceLine" id="cb590-2" data-line-number="2"><span class="kw">library</span>(nycflights13)</a>
-<a class="sourceLine" id="cb590-3" data-line-number="3"></a>
-<a class="sourceLine" id="cb590-4" data-line-number="4">summary_temp_by_month &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span></a>
-<a class="sourceLine" id="cb590-5" data-line-number="5"><span class="st">  </span><span class="kw">group_by</span>(month) <span class="op">%&gt;%</span><span class="st"> </span></a>
-<a class="sourceLine" id="cb590-6" data-line-number="6"><span class="st">  </span><span class="kw">summarize</span>(</a>
-<a class="sourceLine" id="cb590-7" data-line-number="7">          <span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
-<a class="sourceLine" id="cb590-8" data-line-number="8">          <span class="dt">std_dev =</span> <span class="kw">sd</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)</a>
-<a class="sourceLine" id="cb590-9" data-line-number="9">          )</a></code></pre></div>
+<div class="sourceCode" id="cb589"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb589-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb589-2" data-line-number="2"><span class="kw">library</span>(nycflights13)</a>
+<a class="sourceLine" id="cb589-3" data-line-number="3"></a>
+<a class="sourceLine" id="cb589-4" data-line-number="4">summary_temp_by_month &lt;-<span class="st"> </span>weather <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb589-5" data-line-number="5"><span class="st">  </span><span class="kw">group_by</span>(month) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb589-6" data-line-number="6"><span class="st">  </span><span class="kw">summarize</span>(</a>
+<a class="sourceLine" id="cb589-7" data-line-number="7">          <span class="dt">mean =</span> <span class="kw">mean</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),</a>
+<a class="sourceLine" id="cb589-8" data-line-number="8">          <span class="dt">std_dev =</span> <span class="kw">sd</span>(temp, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)</a>
+<a class="sourceLine" id="cb589-9" data-line-number="9">          )</a></code></pre></div>
 <p><strong>(LC3.7)</strong> Recreate <code>by_monthly_origin</code>, but instead of grouping via <code>group_by(origin, month)</code>, group variables in a different order <code>group_by(month, origin)</code>. What differs in the resulting dataset?</p>
 <p><strong>Solution</strong>:</p>
+<div class="sourceCode" id="cb590"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb590-1" data-line-number="1">by_monthly_origin &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb590-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(month, origin) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb590-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">count =</span> <span class="kw">n</span>())</a></code></pre></div>
 <div class="sourceCode" id="cb591"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb591-1" data-line-number="1">by_monthly_origin</a></code></pre></div>
+<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
+<thead>
+<tr>
+<th style="text-align:right;">
+month
+</th>
+<th style="text-align:left;">
+origin
+</th>
+<th style="text-align:right;">
+count
+</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td style="text-align:right;">
+1
+</td>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:right;">
+9893
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+1
+</td>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:right;">
+9161
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+1
+</td>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:right;">
+7950
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+2
+</td>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:right;">
+9107
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+2
+</td>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:right;">
+8421
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+2
+</td>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:right;">
+7423
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+3
+</td>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:right;">
+10420
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+3
+</td>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:right;">
+9697
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+3
+</td>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:right;">
+8717
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+4
+</td>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:right;">
+10531
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+4
+</td>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:right;">
+9218
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+4
+</td>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:right;">
+8581
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+5
+</td>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:right;">
+10592
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+5
+</td>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:right;">
+9397
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+5
+</td>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:right;">
+8807
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+6
+</td>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:right;">
+10175
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+6
+</td>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:right;">
+9472
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+6
+</td>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:right;">
+8596
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+7
+</td>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:right;">
+10475
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+7
+</td>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:right;">
+10023
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+7
+</td>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:right;">
+8927
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+8
+</td>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:right;">
+10359
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+8
+</td>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:right;">
+9983
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+8
+</td>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:right;">
+8985
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+9
+</td>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:right;">
+9550
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+9
+</td>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:right;">
+8908
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+9
+</td>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:right;">
+9116
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+10
+</td>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:right;">
+10104
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+10
+</td>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:right;">
+9143
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+10
+</td>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:right;">
+9642
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+11
+</td>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:right;">
+9707
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+11
+</td>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:right;">
+8710
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+11
+</td>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:right;">
+8851
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+12
+</td>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:right;">
+9922
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+12
+</td>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:right;">
+9146
+</td>
+</tr>
+<tr>
+<td style="text-align:right;">
+12
+</td>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:right;">
+9067
+</td>
+</tr>
+</tbody>
+</table>
 <p>In <code>by_monthly_origin</code> the <code>month</code> column is now first and the rows are sorted by <code>month</code> instead of origin. If you compare the values of <code>count</code> in <code>by_origin_monthly</code> and <code>by_monthly_origin</code> using the <code>View()</code> function, you’ll see that the values are actually the same, just presented in a different order.</p>
 <p><strong>(LC3.8)</strong> How could we identify how many flights left each of the three airports for each <code>carrier</code>?</p>
 <p><strong>Solution</strong>: We could summarize the count from each airport using the <code>n()</code> function, which <em>counts rows</em>.</p>
-<p>All remarkably similar! Note: the <code>n()</code> function counts rows, whereas the <code>sum(VARIABLE_NAME)</code> funciton sums all values of a certain numerical variable <code>VARIABLE_NAME</code>.</p>
+<div class="sourceCode" id="cb592"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb592-1" data-line-number="1">count_flights_by_airport &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb592-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(origin, carrier) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb592-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">count=</span><span class="kw">n</span>())</a></code></pre></div>
+<div class="sourceCode" id="cb593"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb593-1" data-line-number="1">count_flights_by_airport</a></code></pre></div>
+<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
+<thead>
+<tr>
+<th style="text-align:left;">
+origin
+</th>
+<th style="text-align:left;">
+carrier
+</th>
+<th style="text-align:right;">
+count
+</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:left;">
+9E
+</td>
+<td style="text-align:right;">
+1268
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:left;">
+AA
+</td>
+<td style="text-align:right;">
+3487
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:left;">
+AS
+</td>
+<td style="text-align:right;">
+714
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:left;">
+B6
+</td>
+<td style="text-align:right;">
+6557
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:left;">
+DL
+</td>
+<td style="text-align:right;">
+4342
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:left;">
+EV
+</td>
+<td style="text-align:right;">
+43939
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:left;">
+MQ
+</td>
+<td style="text-align:right;">
+2276
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:left;">
+OO
+</td>
+<td style="text-align:right;">
+6
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:left;">
+UA
+</td>
+<td style="text-align:right;">
+46087
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:left;">
+US
+</td>
+<td style="text-align:right;">
+4405
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:left;">
+VX
+</td>
+<td style="text-align:right;">
+1566
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+EWR
+</td>
+<td style="text-align:left;">
+WN
+</td>
+<td style="text-align:right;">
+6188
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:left;">
+9E
+</td>
+<td style="text-align:right;">
+14651
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:left;">
+AA
+</td>
+<td style="text-align:right;">
+13783
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:left;">
+B6
+</td>
+<td style="text-align:right;">
+42076
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:left;">
+DL
+</td>
+<td style="text-align:right;">
+20701
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:left;">
+EV
+</td>
+<td style="text-align:right;">
+1408
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:left;">
+HA
+</td>
+<td style="text-align:right;">
+342
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:left;">
+MQ
+</td>
+<td style="text-align:right;">
+7193
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:left;">
+UA
+</td>
+<td style="text-align:right;">
+4534
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:left;">
+US
+</td>
+<td style="text-align:right;">
+2995
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+JFK
+</td>
+<td style="text-align:left;">
+VX
+</td>
+<td style="text-align:right;">
+3596
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+9E
+</td>
+<td style="text-align:right;">
+2541
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+AA
+</td>
+<td style="text-align:right;">
+15459
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+B6
+</td>
+<td style="text-align:right;">
+6002
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+DL
+</td>
+<td style="text-align:right;">
+23067
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+EV
+</td>
+<td style="text-align:right;">
+8826
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+F9
+</td>
+<td style="text-align:right;">
+685
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+FL
+</td>
+<td style="text-align:right;">
+3260
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+MQ
+</td>
+<td style="text-align:right;">
+16928
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+OO
+</td>
+<td style="text-align:right;">
+26
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+UA
+</td>
+<td style="text-align:right;">
+8044
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+US
+</td>
+<td style="text-align:right;">
+13136
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+WN
+</td>
+<td style="text-align:right;">
+6087
+</td>
+</tr>
+<tr>
+<td style="text-align:left;">
+LGA
+</td>
+<td style="text-align:left;">
+YV
+</td>
+<td style="text-align:right;">
+601
+</td>
+</tr>
+</tbody>
+</table>
+<p>All remarkably similar! Note: the <code>n()</code> function counts rows, whereas the <code>sum(VARIABLE_NAME)</code> function sums all values of a certain numerical variable <code>VARIABLE_NAME</code>.</p>
 <p><strong>(LC3.9)</strong> How does the <code>filter</code> operation differ from a <code>group_by</code> followed by a <code>summarize</code>?</p>
 <p><strong>Solution</strong>:</p>
 <ul>
@@ -844,12 +1972,130 @@ <h2><span class="header-section-number">D.3</span> Chapter 3 Solutions</h2>
 <p><strong>Solution</strong>: When datasets are in normal form, we can easily <code>_join</code> them with other datasets! For example, we can join the <code>flights</code> data with the <code>planes</code> data.</p>
 <p><strong>(LC3.16)</strong> What are some ways to select all three of the <code>dest</code>, <code>air_time</code>, and <code>distance</code> variables from <code>flights</code>? Give the code showing how to do this in at least three different ways.</p>
 <p><strong>Solution</strong>:</p>
+<div class="sourceCode" id="cb594"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb594-1" data-line-number="1"><span class="co"># The regular way:</span></a>
+<a class="sourceLine" id="cb594-2" data-line-number="2">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb594-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(dest, air_time, distance)</a></code></pre></div>
+<pre><code># A tibble: 336,776 x 3
+   dest  air_time distance
+   &lt;chr&gt;    &lt;dbl&gt;    &lt;dbl&gt;
+ 1 IAH        227     1400
+ 2 IAH        227     1416
+ 3 MIA        160     1089
+ 4 BQN        183     1576
+ 5 ATL        116      762
+ 6 ORD        150      719
+ 7 FLL        158     1065
+ 8 IAD         53      229
+ 9 MCO        140      944
+10 ORD        138      733
+# … with 336,766 more rows</code></pre>
+<div class="sourceCode" id="cb596"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb596-1" data-line-number="1"><span class="co"># Since they are sequential columns in the dataset</span></a>
+<a class="sourceLine" id="cb596-2" data-line-number="2">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb596-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(dest<span class="op">:</span>distance)</a></code></pre></div>
+<pre><code># A tibble: 336,776 x 3
+   dest  air_time distance
+   &lt;chr&gt;    &lt;dbl&gt;    &lt;dbl&gt;
+ 1 IAH        227     1400
+ 2 IAH        227     1416
+ 3 MIA        160     1089
+ 4 BQN        183     1576
+ 5 ATL        116      762
+ 6 ORD        150      719
+ 7 FLL        158     1065
+ 8 IAD         53      229
+ 9 MCO        140      944
+10 ORD        138      733
+# … with 336,766 more rows</code></pre>
+<div class="sourceCode" id="cb598"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb598-1" data-line-number="1"><span class="co"># Not as effective, by removing everything else</span></a>
+<a class="sourceLine" id="cb598-2" data-line-number="2">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb598-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(<span class="op">-</span>year, <span class="op">-</span>month, <span class="op">-</span>day, <span class="op">-</span>dep_time, <span class="op">-</span>sched_dep_time, <span class="op">-</span>dep_delay, <span class="op">-</span>arr_time,</a>
+<a class="sourceLine" id="cb598-4" data-line-number="4">         <span class="op">-</span>sched_arr_time, <span class="op">-</span>arr_delay, <span class="op">-</span>carrier, <span class="op">-</span>flight, <span class="op">-</span>tailnum, <span class="op">-</span>origin, </a>
+<a class="sourceLine" id="cb598-5" data-line-number="5">         <span class="op">-</span>hour, <span class="op">-</span>minute, <span class="op">-</span>time_hour)</a></code></pre></div>
+<pre><code># A tibble: 336,776 x 6
+   dest  air_time distance  gain    hours gain_per_hour
+   &lt;chr&gt;    &lt;dbl&gt;    &lt;dbl&gt; &lt;dbl&gt;    &lt;dbl&gt;         &lt;dbl&gt;
+ 1 IAH        227     1400    -9 3.78333       -2.37885
+ 2 IAH        227     1416   -16 3.78333       -4.22907
+ 3 MIA        160     1089   -31 2.66667      -11.625  
+ 4 BQN        183     1576    17 3.05           5.57377
+ 5 ATL        116      762    19 1.93333        9.82759
+ 6 ORD        150      719   -16 2.5           -6.4    
+ 7 FLL        158     1065   -24 2.63333       -9.11392
+ 8 IAD         53      229    11 0.883333      12.4528 
+ 9 MCO        140      944     5 2.33333        2.14286
+10 ORD        138      733   -10 2.300         -4.34783
+# … with 336,766 more rows</code></pre>
 <p><strong>(LC3.17)</strong> How could one use <code>starts_with</code>, <code>ends_with</code>, and <code>contains</code> to select columns from the <code>flights</code> data frame? Provide three different examples in total: one for <code>starts_with</code>, one for <code>ends_with</code>, and one for <code>contains</code>.</p>
 <p><strong>Solution</strong>:</p>
+<div class="sourceCode" id="cb600"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb600-1" data-line-number="1"><span class="co"># Anything that starts with &quot;d&quot;</span></a>
+<a class="sourceLine" id="cb600-2" data-line-number="2">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb600-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(<span class="kw">starts_with</span>(<span class="st">&quot;d&quot;</span>))</a></code></pre></div>
+<pre><code># A tibble: 336,776 x 5
+     day dep_time dep_delay dest  distance
+   &lt;int&gt;    &lt;int&gt;     &lt;dbl&gt; &lt;chr&gt;    &lt;dbl&gt;
+ 1     1      517         2 IAH       1400
+ 2     1      533         4 IAH       1416
+ 3     1      542         2 MIA       1089
+ 4     1      544        -1 BQN       1576
+ 5     1      554        -6 ATL        762
+ 6     1      554        -4 ORD        719
+ 7     1      555        -5 FLL       1065
+ 8     1      557        -3 IAD        229
+ 9     1      557        -3 MCO        944
+10     1      558        -2 ORD        733
+# … with 336,766 more rows</code></pre>
+<div class="sourceCode" id="cb602"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb602-1" data-line-number="1"><span class="co"># Anything related to delays:</span></a>
+<a class="sourceLine" id="cb602-2" data-line-number="2">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb602-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(<span class="kw">ends_with</span>(<span class="st">&quot;delay&quot;</span>))</a></code></pre></div>
+<pre><code># A tibble: 336,776 x 2
+   dep_delay arr_delay
+       &lt;dbl&gt;     &lt;dbl&gt;
+ 1         2        11
+ 2         4        20
+ 3         2        33
+ 4        -1       -18
+ 5        -6       -25
+ 6        -4        12
+ 7        -5        19
+ 8        -3       -14
+ 9        -3        -8
+10        -2         8
+# … with 336,766 more rows</code></pre>
+<div class="sourceCode" id="cb604"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb604-1" data-line-number="1"><span class="co"># Anything related to departures:</span></a>
+<a class="sourceLine" id="cb604-2" data-line-number="2">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb604-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(<span class="kw">contains</span>(<span class="st">&quot;dep&quot;</span>))</a></code></pre></div>
+<pre><code># A tibble: 336,776 x 3
+   dep_time sched_dep_time dep_delay
+      &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;
+ 1      517            515         2
+ 2      533            529         4
+ 3      542            540         2
+ 4      544            545        -1
+ 5      554            600        -6
+ 6      554            558        -4
+ 7      555            600        -5
+ 8      557            600        -3
+ 9      557            600        -3
+10      558            600        -2
+# … with 336,766 more rows</code></pre>
 <p><strong>(LC3.18)</strong> Why might we want to use the <code>select()</code> function on a data frame?</p>
 <p><strong>Solution</strong>: To narrow down the data frame, to make it easier to look at. Using <code>View()</code> for example.</p>
 <p><strong>(LC3.19)</strong> Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013.</p>
 <p><strong>Solution</strong>:</p>
+<div class="sourceCode" id="cb606"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb606-1" data-line-number="1">top_five &lt;-<span class="st"> </span>flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb606-2" data-line-number="2"><span class="st">  </span><span class="kw">group_by</span>(dest) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb606-3" data-line-number="3"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">avg_delay =</span> <span class="kw">mean</span>(arr_delay, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb606-4" data-line-number="4"><span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(avg_delay)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb606-5" data-line-number="5"><span class="st">  </span><span class="kw">top_n</span>(<span class="dt">n =</span> <span class="dv">5</span>)</a>
+<a class="sourceLine" id="cb606-6" data-line-number="6">top_five</a></code></pre></div>
+<pre><code># A tibble: 5 x 2
+  dest  avg_delay
+  &lt;chr&gt;     &lt;dbl&gt;
+1 CAE     41.7642
+2 TUL     33.6599
+3 OKC     30.6190
+4 JAC     28.0952
+5 TYS     24.0692</code></pre>
 <p><strong>(LC3.20)</strong> Using the datasets included in the <code>nycflights13</code> package, compute the available seat miles for each airline sorted in descending order. After completing all the necessary data wrangling steps, the resulting data frame should have 16 rows (one for each airline) and 2 columns (airline name and available seat miles). Here are some hints:</p>
 <ol style="list-style-type: decimal">
 <li><strong>Crucial</strong>: Unless you are very confident in what you are doing, it is worthwhile to not starting coding right away, but rather first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level <em>pseudocode</em> that is informal yet detailed enough to articulate what you are doing. This way you won’t confuse <em>what</em> you are trying to do (the algorithm) with <em>how</em> you are going to do it (writing <code>dplyr</code> code).</li>
@@ -858,31 +2104,201 @@ <h2><span class="header-section-number">D.3</span> Chapter 3 Solutions</h2>
 <li>Consider the data wrangling verbs in Table <a href="3-wrangling.html#tab:wrangle-summary-table">3.2</a> as your toolbox!</li>
 </ol>
 <p><strong>Solution</strong>: Here are some examples of student-written <a href="https://twitter.com/rudeboybert/status/964181298691629056">pseudocode</a>. Based on our own pseudocode, let’s first display the entire solution.</p>
+<div class="sourceCode" id="cb608"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb608-1" data-line-number="1">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb608-2" data-line-number="2"><span class="st">  </span><span class="kw">inner_join</span>(planes, <span class="dt">by =</span> <span class="st">&quot;tailnum&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb608-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(carrier, seats, distance) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb608-4" data-line-number="4"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">ASM =</span> seats <span class="op">*</span><span class="st"> </span>distance) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb608-5" data-line-number="5"><span class="st">  </span><span class="kw">group_by</span>(carrier) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb608-6" data-line-number="6"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">ASM =</span> <span class="kw">sum</span>(ASM, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb608-7" data-line-number="7"><span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(ASM))</a></code></pre></div>
+<pre><code># A tibble: 16 x 2
+   carrier         ASM
+   &lt;chr&gt;         &lt;dbl&gt;
+ 1 UA      15516377526
+ 2 DL      10532885801
+ 3 B6       9618222135
+ 4 AA       3677292231
+ 5 US       2533505829
+ 6 VX       2296680778
+ 7 EV       1817236275
+ 8 WN       1718116857
+ 9 9E        776970310
+10 HA        642478122
+11 AS        314104736
+12 FL        219628520
+13 F9        184832280
+14 YV         20163632
+15 MQ          7162420
+16 OO          1299835</code></pre>
 <p>Let’s now break this down step-by-step. To compute the available seat miles for a given flight, we need the <code>distance</code> variable from the <code>flights</code> data frame and the <code>seats</code> variable from the <code>planes</code> data frame, necessitating a join by the key variable <code>tailnum</code> as illustrated in Figure <a href="3-wrangling.html#fig:reldiagram">3.7</a>. To keep the resulting data frame easy to view, we’ll <code>select()</code> only these two variables and <code>carrier</code>:</p>
+<div class="sourceCode" id="cb610"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb610-1" data-line-number="1">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb610-2" data-line-number="2"><span class="st">  </span><span class="kw">inner_join</span>(planes, <span class="dt">by =</span> <span class="st">&quot;tailnum&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb610-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(carrier, seats, distance)</a></code></pre></div>
+<pre><code># A tibble: 284,170 x 3
+   carrier seats distance
+   &lt;chr&gt;   &lt;int&gt;    &lt;dbl&gt;
+ 1 UA        149     1400
+ 2 UA        149     1416
+ 3 AA        178     1089
+ 4 B6        200     1576
+ 5 DL        178      762
+ 6 UA        191      719
+ 7 B6        200     1065
+ 8 EV         55      229
+ 9 B6        200      944
+10 B6        200     1028
+# … with 284,160 more rows</code></pre>
 <p>Now for each flight we can compute the available seat miles <code>ASM</code> by multiplying the number of seats by the distance via a <code>mutate()</code>:</p>
+<div class="sourceCode" id="cb612"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb612-1" data-line-number="1">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb612-2" data-line-number="2"><span class="st">  </span><span class="kw">inner_join</span>(planes, <span class="dt">by =</span> <span class="st">&quot;tailnum&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb612-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(carrier, seats, distance) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb612-4" data-line-number="4"><span class="st">  </span><span class="co"># Added:</span></a>
+<a class="sourceLine" id="cb612-5" data-line-number="5"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">ASM =</span> seats <span class="op">*</span><span class="st"> </span>distance)</a></code></pre></div>
+<pre><code># A tibble: 284,170 x 4
+   carrier seats distance    ASM
+   &lt;chr&gt;   &lt;int&gt;    &lt;dbl&gt;  &lt;dbl&gt;
+ 1 UA        149     1400 208600
+ 2 UA        149     1416 210984
+ 3 AA        178     1089 193842
+ 4 B6        200     1576 315200
+ 5 DL        178      762 135636
+ 6 UA        191      719 137329
+ 7 B6        200     1065 213000
+ 8 EV         55      229  12595
+ 9 B6        200      944 188800
+10 B6        200     1028 205600
+# … with 284,160 more rows</code></pre>
 <p>Next we want to sum the <code>ASM</code> for each carrier. We achieve this by first grouping by <code>carrier</code> and then summarizing using the <code>sum()</code> function:</p>
+<div class="sourceCode" id="cb614"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb614-1" data-line-number="1">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb614-2" data-line-number="2"><span class="st">  </span><span class="kw">inner_join</span>(planes, <span class="dt">by =</span> <span class="st">&quot;tailnum&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb614-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(carrier, seats, distance) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb614-4" data-line-number="4"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">ASM =</span> seats <span class="op">*</span><span class="st"> </span>distance) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb614-5" data-line-number="5"><span class="st">  </span><span class="co"># Added:</span></a>
+<a class="sourceLine" id="cb614-6" data-line-number="6"><span class="st">  </span><span class="kw">group_by</span>(carrier) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb614-7" data-line-number="7"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">ASM =</span> <span class="kw">sum</span>(ASM))</a></code></pre></div>
+<pre><code># A tibble: 16 x 2
+   carrier         ASM
+   &lt;chr&gt;         &lt;dbl&gt;
+ 1 9E        776970310
+ 2 AA       3677292231
+ 3 AS        314104736
+ 4 B6       9618222135
+ 5 DL      10532885801
+ 6 EV       1817236275
+ 7 F9        184832280
+ 8 FL        219628520
+ 9 HA        642478122
+10 MQ          7162420
+11 OO          1299835
+12 UA      15516377526
+13 US       2533505829
+14 VX       2296680778
+15 WN       1718116857
+16 YV         20163632</code></pre>
 <p>However, because for certain carriers certain flights have missing <code>NA</code> values, the resulting table also returns <code>NA</code>’s. We can eliminate these by adding a <code>na.rm = TRUE</code> argument to <code>sum()</code>, telling R that we want to remove the <code>NA</code>’s in the sum. We saw this in Section <a href="3-wrangling.html#summarize">3.3</a>:</p>
+<div class="sourceCode" id="cb616"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb616-1" data-line-number="1">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb616-2" data-line-number="2"><span class="st">  </span><span class="kw">inner_join</span>(planes, <span class="dt">by =</span> <span class="st">&quot;tailnum&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb616-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(carrier, seats, distance) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb616-4" data-line-number="4"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">ASM =</span> seats <span class="op">*</span><span class="st"> </span>distance) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb616-5" data-line-number="5"><span class="st">  </span><span class="kw">group_by</span>(carrier) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb616-6" data-line-number="6"><span class="st">  </span><span class="co"># Modified:</span></a>
+<a class="sourceLine" id="cb616-7" data-line-number="7"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">ASM =</span> <span class="kw">sum</span>(ASM, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</a></code></pre></div>
+<pre><code># A tibble: 16 x 2
+   carrier         ASM
+   &lt;chr&gt;         &lt;dbl&gt;
+ 1 9E        776970310
+ 2 AA       3677292231
+ 3 AS        314104736
+ 4 B6       9618222135
+ 5 DL      10532885801
+ 6 EV       1817236275
+ 7 F9        184832280
+ 8 FL        219628520
+ 9 HA        642478122
+10 MQ          7162420
+11 OO          1299835
+12 UA      15516377526
+13 US       2533505829
+14 VX       2296680778
+15 WN       1718116857
+16 YV         20163632</code></pre>
 <p>Finally, we <code>arrange()</code> the data in <code>desc()</code>ending order of <code>ASM</code>.</p>
+<div class="sourceCode" id="cb618"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb618-1" data-line-number="1">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb618-2" data-line-number="2"><span class="st">  </span><span class="kw">inner_join</span>(planes, <span class="dt">by =</span> <span class="st">&quot;tailnum&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb618-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(carrier, seats, distance) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb618-4" data-line-number="4"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">ASM =</span> seats <span class="op">*</span><span class="st"> </span>distance) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb618-5" data-line-number="5"><span class="st">  </span><span class="kw">group_by</span>(carrier) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb618-6" data-line-number="6"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">ASM =</span> <span class="kw">sum</span>(ASM, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb618-7" data-line-number="7"><span class="st">  </span><span class="co"># Added:</span></a>
+<a class="sourceLine" id="cb618-8" data-line-number="8"><span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(ASM))</a></code></pre></div>
+<pre><code># A tibble: 16 x 2
+   carrier         ASM
+   &lt;chr&gt;         &lt;dbl&gt;
+ 1 UA      15516377526
+ 2 DL      10532885801
+ 3 B6       9618222135
+ 4 AA       3677292231
+ 5 US       2533505829
+ 6 VX       2296680778
+ 7 EV       1817236275
+ 8 WN       1718116857
+ 9 9E        776970310
+10 HA        642478122
+11 AS        314104736
+12 FL        219628520
+13 F9        184832280
+14 YV         20163632
+15 MQ          7162420
+16 OO          1299835</code></pre>
 <p>While the above data frame is correct, the IATA <code>carrier</code> code is not always useful. For example, what carrier is <code>WN</code>? We can address this by joining with the <code>airlines</code> dataset using <code>carrier</code> is the key variable. While this step is not absolutely required, it goes a long way to making the table easier to make sense of. It is important to be empathetic with the ultimate consumers of your presented data!</p>
+<div class="sourceCode" id="cb620"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb620-1" data-line-number="1">flights <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb620-2" data-line-number="2"><span class="st">  </span><span class="kw">inner_join</span>(planes, <span class="dt">by =</span> <span class="st">&quot;tailnum&quot;</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb620-3" data-line-number="3"><span class="st">  </span><span class="kw">select</span>(carrier, seats, distance) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb620-4" data-line-number="4"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">ASM =</span> seats <span class="op">*</span><span class="st"> </span>distance) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb620-5" data-line-number="5"><span class="st">  </span><span class="kw">group_by</span>(carrier) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb620-6" data-line-number="6"><span class="st">  </span><span class="kw">summarize</span>(<span class="dt">ASM =</span> <span class="kw">sum</span>(ASM, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb620-7" data-line-number="7"><span class="st">  </span><span class="kw">arrange</span>(<span class="kw">desc</span>(ASM)) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb620-8" data-line-number="8"><span class="st">  </span><span class="co"># Added:</span></a>
+<a class="sourceLine" id="cb620-9" data-line-number="9"><span class="st">  </span><span class="kw">inner_join</span>(airlines, <span class="dt">by =</span> <span class="st">&quot;carrier&quot;</span>)</a></code></pre></div>
+<pre><code># A tibble: 16 x 3
+   carrier         ASM name                       
+   &lt;chr&gt;         &lt;dbl&gt; &lt;chr&gt;                      
+ 1 UA      15516377526 United Air Lines Inc.      
+ 2 DL      10532885801 Delta Air Lines Inc.       
+ 3 B6       9618222135 JetBlue Airways            
+ 4 AA       3677292231 American Airlines Inc.     
+ 5 US       2533505829 US Airways Inc.            
+ 6 VX       2296680778 Virgin America             
+ 7 EV       1817236275 ExpressJet Airlines Inc.   
+ 8 WN       1718116857 Southwest Airlines Co.     
+ 9 9E        776970310 Endeavor Air Inc.          
+10 HA        642478122 Hawaiian Airlines Inc.     
+11 AS        314104736 Alaska Airlines Inc.       
+12 FL        219628520 AirTran Airways Corporation
+13 F9        184832280 Frontier Airlines Inc.     
+14 YV         20163632 Mesa Airlines Inc.         
+15 MQ          7162420 Envoy Air                  
+16 OO          1299835 SkyWest Airlines Inc.      </code></pre>
 <hr />
 </div>
 <div id="chapter-4-solutions" class="section level2">
 <h2><span class="header-section-number">D.4</span> Chapter 4 Solutions</h2>
-<div class="sourceCode" id="cb592"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb592-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
-<a class="sourceLine" id="cb592-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
-<a class="sourceLine" id="cb592-3" data-line-number="3"><span class="kw">library</span>(nycflights13)</a>
-<a class="sourceLine" id="cb592-4" data-line-number="4"><span class="kw">library</span>(tidyr)</a>
-<a class="sourceLine" id="cb592-5" data-line-number="5"><span class="kw">library</span>(readr)</a></code></pre></div>
+<div class="sourceCode" id="cb622"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb622-1" data-line-number="1"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb622-2" data-line-number="2"><span class="kw">library</span>(ggplot2)</a>
+<a class="sourceLine" id="cb622-3" data-line-number="3"><span class="kw">library</span>(readr)</a>
+<a class="sourceLine" id="cb622-4" data-line-number="4"><span class="kw">library</span>(tidyr)</a>
+<a class="sourceLine" id="cb622-5" data-line-number="5"><span class="kw">library</span>(nycflights13)</a>
+<a class="sourceLine" id="cb622-6" data-line-number="6"><span class="kw">library</span>(fivethirtyeight)</a></code></pre></div>
 <p><strong>(LC4.1)</strong> What are common characteristics of “tidy” datasets?</p>
 <p><strong>Solution</strong>: Rows correspond to observations, while columns correspond to variables.</p>
 <p><strong>(LC4.2)</strong> What makes “tidy” datasets useful for organizing data?</p>
 <p><strong>Solution</strong>: Tidy datasets are an organized way of viewing data. This format is required for the <code>ggplot2</code> and <code>dplyr</code> packages for data visualization and wrangling.</p>
 <p><strong>(LC4.3)</strong> Take a look the <code>airline_safety</code> data frame included in the <code>fivethirtyeight</code> data. Run the following:</p>
-<div class="sourceCode" id="cb593"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb593-1" data-line-number="1">airline_safety</a></code></pre></div>
+<div class="sourceCode" id="cb623"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb623-1" data-line-number="1">airline_safety</a></code></pre></div>
 <p>After reading the help file by running <code>?airline_safety</code>, we see that <code>airline_safety</code> is a data frame containing information on different airlines companies’ safety records. This data was originally reported on the data journalism website FiveThirtyEight.com in Nate Silver’s article <a href="https://fivethirtyeight.com/features/should-travelers-avoid-flying-airlines-that-have-had-crashes-in-the-past/">“Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?”</a>. Let’s ignore the <code>incl_reg_subsidiaries</code> and <code>avail_seat_km_per_week</code> variables for simplicity:</p>
-<div class="sourceCode" id="cb594"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb594-1" data-line-number="1">airline_safety_smaller &lt;-<span class="st"> </span>airline_safety <span class="op">%&gt;%</span><span class="st"> </span></a>
-<a class="sourceLine" id="cb594-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(<span class="op">-</span><span class="kw">c</span>(incl_reg_subsidiaries, avail_seat_km_per_week))</a>
-<a class="sourceLine" id="cb594-3" data-line-number="3">airline_safety_smaller</a></code></pre></div>
+<div class="sourceCode" id="cb624"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb624-1" data-line-number="1">airline_safety_smaller &lt;-<span class="st"> </span>airline_safety <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb624-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(<span class="op">-</span><span class="kw">c</span>(incl_reg_subsidiaries, avail_seat_km_per_week))</a>
+<a class="sourceLine" id="cb624-3" data-line-number="3">airline_safety_smaller</a></code></pre></div>
 <pre><code># A tibble: 56 x 7
    airline incidents_85_99 fatal_accidents… fatalities_85_99 incidents_00_14
    &lt;chr&gt;             &lt;int&gt;            &lt;int&gt;            &lt;int&gt;           &lt;int&gt;
@@ -898,11 +2314,33 @@ <h2><span class="header-section-number">D.4</span> Chapter 4 Solutions</h2>
 10 Alital…               7                2               50               4
 # … with 46 more rows, and 2 more variables: fatal_accidents_00_14 &lt;int&gt;,
 #   fatalities_00_14 &lt;int&gt;</code></pre>
-<p>This data frame is not in “tidy” format. How would you convert this data frame to be in “tidy” format, in particular so that it has a variable <code>incident_type_years</code> indicating the indicent type/year and a variable <code>count</code> of the counts?</p>
-<p><strong>Solution</strong>: Using the <code>gather()</code> function from the <code>tidyr</code> package:</p>
-<div class="sourceCode" id="cb596"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb596-1" data-line-number="1">airline_safety_smaller_tidy &lt;-<span class="st"> </span>airline_safety_smaller <span class="op">%&gt;%</span><span class="st"> </span></a>
-<a class="sourceLine" id="cb596-2" data-line-number="2"><span class="st">  </span><span class="kw">gather</span>(<span class="dt">key =</span> incident_type_years, <span class="dt">value =</span> count, <span class="op">-</span>airline)</a>
-<a class="sourceLine" id="cb596-3" data-line-number="3">airline_safety_smaller_tidy</a></code></pre></div>
+<p>This data frame is not in “tidy” format. How would you convert this data frame to be in “tidy” format, in particular so that it has a variable <code>incident_type_years</code> indicating the incident type/year and a variable <code>count</code> of the counts?</p>
+<p><strong>Solution</strong>:</p>
+<p>This can been done using the <code>pivot_longer()</code> function from the <code>tidyr</code> package:</p>
+<div class="sourceCode" id="cb626"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb626-1" data-line-number="1">airline_safety_smaller_tidy &lt;-<span class="st"> </span>airline_safety_smaller <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb626-2" data-line-number="2"><span class="st">  </span><span class="kw">pivot_longer</span>(<span class="dt">names_to =</span> <span class="st">&quot;incident_type_years&quot;</span>, </a>
+<a class="sourceLine" id="cb626-3" data-line-number="3">               <span class="dt">values_to =</span> <span class="st">&quot;count&quot;</span>, </a>
+<a class="sourceLine" id="cb626-4" data-line-number="4">               <span class="dt">cols =</span> <span class="op">-</span>airline)</a>
+<a class="sourceLine" id="cb626-5" data-line-number="5">airline_safety_smaller_tidy</a></code></pre></div>
+<pre><code># A tibble: 336 x 3
+   airline    incident_type_years   count
+   &lt;chr&gt;      &lt;chr&gt;                 &lt;int&gt;
+ 1 Aer Lingus incidents_85_99           2
+ 2 Aer Lingus fatal_accidents_85_99     0
+ 3 Aer Lingus fatalities_85_99          0
+ 4 Aer Lingus incidents_00_14           0
+ 5 Aer Lingus fatal_accidents_00_14     0
+ 6 Aer Lingus fatalities_00_14          0
+ 7 Aeroflot   incidents_85_99          76
+ 8 Aeroflot   fatal_accidents_85_99    14
+ 9 Aeroflot   fatalities_85_99        128
+10 Aeroflot   incidents_00_14           6
+# … with 326 more rows</code></pre>
+<p>If you look at the resulting <code>airline_safety_smaller_tidy</code> data frame in the spreadsheet viewer, you’ll see that the variable <code>incident_type_years</code> has 6 possible values: <code>&quot;incidents_85_99&quot;, &quot;fatal_accidents_85_99&quot;, &quot;fatalities_85_99&quot;,  &quot;incidents_00_14&quot;, &quot;fatal_accidents_00_14&quot;, &quot;fatalities_00_14&quot;</code> corresponding to the 6 columns of <code>airline_safety_smaller</code> we tidied.</p>
+<p>Note that prior to <code>tidyr</code> version 1.0.0 released to CRAN in September 2019, this could also have been done using the <code>gather()</code> function from the <code>tidyr</code> package:</p>
+<div class="sourceCode" id="cb628"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb628-1" data-line-number="1">airline_safety_smaller_tidy &lt;-<span class="st"> </span>airline_safety_smaller <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb628-2" data-line-number="2"><span class="st">  </span><span class="kw">gather</span>(<span class="dt">key =</span> incident_type_years, <span class="dt">value =</span> count, <span class="op">-</span>airline)</a>
+<a class="sourceLine" id="cb628-3" data-line-number="3">airline_safety_smaller_tidy</a></code></pre></div>
 <pre><code># A tibble: 336 x 3
    airline               incident_type_years count
    &lt;chr&gt;                 &lt;chr&gt;               &lt;int&gt;
@@ -917,14 +2355,15 @@ <h2><span class="header-section-number">D.4</span> Chapter 4 Solutions</h2>
  9 Alaska Airlines       incidents_85_99         5
 10 Alitalia              incidents_85_99         7
 # … with 326 more rows</code></pre>
-<p>If you look at the resulting <code>airline_safety_smaller_tidy</code> data frame in the spreadsheet viewer, you’ll see that the variable <code>incident_type_years</code> has 6 possible values: <code>&quot;incidents_85_99&quot;, &quot;fatal_accidents_85_99&quot;, &quot;fatalities_85_99&quot;,  &quot;incidents_00_14&quot;, &quot;fatal_accidents_00_14&quot;, &quot;fatalities_00_14&quot;</code> corresponding to the 6 columns of <code>airline_safety_smaller</code> we tidied.</p>
 <p><strong>(LC4.4)</strong> Convert the <code>dem_score</code> data frame into
 a tidy data frame and assign the name of <code>dem_score_tidy</code> to the resulting long-formatted data frame.</p>
 <p><strong>Solution</strong>: Running the following in the console:</p>
-<div class="sourceCode" id="cb598"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb598-1" data-line-number="1">dem_score_tidy &lt;-<span class="st"> </span>dem_score <span class="op">%&gt;%</span><span class="st"> </span></a>
-<a class="sourceLine" id="cb598-2" data-line-number="2"><span class="st">  </span><span class="kw">gather</span>(<span class="dt">key =</span> year, <span class="dt">value =</span> democracy_score, <span class="op">-</span><span class="st"> </span>country)</a></code></pre></div>
+<div class="sourceCode" id="cb630"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb630-1" data-line-number="1">dem_score_tidy &lt;-<span class="st"> </span>dem_score <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb630-2" data-line-number="2"><span class="st">  </span><span class="kw">pivot_longer</span>(<span class="dt">names_to =</span> <span class="st">&quot;year&quot;</span>, <span class="dt">values_to =</span> <span class="st">&quot;democracy_score&quot;</span>, </a>
+<a class="sourceLine" id="cb630-3" data-line-number="3">               <span class="dt">cols =</span> <span class="op">-</span>country)</a>
+<a class="sourceLine" id="cb630-4" data-line-number="4"><span class="co">#  gather(key = year, value = democracy_score, - country)</span></a></code></pre></div>
 <p>Let’s now compare the <code>dem_score</code> and <code>dem_score_tidy</code>. <code>dem_score</code> has democracy score information for each year in columns, whereas in <code>dem_score_tidy</code> there are explicit variables <code>year</code> and <code>democracy_score</code>. While both representations of the data contain the same information, we can only use <code>ggplot()</code> to create plots using the <code>dem_score_tidy</code> data frame.</p>
-<div class="sourceCode" id="cb599"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb599-1" data-line-number="1">dem_score</a></code></pre></div>
+<div class="sourceCode" id="cb631"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb631-1" data-line-number="1">dem_score</a></code></pre></div>
 <pre><code># A tibble: 96 x 10
    country    `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
    &lt;chr&gt;       &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;
@@ -939,28 +2378,31 @@ <h2><span class="header-section-number">D.4</span> Chapter 4 Solutions</h2>
  9 Bhutan        -10    -10    -10    -10    -10    -10    -10    -10    -10
 10 Bolivia        -4     -3     -3     -4     -7     -7      8      9      9
 # … with 86 more rows</code></pre>
-<div class="sourceCode" id="cb601"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb601-1" data-line-number="1">dem_score_tidy</a></code></pre></div>
+<div class="sourceCode" id="cb633"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb633-1" data-line-number="1">dem_score_tidy</a></code></pre></div>
 <pre><code># A tibble: 864 x 3
-   country    year  democracy_score
-   &lt;chr&gt;      &lt;chr&gt;           &lt;dbl&gt;
- 1 Albania    1952               -9
- 2 Argentina  1952               -9
- 3 Armenia    1952               -9
- 4 Australia  1952               10
- 5 Austria    1952               10
- 6 Azerbaijan 1952               -9
- 7 Belarus    1952               -9
- 8 Belgium    1952               10
- 9 Bhutan     1952              -10
-10 Bolivia    1952               -4
+   country   year  democracy_score
+   &lt;chr&gt;     &lt;chr&gt;           &lt;dbl&gt;
+ 1 Albania   1952               -9
+ 2 Albania   1957               -9
+ 3 Albania   1962               -9
+ 4 Albania   1967               -9
+ 5 Albania   1972               -9
+ 6 Albania   1977               -9
+ 7 Albania   1982               -9
+ 8 Albania   1987               -9
+ 9 Albania   1992                5
+10 Argentina 1952               -9
 # … with 854 more rows</code></pre>
 <p><strong>(LC4.5)</strong> Read in the life expectancy data stored at <a href="https://moderndive.com/data/le_mess.csv" class="uri">https://moderndive.com/data/le_mess.csv</a> and convert it to a tidy data frame.</p>
 <p><strong>Solution</strong>: The code is similar</p>
-<div class="sourceCode" id="cb603"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb603-1" data-line-number="1">life_expectancy &lt;-<span class="st"> </span><span class="kw">read_csv</span>(<span class="st">&quot;https://moderndive.com/data/le_mess.csv&quot;</span>)</a>
-<a class="sourceLine" id="cb603-2" data-line-number="2">life_expectancy_tidy &lt;-<span class="st"> </span>life_expectancy <span class="op">%&gt;%</span><span class="st"> </span></a>
-<a class="sourceLine" id="cb603-3" data-line-number="3"><span class="st">  </span><span class="kw">gather</span>(<span class="dt">key =</span> year, <span class="dt">value =</span> life_expectancy, <span class="op">-</span>country)</a></code></pre></div>
+<div class="sourceCode" id="cb635"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb635-1" data-line-number="1">life_expectancy &lt;-<span class="st"> </span><span class="kw">read_csv</span>(<span class="st">&quot;https://moderndive.com/data/le_mess.csv&quot;</span>)</a>
+<a class="sourceLine" id="cb635-2" data-line-number="2">life_expectancy_tidy &lt;-<span class="st"> </span>life_expectancy <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb635-3" data-line-number="3"><span class="st">  </span><span class="kw">pivot_longer</span>(<span class="dt">names_to =</span> <span class="st">&quot;year&quot;</span>, </a>
+<a class="sourceLine" id="cb635-4" data-line-number="4">               <span class="dt">values_to =</span> <span class="st">&quot;life_expectancy&quot;</span>,</a>
+<a class="sourceLine" id="cb635-5" data-line-number="5">               <span class="dt">cols =</span> <span class="op">-</span>country)</a>
+<a class="sourceLine" id="cb635-6" data-line-number="6"><span class="co">#  gather(key = year, value = life_expectancy, -country)</span></a></code></pre></div>
 <p>We observe the same construct structure with respect to <code>year</code> in <code>life_expectancy</code> vs <code>life_expectancy_tidy</code> as we did in <code>dem_score</code> vs <code>dem_score_tidy</code>:</p>
-<div class="sourceCode" id="cb604"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb604-1" data-line-number="1">life_expectancy</a></code></pre></div>
+<div class="sourceCode" id="cb636"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb636-1" data-line-number="1">life_expectancy</a></code></pre></div>
 <pre><code># A tibble: 202 x 67
    country  `1951` `1952` `1953`  `1954` `1955` `1956` `1957` `1958` `1959`
    &lt;chr&gt;     &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;   &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;
@@ -986,31 +2428,520 @@ <h2><span class="header-section-number">D.4</span> Chapter 4 Solutions</h2>
 #   `2002` &lt;dbl&gt;, `2003` &lt;dbl&gt;, `2004` &lt;dbl&gt;, `2005` &lt;dbl&gt;, `2006` &lt;dbl&gt;,
 #   `2007` &lt;dbl&gt;, `2008` &lt;dbl&gt;, `2009` &lt;dbl&gt;, `2010` &lt;dbl&gt;, `2011` &lt;dbl&gt;,
 #   `2012` &lt;dbl&gt;, `2013` &lt;dbl&gt;, `2014` &lt;dbl&gt;, `2015` &lt;dbl&gt;, `2016` &lt;dbl&gt;</code></pre>
-<div class="sourceCode" id="cb606"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb606-1" data-line-number="1">life_expectancy_tidy</a></code></pre></div>
+<div class="sourceCode" id="cb638"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb638-1" data-line-number="1">life_expectancy_tidy</a></code></pre></div>
 <pre><code># A tibble: 13,332 x 3
-   country             year  life_expectancy
-   &lt;chr&gt;               &lt;chr&gt;           &lt;dbl&gt;
- 1 Afghanistan         1951          27.13  
- 2 Albania             1951          54.72  
- 3 Algeria             1951          43.03  
- 4 Angola              1951          31.05  
- 5 Antigua and Barbuda 1951          58.26  
- 6 Argentina           1951          61.93  
- 7 Armenia             1951          62.67  
- 8 Aruba               1951          58.96  
- 9 Australia           1951          68.710 
-10 Austria             1951          65.2400
+   country     year  life_expectancy
+   &lt;chr&gt;       &lt;chr&gt;           &lt;dbl&gt;
+ 1 Afghanistan 1951            27.13
+ 2 Afghanistan 1952            27.67
+ 3 Afghanistan 1953            28.19
+ 4 Afghanistan 1954            28.73
+ 5 Afghanistan 1955            29.27
+ 6 Afghanistan 1956            29.8 
+ 7 Afghanistan 1957            30.34
+ 8 Afghanistan 1958            30.86
+ 9 Afghanistan 1959            31.4 
+10 Afghanistan 1960            31.94
 # … with 13,322 more rows</code></pre>
 <hr />
 </div>
 <div id="chapter-5-solutions" class="section level2">
 <h2><span class="header-section-number">D.5</span> Chapter 5 Solutions</h2>
-<p>To come!</p>
-<div class="sourceCode" id="cb608"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb608-1" data-line-number="1"><span class="kw">library</span>(ggplot2)</a>
-<a class="sourceLine" id="cb608-2" data-line-number="2"><span class="kw">library</span>(dplyr)</a>
-<a class="sourceLine" id="cb608-3" data-line-number="3"><span class="kw">library</span>(moderndive)</a>
-<a class="sourceLine" id="cb608-4" data-line-number="4"><span class="kw">library</span>(gapminder)</a>
-<a class="sourceLine" id="cb608-5" data-line-number="5"><span class="co">#library(skimr)</span></a></code></pre></div>
+<div class="sourceCode" id="cb640"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb640-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb640-2" data-line-number="2"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb640-3" data-line-number="3"><span class="kw">library</span>(skimr)</a>
+<a class="sourceLine" id="cb640-4" data-line-number="4"><span class="kw">library</span>(gapminder)</a></code></pre></div>
+<p><strong>(LC5.1)</strong> Conduct a new exploratory data analysis with the same outcome variable <span class="math inline">\(y\)</span> being <code>score</code> but with <code>age</code> as the new explanatory variable <span class="math inline">\(x\)</span>. Remember, this involves three things:</p>
+<ol style="list-style-type: lower-alpha">
+<li>Looking at the raw data values.</li>
+<li>Computing summary statistics.</li>
+<li>Creating data visualizations.</li>
+</ol>
+<p>What can you say about the relationship between age and teaching scores based on this exploration?</p>
+<p><strong>Solution</strong>:</p>
+<ul>
+<li>Looking at the raw data values:</li>
+</ul>
+<div class="sourceCode" id="cb641"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb641-1" data-line-number="1"><span class="kw">glimpse</span>(evals_ch5)</a></code></pre></div>
+<pre><code>Observations: 463
+Variables: 4
+$ ID      &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
+$ score   &lt;dbl&gt; 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4…
+$ bty_avg &lt;dbl&gt; 5.00, 5.00, 5.00, 5.00, 3.00, 3.00, 3.00, 3.33, 3.33, 3.17, 3…
+$ age     &lt;int&gt; 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4…</code></pre>
+<ul>
+<li>Computing summary statistics:</li>
+</ul>
+<div class="sourceCode" id="cb643"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb643-1" data-line-number="1"><span class="kw">skim_with</span>(<span class="dt">numeric =</span> <span class="kw">list</span>(<span class="dt">hist =</span> <span class="ot">NULL</span>), <span class="dt">integer =</span> <span class="kw">list</span>(<span class="dt">hist =</span> <span class="ot">NULL</span>))</a></code></pre></div>
+<div class="sourceCode" id="cb644"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb644-1" data-line-number="1">evals_ch5 <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb644-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(score, age) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb644-3" data-line-number="3"><span class="st">  </span><span class="kw">skim</span>()</a></code></pre></div>
+<pre><code>Skim summary statistics
+ n obs: 463 
+ n variables: 2 
+
+── Variable type:integer ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+ variable missing complete   n  mean  sd p0 p25 p50 p75 p100
+      age       0      463 463 48.37 9.8 29  42  48  57   73
+
+── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+ variable missing complete   n mean   sd  p0 p25 p50 p75 p100
+    score       0      463 463 4.17 0.54 2.3 3.8 4.3 4.6    5</code></pre>
+<p>(Note that for formatting purposes, the inline histogram that is usually printed with skim() has been removed. This can be done by running <code>skim_with(numeric = list(hist = NULL), integer = list(hist = NULL))</code> prior to using the <code>skim()</code> function as well.)</p>
+<ul>
+<li>Creating data visualizations:</li>
+</ul>
+<div class="sourceCode" id="cb646"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb646-1" data-line-number="1"><span class="kw">ggplot</span>(evals_ch5, <span class="kw">aes</span>(<span class="dt">x =</span> age, <span class="dt">y =</span> score)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb646-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb646-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Age&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Teaching Score&quot;</span>,</a>
+<a class="sourceLine" id="cb646-4" data-line-number="4">       <span class="dt">title =</span> <span class="st">&quot;Scatterplot of relationship of teaching score and age&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-600-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<!--
+TODO: Albert needs to double check interpretation:
+-->
+<p>Based on the scatterplot visualization, there seem to have a weak negative relationship between age and teaching score. As age increases, the teaching score see, to decrease slightly.</p>
+<p><strong>(LC5.2)</strong> Fit a new simple linear regression using <code>lm(score ~ age, data = evals_ch5)</code> where <code>age</code> is the new explanatory variable <span class="math inline">\(x\)</span>. Get information about the “best-fitting” line from the regression table by applying the <code>get_regression_table()</code> function. How do the regression results match up with the results from your earlier exploratory data analysis?</p>
+<p><strong>Solution</strong>:</p>
+<div class="sourceCode" id="cb647"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb647-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb647-2" data-line-number="2">score_age_model &lt;-<span class="st"> </span><span class="kw">lm</span>(score <span class="op">~</span><span class="st"> </span>age, <span class="dt">data =</span> evals_ch5)</a>
+<a class="sourceLine" id="cb647-3" data-line-number="3"><span class="co"># Get regression table:</span></a>
+<a class="sourceLine" id="cb647-4" data-line-number="4"><span class="kw">get_regression_table</span>(score_age_model)</a></code></pre></div>
+<pre><code># A tibble: 2 x 7
+  term      estimate std_error statistic p_value lower_ci upper_ci
+  &lt;chr&gt;        &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;   &lt;dbl&gt;    &lt;dbl&gt;    &lt;dbl&gt;
+1 intercept    4.462     0.127    35.195   0        4.213    4.711
+2 age         -0.006     0.003    -2.311   0.021   -0.011   -0.001</code></pre>
+<p><span class="math display">\[
+\begin{aligned}
+\widehat{y} &amp;= b_0 + b_1 \cdot x\\
+\widehat{\text{score}} &amp;= b_0 + b_{\text{age}} \cdot\text{age}\\
+&amp;= 4.462 - 0.006\cdot\text{age}
+\end{aligned}
+\]</span></p>
+<!--
+TODO: Albert will verify Starry's interpretation:
+-->
+<p>For every increase of 1 unit in <code>age</code>, there is an <em>associated</em> decrease of, <em>on average</em>, 0.006 units of <code>score</code>. It matches with the results from our earlier exploratory data analysis.</p>
+<p><strong>(LC5.3)</strong> Generate a data frame of the residuals of the model where you used <code>age</code> as the explanatory <span class="math inline">\(x\)</span> variable.</p>
+<p><strong>Solution</strong>:</p>
+<div class="sourceCode" id="cb649"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb649-1" data-line-number="1">score_age_regression_points &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(score_age_model)</a>
+<a class="sourceLine" id="cb649-2" data-line-number="2">score_age_regression_points</a></code></pre></div>
+<pre><code># A tibble: 463 x 5
+      ID score   age score_hat residual
+   &lt;int&gt; &lt;dbl&gt; &lt;int&gt;     &lt;dbl&gt;    &lt;dbl&gt;
+ 1     1 4.7      36     4.248  0.452  
+ 2     2 4.100    36     4.248 -0.148  
+ 3     3 3.9      36     4.248 -0.34800
+ 4     4 4.8      36     4.248  0.552  
+ 5     5 4.600    59     4.112  0.488  
+ 6     6 4.3      59     4.112  0.188  
+ 7     7 2.8      59     4.112 -1.312  
+ 8     8 4.100    51     4.159 -0.059  
+ 9     9 3.4      51     4.159 -0.759  
+10    10 4.5      40     4.224  0.276  
+# … with 453 more rows</code></pre>
+<p><strong>(LC5.4)</strong> Conduct a new exploratory data analysis with the same explanatory variable <span class="math inline">\(x\)</span> being <code>continent</code> but with <code>gdpPercap</code> as the new outcome variable <span class="math inline">\(y\)</span>. Remember, this involves three things:</p>
+<ol style="list-style-type: decimal">
+<li>Most crucially: Looking at the raw data values.</li>
+<li>Computing summary statistics, such as means, medians, and interquartile ranges.</li>
+<li>Creating data visualizations.</li>
+</ol>
+<p>What can you say about the differences in GDP per capita between continents based on this exploration?</p>
+<p><strong>Solution</strong>:</p>
+<ul>
+<li>Looking at the raw data values:</li>
+</ul>
+<div class="sourceCode" id="cb651"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb651-1" data-line-number="1"><span class="kw">glimpse</span>(gapminder2007)</a></code></pre></div>
+<pre><code>Observations: 142
+Variables: 4
+$ country   &lt;fct&gt; Afghanistan, Albania, Algeria, Angola, Argentina, Australia…
+$ lifeExp   &lt;dbl&gt; 43.8, 76.4, 72.3, 42.7, 75.3, 81.2, 79.8, 75.6, 64.1, 79.4,…
+$ continent &lt;fct&gt; Asia, Europe, Africa, Africa, Americas, Oceania, Europe, As…
+$ gdpPercap &lt;dbl&gt; 975, 5937, 6223, 4797, 12779, 34435, 36126, 29796, 1391, 33…</code></pre>
+<ul>
+<li>Computing summary statistics, such as means, medians, and interquartile ranges:</li>
+</ul>
+<div class="sourceCode" id="cb653"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb653-1" data-line-number="1">gapminder2007 <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb653-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(gdpPercap, continent) <span class="op">%&gt;%</span></a>
+<a class="sourceLine" id="cb653-3" data-line-number="3"><span class="st">  </span><span class="kw">skim</span>()</a></code></pre></div>
+<pre><code>Skim summary statistics
+ n obs: 142 
+ n variables: 2 
+
+── Variable type:factor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+  variable missing complete   n n_unique                         top_counts
+ continent       0      142 142        5 Afr: 52, Asi: 33, Eur: 30, Ame: 25
+ ordered
+   FALSE
+
+── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+  variable missing complete   n     mean       sd     p0     p25     p50
+ gdpPercap       0      142 142 11680.07 12859.94 277.55 1624.84 6124.37
+      p75     p100
+ 18008.84 49357.19</code></pre>
+<ul>
+<li>Creating data visualizations:</li>
+</ul>
+<div class="sourceCode" id="cb655"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb655-1" data-line-number="1"><span class="kw">ggplot</span>(gapminder2007, <span class="kw">aes</span>(<span class="dt">x =</span> continent, <span class="dt">y =</span> gdpPercap)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb655-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_boxplot</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb655-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Continent&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;GPD per capita&quot;</span>,</a>
+<a class="sourceLine" id="cb655-4" data-line-number="4">       <span class="dt">title =</span> <span class="st">&quot;GDP by continent&quot;</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-605-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<!--
+TODO: Albert needs to double check interpretation:
+-->
+<p>Based on this exploration, it seems that GDP’s are very different among different continents, which means that continent might be a statistically significant predictor for an area’s GDP.</p>
+<p><strong>(LC5.5)</strong> Fit a new linear regression using <code>lm(gdpPercap ~ continent, data = gapminder2007)</code> where <code>gdpPercap</code> is the new outcome variable <span class="math inline">\(y\)</span>. Get information about the “best-fitting” line from the regression table by applying the <code>get_regression_table()</code> function. How do the regression results match up with the results from your previous exploratory data analysis?</p>
+<p><strong>Solution</strong>:</p>
+<div class="sourceCode" id="cb656"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb656-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb656-2" data-line-number="2">gdp_model &lt;-<span class="st"> </span><span class="kw">lm</span>(gdpPercap <span class="op">~</span><span class="st"> </span>continent, <span class="dt">data =</span> gapminder2007)</a>
+<a class="sourceLine" id="cb656-3" data-line-number="3"><span class="co"># Get regression table:</span></a>
+<a class="sourceLine" id="cb656-4" data-line-number="4"><span class="kw">get_regression_table</span>(gdp_model)</a></code></pre></div>
+<pre><code># A tibble: 5 x 7
+  term              estimate std_error statistic p_value  lower_ci upper_ci
+  &lt;chr&gt;                &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;   &lt;dbl&gt;     &lt;dbl&gt;    &lt;dbl&gt;
+1 intercept          3089.03   1372.74     2.25    0.026   374.538  5803.53
+2 continentAmericas  7914.00   2409.14     3.285   0.001  3150.08  12677.9 
+3 continentAsia      9383.99   2203.13     4.259   0      5027.46  13740.5 
+4 continentEurope   21965.4    2269.52     9.678   0     17477.6   26453.3 
+5 continentOceania  26721.2    7132.96     3.746   0     12616.2   40826.1 </code></pre>
+<p><span class="math display">\[
+\begin{aligned}
+\widehat{y} = \widehat{\text{gdpPercap}} &amp;= b_0 + b_{\text{Amer}}\cdot\mathbb{1}_{\mbox{Amer}}(x) + b_{\text{Asia}}\cdot\mathbb{1}_{\mbox{Asia}}(x) + \\
+&amp; \qquad b_{\text{Euro}}\cdot\mathbb{1}_{\mbox{Euro}}(x) + b_{\text{Ocean}}\cdot\mathbb{1}_{\mbox{Ocean}}(x)\\
+&amp;= 3089 + 7914\cdot\mathbb{1}_{\mbox{Amer}}(x) + 9384\cdot\mathbb{1}_{\mbox{Asia}}(x) + \\
+&amp; \qquad 21965\cdot\mathbb{1}_{\mbox{Euro}}(x) + 26721\cdot\mathbb{1}_{\mbox{Ocean}}(x)
+\end{aligned}
+\]</span></p>
+<!--
+TODO: Albert will double check Starry's interpretation:
+-->
+<p>In our previous exploratory data analysis, it seemed that continent is a statistically significant predictor for an area’s GDP. Here, by fit a new linear regression using <code>lm(gdpPercap ~ continent, data = gapminder2007)</code> where <code>gdpPercap</code> is the new outcome variable <span class="math inline">\(y\)</span>, we are able to write an equation to predict <code>gdpPercap</code> using the continent as statistically significant predictors. Therefore, the regression results matches with the results from your previous exploratory data analysis.</p>
+<p><strong>(LC5.6)</strong> Using either the sorting functionality of RStudio’s spreadsheet viewer or using the data wrangling tools you learned in Chapter <a href="3-wrangling.html#wrangling">3</a>, identify the five countries with the five smallest (most negative) residuals? What do these negative residuals say about their life expectancy relative to their continents?</p>
+<p><strong>Solution</strong>:
+Using the sorting functionality of RStudio’s spreadsheet viewer, we can identify that the five countries with the five smallest (most negative) residuals are: Afghanistan, Swaziland, Mozambique, Haiti, and Zambia.</p>
+<p>These negative residuals indicate that these data points have the biggest negative deviations from their group means. This means that these five countries’ average life expectancies are the lowest comparing to their respective continents’ average life expectancies. For example, the residual for Afghanistan is <span class="math inline">\(-26.900\)</span> and it is the smallest residual. This means that the average life expectancy of Afghanistan is <span class="math inline">\(26.900\)</span> years lower than the average life expectancy of its continent, Asia.</p>
+<p><strong>(LC5.7)</strong> Repeat this process, but identify the five countries with the five largest (most positive) residuals. What do these positive residuals say about their life expectancy relative to their continents?</p>
+<p><strong>Solution</strong>:
+Using either the sorting functionality of RStudio’s spreadsheet viewer, we can identify that the five countries with the five largest (most positive) residuals are: Reunion, Libya, Tunisia, Mauritius, and Algeria.</p>
+<p>These positive residuals indicate that the data points are above the regression line with the longest distance. This means that these five countries’ average life expectancies are the highest comparing to their respective continents’ average life expectancies. For example, the residual for Reunion is <span class="math inline">\(21.636\)</span> and it is the largest residual. This means that the average life expectancy of Reunion is <span class="math inline">\(21.636\)</span> years lower than the average life expectancy of its continent, Africa.</p>
+<p><strong>(LC5.8)</strong> Note in the following plot there are 3 points marked with dots along with:</p>
+<ul>
+<li>The “best” fitting solid regression line in blue</li>
+<li>An arbitrarily chosen dotted red line</li>
+<li>Another arbitrarily chosen dashed green line</li>
+</ul>
+<div class="figure" style="text-align: center"><span id="fig:unnamed-chunk-607"></span>
+<img src="ModernDive_files/figure-html/unnamed-chunk-607-1.png" alt="Regression line and two others." width="80%" />
+<p class="caption">
+FIGURE D.2: Regression line and two others.
+</p>
+</div>
+<p>Compute the sum of squared residuals by hand for each line and show that of these three lines, the regression line in blue has the smallest value.</p>
+<p><strong>Solution</strong>:</p>
+<ul>
+<li>The “best” fitting solid regression line in blue:</li>
+</ul>
+<p><span class="math display">\[
+\sum_{i=1}^{n}(y_i - \widehat{y}_i)^2 = (2.0-1.5)^2+(0.50-2.0)^2+(3.0-2.5)^2=2.75
+\]</span></p>
+<ul>
+<li>An arbitrarily chosen dotted red line:</li>
+</ul>
+<p><span class="math display">\[
+\sum_{i=1}^{n}(y_i - \widehat{y}_i)^2 = (2.0-2.5)^2+(0.50-2.5)^2+(3.0-2.5)^2=4.5
+\]</span></p>
+<ul>
+<li>Another arbitrarily chosen dashed green line:</li>
+</ul>
+<p><span class="math display">\[
+\sum_{i=1}^{n}(y_i - \widehat{y}_i)^2 = (2.0-2.0)^2+(0.50-1.5)^2+(3.0-1.0)^2=5
+\]</span></p>
+<p>As calculated, <span class="math inline">\(2.75&lt;4.5&lt;5\)</span>. Therefore, we show that the regression line in blue has the smallest value of the residual sum of squares.</p>
+<hr />
+</div>
+<div id="chapter-6-solutions" class="section level2">
+<h2><span class="header-section-number">D.6</span> Chapter 6 Solutions</h2>
+<div class="sourceCode" id="cb658"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb658-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb658-2" data-line-number="2"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb658-3" data-line-number="3"><span class="kw">library</span>(skimr)</a>
+<a class="sourceLine" id="cb658-4" data-line-number="4"><span class="kw">library</span>(ISLR)</a></code></pre></div>
+<p><strong>(LC6.1)</strong> Compute the observed values, fitted values, and residuals not for the interaction model as we just did, but rather for the parallel slopes model we saved in <code>score_model_interaction</code>.</p>
+<p><strong>Solution</strong>:</p>
+<div class="sourceCode" id="cb659"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb659-1" data-line-number="1">regression_points_parallel &lt;-<span class="st"> </span><span class="kw">get_regression_points</span>(score_model_parallel_slopes)</a>
+<a class="sourceLine" id="cb659-2" data-line-number="2">regression_points_parallel</a></code></pre></div>
+<pre><code># A tibble: 463 x 6
+      ID score   age gender score_hat  residual
+   &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;fct&gt;      &lt;dbl&gt;     &lt;dbl&gt;
+ 1     1 4.7      36 female   4.172    0.528   
+ 2     2 4.100    36 female   4.172   -0.072000
+ 3     3 3.9      36 female   4.172   -0.272   
+ 4     4 4.8      36 female   4.172    0.628   
+ 5     5 4.600    59 male     4.163    0.437   
+ 6     6 4.3      59 male     4.163    0.137   
+ 7     7 2.8      59 male     4.163   -1.363   
+ 8     8 4.100    51 male     4.232   -0.132   
+ 9     9 3.4      51 male     4.232   -0.832   
+10    10 4.5      40 female   4.13700  0.363   
+# … with 453 more rows</code></pre>
+<p><strong>(LC6.2)</strong> Conduct a new exploratory data analysis with the same outcome variable <span class="math inline">\(y\)</span> being <code>debt</code> but with <code>credit_rating</code> and <code>age</code> as the new explanatory variables <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span>. Remember, this involves three things:</p>
+<ol style="list-style-type: decimal">
+<li>Most crucially: Looking at the raw data values.</li>
+<li>Computing summary statistics, such as means, medians, and interquartile ranges.</li>
+<li>Creating data visualizations.</li>
+</ol>
+<p>What can you say about the relationship between a credit card holder’s debt and their credit rating and age?</p>
+<p><strong>Solution</strong>:</p>
+<ul>
+<li>Most crucially: Looking at the raw data values.</li>
+</ul>
+<div class="sourceCode" id="cb661"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb661-1" data-line-number="1">credit_ch6 <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb661-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(debt, credit_rating, age) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb661-3" data-line-number="3"><span class="st">  </span><span class="kw">head</span>()</a></code></pre></div>
+<pre><code># A tibble: 6 x 3
+   debt credit_rating   age
+  &lt;int&gt;         &lt;int&gt; &lt;int&gt;
+1   333           283    34
+2   903           483    82
+3   580           514    71
+4   964           681    36
+5   331           357    68
+6  1151           569    77</code></pre>
+<ul>
+<li>Computing summary statistics, such as means, medians, and interquartile ranges.</li>
+</ul>
+<div class="sourceCode" id="cb663"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb663-1" data-line-number="1"><span class="kw">skim_with</span>(<span class="dt">numeric =</span> <span class="kw">list</span>(<span class="dt">hist =</span> <span class="ot">NULL</span>), <span class="dt">integer =</span> <span class="kw">list</span>(<span class="dt">hist =</span> <span class="ot">NULL</span>))</a></code></pre></div>
+<div class="sourceCode" id="cb664"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb664-1" data-line-number="1">credit_ch6 <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb664-2" data-line-number="2"><span class="st">  </span><span class="kw">select</span>(debt, credit_rating, age) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb664-3" data-line-number="3"><span class="st">  </span><span class="kw">skim</span>()</a></code></pre></div>
+<pre><code>Skim summary statistics
+ n obs: 400 
+ n variables: 3 
+
+── Variable type:integer ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+      variable missing complete   n   mean     sd p0    p25   p50    p75 p100
+           age       0      400 400  55.67  17.25 23  41.75  56    70      98
+ credit_rating       0      400 400 354.94 154.72 93 247.25 344   437.25  982
+          debt       0      400 400 520.01 459.76  0  68.75 459.5 863    1999</code></pre>
+<ul>
+<li>Creating data visualizations.</li>
+</ul>
+<div class="sourceCode" id="cb666"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb666-1" data-line-number="1"><span class="kw">ggplot</span>(credit_ch6, <span class="kw">aes</span>(<span class="dt">x =</span> credit_rating, <span class="dt">y =</span> debt)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb666-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb666-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Credit rating&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Credit card debt (in $)&quot;</span>, </a>
+<a class="sourceLine" id="cb666-4" data-line-number="4">       <span class="dt">title =</span> <span class="st">&quot;Debt and credit rating&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb666-5" data-line-number="5"><span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-615-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<div class="sourceCode" id="cb667"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb667-1" data-line-number="1"><span class="kw">ggplot</span>(credit_ch6, <span class="kw">aes</span>(<span class="dt">x =</span> age, <span class="dt">y =</span> debt)) <span class="op">+</span></a>
+<a class="sourceLine" id="cb667-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
+<a class="sourceLine" id="cb667-3" data-line-number="3"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">x =</span> <span class="st">&quot;Age (in year)&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Credit card debt (in $)&quot;</span>, </a>
+<a class="sourceLine" id="cb667-4" data-line-number="4">       <span class="dt">title =</span> <span class="st">&quot;Debt and age&quot;</span>) <span class="op">+</span></a>
+<a class="sourceLine" id="cb667-5" data-line-number="5"><span class="st">  </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">&quot;lm&quot;</span>, <span class="dt">se =</span> <span class="ot">FALSE</span>)</a></code></pre></div>
+<p><img src="ModernDive_files/figure-html/unnamed-chunk-615-2.png" width="\textwidth" style="display: block; margin: auto;" />
+It seems that there is a positive relationship between one’s credit rating and their debt, and a slight negative between one’s age and their debt.</p>
+<p><strong>(LC6.3)</strong> Fit a new simple linear regression using <code>lm(debt ~ credit_rating + age, data = credit_ch6)</code> where <code>credit_rating</code> and <code>age</code> are the new numerical explanatory variables <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span>. Get information about the “best-fitting” regression plane from the regression table by applying the <code>get_regression_table()</code> function. How do the regression results match up with the results from your previous exploratory data analysis?</p>
+<div class="sourceCode" id="cb668"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb668-1" data-line-number="1"><span class="co"># Fit regression model:</span></a>
+<a class="sourceLine" id="cb668-2" data-line-number="2">debt_model_<span class="dv">2</span> &lt;-<span class="st"> </span><span class="kw">lm</span>(debt <span class="op">~</span><span class="st"> </span>credit_rating <span class="op">+</span><span class="st"> </span>age, <span class="dt">data =</span> credit_ch6)</a>
+<a class="sourceLine" id="cb668-3" data-line-number="3"><span class="co"># Get regression table:</span></a>
+<a class="sourceLine" id="cb668-4" data-line-number="4"><span class="kw">get_regression_table</span>(debt_model_<span class="dv">2</span>)</a></code></pre></div>
+<pre><code># A tibble: 3 x 7
+  term          estimate std_error statistic p_value lower_ci upper_ci
+  &lt;chr&gt;            &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;   &lt;dbl&gt;    &lt;dbl&gt;    &lt;dbl&gt;
+1 intercept     -269.581    44.806    -6.017       0 -357.668 -181.494
+2 credit_rating    2.593     0.074    34.84        0    2.447    2.74 
+3 age             -2.351     0.668    -3.521       0   -3.663   -1.038</code></pre>
+<p>The coefficients for both new numerical explanatory variables <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span>, <code>credit_rating</code> and <code>age</code>, are <span class="math inline">\(2.59\)</span> and <span class="math inline">\(-2.35\)</span> respectively, which means that <code>debt</code> and <code>credit_rating</code> are positively correlated, and <code>debt</code> and <code>age</code> are negatively correlated. This matches up with the results from your previous exploratory data analysis.</p>
+<hr />
+</div>
+<div id="chapter-7-solutions" class="section level2">
+<h2><span class="header-section-number">D.7</span> Chapter 7 Solutions</h2>
+<div class="sourceCode" id="cb670"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb670-1" data-line-number="1"><span class="kw">library</span>(ggplot2)</a>
+<a class="sourceLine" id="cb670-2" data-line-number="2"><span class="kw">library</span>(dplyr)</a>
+<a class="sourceLine" id="cb670-3" data-line-number="3"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb670-4" data-line-number="4"><span class="kw">library</span>(gapminder)</a>
+<a class="sourceLine" id="cb670-5" data-line-number="5"><span class="kw">library</span>(skimr)</a></code></pre></div>
+<p><strong>(LC7.1)</strong> Why was it important to mix the bowl before we sampled the balls?</p>
+<p><strong>Solution</strong>:</p>
+<p>So that we make sure the sampled balls are randomized.</p>
+<p><strong>(LC7.2)</strong> Why is it that our 33 groups of friends did not all have the same numbers of balls that were red out of 50, and hence different proportions red?</p>
+<p><strong>Solution</strong>:</p>
+<p>Because not all pairs have the same portion of the population of the balls, so each pair has a different sampled balls with different color compositions.</p>
+<p><strong>(LC7.3)</strong> Why couldn’t we study the effects of sampling variation when we used the virtual shovel only once? Why did we need to take more than one virtual sample (in our case 33 virtual samples)?</p>
+<p><strong>Solution</strong>:</p>
+<p>If we use the virtual shovel only once, we only get one sample of the population. We need to take more than one virtual sample to get a range of proportions.</p>
+<p><strong>(LC7.4)</strong> Why did we not take 1000 “tactile” samples of 50 balls by hand?</p>
+<p><strong>Solution</strong>:</p>
+<p>That would be way too much repeated work.</p>
+<p><strong>(LC7.5)</strong> Looking at Figure <a href="7-sampling.html#fig:samplingdistribution-virtual-1000">7.10</a>, would you say that sampling 50 balls where 30% of them were red is likely or not? What about sampling 50 balls where 10% of them were red?</p>
+<p><strong>Solution</strong>:</p>
+<p>According to the Figure, less than 150 out of the 1000 counts were 30% red. So I would say that sampling 50 balls where 30% of them were red is not very likely. Almost no count was only 10% red, so sampling 50 balls where 10% of them were red is extremely unlikely.</p>
+<p><strong>(LC7.6)</strong> In Figure <a href="7-sampling.html#fig:comparing-sampling-distributions">7.12</a>, we used shovels to take 1000 samples each, computed the resulting 1000 proportions of the shovel’s balls that were red, and then visualized the distribution of these 1000 proportions in a histogram. We did this for shovels with 25, 50, and 100 slots in them. As the size of the shovels increased, the histograms got narrower. In other words, as the size of the shovels increased from 25 to 50 to 100, did the 1000 proportions</p>
+<ul>
+<li>A. vary less,</li>
+<li>B. vary by the same amount, or</li>
+<li><p>C. vary more?</p>
+<p><strong>Solution</strong>:</p>
+<p>A. As the histograms got narrower, the 1000 proportions varied less.</p></li>
+</ul>
+<p><strong>(LC7.7)</strong> What summary statistic did we use to quantify how much the 1000 proportions red varied?</p>
+<ul>
+<li>A. The inter-quartile range</li>
+<li>B. The standard deviation</li>
+<li>C. The range: the largest value minus the smallest.</li>
+</ul>
+<p><strong>Solution</strong>:</p>
+<p>B. The standard deviation is used to quantify how much a set of data varies.</p>
+<p><strong>(LC7.8)</strong> In the case of our bowl activity, what is the <em>population parameter</em>? Do we know its value?</p>
+<p><strong>Solution</strong>:</p>
+<!--
+  TODO: Albert needs to double check this ans:
+  -->
+<p>The <em>population parameter</em> in the case of our bowl activity is the total number of balls. We know its value.</p>
+<p><strong>(LC7.9)</strong> What would performing a census in our bowl activity correspond to? Why did we not perform a census?</p>
+<p><strong>Solution</strong>:</p>
+<p>Performing a census in our bowl activity correspond to counting the total number of red balls in all balls, We did not perform a census because it would be too much repetitive work and it is unnecessary.</p>
+<p><strong>(LC7.10)</strong> What purpose do <em>point estimates</em> serve in general? What is the name of the point estimate specific to our bowl activity? What is its mathematical notation?</p>
+<p><strong>Solution</strong>:</p>
+<p><em>Point estimates</em> serve to <em>estimate</em> an unknown population parameter in the sample. In our bowl activity, our point estimate is the <em>sample proportion</em>: the proportion of the shovel’s balls that are red. We mathematically denote the sample proportion using <span class="math inline">\(\widehat{p}\)</span>.</p>
+<p><strong>(LC7.11)</strong> How did we ensure that our tactile samples using the shovel were random?</p>
+<p><strong>Solution</strong>:</p>
+<p>We virtually shuffle the sample each time.</p>
+<p><strong>(LC7.12)</strong> Why is it important that sampling be done <em>at random</em>?</p>
+<p><strong>Solution</strong>:</p>
+<p>So that we get different samples each time to estimate the total population.</p>
+<p><strong>(LC7.13)</strong> What are we <em>inferring</em> about the bowl based on the samples using the shovel?</p>
+<p><strong>Solution</strong>:</p>
+<p>We are <em>inferring</em> that the samples are representing the total population in the ball.</p>
+<p><strong>(LC7.14)</strong> What purpose did the <em>sampling distributions</em> serve?</p>
+<p><strong>Solution</strong>:</p>
+<p>Using the sampling distributions, for a given sample size <span class="math inline">\(n\)</span>, we can make statements about what values we can typically expect.</p>
+<p><strong>(LC7.15)</strong> What does the <em>standard error</em> of the sample proportion <span class="math inline">\(\widehat{p}\)</span> quantify?</p>
+<p><strong>Solution</strong>:</p>
+<p>Standard errors quantify the effect of sampling variation induced on our estimates.</p>
+<p><strong>(LC7.16)</strong> The table that follows is a version of Table <a href="7-sampling.html#tab:comparing-n-2">7.3</a> matching sample sizes <span class="math inline">\(n\)</span> to different <em>standard errors</em> of the sample proportion <span class="math inline">\(\widehat{p}\)</span>, but with the rows randomly re-ordered and the sample sizes removed. Fill in the table by matching the correct sample sizes to the correct standard errors.</p>
+<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
+<thead>
+<tr>
+<th>
+Sample size
+</th>
+<th>
+Standard error of <span class="math inline">\(\widehat{p}\)</span>
+</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>
+n =
+</td>
+<td>
+0.094
+</td>
+</tr>
+<tr>
+<td>
+n =
+</td>
+<td>
+0.045
+</td>
+</tr>
+<tr>
+<td>
+n =
+</td>
+<td>
+0.069
+</td>
+</tr>
+</tbody>
+</table>
+<p><strong>Solution</strong>:</p>
+<p><span class="math inline">\(n\)</span> = <span class="math inline">\(25\)</span>, <span class="math inline">\(100\)</span>, <span class="math inline">\(50\)</span> respectively.</p>
+<p>For the following four learning checks, let the <em>estimate</em> be the sample proportion <span class="math inline">\(\widehat{p}\)</span>: the proportion of a shovel’s balls that were red. It estimates the population proportion <span class="math inline">\(p\)</span>: the proportion of the bowl’s balls that were red.</p>
+<p><strong>(LC7.17)</strong> What is the difference between an <em>accurate</em> estimate and a <em>precise</em> estimate?</p>
+<p><strong>Solution</strong>:</p>
+<p>An <em>accurate</em> estimate gives an estimate that is close to, but not necessary the exact, actual value. A <em>precise</em> estimate gives the exact actual value.</p>
+<p><strong>(LC7.18)</strong> How do we ensure that an estimate is <em>accurate</em>? How do we ensure that an estimate is <em>precise</em>?</p>
+<p>To ensure that an estimate is <em>accurate</em>, we need to have a reasonable range of estimate, and make sure that the estimate is reasonably close to the actual value To ensure that an estimate is <em>precise</em>, we need to make sure the estimate is equivalent to the actual value.</p>
+<p><strong>(LC7.19)</strong> In a real-life situation, we would not take 1000 different samples to infer about a population, but rather only one. Then, what was the purpose of our exercises where we took 1000 different samples?</p>
+<p><strong>Solution</strong>:</p>
+<p>To get a narrower range of the estimates.</p>
+<p><strong>(LC7.20)</strong> Figure <a href="7-sampling.html#fig:accuracy-vs-precision">7.16</a> with the targets shows four combinations of “accurate versus precise” estimates. Draw four corresponding <em>sampling distributions</em> of the sample proportion <span class="math inline">\(\widehat{p}\)</span>, like the one in the left-most plot in Figure <a href="7-sampling.html#fig:comparing-sampling-distributions-3">7.15</a>.</p>
+<p><strong>Solution</strong>:
+<img src="ModernDive_files/figure-html/unnamed-chunk-619-1.png" width="\textwidth" style="display: block; margin: auto;" /></p>
+<p>Comment on the representativeness of the following <em>sampling methodologies</em>:</p>
+<p><strong>(LC7.21)</strong> The Royal Air Force wants to study how resistant all their airplanes are to bullets. They study the bullet holes on all the airplanes on the tarmac after an air battle against the Luftwaffe (German Air Force).</p>
+<p><strong>Solution</strong>:</p>
+<p>The airplanes on the tarmac after an air battle against the Luftwaffe is not a good representation of all airplanes, because the airplanes which were attacked in less resistant areas did not make it back to the tarmac. This is called <em>survival bias</em>. Survivor’s bias or survival bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. This can lead to false conclusions in several different ways. It is a form of selection bias.</p>
+<p><strong>(LC7.22)</strong> Imagine it is 1993, a time when almost all households had landlines. You want to know the average number of people in each household in your city. You randomly pick out 500 phone numbers from the phone book and conduct a phone survey.</p>
+<p><strong>Solution</strong>:</p>
+<p>This is not a good representation, because: (1) adults are more likely to pickup phone calls; (2) households with more people are more likely to have people to be available to pickup phone calls; (3) we are not certain whether all households are in the phone book.</p>
+<p><strong>(LC7.23)</strong> You want to know the prevalence of illegal downloading of TV shows among students at a local college. You get the emails of 100 randomly chosen students and ask them, “How many times did you download a pirated TV show last week?”.</p>
+<p><strong>Solution</strong>:</p>
+<p>This is not a good representation, because it is very likely that students will lie in this survey to stay out of trouble. So we may not get honest data. This is called <em>volunteer bias</em>: systematic error due to differences between those who choose to participate in studies and those who do not.</p>
+<p><strong>(LC7.24)</strong> A local college administrator wants to know the average income of all graduates in the last 10 years. So they get the records of five randomly chosen graduates, contact them, and obtain their answers.</p>
+<p><strong>Solution</strong>:</p>
+<p>This is not a good representation, because the sample size is too small. The sample is representative but not precise.</p>
+<hr />
+</div>
+<div id="chapter-8-solutions" class="section level2">
+<h2><span class="header-section-number">D.8</span> Chapter 8 Solutions</h2>
+<div class="sourceCode" id="cb671"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb671-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb671-2" data-line-number="2"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb671-3" data-line-number="3"><span class="kw">library</span>(infer)</a></code></pre></div>
+<p><strong>(LC8.1)</strong> What is the chief difference between a bootstrap distribution and a sampling distribution?</p>
+<p><strong>Solution</strong>:</p>
+<p>A bootstrap sample is a smaller sample that is “bootstrapped” from a larger sample. Bootstrapping is a type of resampling where large numbers of smaller samples of the same size are repeatedly drawn, with replacement, from a single original sample.</p>
+<p><strong>(LC8.2)</strong> Looking at the bootstrap distribution for the sample mean in Figure <a href="8-confidence-intervals.html#fig:one-thousand-sample-means">8.14</a>, between what two values would you say <em>most</em> values lie?</p>
+<p><strong>Solution</strong>:</p>
+<p><em>Most</em> values lie in 1990 amd 2000.</p>
+<p><strong>(LC8.3)</strong> What condition about the bootstrap distribution must be met for us to be able to construct confidence intervals using the standard error method?</p>
+<p><strong>Solution</strong>:</p>
+<p>We can only use the standard error rule when the bootstrap distribution is roughly normally distributed.</p>
+<p><strong>(LC8.4)</strong> Say we wanted to construct a 68% confidence interval instead of a 95% confidence interval for <span class="math inline">\(\mu\)</span>. Describe what changes are needed to make this happen. Hint: we suggest you look at Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a> on the normal distribution.</p>
+<p><strong>Solution</strong>:</p>
+<p>Thus, using our 68% rule of thumb about normal distributions from Appendix <a href="A-appendixA.html#appendix-normal-curve">A.2</a>, we can use the following formula to determine the lower and upper endpoints of a 95% confidence interval for <span class="math inline">\(\mu\)</span>:</p>
+<p><span class="math display">\[\overline{x} \pm 1 \cdot SE = (\overline{x} - 1 \cdot SE, \overline{x} + 1 \cdot SE)\]</span></p>
+<p><strong>(LC8.5)</strong> Construct a 95% confidence interval for the <em>median</em> year of minting of <em>all</em> US pennies? Use the percentile method and, if appropriate, then use the standard-error method.</p>
+<p><strong>Solution</strong>:</p>
+<p>Using the percentile method:</p>
+<div class="sourceCode" id="cb672"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb672-1" data-line-number="1">bootstrap_distribution &lt;-<span class="st"> </span>pennies_sample <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb672-2" data-line-number="2"><span class="st">  </span><span class="kw">specify</span>(<span class="dt">response =</span> year) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb672-3" data-line-number="3"><span class="st">  </span><span class="kw">generate</span>(<span class="dt">reps =</span> <span class="dv">1000</span>) <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb672-4" data-line-number="4"><span class="st">  </span><span class="kw">calculate</span>(<span class="dt">stat =</span> <span class="st">&quot;median&quot;</span>)</a>
+<a class="sourceLine" id="cb672-5" data-line-number="5">percentile_ci &lt;-<span class="st"> </span>bootstrap_distribution <span class="op">%&gt;%</span><span class="st"> </span></a>
+<a class="sourceLine" id="cb672-6" data-line-number="6"><span class="st">  </span><span class="kw">get_confidence_interval</span>(<span class="dt">level =</span> <span class="fl">0.95</span>, <span class="dt">type =</span> <span class="st">&quot;percentile&quot;</span>)</a>
+<a class="sourceLine" id="cb672-7" data-line-number="7">percentile_ci</a></code></pre></div>
+<pre><code># A tibble: 1 x 2
+  `2.5%` `97.5%`
+   &lt;dbl&gt;   &lt;dbl&gt;
+1   1988    2000</code></pre>
+<hr />
+</div>
+<div id="chapter-9-solutions" class="section level2">
+<h2><span class="header-section-number">D.9</span> Chapter 9 Solutions</h2>
+<div class="sourceCode" id="cb674"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb674-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb674-2" data-line-number="2"><span class="kw">library</span>(infer)</a>
+<a class="sourceLine" id="cb674-3" data-line-number="3"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb674-4" data-line-number="4"><span class="kw">library</span>(nycflights13)</a>
+<a class="sourceLine" id="cb674-5" data-line-number="5"><span class="kw">library</span>(ggplot2movies)</a></code></pre></div>
+<hr />
+</div>
+<div id="chapter-10-solutions" class="section level2">
+<h2><span class="header-section-number">D.10</span> Chapter 10 Solutions</h2>
+<div class="sourceCode" id="cb675"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb675-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb675-2" data-line-number="2"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb675-3" data-line-number="3"><span class="kw">library</span>(infer)</a></code></pre></div>
+<hr />
+</div>
+<div id="chapter-11-solutions" class="section level2">
+<h2><span class="header-section-number">D.11</span> Chapter 11 Solutions</h2>
+<div class="sourceCode" id="cb676"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb676-1" data-line-number="1"><span class="kw">library</span>(tidyverse)</a>
+<a class="sourceLine" id="cb676-2" data-line-number="2"><span class="kw">library</span>(moderndive)</a>
+<a class="sourceLine" id="cb676-3" data-line-number="3"><span class="kw">library</span>(skimr)</a>
+<a class="sourceLine" id="cb676-4" data-line-number="4"><span class="kw">library</span>(fivethirtyeight)</a></code></pre></div>
 
 </div>
 </div>
@@ -1025,11 +2956,13 @@ <h2><span class="header-section-number">D.5</span> Chapter 5 Solutions</h2>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -1037,12 +2970,11 @@ <h2><span class="header-section-number">D.5</span> Chapter 5 Solutions</h2>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -1057,6 +2989,10 @@ <h2><span class="header-section-number">D.5</span> Chapter 5 Solutions</h2>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
diff --git a/docs/E-appendixE.html b/docs/E-appendixE.html
index 1442ffb45..4eafbae9e 100644
--- a/docs/E-appendixE.html
+++ b/docs/E-appendixE.html
@@ -4,35 +4,35 @@
 
   <meta charset="utf-8" />
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
-  <title>E Information about R Packages Used | Statistical Inference via Data Science</title>
+  <title>E Versions of R Packages Used | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
-  <meta property="og:title" content="E Information about R Packages Used | Statistical Inference via Data Science" />
+  <meta property="og:title" content="E Versions of R Packages Used | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
-  <meta name="twitter:title" content="E Information about R Packages Used | Statistical Inference via Data Science" />
+  <meta name="twitter:title" content="E Versions of R Packages Used | Statistical Inference via Data Science" />
   <meta name="twitter:site" content="@ModernDive" />
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="D-appendixD.html">
-<link rel="next" href="references.html">
+<link rel="prev" href="D-appendixD.html"/>
+<link rel="next" href="references.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -569,9 +582,14 @@ <h1>
 <img src='https://moderndive.com/wide_format.png' alt="ModernDive">
 </html>
 <div id="appendixE" class="section level1">
-<h1><span class="header-section-number">E</span> Information about R Packages Used</h1>
-<p>This book uses the following versions of R packages (and their dependent packages). If you are seeing results slightly different than what is shown in the book and you want to get a closer match, we recommend you install the particular version of the package we used. This can be done by first installing the <code>remotes</code> package via <code>install.packages(&quot;remotes&quot;)</code> and then the particular version of a package using syntax similar to the following replacing the <code>package</code> argument with the name of the package in quotes and the <code>version</code> argument with the particular number of the version to install.</p>
-<pre class="sourceCode r"><code class="sourceCode r">remotes<span class="op">::</span><span class="kw">install_version</span>(<span class="dt">package =</span> <span class="st">&quot;moderndive&quot;</span>, <span class="dt">version =</span> <span class="st">&quot;0.3.0&quot;</span>)</code></pre>
+<h1><span class="header-section-number">E</span> Versions of R Packages Used</h1>
+<p>If you are seeing different results than what is in the book, we recommend installing the exact version of the packages we used. This can be done by first installing the <code>remotes</code> package via <code>install.packages(&quot;remotes&quot;)</code>. Then, use <code>install_version()</code> replacing the <code>package</code> argument with the package name in quotes and the <code>version</code> argument with the particular version number to install.<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a></p>
+<div class="sourceCode" id="cb677"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb677-1" data-line-number="1">remotes<span class="op">::</span><span class="kw">install_version</span>(<span class="dt">package =</span> <span class="st">&quot;skimr&quot;</span>, <span class="dt">version =</span> <span class="st">&quot;1.0.6&quot;</span>)</a></code></pre></div>
+<!--
+\begin{multicols}{2}
+\setbox\ltmcbox\vbox{
+\makeatletter\col@number\@ne
+-->
 <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <thead>
 <tr>
@@ -586,50 +604,10 @@ <h1><span class="header-section-number">E</span> Information about R Packages Us
 <tbody>
 <tr>
 <td style="text-align:left;">
-askpass
-</td>
-<td style="text-align:left;">
-1.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-assertthat
-</td>
-<td style="text-align:left;">
-0.2.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-backports
-</td>
-<td style="text-align:left;">
-1.1.4
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-base64enc
-</td>
-<td style="text-align:left;">
-0.1-3
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-BH
-</td>
-<td style="text-align:left;">
-1.69.0-1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-brew
+bookdown
 </td>
 <td style="text-align:left;">
-1.0-6
+0.16
 </td>
 </tr>
 <tr>
@@ -642,118 +620,6 @@ <h1><span class="header-section-number">E</span> Information about R Packages Us
 </tr>
 <tr>
 <td style="text-align:left;">
-callr
-</td>
-<td style="text-align:left;">
-3.3.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-cellranger
-</td>
-<td style="text-align:left;">
-1.1.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-cli
-</td>
-<td style="text-align:left;">
-1.1.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-clipr
-</td>
-<td style="text-align:left;">
-0.6.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-clisymbols
-</td>
-<td style="text-align:left;">
-1.2.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-colorspace
-</td>
-<td style="text-align:left;">
-1.4-1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-commonmark
-</td>
-<td style="text-align:left;">
-1.7
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-crayon
-</td>
-<td style="text-align:left;">
-1.3.4
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-curl
-</td>
-<td style="text-align:left;">
-4.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-DBI
-</td>
-<td style="text-align:left;">
-1.0.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-dbplyr
-</td>
-<td style="text-align:left;">
-1.4.2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-desc
-</td>
-<td style="text-align:left;">
-1.2.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-devtools
-</td>
-<td style="text-align:left;">
-2.1.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-digest
-</td>
-<td style="text-align:left;">
-0.6.20
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
 dplyr
 </td>
 <td style="text-align:left;">
@@ -770,34 +636,10 @@ <h1><span class="header-section-number">E</span> Information about R Packages Us
 </tr>
 <tr>
 <td style="text-align:left;">
-ellipsis
-</td>
-<td style="text-align:left;">
-0.2.0.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-evaluate
-</td>
-<td style="text-align:left;">
-0.14
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-fansi
-</td>
-<td style="text-align:left;">
-0.4.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
 fivethirtyeight
 </td>
 <td style="text-align:left;">
-0.4.0
+0.5.0
 </td>
 </tr>
 <tr>
@@ -810,22 +652,6 @@ <h1><span class="header-section-number">E</span> Information about R Packages Us
 </tr>
 <tr>
 <td style="text-align:left;">
-formula.tools
-</td>
-<td style="text-align:left;">
-1.7.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-fs
-</td>
-<td style="text-align:left;">
-1.3.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
 gapminder
 </td>
 <td style="text-align:left;">
@@ -834,14 +660,6 @@ <h1><span class="header-section-number">E</span> Information about R Packages Us
 </tr>
 <tr>
 <td style="text-align:left;">
-generics
-</td>
-<td style="text-align:left;">
-0.0.2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
 ggplot2
 </td>
 <td style="text-align:left;">
@@ -858,143 +676,95 @@ <h1><span class="header-section-number">E</span> Information about R Packages Us
 </tr>
 <tr>
 <td style="text-align:left;">
-ggrepel
-</td>
-<td style="text-align:left;">
-0.8.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-gh
-</td>
-<td style="text-align:left;">
-1.0.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-git2r
-</td>
-<td style="text-align:left;">
-0.26.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-glue
-</td>
-<td style="text-align:left;">
-1.3.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-gridExtra
-</td>
-<td style="text-align:left;">
-2.3
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-gtable
-</td>
-<td style="text-align:left;">
-0.3.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-haven
+infer
 </td>
 <td style="text-align:left;">
-2.1.0
+0.5.1
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-highr
+ISLR
 </td>
 <td style="text-align:left;">
-0.8
+1.2
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-hms
+janitor
 </td>
 <td style="text-align:left;">
-0.4.2
+1.2.0
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-htmltools
+kableExtra
 </td>
 <td style="text-align:left;">
-0.3.6
+1.1.0
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-htmlwidgets
+knitr
 </td>
 <td style="text-align:left;">
-1.3
+1.26
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-httr
+moderndive
 </td>
 <td style="text-align:left;">
-1.4.0
+0.4.0
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-infer
+mvtnorm
 </td>
 <td style="text-align:left;">
-0.4.1
+1.0-11
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-ini
+nycflights13
 </td>
 <td style="text-align:left;">
-0.3.1
+1.0.1
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-ISLR
+patchwork
 </td>
 <td style="text-align:left;">
-1.2
+0.0.1
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-janitor
+purrr
 </td>
 <td style="text-align:left;">
-1.2.0
+0.3.3
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-jsonlite
+readr
 </td>
 <td style="text-align:left;">
-1.6
+1.3.1
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-kableExtra
+scales
 </td>
 <td style="text-align:left;">
 1.1.0
@@ -1002,640 +772,80 @@ <h1><span class="header-section-number">E</span> Information about R Packages Us
 </tr>
 <tr>
 <td style="text-align:left;">
-knitr
-</td>
-<td style="text-align:left;">
-1.23
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-labeling
+skimr
 </td>
 <td style="text-align:left;">
-0.3
+1.0.6
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-lattice
+stringr
 </td>
 <td style="text-align:left;">
-0.20-38
+1.4.0
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-lazyeval
+tibble
 </td>
 <td style="text-align:left;">
-0.2.2
+2.1.3
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-lubridate
+tidyr
 </td>
 <td style="text-align:left;">
-1.7.4
+1.0.0
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-magrittr
+tidyverse
 </td>
 <td style="text-align:left;">
-1.5
+1.3.0
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-markdown
+viridis
 </td>
 <td style="text-align:left;">
-1.0
+0.5.1
 </td>
 </tr>
 <tr>
 <td style="text-align:left;">
-MASS
+viridisLite
 </td>
 <td style="text-align:left;">
-7.3-51.4
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-Matrix
-</td>
-<td style="text-align:left;">
-1.2-17
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-memoise
-</td>
-<td style="text-align:left;">
-1.1.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-mgcv
-</td>
-<td style="text-align:left;">
-1.8-28
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-mime
-</td>
-<td style="text-align:left;">
-0.7
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-modelr
-</td>
-<td style="text-align:left;">
-0.1.4
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-moderndive
-</td>
-<td style="text-align:left;">
-0.3.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-munsell
-</td>
-<td style="text-align:left;">
-0.5.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-mvtnorm
-</td>
-<td style="text-align:left;">
-1.0-11
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-nlme
-</td>
-<td style="text-align:left;">
-3.1-139
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-nycflights13
-</td>
-<td style="text-align:left;">
-1.0.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-openssl
-</td>
-<td style="text-align:left;">
-1.4.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-operator.tools
-</td>
-<td style="text-align:left;">
-1.6.3
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-pander
-</td>
-<td style="text-align:left;">
-0.6.3
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-patchwork
-</td>
-<td style="text-align:left;">
-0.0.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-pillar
-</td>
-<td style="text-align:left;">
-1.4.2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-pkgbuild
-</td>
-<td style="text-align:left;">
-1.0.3
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-pkgconfig
-</td>
-<td style="text-align:left;">
-2.0.2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-pkgload
-</td>
-<td style="text-align:left;">
-1.0.2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-plogr
-</td>
-<td style="text-align:left;">
-0.2.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-plyr
-</td>
-<td style="text-align:left;">
-1.8.4
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-praise
-</td>
-<td style="text-align:left;">
-1.0.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-prettyunits
-</td>
-<td style="text-align:left;">
-1.0.2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-processx
-</td>
-<td style="text-align:left;">
-3.4.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-progress
-</td>
-<td style="text-align:left;">
-1.2.2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-ps
-</td>
-<td style="text-align:left;">
-1.3.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-purrr
-</td>
-<td style="text-align:left;">
-0.3.2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-R6
-</td>
-<td style="text-align:left;">
-2.4.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-rcmdcheck
-</td>
-<td style="text-align:left;">
-1.3.3
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-RColorBrewer
-</td>
-<td style="text-align:left;">
-1.1-2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-Rcpp
-</td>
-<td style="text-align:left;">
-1.0.2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-readr
-</td>
-<td style="text-align:left;">
-1.3.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-readxl
-</td>
-<td style="text-align:left;">
-1.3.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-rematch
-</td>
-<td style="text-align:left;">
-1.0.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-remotes
-</td>
-<td style="text-align:left;">
-2.1.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-reprex
-</td>
-<td style="text-align:left;">
-0.3.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-reshape2
-</td>
-<td style="text-align:left;">
-1.4.3
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-rlang
-</td>
-<td style="text-align:left;">
-0.4.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-rmarkdown
-</td>
-<td style="text-align:left;">
-1.14
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-roxygen2
-</td>
-<td style="text-align:left;">
-6.1.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-rprojroot
-</td>
-<td style="text-align:left;">
-1.3-2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-rstudioapi
-</td>
-<td style="text-align:left;">
-0.10
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-rvest
-</td>
-<td style="text-align:left;">
-0.3.4
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-scales
-</td>
-<td style="text-align:left;">
-1.0.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-selectr
-</td>
-<td style="text-align:left;">
-0.4-1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-sessioninfo
-</td>
-<td style="text-align:left;">
-1.1.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-skimr
-</td>
-<td style="text-align:left;">
-1.0.7
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-snakecase
-</td>
-<td style="text-align:left;">
-0.11.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-stringi
-</td>
-<td style="text-align:left;">
-1.4.3
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-stringr
-</td>
-<td style="text-align:left;">
-1.4.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-sys
-</td>
-<td style="text-align:left;">
-3.2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-testthat
-</td>
-<td style="text-align:left;">
-2.1.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-tibble
-</td>
-<td style="text-align:left;">
-2.1.3
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-tidyr
-</td>
-<td style="text-align:left;">
-0.8.3
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-tidyselect
-</td>
-<td style="text-align:left;">
-0.2.5
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-tidyverse
-</td>
-<td style="text-align:left;">
-1.2.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-tinytex
-</td>
-<td style="text-align:left;">
-0.14
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-usethis
-</td>
-<td style="text-align:left;">
-1.5.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-utf8
-</td>
-<td style="text-align:left;">
-1.1.4
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-vctrs
-</td>
-<td style="text-align:left;">
-0.2.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-viridis
-</td>
-<td style="text-align:left;">
-0.5.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-viridisLite
-</td>
-<td style="text-align:left;">
-0.3.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-webshot
-</td>
-<td style="text-align:left;">
-0.5.1
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-whisker
-</td>
-<td style="text-align:left;">
-0.3-2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-withr
-</td>
-<td style="text-align:left;">
-2.1.2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-xfun
-</td>
-<td style="text-align:left;">
-0.8
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-xml2
-</td>
-<td style="text-align:left;">
-1.2.2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-xopen
-</td>
-<td style="text-align:left;">
-1.0.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-xts
-</td>
-<td style="text-align:left;">
-0.11-2
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-yaml
-</td>
-<td style="text-align:left;">
-2.2.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-zeallot
-</td>
-<td style="text-align:left;">
-0.1.0
-</td>
-</tr>
-<tr>
-<td style="text-align:left;">
-zoo
-</td>
-<td style="text-align:left;">
-1.8-6
+0.3.0
 </td>
 </tr>
 </tbody>
 </table>
 
+<!--
+% Wrap this after the table to get into multiple columns
+\unskip
+\unpenalty
+\unpenalty}
+\unvbox\ltmcbox
+
+\end{multicols}
+-->
 
+
+</div>
+<div class="footnotes">
+<hr />
+<ol start="2">
+<li id="fn2"><p>As of November 2019, the <code>patchwork</code> package is not on CRAN and needs to be installed via <code>remotes::install_github(&quot;thomasp85/patchwork&quot;)</code> instead of using <code>install_version()</code>.<a href="E-appendixE.html#fnref2" class="footnote-back">↩</a></p></li>
+</ol>
 </div>
             </section>
 
@@ -1648,11 +858,13 @@ <h1><span class="header-section-number">E</span> Information about R Packages Us
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -1660,12 +872,11 @@ <h1><span class="header-section-number">E</span> Information about R Packages Us
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -1680,6 +891,10 @@ <h1><span class="header-section-number">E</span> Information about R Packages Us
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -1696,8 +911,9 @@ <h1><span class="header-section-number">E</span> Information about R Packages Us
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/images/logos/Rlogo.png b/docs/images/logos/Rlogo.png
index 60f3b70a1..42b2e610b 100644
Binary files a/docs/images/logos/Rlogo.png and b/docs/images/logos/Rlogo.png differ
diff --git a/docs/images/logos/book_cover.png b/docs/images/logos/book_cover.png
index 075b4239e..e64686fc7 100644
Binary files a/docs/images/logos/book_cover.png and b/docs/images/logos/book_cover.png differ
diff --git a/docs/images/logos/book_cover_old.png b/docs/images/logos/book_cover_old.png
deleted file mode 100644
index f20fd9ef6..000000000
Binary files a/docs/images/logos/book_cover_old.png and /dev/null differ
diff --git a/docs/index.html b/docs/index.html
index f9911e30b..75f4bffee 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="Statistical Inference via Data Science" />
@@ -21,10 +21,10 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
@@ -32,7 +32,7 @@
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
 
-<link rel="next" href="1-getting-started.html">
+<link rel="next" href="foreword.html"/>
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -570,19 +583,16 @@ <h1>
 </html>
 <div id="header">
 <h1 class="title">Statistical Inference via Data Science</h1>
-<h2 class="subtitle"><em>A moderndive into R and the tidyverse</em></h2>
-<p class="author"><em>Chester Ismay and Albert Y. Kim</em></p>
-<p class="date"><em>August 28, 2019</em></p>
+<h2 class="subtitle"><em>A ModernDive into R and the tidyverse</em></h2>
+<p class="author"><em>Chester Ismay and Albert Y. Kim <br> Foreword by Kelly S. McConville</em></p>
+<p class="date"><em>November 25, 2019</em></p>
 </div>
-<div id="preface" class="section level1 unnumbered">
-<h1>Preface</h1>
-<h1>
-<br>Special Announcement</br>
-</h1>
+<div id="special-announcement" class="section level1 unnumbered">
+<h1>Special Announcement</h1>
 <!-- include=FALSE for PDF sending to CRC -->
 <div class="announcement">
 <p>
-<strong>We’re excited to announce that we’ve signed a book deal with CRC Press! We will be publishing our first fully complete online version of ModernDive in Summer 2019, with a corresponding print edition to follow in Fall 2019. Don’t worry though, our content will remain freely available on <a href="https://moderndive.com/">ModernDive.com</a>.</strong>
+<strong>We’re excited to announce that we’ve signed a book deal with CRC Press! We will be publishing our first fully complete online version of ModernDive in November 2019, with a corresponding print edition to follow in December 2019. Don’t worry though, our content will remain freely available on <a href="https://moderndive.com/">ModernDive.com</a>.</strong>
 </p>
 </div>
 <center>
@@ -590,292 +600,15 @@ <h1>
 </center>
 <!--
 <div class="announcement">
-<p><strong>This is a previous version (v<code>r version</code>) of ModernDive and may be out of date. For the current version of ModernDive, please go to <a href="https://moderndive.com/">ModernDive.com</a>.</strong></p>
+<p><strong>This is a previous version (v<code>r version</code>) of <em>ModernDive</em> and may be out of date. For the current version of <em>ModernDive</em>, please go to <a href="https://moderndive.com/">ModernDive.com</a>.</strong></p>
 </div>
 -->
-<div class="learncheck">
-<p>
-<strong>Please note that you are currently looking at the “development version” of ModernDive, which is a work in progress currently being edited and thus subject to frequent change. For the latest “released version” of ModernDive, which is updated around twice a year, please visit <a href="https://moderndive.com/">ModernDive.com</a>.</strong>
-</p>
-</div>
-
-<center>
-<img src="images/logos/Rlogo.png" height="100" />       <img src="images/logos/RStudio-Logo-Blue-Gradient.png" height="100" />
-</center>
-<p><strong>Help! I’m new to R and RStudio and I need to learn about them! However, I’m completely new to coding! What do I do?</strong></p>
-
-<!--
-<img src="images/logos/Rlogo.svg" style="height: 150px;"/>
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
-<img src="images/logos/RStudio-Logo-Blue-Gradient.png" style="height: 150px;"/>
--->
-<p>If you’re asking yourself this question, then you’ve come to the right place! Start with the “Introduction for students” section.</p>
-<ul>
-<li><em>Are you an instructor hoping to use this book in your courses? Then read the “Introduction for instructors” section for more information on how to teach with this book.</em></li>
-<li><em>Are you looking to connect with and contribute to ModernDive? Then read the “Connect and contribute” section for information on how.</em></li>
-<li><em>Are you curious about the publishing of this book? Then read the “About this book” section for more information on the open-source technology, in particular R Markdown and the bookdown package.</em></li>
-</ul>
-<p>This is version 0.6.1 of ModernDive published on August 28, 2019. For previous versions of ModernDive, see the “About this book” section below.</p>
-<div id="introduction-for-students" class="section level2 unnumbered">
-<h2>Introduction for students</h2>
-<p>This book assumes no prerequisites: no algebra, no calculus, and no prior programming/coding experience. This is intended to be a gentle introduction to the practice of analyzing data and answering questions using data the way data scientists, statisticians, data journalists, and other researchers would.</p>
-<p>We present a map of your upcoming journey in Figure <a href="index.html#fig:moderndive-figure">0.1</a>.</p>
-<div class="figure" style="text-align: center"><span id="fig:moderndive-figure"></span>
-<img src="images/flowcharts/flowchart/flowchart.002.png" alt="ModernDive Flowchart." width="\textwidth" />
-<p class="caption">
-FIGURE 0.1: ModernDive Flowchart.
-</p>
-</div>
-<p>You’ll first get started with data in Chapter <a href="1-getting-started.html#getting-started">1</a> where you’ll learn about the difference between R and RStudio, start coding in R, install and load your first R packages, and explore your first dataset: all domestic departure <code>flights</code> from a New York City airport in 2013. Then you’ll cover the following three portions of this book:</p>
-<ol style="list-style-type: decimal">
-<li>Data science with <code>tidyverse</code>. You’ll assemble your data science toolbox using <code>tidyverse</code> packages. In particular you’ll
-<ul>
-<li>Ch.<a href="2-viz.html#viz">2</a>: Visualize data using the <code>ggplot2</code> package.</li>
-<li>Ch.<a href="3-wrangling.html#wrangling">3</a>: Wrangle data using the <code>dplyr</code> package.</li>
-<li>Ch.<a href="4-tidy.html#tidy">4</a>: Learn about the concept of “tidy” data as a standardized data frame input and output format for all packages in the <code>tidyverse</code>. Furthermore, you’ll learn how to import spreadsheet files into R using the <code>readr</code> package.</li>
-</ul></li>
-<li>Data modeling with <code>moderndive</code>. Using these data science tools and helper functions from the <code>moderndive</code> package, you’ll fit your first data models. In particular:
-<ul>
-<li>Ch.<a href="5-regression.html#regression">5</a>: Basic regression models with only one explanatory variable.</li>
-<li>Ch.<a href="6-multiple-regression.html#multiple-regression">6</a>: Multiple regression models with more than one explanatory variable.</li>
-</ul></li>
-<li>Statistical inference with <code>infer</code>. Once again using your newly acquired data science tools, you’ll unpack statistical inference using the <code>infer</code> package. In particular you’ll:
-<ul>
-<li>Ch.<a href="7-sampling.html#sampling">7</a>: Learn about the role that sampling variability plays in statistical inference and the role that sample size plays in sampling variability.</li>
-<li>Ch.<a href="8-confidence-intervals.html#confidence-intervals">8</a>: Construct confidence intervals.</li>
-<li>Ch.<a href="9-hypothesis-testing.html#hypothesis-testing">9</a>: Conduct hypothesis tests.</li>
-</ul></li>
-<li>Data modeling with <code>moderndive</code> (revisited): Armed with your understanding of statistical inference, you’ll revisit and review the models you’ll construct in Ch.<a href="5-regression.html#regression">5</a> &amp; Ch.<a href="6-multiple-regression.html#multiple-regression">6</a>. In particular you’ll:
-<ul>
-<li>Ch.<a href="10-inference-for-regression.html#inference-for-regression">10</a>: Interpret confidence intervals and hypothesis tests in a regression setting.</li>
-</ul></li>
-</ol>
-<p>We’ll end with a discussion on what it means to “tell the story with data” in Chapter <a href="11-thinking-with-data.html#thinking-with-data">11</a> by presenting example case studies.</p>
-<div id="what-we-hope-you-will-learn-from-this-book" class="section level3 unnumbered">
-<h3>What we hope you will learn from this book</h3>
-<p>We hope that by the end of this book, you’ll have learned how to</p>
-<ol style="list-style-type: decimal">
-<li>Use R and the <code>tidyverse</code> suite of R <em>packages</em> for data science.</li>
-<li>Fit your first <em>models</em> to data, using a method known as <em>linear regression</em>.</li>
-<li>Perform <em>statistical inference</em> using <em>confidence intervals</em> and <em>hypothesis tests</em>.</li>
-<li><em>Tell your story with data</em> using these tools.</li>
-</ol>
-<p>What do we mean by data stories? We mean any analysis involving data that engages the reader in answering questions with careful visuals and thoughtful discussion. Further discussions on data stories can be found in the blogpost <a href="https://www.thinkwithgoogle.com/marketing-resources/data-measurement/tell-meaningful-stories-with-data/">“Tell a Meaningful Story With Data.”</a></p>
-<p>Over the course of this book, you will develop your “data science toolbox,” equipping yourself with tools such as data visualization, data formatting, data wrangling, and data modeling using regression.</p>
 <!--
-With these tools, you'll be able to perform the entirety of the "data/science pipeline" while building data communication skills (see the "Data/science pipeline" subsection below for more details). 
--->
-<p>In particular, this book will lean heavily on data visualization. In today’s world, we are bombarded with graphics that attempt to convey ideas. We will explore what makes a good graphic and what the standard ways are used to convey relationships within data. In general, we’ll use visualization as a way of building almost all of the ideas in this book.</p>
-<p>To impart the statistical lessons of this book, we have intentionally minimized the number of mathematical formulas used. Instead, you’ll develop a conceptual understanding of statistics using data visualization and computer simulations. We hope this is a more intuitive experience than the way statistics has traditionally been taught in the past and how it is commonly perceived.</p>
-<p>Finally, you’ll learn the importance of literate programming.  By this we mean you’ll learn how to write code that is useful not just for a computer to execute but also for readers to understand exactly what your analysis is doing and how you did it. This is part of a greater effort to encourage reproducible research (see the “Reproducible research” subsection for more details). Hal Abelson  coined the phrase that we will follow throughout this book:</p>
-<blockquote>
-<p>“Programs must be written for people to read, and only incidentally for machines to execute.”</p>
-</blockquote>
-<p>We understand that there may be challenging moments as you learn to program. Both of us continue to struggle and find ourselves often using web searches to find answers and reach out to colleagues for help. In the long run though, we all can solve problems faster and more elegantly via programming. We wrote this book as our way to help you get started and you should know that there is a huge community of R users that are always happy to help everyone along as well. This community exists in particular on the internet on various forums and websites such as <a href="https://stackoverflow.com/">stackoverflow.com</a>.</p>
-</div>
-<div id="datascience-pipeline" class="section level3 unnumbered">
-<h3>Data/science pipeline</h3>
-<p>You may think of statistics as just being a bunch of numbers. We commonly hear the phrase “statistician” when listening to broadcasts of sporting events. Statistics (in particular, data analysis), in addition to describing numbers like with baseball batting averages, plays a vital role in all of the sciences.  You’ll commonly hear the phrase “statistically significant” thrown around in the media. You’ll see articles that say “Science now shows that chocolate is good for you.” Underpinning these claims is data analysis.  By the end of this book, you’ll be able to better understand whether these claims should be trusted or whether we should be wary. Inside data analysis are many sub-fields that we will discuss throughout this book (though not necessarily in this order):</p>
-<ul>
-<li>data collection</li>
-<li>data wrangling</li>
-<li>data visualization</li>
-<li>data modeling</li>
-<li>inference</li>
-<li>correlation and regression</li>
-<li>interpretation of results</li>
-<li>data communication/storytelling</li>
-</ul>
-<p>These sub-fields are summarized in what Grolemund  and Wickham  term the <a href="http://r4ds.had.co.nz/explore-intro.html">“Data/Science Pipeline”</a> in Figure <a href="index.html#fig:pipeline-figure">0.2</a>.</p>
-<div class="figure" style="text-align: center"><span id="fig:pipeline-figure"></span>
-<img src="images/r4ds/data_science_pipeline.png" alt="Data/Science Pipeline." width="\textwidth" />
-<p class="caption">
-FIGURE 0.2: Data/Science Pipeline.
-</p>
-</div>
-<p>We will begin by digging into the gray <strong>Understand</strong> portion of the cycle with data visualization, then with a discussion on what is meant by tidy data and data wrangling, and then conclude by talking about interpreting and discussing the results of our models via <strong>Communication</strong>. These steps are vital to any statistical analysis. But why should you care about statistics? “Why did they make me take this class?”</p>
-<p>There’s a reason so many fields require a statistics course. Scientific knowledge grows through an understanding of statistical significance and data analysis. You needn’t be intimidated by statistics. It’s not the beast that it used to be and, paired with computation, you’ll see how reproducible research in the sciences particularly increases scientific knowledge.</p>
+<div class="learncheck">
+<p><strong>Please note that you are currently looking at the “development version” of <em>ModernDive</em>, which is a work in progress currently being edited and thus subject to frequent change. For the latest “released version” of <em>ModernDive</em>, which is updated around twice a year, please visit <a href="https://moderndive.com/">ModernDive.com</a>.</strong></p>
 </div>
-<div id="reproducible-research" class="section level3 unnumbered">
-<h3>Reproducible research</h3>
-<blockquote>
-<p>“The most important tool is the <em>mindset</em>, when starting, that the end product will be reproducible.” – Keith Baggerly</p>
-</blockquote>
-<p></p>
-<p>Another goal of this book is to help readers understand the importance of reproducible analyses. The hope is to get readers into the habit of making their analyses reproducible from the very beginning. This means we’ll be trying to help you build new habits. This will take practice and be difficult at times. You’ll see just why it is so important for you to keep track of your code and well-document it to help yourself later and any potential collaborators as well.</p>
-<p>Copying and pasting results from one program into a word processor is not the way that efficient and effective scientific research is conducted. It’s much more important for time to be spent on data collection and data analysis and not on copying and pasting plots back and forth across a variety of programs.</p>
-<p>In traditional analyses if an error was made with the original data, we’d need to step through the entire process again: recreate the plots and copy-and-paste all of the new plots and our statistical analysis into your document. This is error prone and a frustrating use of time. We’ll see how to use R Markdown to get away from this tedious activity so that we can spend more time doing science.</p>
-<blockquote>
-<p>“We are talking about <em>computational</em> reproducibility.” - Yihui Xie</p>
-</blockquote>
-<p></p>
-<p>Reproducibility means a lot of things in terms of different scientific fields. Are experiments conducted in a way that another researcher could follow the steps and get similar results? In this book, we will focus on what is known as <strong>computational reproducibility</strong>.  This refers to being able to pass all of one’s data analysis, data-sets, and conclusions to someone else and have them get exactly the same results on their machine. This allows for time to be spent interpreting results and considering assumptions instead of the more error prone way of starting from scratch or following a list of steps that may be different from machine to machine.</p>
-<!--
-Additionally, this book will focus on computational thinking, data thinking, and inferential thinking. We'll see throughout the book how these three modes of thinking can build effective ways to work with, to describe, and to convey statistical knowledge.  
 -->
-</div>
-<div id="final-note-for-students" class="section level3 unnumbered">
-<h3>Final note for students</h3>
-<p>At this point, if you are interested in instructor perspectives on this book, ways to contribute and collaborate, or the technical details of this book’s construction and publishing, then continue with the rest of the chapter. Otherwise, let’s get started with R and RStudio in Chapter <a href="1-getting-started.html#getting-started">1</a>!</p>
-</div>
-</div>
-<div id="introduction-for-instructors" class="section level2 unnumbered">
-<h2>Introduction for instructors</h2>
-<div id="resources" class="section level3 unnumbered">
-<h3>Resources</h3>
-<p>Here are some resources to help you use ModernDive:</p>
-<ol style="list-style-type: decimal">
-<li>We’ve included review questions posed as <em>Learning Checks</em>. You can find all the solutions to all Learning Checks in Appendix D of the online version of the book at <a href="https://moderndive.com/D-appendixD.html" class="uri">https://moderndive.com/D-appendixD.html</a>.</li>
-<li>Dr. Jenny Smetzer and Albert Y. Kim have written a series of labs and problem sets. You can find them at <a href="https://moderndive.com/labs" class="uri">https://moderndive.com/labs</a>.</li>
-<li>You can see the webpages for two courses that use ModernDive:
-<ul>
-<li>Smith College “SDS192 Introduction to Data Science”: <a href="https://rudeboybert.github.io/SDS192/" class="uri">https://rudeboybert.github.io/SDS192/</a>.</li>
-<li>Smith College “SDS220 Introduction to Probability and Statistics” <a href="https://rudeboybert.github.io/SDS220/" class="uri">https://rudeboybert.github.io/SDS220/</a>.</li>
-</ul></li>
-</ol>
-</div>
-<div id="why-did-we-write-this-book" class="section level3 unnumbered">
-<h3>Why did we write this book?</h3>
-<p>This book is inspired by the following books:</p>
-<ul>
-<li>“Mathematical Statistics with Resampling and R” <span class="citation">(Chihara and Hesterberg <a href="#ref-hester2011">2011</a>)</span>,</li>
-<li>“OpenIntro: Intro Stat with Randomization and Simulation” <span class="citation">(Diez, Barr, and Çetinkaya-Rundel <a href="#ref-isrs2014">2014</a>)</span>, and</li>
-<li>“R for Data Science” <span class="citation">(Grolemund and Wickham <a href="#ref-rds2016">2016</a>)</span>.</li>
-</ul>
-<p>The first book, while designed for upper-level undergraduates and graduate students, provides an excellent resource on how to use resampling to impart statistical concepts like sampling distributions using computation instead of large-sample approximations and other mathematical formulas. The last two books are free options to learning introductory statistics and data science, providing an alternative to the many traditionally expensive introductory statistics textbooks.</p>
-<p>When looking over the large number of introductory statistics textbooks that currently exist, we found that there wasn’t one that incorporated many newly developed R packages directly into the text, in particular the many packages included in the <a href="http://tidyverse.org/"><code>tidyverse</code></a> collection of packages, such as <code>ggplot2</code>, <code>dplyr</code>, <code>tidyr</code>, and <code>broom</code>. Additionally, there wasn’t an open-source and easily reproducible textbook available that exposed new learners all of three of the learning goals we listed.</p>
-</div>
-<div id="who-is-this-book-for" class="section level3 unnumbered">
-<h3>Who is this book for?</h3>
-<p>This book is intended for instructors of traditional introductory statistics classes using RStudio, either the desktop or server version, who would like to inject more data science topics into their syllabus. We assume that students taking the class will have no prior algebra, calculus, nor programming/coding experience.</p>
-<p>Here are some principles and beliefs we kept in mind while writing this text. If you agree with them, this might be the book for you.</p>
-<ol style="list-style-type: decimal">
-<li><strong>Blur the lines between lecture and lab</strong>
-<ul>
-<li>With increased availability and accessibility of laptops and open-source non-proprietary statistical software, the strict dichotomy between lab and lecture can be loosened.</li>
-<li>It’s much harder for students to understand the importance of using software if they only use it once a week or less. They forget the syntax in much the same way someone learning a foreign language forgets the rules. Frequent reinforcement is key.</li>
-</ul></li>
-<li><strong>Focus on the entire data/science research pipeline</strong>
-<ul>
-<li>We believe that the entirety of Grolemund and Wickham’s <a href="http://r4ds.had.co.nz/introduction.html">data/science pipeline</a>  should be taught.</li>
-<li>We believe in George Cobb’s <a href="https://arxiv.org/abs/1507.05346">“minimizing prerequisites to research”</a>:  students should be answering questions with data as soon as possible.</li>
-</ul></li>
-<li><strong>It’s all about the data</strong>
-<ul>
-<li>We leverage R packages for rich, real, and realistic data-sets that at the same time are easy-to-load into R, such as the <code>nycflights13</code> and <code>fivethirtyeight</code> packages.</li>
-<li>We believe that <a href="http://escholarship.org/uc/item/84v3774z">data visualization is a gateway drug for statistics</a> and that the Grammar of Graphics as implemented in the <code>ggplot2</code> package is the best way to impart such lessons. However, we often hear: “You can’t teach <code>ggplot2</code> for data visualization in intro stats!” We, like  <a href="http://varianceexplained.org/r/teach_ggplot2_to_beginners/">David Robinson</a>, are much more optimistic.</li>
-<li><code>dplyr</code> has made data wrangling much more <a href="http://chance.amstat.org/2015/04/setting-the-stage/">accessible</a> to novices, and hence much more interesting data-sets can be explored.</li>
-</ul></li>
-<li><strong>Use simulation/resampling to introduce statistical inference, not probability/mathematical formulas</strong>
-<ul>
-<li>Instead of using formulas, large-sample approximations, and probability tables, we teach statistical concepts using resampling-based inference.</li>
-<li>This allows for a de-emphasis of traditional probability topics, freeing up room in the syllabus for other topics. Bridges to these mathematical concepts are given as well to help with relation of these traditional topics with more modern approaches.</li>
-</ul></li>
-<li><strong>Don’t fence off students from the computation pool, throw them in!</strong>
-<ul>
-<li>Computing skills are essential to working with data in the 21st century. Given this fact, we feel that to shield students from computing is to ultimately do them a disservice.</li>
-<li>We are not teaching a course on coding/programming per se, but rather just enough of the computational and algorithmic thinking necessary for data analysis.</li>
-</ul></li>
-<li><strong>Complete reproducibility and customizability</strong>
-<ul>
-<li>We are frustrated when textbooks give examples, but not the source code and the data itself. We give you the source code for all examples as well as the whole book!</li>
-<li>Ultimately the best textbook is one you’ve written yourself. You know best your audience, their background, and their priorities. You know best your own style and the types of examples and problems you like best. Customization is the ultimate end. For more about how to make this book your own, see <a href="about-book">About this Book</a>.</li>
-</ul></li>
-</ol>
-</div>
-</div>
-<div id="connect-and-contribute" class="section level2 unnumbered">
-<h2>Connect and contribute</h2>
-<p>If you would like to connect with ModernDive, check out the following links:</p>
-<ul>
-<li>If you would like to receive periodic updates about ModernDive (roughly every 6 months), please sign up for our <a href="http://eepurl.com/cBkItf">mailing list</a>.</li>
-<li>Contact Albert at <a href="mailto:albert.ys.kim@gmail.com">albert.ys.kim@gmail.com</a> and Chester at <a href="mailto:chester.ismay@gmail.com">chester.ismay@gmail.com</a>.</li>
-<li>We’re on Twitter at <a href="https://twitter.com/moderndive">moderndive</a>.</li>
-</ul>
-<p>If you would like to contribute to ModernDive, there are many ways! We would love your help and feedback to make this book as great as possible! For example, if you find any errors, typos, or areas for improvement, then please email us or post an issue on our <a href="https://github.com/moderndive/moderndive_book/issues">GitHub issues</a>  page. If you are familiar with GitHub and would like to contribute more, please see the “About this book” section.</p>
-<p>The authors would like to thank <a href="https://github.com/nsonneborn">Nina Sonneborn</a>, <a href="https://twitter.com/rhobott?lang=en">Kristin Bott</a>, <a href="https://www.smith.edu/academics/faculty/jennifer-smetzer">Dr. Jenny Smetzer</a>, and the participants of our <a href="https://www.causeweb.org/cause/uscots/uscots17/workshop/3">2017</a> and <a href="https://www.causeweb.org/cause/uscots/uscots19/workshop/4">2019</a> USCOTS workshops for their feedback and suggestions. We’d also like to thank <a href="https://twitter.com/andrewheiss">Dr. Andrew Heiss</a> for contributing Subsection <a href="1-getting-started.html#tips-code">1.2.3</a> on “Errors, warnings, and messages.” and <a href="https://github.com/Starryz">Starry Zhou</a> for her many edits to the book. A special thanks goes to Dr. Yana Weinstein, cognitive psychological scientist and co-founder of <a href="http://www.learningscientists.org/yana-weinstein/">The Learning Scientists</a>, for her extensive feedback.</p>
-</div>
-<div id="about-this-book" class="section level2 unnumbered">
-<h2>About this book</h2>
-<p>This book was written using RStudio’s <a href="https://bookdown.org/">bookdown</a>  package by Yihui Xie  <span class="citation">(Xie <a href="#ref-R-bookdown">2019</a>)</span>. This package simplifies the publishing of books by having all content written in  <a href="http://rmarkdown.rstudio.com/html_document_format.html">R Markdown</a>. The bookdown/R Markdown source code for all versions of ModernDive is available on GitHub:</p>
-<ul>
-<li><strong>Latest published version</strong> The most up-to-date release:
-<ul>
-<li>Version 0.6.1 released on August 28, 2019 (<a href="https://github.com/moderndive/moderndive_book/releases/tag/v0.6.1">source code</a>).</li>
-<li>Available at <a href="https://moderndive.com/">ModernDive.com</a></li>
-</ul></li>
-<li><strong>Development version</strong> The working copy of the next version which is currently being edited:
-<ul>
-<li>Preview of development version is available at <a href="https://moderndive.netlify.com/">https://moderndive.netlify.com/</a></li>
-<li>Source code: Available on ModernDive’s <a href="https://github.com/moderndive/moderndive_book">GitHub repository page</a></li>
-</ul></li>
-<li><strong>Previous versions</strong> Older versions that may be out of date:
-<ul>
-<li><a href="previous_versions/v0.6.0/index.html">Version 0.6.0</a> released on August 7, 2019 (<a href="https://github.com/moderndive/moderndive_book/releases/tag/v0.6.0">source code</a>))</li>
-<li><a href="previous_versions/v0.5.0/index.html">Version 0.5.0</a> released on February 24, 2019 (<a href="https://github.com/moderndive/moderndive_book/releases/tag/v0.5.0">source code</a>)</li>
-<li><a href="previous_versions/v0.4.0/index.html">Version 0.4.0</a> released on July 21, 2018 (<a href="https://github.com/moderndive/moderndive_book/releases/tag/v0.4.0">source code</a>)</li>
-<li><a href="previous_versions/v0.3.0/index.html">Version 0.3.0</a> released on February 3, 2018 (<a href="https://github.com/moderndive/moderndive_book/releases/tag/v0.3.0">source code</a>)</li>
-<li><a href="previous_versions/v0.2.0/index.html">Version 0.2.0</a> released on August 2, 2017 (<a href="https://github.com/moderndive/moderndive_book/releases/tag/v0.2.0">source code</a>)</li>
-<li><a href="previous_versions/v0.1.3/index.html">Version 0.1.3</a> released on February 9, 2017 (<a href="https://github.com/moderndive/moderndive_book/releases/tag/v0.1.3">source code</a>)</li>
-<li><a href="previous_versions/v0.1.2/index.html">Version 0.1.2</a> released on January 22, 2017 (<a href="https://github.com/moderndive/moderndive_book/releases/tag/v0.1.2">source code</a>)</li>
-</ul></li>
-</ul>
-<p>Could this be a new paradigm for textbooks? Instead of the traditional model of textbook companies publishing updated <em>editions</em> of the textbook every few years, we apply a software design influenced model of publishing more easily updated <em>versions</em>. We can then leverage open-source communities of instructors and developers for ideas, tools, resources, and feedback. As such, we welcome your pull requests.</p>
-<p>Finally, feel free to modify the book as you wish for your own needs, but please list the authors at the top of <code>index.Rmd</code> as “Chester Ismay, Albert Y. Kim, and YOU!”</p>
-</div>
-<div id="about-the-authors" class="section level2 unnumbered">
-<h2>About the authors</h2>
-<p>Who we are!</p>
-<!-- <img src="images/ismay.jpeg" alt="Drawing" style="height: 200px;"/>  |  <img src="images/kim.jpeg" alt="Drawing" style="height: 200px;"/> -->
-<table>
-<thead>
-<tr class="header">
-<th align="center">Chester Ismay</th>
-<th align="center">Albert Y. Kim</th>
-</tr>
-</thead>
-<tbody>
-<tr class="odd">
-<td align="center"><img src="images/ismay.png" height="200" /></td>
-<td align="center"><img src="images/kim.png" height="200" /></td>
-</tr>
-</tbody>
-</table>
-<ul>
-<li>Chester Ismay: Data Science Evangelist - DataRobot, Portland, OR, USA.
-<ul>
-<li>Email: <a href="mailto:chester.ismay@gmail.com">chester.ismay@gmail.com</a></li>
-<li>Webpage: <a href="http://chester.rbind.io/" class="uri">http://chester.rbind.io/</a></li>
-<li>Twitter: <a href="https://twitter.com/old_man_chester">old_man_chester</a></li>
-<li>GitHub: <a href="https://github.com/ismayc" class="uri">https://github.com/ismayc</a></li>
-</ul></li>
-<li>Albert Y. Kim: Assistant Professor of Statistical &amp; Data Sciences - Smith College, Northampton, MA, USA.
-<ul>
-<li>Email: <a href="mailto:albert.ys.kim@gmail.com">albert.ys.kim@gmail.com</a></li>
-<li>Webpage: <a href="http://rudeboybert.rbind.io/" class="uri">http://rudeboybert.rbind.io/</a></li>
-<li>Twitter: <a href="https://twitter.com/rudeboybert">rudeboybert</a></li>
-<li>GitHub: <a href="https://github.com/rudeboybert" class="uri">https://github.com/rudeboybert</a></li>
-</ul></li>
-</ul>
-<!-- For use only in PDF, is skipped in HTML -->
-
-
-</div>
-</div>
-<h3>References</h3>
-<div id="refs" class="references">
-<div id="ref-hester2011">
-<p>Chihara, Laura M., and Tim C. Hesterberg. 2011. <em>Mathematical Statistics with Resampling and R</em>. Hoboken, NJ: John Wiley; Sons. <a href="https://sites.google.com/site/chiharahesterberg/home">https://sites.google.com/site/chiharahesterberg/home</a>.</p>
-</div>
-<div id="ref-isrs2014">
-<p>Diez, David M, Christopher D Barr, and Mine Çetinkaya-Rundel. 2014. <em>Introductory Statistics with Randomization and Simulation</em>. First Edition. <a href="https://www.openintro.org/stat/textbook.php?stat_book=isrs">https://www.openintro.org/stat/textbook.php?stat_book=isrs</a>.</p>
-</div>
-<div id="ref-rds2016">
-<p>Grolemund, Garrett, and Hadley Wickham. 2016. <em>R for Data Science</em>. <a href="http://r4ds.had.co.nz/">http://r4ds.had.co.nz/</a>.</p>
-</div>
-<div id="ref-R-bookdown">
-<p>Xie, Yihui. 2019. <em>Bookdown: Authoring Books and Technical Documents with R Markdown</em>. <a href="https://CRAN.R-project.org/package=bookdown">https://CRAN.R-project.org/package=bookdown</a>.</p>
-</div>
+<!-- index.Rmd has to have some content in it or it won't create an index.html file. When we remove the Special Announcement, make sure to keep this in so that index.html is included. -->
 </div>
             </section>
 
@@ -883,16 +616,18 @@ <h3>References</h3>
         </div>
       </div>
 
-<a href="1-getting-started.html" class="navigation navigation-next navigation-unique" aria-label="Next page"><i class="fa fa-angle-right"></i></a>
+<a href="foreword.html" class="navigation navigation-next navigation-unique" aria-label="Next page"><i class="fa fa-angle-right"></i></a>
     </div>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -900,12 +635,11 @@ <h3>References</h3>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -920,6 +654,10 @@ <h3>References</h3>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -936,8 +674,9 @@ <h3>References</h3>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/labs.html b/docs/labs.html
index 55b65ba23..680e47532 100644
--- a/docs/labs.html
+++ b/docs/labs.html
@@ -1 +1 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html lang="en"> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> <title>ModernDive Labs</title> </head> <body> <p>This is placeholder for student labs and other resources corresponding to the textbook at <a href="https://moderndive.com" target="_blank">moderndive.com</a> </p> </body> </html>
+<meta http-equiv="Refresh" content="0; url=https://moderndive.github.io/moderndive_labs/index.html" />
diff --git a/docs/libs/gitbook-2.6.7/js/plugin-bookdown.js b/docs/libs/gitbook-2.6.7/js/plugin-bookdown.js
index 04d56dd0e..0ef72b6b5 100644
--- a/docs/libs/gitbook-2.6.7/js/plugin-bookdown.js
+++ b/docs/libs/gitbook-2.6.7/js/plugin-bookdown.js
@@ -28,6 +28,18 @@ gitbook.require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) {
       }
     });
 
+    // add the View button (file view on Github)
+    var view = config.view;
+    if (view && view.link) gitbook.toolbar.createButton({
+      icon: 'fa fa-eye',
+      label: view.text || 'View Source',
+      position: 'left',
+      onClick: function(e) {
+        e.preventDefault();
+        window.open(view.link);
+      }
+    });
+
     // add the Download button
     var down = config.download;
     var normalizeDownload = function() {
@@ -72,7 +84,7 @@ gitbook.require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) {
     if (config.search !== false) info.push('f: Toggle search input ' +
       '(use <up>/<down>/Enter in the search input to navigate through search matches; ' +
       'press Esc to cancel search)');
-    gitbook.toolbar.createButton({
+    if (config.info !== false) gitbook.toolbar.createButton({
       icon: 'fa fa-info',
       label: 'Information about the toolbar',
       position: 'left',
@@ -85,6 +97,8 @@ gitbook.require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) {
     // highlight the current section in TOC
     var href = window.location.pathname;
     href = href.substr(href.lastIndexOf('/') + 1);
+    // accentuated characters need to be decoded (#819)
+    href = decodeURIComponent(href);
     if (href === '') href = 'index.html';
     var li = $('a[href^="' + href + location.hash + '"]').parent('li.chapter').first();
     var summary = $('ul.summary'), chaps = summary.find('li.chapter');
diff --git a/docs/libs/gitbook-2.6.7/js/plugin-fontsettings.js b/docs/libs/gitbook-2.6.7/js/plugin-fontsettings.js
index b39eca27e..a70f0fb37 100644
--- a/docs/libs/gitbook-2.6.7/js/plugin-fontsettings.js
+++ b/docs/libs/gitbook-2.6.7/js/plugin-fontsettings.js
@@ -96,7 +96,8 @@ gitbook.require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) {
 
     gitbook.events.bind("start", function(e, config) {
         var opts = config.fontsettings;
-
+        if (!opts) return;
+        
         // Create buttons in toolbar
         gitbook.toolbar.createButton({
             icon: 'fa fa-font',
diff --git a/docs/libs/gitbook-2.6.7/js/plugin-sharing.js b/docs/libs/gitbook-2.6.7/js/plugin-sharing.js
index bc271149c..8d279518d 100644
--- a/docs/libs/gitbook-2.6.7/js/plugin-sharing.js
+++ b/docs/libs/gitbook-2.6.7/js/plugin-sharing.js
@@ -15,7 +15,7 @@ gitbook.require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) {
             'icon': 'fa fa-facebook',
             'onClick': function(e) {
                 e.preventDefault();
-                window.open("http://www.facebook.com/sharer/sharer.php?s=100&p[url]="+encodeURIComponent(location.href));
+                window.open("http://www.facebook.com/sharer/sharer.php?u="+encodeURIComponent(location.href));
             }
         },
         'twitter': {
@@ -23,15 +23,7 @@ gitbook.require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) {
             'icon': 'fa fa-twitter',
             'onClick': function(e) {
                 e.preventDefault();
-                window.open("http://twitter.com/home?status="+encodeURIComponent(document.title+" "+location.href));
-            }
-        },
-        'google': {
-            'label': 'Google+',
-            'icon': 'fa fa-google-plus',
-            'onClick': function(e) {
-                e.preventDefault();
-                window.open("https://plus.google.com/share?url="+encodeURIComponent(location.href));
+                window.open("http://twitter.com/intent/tweet?text="+document.title+"&url="+encodeURIComponent(location.href)+"&hashtags=rmarkdown,bookdown");
             }
         },
         'linkedin': {
@@ -52,7 +44,7 @@ gitbook.require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) {
         },
         'instapaper': {
             'label': 'Instapaper',
-            'icon': 'fa fa-instapaper',
+            'icon': 'fa fa-italic',
             'onClick': function(e) {
                 e.preventDefault();
                 window.open("http://www.instapaper.com/text?u="+encodeURIComponent(location.href));
@@ -78,7 +70,7 @@ gitbook.require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) {
         var menu = _.chain(opts.all)
             .map(function(id) {
                 var site = SITES[id];
-
+                if (!site) return;
                 return {
                     text: site.label,
                     onClick: site.onClick
diff --git a/docs/libs/htmlwidgets-1.3/htmlwidgets.js b/docs/libs/htmlwidgets-1.3/htmlwidgets.js
deleted file mode 100644
index ed9837d9c..000000000
--- a/docs/libs/htmlwidgets-1.3/htmlwidgets.js
+++ /dev/null
@@ -1,839 +0,0 @@
-(function() {
-  // If window.HTMLWidgets is already defined, then use it; otherwise create a
-  // new object. This allows preceding code to set options that affect the
-  // initialization process (though none currently exist).
-  window.HTMLWidgets = window.HTMLWidgets || {};
-
-  // See if we're running in a viewer pane. If not, we're in a web browser.
-  var viewerMode = window.HTMLWidgets.viewerMode =
-      /\bviewer_pane=1\b/.test(window.location);
-
-  // See if we're running in Shiny mode. If not, it's a static document.
-  // Note that static widgets can appear in both Shiny and static modes, but
-  // obviously, Shiny widgets can only appear in Shiny apps/documents.
-  var shinyMode = window.HTMLWidgets.shinyMode =
-      typeof(window.Shiny) !== "undefined" && !!window.Shiny.outputBindings;
-
-  // We can't count on jQuery being available, so we implement our own
-  // version if necessary.
-  function querySelectorAll(scope, selector) {
-    if (typeof(jQuery) !== "undefined" && scope instanceof jQuery) {
-      return scope.find(selector);
-    }
-    if (scope.querySelectorAll) {
-      return scope.querySelectorAll(selector);
-    }
-  }
-
-  function asArray(value) {
-    if (value === null)
-      return [];
-    if ($.isArray(value))
-      return value;
-    return [value];
-  }
-
-  // Implement jQuery's extend
-  function extend(target /*, ... */) {
-    if (arguments.length == 1) {
-      return target;
-    }
-    for (var i = 1; i < arguments.length; i++) {
-      var source = arguments[i];
-      for (var prop in source) {
-        if (source.hasOwnProperty(prop)) {
-          target[prop] = source[prop];
-        }
-      }
-    }
-    return target;
-  }
-
-  // IE8 doesn't support Array.forEach.
-  function forEach(values, callback, thisArg) {
-    if (values.forEach) {
-      values.forEach(callback, thisArg);
-    } else {
-      for (var i = 0; i < values.length; i++) {
-        callback.call(thisArg, values[i], i, values);
-      }
-    }
-  }
-
-  // Replaces the specified method with the return value of funcSource.
-  //
-  // Note that funcSource should not BE the new method, it should be a function
-  // that RETURNS the new method. funcSource receives a single argument that is
-  // the overridden method, it can be called from the new method. The overridden
-  // method can be called like a regular function, it has the target permanently
-  // bound to it so "this" will work correctly.
-  function overrideMethod(target, methodName, funcSource) {
-    var superFunc = target[methodName] || function() {};
-    var superFuncBound = function() {
-      return superFunc.apply(target, arguments);
-    };
-    target[methodName] = funcSource(superFuncBound);
-  }
-
-  // Add a method to delegator that, when invoked, calls
-  // delegatee.methodName. If there is no such method on
-  // the delegatee, but there was one on delegator before
-  // delegateMethod was called, then the original version
-  // is invoked instead.
-  // For example:
-  //
-  // var a = {
-  //   method1: function() { console.log('a1'); }
-  //   method2: function() { console.log('a2'); }
-  // };
-  // var b = {
-  //   method1: function() { console.log('b1'); }
-  // };
-  // delegateMethod(a, b, "method1");
-  // delegateMethod(a, b, "method2");
-  // a.method1();
-  // a.method2();
-  //
-  // The output would be "b1", "a2".
-  function delegateMethod(delegator, delegatee, methodName) {
-    var inherited = delegator[methodName];
-    delegator[methodName] = function() {
-      var target = delegatee;
-      var method = delegatee[methodName];
-
-      // The method doesn't exist on the delegatee. Instead,
-      // call the method on the delegator, if it exists.
-      if (!method) {
-        target = delegator;
-        method = inherited;
-      }
-
-      if (method) {
-        return method.apply(target, arguments);
-      }
-    };
-  }
-
-  // Implement a vague facsimilie of jQuery's data method
-  function elementData(el, name, value) {
-    if (arguments.length == 2) {
-      return el["htmlwidget_data_" + name];
-    } else if (arguments.length == 3) {
-      el["htmlwidget_data_" + name] = value;
-      return el;
-    } else {
-      throw new Error("Wrong number of arguments for elementData: " +
-        arguments.length);
-    }
-  }
-
-  // http://stackoverflow.com/questions/3446170/escape-string-for-use-in-javascript-regex
-  function escapeRegExp(str) {
-    return str.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, "\\$&");
-  }
-
-  function hasClass(el, className) {
-    var re = new RegExp("\\b" + escapeRegExp(className) + "\\b");
-    return re.test(el.className);
-  }
-
-  // elements - array (or array-like object) of HTML elements
-  // className - class name to test for
-  // include - if true, only return elements with given className;
-  //   if false, only return elements *without* given className
-  function filterByClass(elements, className, include) {
-    var results = [];
-    for (var i = 0; i < elements.length; i++) {
-      if (hasClass(elements[i], className) == include)
-        results.push(elements[i]);
-    }
-    return results;
-  }
-
-  function on(obj, eventName, func) {
-    if (obj.addEventListener) {
-      obj.addEventListener(eventName, func, false);
-    } else if (obj.attachEvent) {
-      obj.attachEvent(eventName, func);
-    }
-  }
-
-  function off(obj, eventName, func) {
-    if (obj.removeEventListener)
-      obj.removeEventListener(eventName, func, false);
-    else if (obj.detachEvent) {
-      obj.detachEvent(eventName, func);
-    }
-  }
-
-  // Translate array of values to top/right/bottom/left, as usual with
-  // the "padding" CSS property
-  // https://developer.mozilla.org/en-US/docs/Web/CSS/padding
-  function unpackPadding(value) {
-    if (typeof(value) === "number")
-      value = [value];
-    if (value.length === 1) {
-      return {top: value[0], right: value[0], bottom: value[0], left: value[0]};
-    }
-    if (value.length === 2) {
-      return {top: value[0], right: value[1], bottom: value[0], left: value[1]};
-    }
-    if (value.length === 3) {
-      return {top: value[0], right: value[1], bottom: value[2], left: value[1]};
-    }
-    if (value.length === 4) {
-      return {top: value[0], right: value[1], bottom: value[2], left: value[3]};
-    }
-  }
-
-  // Convert an unpacked padding object to a CSS value
-  function paddingToCss(paddingObj) {
-    return paddingObj.top + "px " + paddingObj.right + "px " + paddingObj.bottom + "px " + paddingObj.left + "px";
-  }
-
-  // Makes a number suitable for CSS
-  function px(x) {
-    if (typeof(x) === "number")
-      return x + "px";
-    else
-      return x;
-  }
-
-  // Retrieves runtime widget sizing information for an element.
-  // The return value is either null, or an object with fill, padding,
-  // defaultWidth, defaultHeight fields.
-  function sizingPolicy(el) {
-    var sizingEl = document.querySelector("script[data-for='" + el.id + "'][type='application/htmlwidget-sizing']");
-    if (!sizingEl)
-      return null;
-    var sp = JSON.parse(sizingEl.textContent || sizingEl.text || "{}");
-    if (viewerMode) {
-      return sp.viewer;
-    } else {
-      return sp.browser;
-    }
-  }
-
-  // @param tasks Array of strings (or falsy value, in which case no-op).
-  //   Each element must be a valid JavaScript expression that yields a
-  //   function. Or, can be an array of objects with "code" and "data"
-  //   properties; in this case, the "code" property should be a string
-  //   of JS that's an expr that yields a function, and "data" should be
-  //   an object that will be added as an additional argument when that
-  //   function is called.
-  // @param target The object that will be "this" for each function
-  //   execution.
-  // @param args Array of arguments to be passed to the functions. (The
-  //   same arguments will be passed to all functions.)
-  function evalAndRun(tasks, target, args) {
-    if (tasks) {
-      forEach(tasks, function(task) {
-        var theseArgs = args;
-        if (typeof(task) === "object") {
-          theseArgs = theseArgs.concat([task.data]);
-          task = task.code;
-        }
-        var taskFunc = eval("(" + task + ")");
-        if (typeof(taskFunc) !== "function") {
-          throw new Error("Task must be a function! Source:\n" + task);
-        }
-        taskFunc.apply(target, theseArgs);
-      });
-    }
-  }
-
-  function initSizing(el) {
-    var sizing = sizingPolicy(el);
-    if (!sizing)
-      return;
-
-    var cel = document.getElementById("htmlwidget_container");
-    if (!cel)
-      return;
-
-    if (typeof(sizing.padding) !== "undefined") {
-      document.body.style.margin = "0";
-      document.body.style.padding = paddingToCss(unpackPadding(sizing.padding));
-    }
-
-    if (sizing.fill) {
-      document.body.style.overflow = "hidden";
-      document.body.style.width = "100%";
-      document.body.style.height = "100%";
-      document.documentElement.style.width = "100%";
-      document.documentElement.style.height = "100%";
-      if (cel) {
-        cel.style.position = "absolute";
-        var pad = unpackPadding(sizing.padding);
-        cel.style.top = pad.top + "px";
-        cel.style.right = pad.right + "px";
-        cel.style.bottom = pad.bottom + "px";
-        cel.style.left = pad.left + "px";
-        el.style.width = "100%";
-        el.style.height = "100%";
-      }
-
-      return {
-        getWidth: function() { return cel.offsetWidth; },
-        getHeight: function() { return cel.offsetHeight; }
-      };
-
-    } else {
-      el.style.width = px(sizing.width);
-      el.style.height = px(sizing.height);
-
-      return {
-        getWidth: function() { return el.offsetWidth; },
-        getHeight: function() { return el.offsetHeight; }
-      };
-    }
-  }
-
-  // Default implementations for methods
-  var defaults = {
-    find: function(scope) {
-      return querySelectorAll(scope, "." + this.name);
-    },
-    renderError: function(el, err) {
-      var $el = $(el);
-
-      this.clearError(el);
-
-      // Add all these error classes, as Shiny does
-      var errClass = "shiny-output-error";
-      if (err.type !== null) {
-        // use the classes of the error condition as CSS class names
-        errClass = errClass + " " + $.map(asArray(err.type), function(type) {
-          return errClass + "-" + type;
-        }).join(" ");
-      }
-      errClass = errClass + " htmlwidgets-error";
-
-      // Is el inline or block? If inline or inline-block, just display:none it
-      // and add an inline error.
-      var display = $el.css("display");
-      $el.data("restore-display-mode", display);
-
-      if (display === "inline" || display === "inline-block") {
-        $el.hide();
-        if (err.message !== "") {
-          var errorSpan = $("<span>").addClass(errClass);
-          errorSpan.text(err.message);
-          $el.after(errorSpan);
-        }
-      } else if (display === "block") {
-        // If block, add an error just after the el, set visibility:none on the
-        // el, and position the error to be on top of the el.
-        // Mark it with a unique ID and CSS class so we can remove it later.
-        $el.css("visibility", "hidden");
-        if (err.message !== "") {
-          var errorDiv = $("<div>").addClass(errClass).css("position", "absolute")
-            .css("top", el.offsetTop)
-            .css("left", el.offsetLeft)
-            // setting width can push out the page size, forcing otherwise
-            // unnecessary scrollbars to appear and making it impossible for
-            // the element to shrink; so use max-width instead
-            .css("maxWidth", el.offsetWidth)
-            .css("height", el.offsetHeight);
-          errorDiv.text(err.message);
-          $el.after(errorDiv);
-
-          // Really dumb way to keep the size/position of the error in sync with
-          // the parent element as the window is resized or whatever.
-          var intId = setInterval(function() {
-            if (!errorDiv[0].parentElement) {
-              clearInterval(intId);
-              return;
-            }
-            errorDiv
-              .css("top", el.offsetTop)
-              .css("left", el.offsetLeft)
-              .css("maxWidth", el.offsetWidth)
-              .css("height", el.offsetHeight);
-          }, 500);
-        }
-      }
-    },
-    clearError: function(el) {
-      var $el = $(el);
-      var display = $el.data("restore-display-mode");
-      $el.data("restore-display-mode", null);
-
-      if (display === "inline" || display === "inline-block") {
-        if (display)
-          $el.css("display", display);
-        $(el.nextSibling).filter(".htmlwidgets-error").remove();
-      } else if (display === "block"){
-        $el.css("visibility", "inherit");
-        $(el.nextSibling).filter(".htmlwidgets-error").remove();
-      }
-    },
-    sizing: {}
-  };
-
-  // Called by widget bindings to register a new type of widget. The definition
-  // object can contain the following properties:
-  // - name (required) - A string indicating the binding name, which will be
-  //   used by default as the CSS classname to look for.
-  // - initialize (optional) - A function(el) that will be called once per
-  //   widget element; if a value is returned, it will be passed as the third
-  //   value to renderValue.
-  // - renderValue (required) - A function(el, data, initValue) that will be
-  //   called with data. Static contexts will cause this to be called once per
-  //   element; Shiny apps will cause this to be called multiple times per
-  //   element, as the data changes.
-  window.HTMLWidgets.widget = function(definition) {
-    if (!definition.name) {
-      throw new Error("Widget must have a name");
-    }
-    if (!definition.type) {
-      throw new Error("Widget must have a type");
-    }
-    // Currently we only support output widgets
-    if (definition.type !== "output") {
-      throw new Error("Unrecognized widget type '" + definition.type + "'");
-    }
-    // TODO: Verify that .name is a valid CSS classname
-
-    // Support new-style instance-bound definitions. Old-style class-bound
-    // definitions have one widget "object" per widget per type/class of
-    // widget; the renderValue and resize methods on such widget objects
-    // take el and instance arguments, because the widget object can't
-    // store them. New-style instance-bound definitions have one widget
-    // object per widget instance; the definition that's passed in doesn't
-    // provide renderValue or resize methods at all, just the single method
-    //   factory(el, width, height)
-    // which returns an object that has renderValue(x) and resize(w, h).
-    // This enables a far more natural programming style for the widget
-    // author, who can store per-instance state using either OO-style
-    // instance fields or functional-style closure variables (I guess this
-    // is in contrast to what can only be called C-style pseudo-OO which is
-    // what we required before).
-    if (definition.factory) {
-      definition = createLegacyDefinitionAdapter(definition);
-    }
-
-    if (!definition.renderValue) {
-      throw new Error("Widget must have a renderValue function");
-    }
-
-    // For static rendering (non-Shiny), use a simple widget registration
-    // scheme. We also use this scheme for Shiny apps/documents that also
-    // contain static widgets.
-    window.HTMLWidgets.widgets = window.HTMLWidgets.widgets || [];
-    // Merge defaults into the definition; don't mutate the original definition.
-    var staticBinding = extend({}, defaults, definition);
-    overrideMethod(staticBinding, "find", function(superfunc) {
-      return function(scope) {
-        var results = superfunc(scope);
-        // Filter out Shiny outputs, we only want the static kind
-        return filterByClass(results, "html-widget-output", false);
-      };
-    });
-    window.HTMLWidgets.widgets.push(staticBinding);
-
-    if (shinyMode) {
-      // Shiny is running. Register the definition with an output binding.
-      // The definition itself will not be the output binding, instead
-      // we will make an output binding object that delegates to the
-      // definition. This is because we foolishly used the same method
-      // name (renderValue) for htmlwidgets definition and Shiny bindings
-      // but they actually have quite different semantics (the Shiny
-      // bindings receive data that includes lots of metadata that it
-      // strips off before calling htmlwidgets renderValue). We can't
-      // just ignore the difference because in some widgets it's helpful
-      // to call this.renderValue() from inside of resize(), and if
-      // we're not delegating, then that call will go to the Shiny
-      // version instead of the htmlwidgets version.
-
-      // Merge defaults with definition, without mutating either.
-      var bindingDef = extend({}, defaults, definition);
-
-      // This object will be our actual Shiny binding.
-      var shinyBinding = new Shiny.OutputBinding();
-
-      // With a few exceptions, we'll want to simply use the bindingDef's
-      // version of methods if they are available, otherwise fall back to
-      // Shiny's defaults. NOTE: If Shiny's output bindings gain additional
-      // methods in the future, and we want them to be overrideable by
-      // HTMLWidget binding definitions, then we'll need to add them to this
-      // list.
-      delegateMethod(shinyBinding, bindingDef, "getId");
-      delegateMethod(shinyBinding, bindingDef, "onValueChange");
-      delegateMethod(shinyBinding, bindingDef, "onValueError");
-      delegateMethod(shinyBinding, bindingDef, "renderError");
-      delegateMethod(shinyBinding, bindingDef, "clearError");
-      delegateMethod(shinyBinding, bindingDef, "showProgress");
-
-      // The find, renderValue, and resize are handled differently, because we
-      // want to actually decorate the behavior of the bindingDef methods.
-
-      shinyBinding.find = function(scope) {
-        var results = bindingDef.find(scope);
-
-        // Only return elements that are Shiny outputs, not static ones
-        var dynamicResults = results.filter(".html-widget-output");
-
-        // It's possible that whatever caused Shiny to think there might be
-        // new dynamic outputs, also caused there to be new static outputs.
-        // Since there might be lots of different htmlwidgets bindings, we
-        // schedule execution for later--no need to staticRender multiple
-        // times.
-        if (results.length !== dynamicResults.length)
-          scheduleStaticRender();
-
-        return dynamicResults;
-      };
-
-      // Wrap renderValue to handle initialization, which unfortunately isn't
-      // supported natively by Shiny at the time of this writing.
-
-      shinyBinding.renderValue = function(el, data) {
-        Shiny.renderDependencies(data.deps);
-        // Resolve strings marked as javascript literals to objects
-        if (!(data.evals instanceof Array)) data.evals = [data.evals];
-        for (var i = 0; data.evals && i < data.evals.length; i++) {
-          window.HTMLWidgets.evaluateStringMember(data.x, data.evals[i]);
-        }
-        if (!bindingDef.renderOnNullValue) {
-          if (data.x === null) {
-            el.style.visibility = "hidden";
-            return;
-          } else {
-            el.style.visibility = "inherit";
-          }
-        }
-        if (!elementData(el, "initialized")) {
-          initSizing(el);
-
-          elementData(el, "initialized", true);
-          if (bindingDef.initialize) {
-            var result = bindingDef.initialize(el, el.offsetWidth,
-              el.offsetHeight);
-            elementData(el, "init_result", result);
-          }
-        }
-        bindingDef.renderValue(el, data.x, elementData(el, "init_result"));
-        evalAndRun(data.jsHooks.render, elementData(el, "init_result"), [el, data.x]);
-      };
-
-      // Only override resize if bindingDef implements it
-      if (bindingDef.resize) {
-        shinyBinding.resize = function(el, width, height) {
-          // Shiny can call resize before initialize/renderValue have been
-          // called, which doesn't make sense for widgets.
-          if (elementData(el, "initialized")) {
-            bindingDef.resize(el, width, height, elementData(el, "init_result"));
-          }
-        };
-      }
-
-      Shiny.outputBindings.register(shinyBinding, bindingDef.name);
-    }
-  };
-
-  var scheduleStaticRenderTimerId = null;
-  function scheduleStaticRender() {
-    if (!scheduleStaticRenderTimerId) {
-      scheduleStaticRenderTimerId = setTimeout(function() {
-        scheduleStaticRenderTimerId = null;
-        window.HTMLWidgets.staticRender();
-      }, 1);
-    }
-  }
-
-  // Render static widgets after the document finishes loading
-  // Statically render all elements that are of this widget's class
-  window.HTMLWidgets.staticRender = function() {
-    var bindings = window.HTMLWidgets.widgets || [];
-    forEach(bindings, function(binding) {
-      var matches = binding.find(document.documentElement);
-      forEach(matches, function(el) {
-        var sizeObj = initSizing(el, binding);
-
-        if (hasClass(el, "html-widget-static-bound"))
-          return;
-        el.className = el.className + " html-widget-static-bound";
-
-        var initResult;
-        if (binding.initialize) {
-          initResult = binding.initialize(el,
-            sizeObj ? sizeObj.getWidth() : el.offsetWidth,
-            sizeObj ? sizeObj.getHeight() : el.offsetHeight
-          );
-          elementData(el, "init_result", initResult);
-        }
-
-        if (binding.resize) {
-          var lastSize = {
-            w: sizeObj ? sizeObj.getWidth() : el.offsetWidth,
-            h: sizeObj ? sizeObj.getHeight() : el.offsetHeight
-          };
-          var resizeHandler = function(e) {
-            var size = {
-              w: sizeObj ? sizeObj.getWidth() : el.offsetWidth,
-              h: sizeObj ? sizeObj.getHeight() : el.offsetHeight
-            };
-            if (size.w === 0 && size.h === 0)
-              return;
-            if (size.w === lastSize.w && size.h === lastSize.h)
-              return;
-            lastSize = size;
-            binding.resize(el, size.w, size.h, initResult);
-          };
-
-          on(window, "resize", resizeHandler);
-
-          // This is needed for cases where we're running in a Shiny
-          // app, but the widget itself is not a Shiny output, but
-          // rather a simple static widget. One example of this is
-          // an rmarkdown document that has runtime:shiny and widget
-          // that isn't in a render function. Shiny only knows to
-          // call resize handlers for Shiny outputs, not for static
-          // widgets, so we do it ourselves.
-          if (window.jQuery) {
-            window.jQuery(document).on(
-              "shown.htmlwidgets shown.bs.tab.htmlwidgets shown.bs.collapse.htmlwidgets",
-              resizeHandler
-            );
-            window.jQuery(document).on(
-              "hidden.htmlwidgets hidden.bs.tab.htmlwidgets hidden.bs.collapse.htmlwidgets",
-              resizeHandler
-            );
-          }
-
-          // This is needed for the specific case of ioslides, which
-          // flips slides between display:none and display:block.
-          // Ideally we would not have to have ioslide-specific code
-          // here, but rather have ioslides raise a generic event,
-          // but the rmarkdown package just went to CRAN so the
-          // window to getting that fixed may be long.
-          if (window.addEventListener) {
-            // It's OK to limit this to window.addEventListener
-            // browsers because ioslides itself only supports
-            // such browsers.
-            on(document, "slideenter", resizeHandler);
-            on(document, "slideleave", resizeHandler);
-          }
-        }
-
-        var scriptData = document.querySelector("script[data-for='" + el.id + "'][type='application/json']");
-        if (scriptData) {
-          var data = JSON.parse(scriptData.textContent || scriptData.text);
-          // Resolve strings marked as javascript literals to objects
-          if (!(data.evals instanceof Array)) data.evals = [data.evals];
-          for (var k = 0; data.evals && k < data.evals.length; k++) {
-            window.HTMLWidgets.evaluateStringMember(data.x, data.evals[k]);
-          }
-          binding.renderValue(el, data.x, initResult);
-          evalAndRun(data.jsHooks.render, initResult, [el, data.x]);
-        }
-      });
-    });
-
-    invokePostRenderHandlers();
-  }
-
-  // Wait until after the document has loaded to render the widgets.
-  if (document.addEventListener) {
-    document.addEventListener("DOMContentLoaded", function() {
-      document.removeEventListener("DOMContentLoaded", arguments.callee, false);
-      window.HTMLWidgets.staticRender();
-    }, false);
-  } else if (document.attachEvent) {
-    document.attachEvent("onreadystatechange", function() {
-      if (document.readyState === "complete") {
-        document.detachEvent("onreadystatechange", arguments.callee);
-        window.HTMLWidgets.staticRender();
-      }
-    });
-  }
-
-
-  window.HTMLWidgets.getAttachmentUrl = function(depname, key) {
-    // If no key, default to the first item
-    if (typeof(key) === "undefined")
-      key = 1;
-
-    var link = document.getElementById(depname + "-" + key + "-attachment");
-    if (!link) {
-      throw new Error("Attachment " + depname + "/" + key + " not found in document");
-    }
-    return link.getAttribute("href");
-  };
-
-  window.HTMLWidgets.dataframeToD3 = function(df) {
-    var names = [];
-    var length;
-    for (var name in df) {
-        if (df.hasOwnProperty(name))
-            names.push(name);
-        if (typeof(df[name]) !== "object" || typeof(df[name].length) === "undefined") {
-            throw new Error("All fields must be arrays");
-        } else if (typeof(length) !== "undefined" && length !== df[name].length) {
-            throw new Error("All fields must be arrays of the same length");
-        }
-        length = df[name].length;
-    }
-    var results = [];
-    var item;
-    for (var row = 0; row < length; row++) {
-        item = {};
-        for (var col = 0; col < names.length; col++) {
-            item[names[col]] = df[names[col]][row];
-        }
-        results.push(item);
-    }
-    return results;
-  };
-
-  window.HTMLWidgets.transposeArray2D = function(array) {
-      if (array.length === 0) return array;
-      var newArray = array[0].map(function(col, i) {
-          return array.map(function(row) {
-              return row[i]
-          })
-      });
-      return newArray;
-  };
-  // Split value at splitChar, but allow splitChar to be escaped
-  // using escapeChar. Any other characters escaped by escapeChar
-  // will be included as usual (including escapeChar itself).
-  function splitWithEscape(value, splitChar, escapeChar) {
-    var results = [];
-    var escapeMode = false;
-    var currentResult = "";
-    for (var pos = 0; pos < value.length; pos++) {
-      if (!escapeMode) {
-        if (value[pos] === splitChar) {
-          results.push(currentResult);
-          currentResult = "";
-        } else if (value[pos] === escapeChar) {
-          escapeMode = true;
-        } else {
-          currentResult += value[pos];
-        }
-      } else {
-        currentResult += value[pos];
-        escapeMode = false;
-      }
-    }
-    if (currentResult !== "") {
-      results.push(currentResult);
-    }
-    return results;
-  }
-  // Function authored by Yihui/JJ Allaire
-  window.HTMLWidgets.evaluateStringMember = function(o, member) {
-    var parts = splitWithEscape(member, '.', '\\');
-    for (var i = 0, l = parts.length; i < l; i++) {
-      var part = parts[i];
-      // part may be a character or 'numeric' member name
-      if (o !== null && typeof o === "object" && part in o) {
-        if (i == (l - 1)) { // if we are at the end of the line then evalulate
-          if (typeof o[part] === "string")
-            o[part] = eval("(" + o[part] + ")");
-        } else { // otherwise continue to next embedded object
-          o = o[part];
-        }
-      }
-    }
-  };
-
-  // Retrieve the HTMLWidget instance (i.e. the return value of an
-  // HTMLWidget binding's initialize() or factory() function)
-  // associated with an element, or null if none.
-  window.HTMLWidgets.getInstance = function(el) {
-    return elementData(el, "init_result");
-  };
-
-  // Finds the first element in the scope that matches the selector,
-  // and returns the HTMLWidget instance (i.e. the return value of
-  // an HTMLWidget binding's initialize() or factory() function)
-  // associated with that element, if any. If no element matches the
-  // selector, or the first matching element has no HTMLWidget
-  // instance associated with it, then null is returned.
-  //
-  // The scope argument is optional, and defaults to window.document.
-  window.HTMLWidgets.find = function(scope, selector) {
-    if (arguments.length == 1) {
-      selector = scope;
-      scope = document;
-    }
-
-    var el = scope.querySelector(selector);
-    if (el === null) {
-      return null;
-    } else {
-      return window.HTMLWidgets.getInstance(el);
-    }
-  };
-
-  // Finds all elements in the scope that match the selector, and
-  // returns the HTMLWidget instances (i.e. the return values of
-  // an HTMLWidget binding's initialize() or factory() function)
-  // associated with the elements, in an array. If elements that
-  // match the selector don't have an associated HTMLWidget
-  // instance, the returned array will contain nulls.
-  //
-  // The scope argument is optional, and defaults to window.document.
-  window.HTMLWidgets.findAll = function(scope, selector) {
-    if (arguments.length == 1) {
-      selector = scope;
-      scope = document;
-    }
-
-    var nodes = scope.querySelectorAll(selector);
-    var results = [];
-    for (var i = 0; i < nodes.length; i++) {
-      results.push(window.HTMLWidgets.getInstance(nodes[i]));
-    }
-    return results;
-  };
-
-  var postRenderHandlers = [];
-  function invokePostRenderHandlers() {
-    while (postRenderHandlers.length) {
-      var handler = postRenderHandlers.shift();
-      if (handler) {
-        handler();
-      }
-    }
-  }
-
-  // Register the given callback function to be invoked after the
-  // next time static widgets are rendered.
-  window.HTMLWidgets.addPostRenderHandler = function(callback) {
-    postRenderHandlers.push(callback);
-  };
-
-  // Takes a new-style instance-bound definition, and returns an
-  // old-style class-bound definition. This saves us from having
-  // to rewrite all the logic in this file to accomodate both
-  // types of definitions.
-  function createLegacyDefinitionAdapter(defn) {
-    var result = {
-      name: defn.name,
-      type: defn.type,
-      initialize: function(el, width, height) {
-        return defn.factory(el, width, height);
-      },
-      renderValue: function(el, x, instance) {
-        return instance.renderValue(x);
-      },
-      resize: function(el, width, height, instance) {
-        return instance.resize(width, height);
-      }
-    };
-
-    if (defn.find)
-      result.find = defn.find;
-    if (defn.renderError)
-      result.renderError = defn.renderError;
-    if (defn.clearError)
-      result.clearError = defn.clearError;
-
-    return result;
-  }
-})();
-
diff --git a/docs/moderndive_files/figure-html/2numxplot1-1.png b/docs/moderndive_files/figure-html/2numxplot1-1.png
index 44dbd280d..89a608b11 100644
Binary files a/docs/moderndive_files/figure-html/2numxplot1-1.png and b/docs/moderndive_files/figure-html/2numxplot1-1.png differ
diff --git a/docs/moderndive_files/figure-html/2numxplot1-repeat-1.png b/docs/moderndive_files/figure-html/2numxplot1-repeat-1.png
index 5ab588d75..a8200c9db 100644
Binary files a/docs/moderndive_files/figure-html/2numxplot1-repeat-1.png and b/docs/moderndive_files/figure-html/2numxplot1-repeat-1.png differ
diff --git a/docs/moderndive_files/figure-html/2numxplot4-1.png b/docs/moderndive_files/figure-html/2numxplot4-1.png
index 0d7609b7d..6ced39a45 100644
Binary files a/docs/moderndive_files/figure-html/2numxplot4-1.png and b/docs/moderndive_files/figure-html/2numxplot4-1.png differ
diff --git a/docs/moderndive_files/figure-html/action-romance-boxplot-1.png b/docs/moderndive_files/figure-html/action-romance-boxplot-1.png
index 81fa17717..799c83bf6 100644
Binary files a/docs/moderndive_files/figure-html/action-romance-boxplot-1.png and b/docs/moderndive_files/figure-html/action-romance-boxplot-1.png differ
diff --git a/docs/moderndive_files/figure-html/alpha-1.png b/docs/moderndive_files/figure-html/alpha-1.png
index 573955b81..d445f0076 100644
Binary files a/docs/moderndive_files/figure-html/alpha-1.png and b/docs/moderndive_files/figure-html/alpha-1.png differ
diff --git a/docs/moderndive_files/figure-html/badbox-1.png b/docs/moderndive_files/figure-html/badbox-1.png
index 3d02603fc..0e38046ca 100644
Binary files a/docs/moderndive_files/figure-html/badbox-1.png and b/docs/moderndive_files/figure-html/badbox-1.png differ
diff --git a/docs/moderndive_files/figure-html/bar-1.png b/docs/moderndive_files/figure-html/bar-1.png
index f15ded4f9..50d22340e 100644
Binary files a/docs/moderndive_files/figure-html/bar-1.png and b/docs/moderndive_files/figure-html/bar-1.png differ
diff --git a/docs/moderndive_files/figure-html/best-fitting-line-1.png b/docs/moderndive_files/figure-html/best-fitting-line-1.png
index cd7ff2e7f..ff9bca20f 100644
Binary files a/docs/moderndive_files/figure-html/best-fitting-line-1.png and b/docs/moderndive_files/figure-html/best-fitting-line-1.png differ
diff --git a/docs/moderndive_files/figure-html/boostrap-distribution-infer-1.png b/docs/moderndive_files/figure-html/boostrap-distribution-infer-1.png
index c310303bb..5d2ebda20 100644
Binary files a/docs/moderndive_files/figure-html/boostrap-distribution-infer-1.png and b/docs/moderndive_files/figure-html/boostrap-distribution-infer-1.png differ
diff --git a/docs/moderndive_files/figure-html/bootstrap-distribution-mythbusters-1.png b/docs/moderndive_files/figure-html/bootstrap-distribution-mythbusters-1.png
index f49c0c25a..9b157dead 100644
Binary files a/docs/moderndive_files/figure-html/bootstrap-distribution-mythbusters-1.png and b/docs/moderndive_files/figure-html/bootstrap-distribution-mythbusters-1.png differ
diff --git a/docs/moderndive_files/figure-html/bootstrap-distribution-mythbusters-CI-1.png b/docs/moderndive_files/figure-html/bootstrap-distribution-mythbusters-CI-1.png
index 5a0195d4a..c0e39a77b 100644
Binary files a/docs/moderndive_files/figure-html/bootstrap-distribution-mythbusters-CI-1.png and b/docs/moderndive_files/figure-html/bootstrap-distribution-mythbusters-CI-1.png differ
diff --git a/docs/moderndive_files/figure-html/bootstrap-distribution-part-deux-1.png b/docs/moderndive_files/figure-html/bootstrap-distribution-part-deux-1.png
index c8e81f1b8..e25c6122f 100644
Binary files a/docs/moderndive_files/figure-html/bootstrap-distribution-part-deux-1.png and b/docs/moderndive_files/figure-html/bootstrap-distribution-part-deux-1.png differ
diff --git a/docs/moderndive_files/figure-html/bootstrap-distribution-slope-1.png b/docs/moderndive_files/figure-html/bootstrap-distribution-slope-1.png
index 297395d21..4c04fd2fd 100644
Binary files a/docs/moderndive_files/figure-html/bootstrap-distribution-slope-1.png and b/docs/moderndive_files/figure-html/bootstrap-distribution-slope-1.png differ
diff --git a/docs/moderndive_files/figure-html/bootstrap-distribution-slope-CI-1.png b/docs/moderndive_files/figure-html/bootstrap-distribution-slope-CI-1.png
index 13da30bab..58a8b402f 100644
Binary files a/docs/moderndive_files/figure-html/bootstrap-distribution-slope-CI-1.png and b/docs/moderndive_files/figure-html/bootstrap-distribution-slope-CI-1.png differ
diff --git a/docs/moderndive_files/figure-html/bootstrap-distribution-two-prop-percentile-1.png b/docs/moderndive_files/figure-html/bootstrap-distribution-two-prop-percentile-1.png
index 5f546b420..8da4caef8 100644
Binary files a/docs/moderndive_files/figure-html/bootstrap-distribution-two-prop-percentile-1.png and b/docs/moderndive_files/figure-html/bootstrap-distribution-two-prop-percentile-1.png differ
diff --git a/docs/moderndive_files/figure-html/bootstrap-distribution-two-prop-se-1.png b/docs/moderndive_files/figure-html/bootstrap-distribution-two-prop-se-1.png
index 9abea29e2..648e7acda 100644
Binary files a/docs/moderndive_files/figure-html/bootstrap-distribution-two-prop-se-1.png and b/docs/moderndive_files/figure-html/bootstrap-distribution-two-prop-se-1.png differ
diff --git a/docs/moderndive_files/figure-html/boxplot-1.png b/docs/moderndive_files/figure-html/boxplot-1.png
index bda759484..4b47e34b2 100644
Binary files a/docs/moderndive_files/figure-html/boxplot-1.png and b/docs/moderndive_files/figure-html/boxplot-1.png differ
diff --git a/docs/moderndive_files/figure-html/carrierpie-1.png b/docs/moderndive_files/figure-html/carrierpie-1.png
index bdc17c01b..7439ad95f 100644
Binary files a/docs/moderndive_files/figure-html/carrierpie-1.png and b/docs/moderndive_files/figure-html/carrierpie-1.png differ
diff --git a/docs/moderndive_files/figure-html/catxplot0b-1.png b/docs/moderndive_files/figure-html/catxplot0b-1.png
index 86d988fa0..5b514d6a6 100644
Binary files a/docs/moderndive_files/figure-html/catxplot0b-1.png and b/docs/moderndive_files/figure-html/catxplot0b-1.png differ
diff --git a/docs/moderndive_files/figure-html/catxplot1-1.png b/docs/moderndive_files/figure-html/catxplot1-1.png
index e0e252e1d..048bf884b 100644
Binary files a/docs/moderndive_files/figure-html/catxplot1-1.png and b/docs/moderndive_files/figure-html/catxplot1-1.png differ
diff --git a/docs/moderndive_files/figure-html/comparing-diff-means-t-stat-1.png b/docs/moderndive_files/figure-html/comparing-diff-means-t-stat-1.png
index 0d129f135..7e02b1c4f 100644
Binary files a/docs/moderndive_files/figure-html/comparing-diff-means-t-stat-1.png and b/docs/moderndive_files/figure-html/comparing-diff-means-t-stat-1.png differ
diff --git a/docs/moderndive_files/figure-html/comparing-sampling-distributions-1.png b/docs/moderndive_files/figure-html/comparing-sampling-distributions-1.png
index 603034fb2..4cdf43bf8 100644
Binary files a/docs/moderndive_files/figure-html/comparing-sampling-distributions-1.png and b/docs/moderndive_files/figure-html/comparing-sampling-distributions-1.png differ
diff --git a/docs/moderndive_files/figure-html/comparing-sampling-distributions-1b-1.png b/docs/moderndive_files/figure-html/comparing-sampling-distributions-1b-1.png
index 603034fb2..616de412c 100644
Binary files a/docs/moderndive_files/figure-html/comparing-sampling-distributions-1b-1.png and b/docs/moderndive_files/figure-html/comparing-sampling-distributions-1b-1.png differ
diff --git a/docs/moderndive_files/figure-html/comparing-sampling-distributions-2-1.png b/docs/moderndive_files/figure-html/comparing-sampling-distributions-2-1.png
index 91aa07980..fd22cca61 100644
Binary files a/docs/moderndive_files/figure-html/comparing-sampling-distributions-2-1.png and b/docs/moderndive_files/figure-html/comparing-sampling-distributions-2-1.png differ
diff --git a/docs/moderndive_files/figure-html/comparing-sampling-distributions-3-1.png b/docs/moderndive_files/figure-html/comparing-sampling-distributions-3-1.png
index 818ed5186..f2d799ba2 100644
Binary files a/docs/moderndive_files/figure-html/comparing-sampling-distributions-3-1.png and b/docs/moderndive_files/figure-html/comparing-sampling-distributions-3-1.png differ
diff --git a/docs/moderndive_files/figure-html/correlation1-1.png b/docs/moderndive_files/figure-html/correlation1-1.png
index c36effecb..442936080 100644
Binary files a/docs/moderndive_files/figure-html/correlation1-1.png and b/docs/moderndive_files/figure-html/correlation1-1.png differ
diff --git a/docs/moderndive_files/figure-html/credit-limit-quartiles-1.png b/docs/moderndive_files/figure-html/credit-limit-quartiles-1.png
index 066d38b14..83b20adc8 100644
Binary files a/docs/moderndive_files/figure-html/credit-limit-quartiles-1.png and b/docs/moderndive_files/figure-html/credit-limit-quartiles-1.png differ
diff --git a/docs/moderndive_files/figure-html/drinks-smaller-1.png b/docs/moderndive_files/figure-html/drinks-smaller-1.png
index 8135dd307..affd3fd7a 100644
Binary files a/docs/moderndive_files/figure-html/drinks-smaller-1.png and b/docs/moderndive_files/figure-html/drinks-smaller-1.png differ
diff --git a/docs/moderndive_files/figure-html/drinks-smaller-tidy-barplot-1.png b/docs/moderndive_files/figure-html/drinks-smaller-tidy-barplot-1.png
index 8135dd307..21c3c927e 100644
Binary files a/docs/moderndive_files/figure-html/drinks-smaller-tidy-barplot-1.png and b/docs/moderndive_files/figure-html/drinks-smaller-tidy-barplot-1.png differ
diff --git a/docs/moderndive_files/figure-html/equal-variance-residuals-1.png b/docs/moderndive_files/figure-html/equal-variance-residuals-1.png
index 313592fd9..c665c3605 100644
Binary files a/docs/moderndive_files/figure-html/equal-variance-residuals-1.png and b/docs/moderndive_files/figure-html/equal-variance-residuals-1.png differ
diff --git a/docs/moderndive_files/figure-html/facet-bar-vert-1.png b/docs/moderndive_files/figure-html/facet-bar-vert-1.png
index 7613a4ed9..20ce30ffa 100644
Binary files a/docs/moderndive_files/figure-html/facet-bar-vert-1.png and b/docs/moderndive_files/figure-html/facet-bar-vert-1.png differ
diff --git a/docs/moderndive_files/figure-html/facethistogram-1.png b/docs/moderndive_files/figure-html/facethistogram-1.png
index 412fed8bd..eb0679612 100644
Binary files a/docs/moderndive_files/figure-html/facethistogram-1.png and b/docs/moderndive_files/figure-html/facethistogram-1.png differ
diff --git a/docs/moderndive_files/figure-html/facethistogram2-1.png b/docs/moderndive_files/figure-html/facethistogram2-1.png
index 6de6f4a85..835ed997b 100644
Binary files a/docs/moderndive_files/figure-html/facethistogram2-1.png and b/docs/moderndive_files/figure-html/facethistogram2-1.png differ
diff --git a/docs/moderndive_files/figure-html/fitted-values-1.png b/docs/moderndive_files/figure-html/fitted-values-1.png
index 527935aa4..c40cc36d1 100644
Binary files a/docs/moderndive_files/figure-html/fitted-values-1.png and b/docs/moderndive_files/figure-html/fitted-values-1.png differ
diff --git a/docs/moderndive_files/figure-html/flights-dodged-bar-color-1.png b/docs/moderndive_files/figure-html/flights-dodged-bar-color-1.png
index 416f0f951..2b56636cc 100644
Binary files a/docs/moderndive_files/figure-html/flights-dodged-bar-color-1.png and b/docs/moderndive_files/figure-html/flights-dodged-bar-color-1.png differ
diff --git a/docs/moderndive_files/figure-html/flights-stacked-bar-1.png b/docs/moderndive_files/figure-html/flights-stacked-bar-1.png
index 1a5921895..266ced683 100644
Binary files a/docs/moderndive_files/figure-html/flights-stacked-bar-1.png and b/docs/moderndive_files/figure-html/flights-stacked-bar-1.png differ
diff --git a/docs/moderndive_files/figure-html/flights-stacked-bar-color-1.png b/docs/moderndive_files/figure-html/flights-stacked-bar-color-1.png
index ad43c9d38..355025481 100644
Binary files a/docs/moderndive_files/figure-html/flights-stacked-bar-color-1.png and b/docs/moderndive_files/figure-html/flights-stacked-bar-color-1.png differ
diff --git a/docs/moderndive_files/figure-html/flightsbar-1.png b/docs/moderndive_files/figure-html/flightsbar-1.png
index bae030a6e..493403fd3 100644
Binary files a/docs/moderndive_files/figure-html/flightsbar-1.png and b/docs/moderndive_files/figure-html/flightsbar-1.png differ
diff --git a/docs/moderndive_files/figure-html/gain-hist-1.png b/docs/moderndive_files/figure-html/gain-hist-1.png
index 60a9b933d..58a7e4af8 100644
Binary files a/docs/moderndive_files/figure-html/gain-hist-1.png and b/docs/moderndive_files/figure-html/gain-hist-1.png differ
diff --git a/docs/moderndive_files/figure-html/gapminder-1.png b/docs/moderndive_files/figure-html/gapminder-1.png
index e8f59a280..ca3d59da7 100644
Binary files a/docs/moderndive_files/figure-html/gapminder-1.png and b/docs/moderndive_files/figure-html/gapminder-1.png differ
diff --git a/docs/moderndive_files/figure-html/geombar-1.png b/docs/moderndive_files/figure-html/geombar-1.png
index 28611a7ac..6589f6382 100644
Binary files a/docs/moderndive_files/figure-html/geombar-1.png and b/docs/moderndive_files/figure-html/geombar-1.png differ
diff --git a/docs/moderndive_files/figure-html/geomcol-1.png b/docs/moderndive_files/figure-html/geomcol-1.png
index 6e9ffee5c..a4eb7800c 100644
Binary files a/docs/moderndive_files/figure-html/geomcol-1.png and b/docs/moderndive_files/figure-html/geomcol-1.png differ
diff --git a/docs/moderndive_files/figure-html/guat-dem-tidy-1.png b/docs/moderndive_files/figure-html/guat-dem-tidy-1.png
index 7f8212939..63614371e 100644
Binary files a/docs/moderndive_files/figure-html/guat-dem-tidy-1.png and b/docs/moderndive_files/figure-html/guat-dem-tidy-1.png differ
diff --git a/docs/moderndive_files/figure-html/ha-as-flights-boxplot-1.png b/docs/moderndive_files/figure-html/ha-as-flights-boxplot-1.png
index 74a8d64a2..48959e68b 100644
Binary files a/docs/moderndive_files/figure-html/ha-as-flights-boxplot-1.png and b/docs/moderndive_files/figure-html/ha-as-flights-boxplot-1.png differ
diff --git a/docs/moderndive_files/figure-html/hist-1.png b/docs/moderndive_files/figure-html/hist-1.png
index af7cb2bfd..d2f2946d0 100644
Binary files a/docs/moderndive_files/figure-html/hist-1.png and b/docs/moderndive_files/figure-html/hist-1.png differ
diff --git a/docs/moderndive_files/figure-html/hist-bins-1.png b/docs/moderndive_files/figure-html/hist-bins-1.png
index 5a38b9a45..5edf31e2e 100644
Binary files a/docs/moderndive_files/figure-html/hist-bins-1.png and b/docs/moderndive_files/figure-html/hist-bins-1.png differ
diff --git a/docs/moderndive_files/figure-html/hist1a-1.png b/docs/moderndive_files/figure-html/hist1a-1.png
index ad0455d1e..c0ec400b0 100644
Binary files a/docs/moderndive_files/figure-html/hist1a-1.png and b/docs/moderndive_files/figure-html/hist1a-1.png differ
diff --git a/docs/moderndive_files/figure-html/hist1b-1.png b/docs/moderndive_files/figure-html/hist1b-1.png
index 6132a0b52..54fa9f5b0 100644
Binary files a/docs/moderndive_files/figure-html/hist1b-1.png and b/docs/moderndive_files/figure-html/hist1b-1.png differ
diff --git a/docs/moderndive_files/figure-html/histogramexample-1.png b/docs/moderndive_files/figure-html/histogramexample-1.png
index 848609236..863093841 100644
Binary files a/docs/moderndive_files/figure-html/histogramexample-1.png and b/docs/moderndive_files/figure-html/histogramexample-1.png differ
diff --git a/docs/moderndive_files/figure-html/hourlytemp-1.png b/docs/moderndive_files/figure-html/hourlytemp-1.png
index eec509610..ac75350e1 100644
Binary files a/docs/moderndive_files/figure-html/hourlytemp-1.png and b/docs/moderndive_files/figure-html/hourlytemp-1.png differ
diff --git a/docs/moderndive_files/figure-html/house-price-interaction-2-1.png b/docs/moderndive_files/figure-html/house-price-interaction-2-1.png
index 5f825d772..bdd285b48 100644
Binary files a/docs/moderndive_files/figure-html/house-price-interaction-2-1.png and b/docs/moderndive_files/figure-html/house-price-interaction-2-1.png differ
diff --git a/docs/moderndive_files/figure-html/house-price-interaction-3-1.png b/docs/moderndive_files/figure-html/house-price-interaction-3-1.png
index 1b946471b..8a22d429d 100644
Binary files a/docs/moderndive_files/figure-html/house-price-interaction-3-1.png and b/docs/moderndive_files/figure-html/house-price-interaction-3-1.png differ
diff --git a/docs/moderndive_files/figure-html/house-price-parallel-slopes-1.png b/docs/moderndive_files/figure-html/house-price-parallel-slopes-1.png
index 6dd4a40b9..a9543baf9 100644
Binary files a/docs/moderndive_files/figure-html/house-price-parallel-slopes-1.png and b/docs/moderndive_files/figure-html/house-price-parallel-slopes-1.png differ
diff --git a/docs/moderndive_files/figure-html/house-prices-viz-1.png b/docs/moderndive_files/figure-html/house-prices-viz-1.png
index 65c055725..a464a6687 100644
Binary files a/docs/moderndive_files/figure-html/house-prices-viz-1.png and b/docs/moderndive_files/figure-html/house-prices-viz-1.png differ
diff --git a/docs/moderndive_files/figure-html/jitter-1.png b/docs/moderndive_files/figure-html/jitter-1.png
index a12502022..aabb2413b 100644
Binary files a/docs/moderndive_files/figure-html/jitter-1.png and b/docs/moderndive_files/figure-html/jitter-1.png differ
diff --git a/docs/moderndive_files/figure-html/jitter-example-plot-1-1.png b/docs/moderndive_files/figure-html/jitter-example-plot-1-1.png
index 25658f640..a259e61e3 100644
Binary files a/docs/moderndive_files/figure-html/jitter-example-plot-1-1.png and b/docs/moderndive_files/figure-html/jitter-example-plot-1-1.png differ
diff --git a/docs/moderndive_files/figure-html/lifeExp2007hist-1.png b/docs/moderndive_files/figure-html/lifeExp2007hist-1.png
index ed27a3909..daef7c6d9 100644
Binary files a/docs/moderndive_files/figure-html/lifeExp2007hist-1.png and b/docs/moderndive_files/figure-html/lifeExp2007hist-1.png differ
diff --git a/docs/moderndive_files/figure-html/log10-price-viz-1.png b/docs/moderndive_files/figure-html/log10-price-viz-1.png
index 1b36e1085..f1a26d1e1 100644
Binary files a/docs/moderndive_files/figure-html/log10-price-viz-1.png and b/docs/moderndive_files/figure-html/log10-price-viz-1.png differ
diff --git a/docs/moderndive_files/figure-html/log10-size-viz-1.png b/docs/moderndive_files/figure-html/log10-size-viz-1.png
index a88d677b9..2455092a1 100644
Binary files a/docs/moderndive_files/figure-html/log10-size-viz-1.png and b/docs/moderndive_files/figure-html/log10-size-viz-1.png differ
diff --git a/docs/moderndive_files/figure-html/model1residualshist-1.png b/docs/moderndive_files/figure-html/model1residualshist-1.png
index 50da29365..f99e04828 100644
Binary files a/docs/moderndive_files/figure-html/model1residualshist-1.png and b/docs/moderndive_files/figure-html/model1residualshist-1.png differ
diff --git a/docs/moderndive_files/figure-html/monthtempbox-1.png b/docs/moderndive_files/figure-html/monthtempbox-1.png
index 5239b5d76..80707c436 100644
Binary files a/docs/moderndive_files/figure-html/monthtempbox-1.png and b/docs/moderndive_files/figure-html/monthtempbox-1.png differ
diff --git a/docs/moderndive_files/figure-html/noalpha-1.png b/docs/moderndive_files/figure-html/noalpha-1.png
index 42acbc5e3..6aea75bcc 100644
Binary files a/docs/moderndive_files/figure-html/noalpha-1.png and b/docs/moderndive_files/figure-html/noalpha-1.png differ
diff --git a/docs/moderndive_files/figure-html/nolayers-1.png b/docs/moderndive_files/figure-html/nolayers-1.png
index 96e1c79b6..996fd9e9b 100644
Binary files a/docs/moderndive_files/figure-html/nolayers-1.png and b/docs/moderndive_files/figure-html/nolayers-1.png differ
diff --git a/docs/moderndive_files/figure-html/non-linear-1.png b/docs/moderndive_files/figure-html/non-linear-1.png
index 752c324c3..b1ad31d68 100644
Binary files a/docs/moderndive_files/figure-html/non-linear-1.png and b/docs/moderndive_files/figure-html/non-linear-1.png differ
diff --git a/docs/moderndive_files/figure-html/normal-curves-1.png b/docs/moderndive_files/figure-html/normal-curves-1.png
index 0cd7b177e..66682fae9 100644
Binary files a/docs/moderndive_files/figure-html/normal-curves-1.png and b/docs/moderndive_files/figure-html/normal-curves-1.png differ
diff --git a/docs/moderndive_files/figure-html/normal-residuals-1.png b/docs/moderndive_files/figure-html/normal-residuals-1.png
index ae83309c9..127ee3f1f 100644
Binary files a/docs/moderndive_files/figure-html/normal-residuals-1.png and b/docs/moderndive_files/figure-html/normal-residuals-1.png differ
diff --git a/docs/moderndive_files/figure-html/normal-rule-of-thumb-1.png b/docs/moderndive_files/figure-html/normal-rule-of-thumb-1.png
index 5007d5fed..8b7d721fd 100644
Binary files a/docs/moderndive_files/figure-html/normal-rule-of-thumb-1.png and b/docs/moderndive_files/figure-html/normal-rule-of-thumb-1.png differ
diff --git a/docs/moderndive_files/figure-html/nov1-1.png b/docs/moderndive_files/figure-html/nov1-1.png
index 601ee09c0..9e1e80255 100644
Binary files a/docs/moderndive_files/figure-html/nov1-1.png and b/docs/moderndive_files/figure-html/nov1-1.png differ
diff --git a/docs/moderndive_files/figure-html/nov2-1.png b/docs/moderndive_files/figure-html/nov2-1.png
index 70d49a25c..e6fe939e2 100644
Binary files a/docs/moderndive_files/figure-html/nov2-1.png and b/docs/moderndive_files/figure-html/nov2-1.png differ
diff --git a/docs/moderndive_files/figure-html/null-distribution-1-1.png b/docs/moderndive_files/figure-html/null-distribution-1-1.png
index cc203050e..62d959f53 100644
Binary files a/docs/moderndive_files/figure-html/null-distribution-1-1.png and b/docs/moderndive_files/figure-html/null-distribution-1-1.png differ
diff --git a/docs/moderndive_files/figure-html/null-distribution-2-1.png b/docs/moderndive_files/figure-html/null-distribution-2-1.png
index 3a50c7503..0ca73e118 100644
Binary files a/docs/moderndive_files/figure-html/null-distribution-2-1.png and b/docs/moderndive_files/figure-html/null-distribution-2-1.png differ
diff --git a/docs/moderndive_files/figure-html/null-distribution-infer-1.png b/docs/moderndive_files/figure-html/null-distribution-infer-1.png
index d0cb1580b..22f0a80fa 100644
Binary files a/docs/moderndive_files/figure-html/null-distribution-infer-1.png and b/docs/moderndive_files/figure-html/null-distribution-infer-1.png differ
diff --git a/docs/moderndive_files/figure-html/null-distribution-infer-2-1.png b/docs/moderndive_files/figure-html/null-distribution-infer-2-1.png
index 53f870af4..5bc9f97f7 100644
Binary files a/docs/moderndive_files/figure-html/null-distribution-infer-2-1.png and b/docs/moderndive_files/figure-html/null-distribution-infer-2-1.png differ
diff --git a/docs/moderndive_files/figure-html/null-distribution-movies-2-1.png b/docs/moderndive_files/figure-html/null-distribution-movies-2-1.png
index 150d4f650..05690f174 100644
Binary files a/docs/moderndive_files/figure-html/null-distribution-movies-2-1.png and b/docs/moderndive_files/figure-html/null-distribution-movies-2-1.png differ
diff --git a/docs/moderndive_files/figure-html/null-distribution-slope-1.png b/docs/moderndive_files/figure-html/null-distribution-slope-1.png
index 5d1571c5e..c1cd74580 100644
Binary files a/docs/moderndive_files/figure-html/null-distribution-slope-1.png and b/docs/moderndive_files/figure-html/null-distribution-slope-1.png differ
diff --git a/docs/moderndive_files/figure-html/numxcatx-comparison-1.png b/docs/moderndive_files/figure-html/numxcatx-comparison-1.png
index a6900e9e6..cb8c28408 100644
Binary files a/docs/moderndive_files/figure-html/numxcatx-comparison-1.png and b/docs/moderndive_files/figure-html/numxcatx-comparison-1.png differ
diff --git a/docs/moderndive_files/figure-html/numxcatx-comparison-2-1.png b/docs/moderndive_files/figure-html/numxcatx-comparison-2-1.png
index a10ddda14..e051f09ec 100644
Binary files a/docs/moderndive_files/figure-html/numxcatx-comparison-2-1.png and b/docs/moderndive_files/figure-html/numxcatx-comparison-2-1.png differ
diff --git a/docs/moderndive_files/figure-html/numxcatx-parallel-1.png b/docs/moderndive_files/figure-html/numxcatx-parallel-1.png
index c0fce52dc..a5cb75a5f 100644
Binary files a/docs/moderndive_files/figure-html/numxcatx-parallel-1.png and b/docs/moderndive_files/figure-html/numxcatx-parallel-1.png differ
diff --git a/docs/moderndive_files/figure-html/numxcatxplot1-1.png b/docs/moderndive_files/figure-html/numxcatxplot1-1.png
index 7e27fe22d..6659fc2ec 100644
Binary files a/docs/moderndive_files/figure-html/numxcatxplot1-1.png and b/docs/moderndive_files/figure-html/numxcatxplot1-1.png differ
diff --git a/docs/moderndive_files/figure-html/numxplot1-1.png b/docs/moderndive_files/figure-html/numxplot1-1.png
index 5696dd6de..72aa0f8f8 100644
Binary files a/docs/moderndive_files/figure-html/numxplot1-1.png and b/docs/moderndive_files/figure-html/numxplot1-1.png differ
diff --git a/docs/moderndive_files/figure-html/numxplot2-1.png b/docs/moderndive_files/figure-html/numxplot2-1.png
index ab3210fa0..ab6548f6f 100644
Binary files a/docs/moderndive_files/figure-html/numxplot2-1.png and b/docs/moderndive_files/figure-html/numxplot2-1.png differ
diff --git a/docs/moderndive_files/figure-html/numxplot3-1.png b/docs/moderndive_files/figure-html/numxplot3-1.png
index 35f7f3778..1fdd68cd4 100644
Binary files a/docs/moderndive_files/figure-html/numxplot3-1.png and b/docs/moderndive_files/figure-html/numxplot3-1.png differ
diff --git a/docs/moderndive_files/figure-html/numxplot4-1.png b/docs/moderndive_files/figure-html/numxplot4-1.png
index 457f26f26..2d4cffe80 100644
Binary files a/docs/moderndive_files/figure-html/numxplot4-1.png and b/docs/moderndive_files/figure-html/numxplot4-1.png differ
diff --git a/docs/moderndive_files/figure-html/numxplot6-1.png b/docs/moderndive_files/figure-html/numxplot6-1.png
index 54d1ddd5d..24e08dd2d 100644
Binary files a/docs/moderndive_files/figure-html/numxplot6-1.png and b/docs/moderndive_files/figure-html/numxplot6-1.png differ
diff --git a/docs/moderndive_files/figure-html/one-thousand-sample-means-1.png b/docs/moderndive_files/figure-html/one-thousand-sample-means-1.png
index 2cc94cce5..8ea81cbe4 100644
Binary files a/docs/moderndive_files/figure-html/one-thousand-sample-means-1.png and b/docs/moderndive_files/figure-html/one-thousand-sample-means-1.png differ
diff --git a/docs/moderndive_files/figure-html/orig-and-resample-means-1.png b/docs/moderndive_files/figure-html/orig-and-resample-means-1.png
index a5e0befde..3d74692d5 100644
Binary files a/docs/moderndive_files/figure-html/orig-and-resample-means-1.png and b/docs/moderndive_files/figure-html/orig-and-resample-means-1.png differ
diff --git a/docs/moderndive_files/figure-html/origandresample-1.png b/docs/moderndive_files/figure-html/origandresample-1.png
index da1beb135..c02d746bd 100644
Binary files a/docs/moderndive_files/figure-html/origandresample-1.png and b/docs/moderndive_files/figure-html/origandresample-1.png differ
diff --git a/docs/moderndive_files/figure-html/p-value-slope-1.png b/docs/moderndive_files/figure-html/p-value-slope-1.png
index 354940e53..e14d55cd4 100644
Binary files a/docs/moderndive_files/figure-html/p-value-slope-1.png and b/docs/moderndive_files/figure-html/p-value-slope-1.png differ
diff --git a/docs/moderndive_files/figure-html/pennies-sample-histogram-1.png b/docs/moderndive_files/figure-html/pennies-sample-histogram-1.png
index 8d81df2de..989f42cdd 100644
Binary files a/docs/moderndive_files/figure-html/pennies-sample-histogram-1.png and b/docs/moderndive_files/figure-html/pennies-sample-histogram-1.png differ
diff --git a/docs/moderndive_files/figure-html/percentile-and-se-method-1.png b/docs/moderndive_files/figure-html/percentile-and-se-method-1.png
index a1f789761..3da73b7b1 100644
Binary files a/docs/moderndive_files/figure-html/percentile-and-se-method-1.png and b/docs/moderndive_files/figure-html/percentile-and-se-method-1.png differ
diff --git a/docs/moderndive_files/figure-html/percentile-ci-viz-1.png b/docs/moderndive_files/figure-html/percentile-ci-viz-1.png
index b2b70bf65..c67988f9f 100644
Binary files a/docs/moderndive_files/figure-html/percentile-ci-viz-1.png and b/docs/moderndive_files/figure-html/percentile-ci-viz-1.png differ
diff --git a/docs/moderndive_files/figure-html/percentile-method-1.png b/docs/moderndive_files/figure-html/percentile-method-1.png
index c891a9e36..c31ee8acd 100644
Binary files a/docs/moderndive_files/figure-html/percentile-method-1.png and b/docs/moderndive_files/figure-html/percentile-method-1.png differ
diff --git a/docs/moderndive_files/figure-html/promotions-barplot-1.png b/docs/moderndive_files/figure-html/promotions-barplot-1.png
index f5d62f6f7..764be760a 100644
Binary files a/docs/moderndive_files/figure-html/promotions-barplot-1.png and b/docs/moderndive_files/figure-html/promotions-barplot-1.png differ
diff --git a/docs/moderndive_files/figure-html/promotions-barplot-permuted-1.png b/docs/moderndive_files/figure-html/promotions-barplot-permuted-1.png
index 877e97713..fd0bdcb41 100644
Binary files a/docs/moderndive_files/figure-html/promotions-barplot-permuted-1.png and b/docs/moderndive_files/figure-html/promotions-barplot-permuted-1.png differ
diff --git a/docs/moderndive_files/figure-html/pvaloneprop-1.png b/docs/moderndive_files/figure-html/pvaloneprop-1.png
index d60fbdde1..dc8cb6215 100644
Binary files a/docs/moderndive_files/figure-html/pvaloneprop-1.png and b/docs/moderndive_files/figure-html/pvaloneprop-1.png differ
diff --git a/docs/moderndive_files/figure-html/qqplotmean-1.png b/docs/moderndive_files/figure-html/qqplotmean-1.png
index 35ce01037..27137f48f 100644
Binary files a/docs/moderndive_files/figure-html/qqplotmean-1.png and b/docs/moderndive_files/figure-html/qqplotmean-1.png differ
diff --git a/docs/moderndive_files/figure-html/recall-parallel-vs-interaction-1.png b/docs/moderndive_files/figure-html/recall-parallel-vs-interaction-1.png
index a6900e9e6..ea01e3d08 100644
Binary files a/docs/moderndive_files/figure-html/recall-parallel-vs-interaction-1.png and b/docs/moderndive_files/figure-html/recall-parallel-vs-interaction-1.png differ
diff --git a/docs/moderndive_files/figure-html/regline-1.png b/docs/moderndive_files/figure-html/regline-1.png
index 35f7f3778..09e913202 100644
Binary files a/docs/moderndive_files/figure-html/regline-1.png and b/docs/moderndive_files/figure-html/regline-1.png differ
diff --git a/docs/moderndive_files/figure-html/reliable-percentile-1.png b/docs/moderndive_files/figure-html/reliable-percentile-1.png
index 8bf358c06..92020c5cf 100644
Binary files a/docs/moderndive_files/figure-html/reliable-percentile-1.png and b/docs/moderndive_files/figure-html/reliable-percentile-1.png differ
diff --git a/docs/moderndive_files/figure-html/reliable-percentile-80-95-99-1.png b/docs/moderndive_files/figure-html/reliable-percentile-80-95-99-1.png
index e233e8188..da6643d7c 100644
Binary files a/docs/moderndive_files/figure-html/reliable-percentile-80-95-99-1.png and b/docs/moderndive_files/figure-html/reliable-percentile-80-95-99-1.png differ
diff --git a/docs/moderndive_files/figure-html/reliable-percentile-n-25-50-100-1.png b/docs/moderndive_files/figure-html/reliable-percentile-n-25-50-100-1.png
index 3ecc0c5e5..2b81542b9 100644
Binary files a/docs/moderndive_files/figure-html/reliable-percentile-n-25-50-100-1.png and b/docs/moderndive_files/figure-html/reliable-percentile-n-25-50-100-1.png differ
diff --git a/docs/moderndive_files/figure-html/reliable-se-1.png b/docs/moderndive_files/figure-html/reliable-se-1.png
index 5f4ab9860..c6adae61f 100644
Binary files a/docs/moderndive_files/figure-html/reliable-se-1.png and b/docs/moderndive_files/figure-html/reliable-se-1.png differ
diff --git a/docs/moderndive_files/figure-html/residual-example-1.png b/docs/moderndive_files/figure-html/residual-example-1.png
index d0f0f3ab8..f7436aecf 100644
Binary files a/docs/moderndive_files/figure-html/residual-example-1.png and b/docs/moderndive_files/figure-html/residual-example-1.png differ
diff --git a/docs/moderndive_files/figure-html/sampling-distribution-part-deux-1.png b/docs/moderndive_files/figure-html/sampling-distribution-part-deux-1.png
index 08e7d8f28..14644e55e 100644
Binary files a/docs/moderndive_files/figure-html/sampling-distribution-part-deux-1.png and b/docs/moderndive_files/figure-html/sampling-distribution-part-deux-1.png differ
diff --git a/docs/moderndive_files/figure-html/samplingdistribution-tactile-1.png b/docs/moderndive_files/figure-html/samplingdistribution-tactile-1.png
index a96b2c95c..1f253a341 100644
Binary files a/docs/moderndive_files/figure-html/samplingdistribution-tactile-1.png and b/docs/moderndive_files/figure-html/samplingdistribution-tactile-1.png differ
diff --git a/docs/moderndive_files/figure-html/samplingdistribution-virtual-1.png b/docs/moderndive_files/figure-html/samplingdistribution-virtual-1.png
index d70213459..4896382c5 100644
Binary files a/docs/moderndive_files/figure-html/samplingdistribution-virtual-1.png and b/docs/moderndive_files/figure-html/samplingdistribution-virtual-1.png differ
diff --git a/docs/moderndive_files/figure-html/samplingdistribution-virtual-1000-1.png b/docs/moderndive_files/figure-html/samplingdistribution-virtual-1000-1.png
index ca9731446..36b676052 100644
Binary files a/docs/moderndive_files/figure-html/samplingdistribution-virtual-1000-1.png and b/docs/moderndive_files/figure-html/samplingdistribution-virtual-1000-1.png differ
diff --git a/docs/moderndive_files/figure-html/se-ci-viz-1.png b/docs/moderndive_files/figure-html/se-ci-viz-1.png
index 88b2d13dc..9cb3d404a 100644
Binary files a/docs/moderndive_files/figure-html/se-ci-viz-1.png and b/docs/moderndive_files/figure-html/se-ci-viz-1.png differ
diff --git a/docs/moderndive_files/figure-html/shovel-bootstrap-1-infer-1.png b/docs/moderndive_files/figure-html/shovel-bootstrap-1-infer-1.png
index 28b3eab51..221266c4a 100644
Binary files a/docs/moderndive_files/figure-html/shovel-bootstrap-1-infer-1.png and b/docs/moderndive_files/figure-html/shovel-bootstrap-1-infer-1.png differ
diff --git a/docs/moderndive_files/figure-html/side-by-side-1.png b/docs/moderndive_files/figure-html/side-by-side-1.png
index a49eb83b3..3ee6c5f88 100644
Binary files a/docs/moderndive_files/figure-html/side-by-side-1.png and b/docs/moderndive_files/figure-html/side-by-side-1.png differ
diff --git a/docs/moderndive_files/figure-html/stacked_bar-1.png b/docs/moderndive_files/figure-html/stacked_bar-1.png
index 2d582c529..8819cf5d5 100644
Binary files a/docs/moderndive_files/figure-html/stacked_bar-1.png and b/docs/moderndive_files/figure-html/stacked_bar-1.png differ
diff --git a/docs/moderndive_files/figure-html/t-distributions-1.png b/docs/moderndive_files/figure-html/t-distributions-1.png
index 197a5c4e3..291503a06 100644
Binary files a/docs/moderndive_files/figure-html/t-distributions-1.png and b/docs/moderndive_files/figure-html/t-distributions-1.png differ
diff --git a/docs/moderndive_files/figure-html/t-stat-3-1.png b/docs/moderndive_files/figure-html/t-stat-3-1.png
index 0ec9a18fe..651ab04fe 100644
Binary files a/docs/moderndive_files/figure-html/t-stat-3-1.png and b/docs/moderndive_files/figure-html/t-stat-3-1.png differ
diff --git a/docs/moderndive_files/figure-html/t-stat-4-1.png b/docs/moderndive_files/figure-html/t-stat-4-1.png
index e43b8f82d..a41bcaf11 100644
Binary files a/docs/moderndive_files/figure-html/t-stat-4-1.png and b/docs/moderndive_files/figure-html/t-stat-4-1.png differ
diff --git a/docs/moderndive_files/figure-html/tactile-conf-int-1.png b/docs/moderndive_files/figure-html/tactile-conf-int-1.png
index 67f118709..a92442862 100644
Binary files a/docs/moderndive_files/figure-html/tactile-conf-int-1.png and b/docs/moderndive_files/figure-html/tactile-conf-int-1.png differ
diff --git a/docs/moderndive_files/figure-html/tactile-resampling-6-1.png b/docs/moderndive_files/figure-html/tactile-resampling-6-1.png
index e34a67e61..f91893ecb 100644
Binary files a/docs/moderndive_files/figure-html/tactile-resampling-6-1.png and b/docs/moderndive_files/figure-html/tactile-resampling-6-1.png differ
diff --git a/docs/moderndive_files/figure-html/tactile-resampling-7-1.png b/docs/moderndive_files/figure-html/tactile-resampling-7-1.png
index 1df51974e..e8d47a586 100644
Binary files a/docs/moderndive_files/figure-html/tactile-resampling-7-1.png and b/docs/moderndive_files/figure-html/tactile-resampling-7-1.png differ
diff --git a/docs/moderndive_files/figure-html/tactile-vs-virtual-1.png b/docs/moderndive_files/figure-html/tactile-vs-virtual-1.png
index af833df99..33eec4717 100644
Binary files a/docs/moderndive_files/figure-html/tactile-vs-virtual-1.png and b/docs/moderndive_files/figure-html/tactile-vs-virtual-1.png differ
diff --git a/docs/moderndive_files/figure-html/temp-on-line-1.png b/docs/moderndive_files/figure-html/temp-on-line-1.png
index 9c005fc85..50c527590 100644
Binary files a/docs/moderndive_files/figure-html/temp-on-line-1.png and b/docs/moderndive_files/figure-html/temp-on-line-1.png differ
diff --git a/docs/moderndive_files/figure-html/three-lines-1.png b/docs/moderndive_files/figure-html/three-lines-1.png
index ec175265a..3ae897f6e 100644
Binary files a/docs/moderndive_files/figure-html/three-lines-1.png and b/docs/moderndive_files/figure-html/three-lines-1.png differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-482-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-482-1.png
deleted file mode 100644
index 29467bd4d..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-482-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-483-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-483-1.png
deleted file mode 100644
index 1aec660c1..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-483-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-487-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-487-1.png
deleted file mode 100644
index fcbedb828..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-487-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-490-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-490-1.png
deleted file mode 100644
index 499f0f636..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-490-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-491-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-491-1.png
deleted file mode 100644
index f59cce967..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-491-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-495-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-495-1.png
deleted file mode 100644
index 4dae6c3c4..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-495-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-500-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-500-1.png
deleted file mode 100644
index 4091a9736..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-500-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-501-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-501-1.png
deleted file mode 100644
index 3a3707bec..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-501-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-505-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-505-1.png
deleted file mode 100644
index 1c1f5fc57..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-505-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-509-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-509-1.png
deleted file mode 100644
index 21bddda68..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-509-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-510-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-510-1.png
deleted file mode 100644
index 7de79b15a..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-510-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-514-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-514-1.png
deleted file mode 100644
index 98c9845dc..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-514-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-518-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-518-1.png
deleted file mode 100644
index b84726e4b..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-518-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-519-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-519-1.png
deleted file mode 100644
index 3b01141bd..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-519-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-523-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-523-1.png
deleted file mode 100644
index 4d0fea1a7..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-523-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-529-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-529-1.png
deleted file mode 100644
index f6603f5cf..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-529-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-537-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-537-1.png
deleted file mode 100644
index b47a7536e..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-537-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/unnamed-chunk-538-1.png b/docs/moderndive_files/figure-html/unnamed-chunk-538-1.png
deleted file mode 100644
index 8fb331cba..000000000
Binary files a/docs/moderndive_files/figure-html/unnamed-chunk-538-1.png and /dev/null differ
diff --git a/docs/moderndive_files/figure-html/us-births-1.png b/docs/moderndive_files/figure-html/us-births-1.png
index 61bb14963..a2f45b91f 100644
Binary files a/docs/moderndive_files/figure-html/us-births-1.png and b/docs/moderndive_files/figure-html/us-births-1.png differ
diff --git a/docs/moderndive_files/figure-html/weather-histogram-1.png b/docs/moderndive_files/figure-html/weather-histogram-1.png
index 454a31c9c..62a4311a8 100644
Binary files a/docs/moderndive_files/figure-html/weather-histogram-1.png and b/docs/moderndive_files/figure-html/weather-histogram-1.png differ
diff --git a/docs/moderndive_files/figure-html/weather-histogram-2-1.png b/docs/moderndive_files/figure-html/weather-histogram-2-1.png
index 2e97735e9..8dc94ad04 100644
Binary files a/docs/moderndive_files/figure-html/weather-histogram-2-1.png and b/docs/moderndive_files/figure-html/weather-histogram-2-1.png differ
diff --git a/docs/moderndive_files/figure-html/zcurve-1.png b/docs/moderndive_files/figure-html/zcurve-1.png
index 09b1ab57e..28727faec 100644
Binary files a/docs/moderndive_files/figure-html/zcurve-1.png and b/docs/moderndive_files/figure-html/zcurve-1.png differ
diff --git a/docs/references.html b/docs/references.html
index 3d7d7b27e..a648cc161 100644
--- a/docs/references.html
+++ b/docs/references.html
@@ -6,14 +6,14 @@
   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
   <title>References | Statistical Inference via Data Science</title>
   <meta name="description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="generator" content="bookdown 0.11 and GitBook 2.6.7" />
+  <meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
 
   <meta property="og:title" content="References | Statistical Inference via Data Science" />
   <meta property="og:type" content="book" />
   <meta property="og:url" content="https://moderndive.com/" />
   <meta property="og:image" content="https://moderndive.com/images/logos/book_cover.png" />
   <meta property="og:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
-  <meta name="github-repo" content="moderndive/moderndive_book" />
+  <meta name="github-repo" content="moderndive/ModernDive_book" />
 
   <meta name="twitter:card" content="summary" />
   <meta name="twitter:title" content="References | Statistical Inference via Data Science" />
@@ -21,17 +21,17 @@
   <meta name="twitter:description" content="An open-source and fully-reproducible electronic textbook for teaching statistical inference using tidyverse data science tools." />
   <meta name="twitter:image" content="https://moderndive.com/images/logos/book_cover.png" />
 
-<meta name="author" content="Chester Ismay and Albert Y. Kim" />
+<meta name="author" content="Chester Ismay and Albert Y. Kim   Foreword by Kelly S. McConville" />
 
 
-<meta name="date" content="2019-08-28" />
+<meta name="date" content="2019-11-25" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="apple-mobile-web-app-capable" content="yes" />
   <meta name="apple-mobile-web-app-status-bar-style" content="black" />
   <link rel="apple-touch-icon-precomposed" sizes="152x152" href="images/logos/favicons/apple-touch-icon.png" />
   <link rel="shortcut icon" href="images/logos/favicons/favicon.ico" type="image/x-icon" />
-<link rel="prev" href="E-appendixE.html">
+<link rel="prev" href="E-appendixE.html"/>
 
 <script src="libs/jquery-2.2.3/jquery.min.js"></script>
 <link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
@@ -40,6 +40,9 @@
 <link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
 <link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
+<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
+
+
 
 
 
@@ -48,7 +51,7 @@
 
 
 <script src="libs/kePrint-0.0.1/kePrint.js"></script>
-<script src="libs/htmlwidgets-1.3/htmlwidgets.js"></script>
+<script src="libs/htmlwidgets-1.5.1/htmlwidgets.js"></script>
 <link href="libs/dygraphs-1.1.1/dygraph.css" rel="stylesheet" />
 <script src="libs/dygraphs-1.1.1/dygraph-combined.js"></script>
 <script src="libs/dygraphs-1.1.1/shapes.js"></script>
@@ -74,7 +77,6 @@
 a.sourceLine:empty { height: 1.2em; }
 .sourceCode { overflow: visible; }
 code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
 pre.sourceCode { margin: 0; }
 @media screen {
 div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@
       <nav role="navigation">
 
 <ul class="summary">
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Preface</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#resources"><i class="fa fa-check"></i>Resources</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
-</ul></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
-<li class="chapter" data-level="" data-path="index.html"><a href="index.html#about-the-authors"><i class="fa fa-check"></i>About the authors</a></li>
-</ul></li>
+<li class="chapter" data-level="" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i>Special Announcement</a></li>
+<li class="chapter" data-level="" data-path="foreword.html"><a href="foreword.html"><i class="fa fa-check"></i>Foreword</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-students"><i class="fa fa-check"></i>Introduction for students</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-we-hope-you-will-learn-from-this-book"><i class="fa fa-check"></i>What we hope you will learn from this book</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#datascience-pipeline"><i class="fa fa-check"></i>Data/science pipeline</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#reproducible-research"><i class="fa fa-check"></i>Reproducible research</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#final-note-for-students"><i class="fa fa-check"></i>Final note for students</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#introduction-for-instructors"><i class="fa fa-check"></i>Introduction for instructors</a><ul>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#resources"><i class="fa fa-check"></i>Resources</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-did-we-write-this-book"><i class="fa fa-check"></i>Why did we write this book?</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#who-is-this-book-for"><i class="fa fa-check"></i>Who is this book for?</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#connect-and-contribute"><i class="fa fa-check"></i>Connect and contribute</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgements"><i class="fa fa-check"></i>Acknowledgements</a></li>
+<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#about-this-book"><i class="fa fa-check"></i>About this book</a></li>
+</ul></li>
+<li class="chapter" data-level="" data-path="about-the-authors.html"><a href="about-the-authors.html"><i class="fa fa-check"></i>About the authors</a></li>
 <li class="chapter" data-level="1" data-path="1-getting-started.html"><a href="1-getting-started.html"><i class="fa fa-check"></i><b>1</b> Getting Started with Data in R</a><ul>
 <li class="chapter" data-level="1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#r-rstudio"><i class="fa fa-check"></i><b>1.1</b> What are R and RStudio?</a><ul>
-<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing-r-and-rstudio"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
+<li class="chapter" data-level="1.1.1" data-path="1-getting-started.html"><a href="1-getting-started.html#installing"><i class="fa fa-check"></i><b>1.1.1</b> Installing R and RStudio</a></li>
 <li class="chapter" data-level="1.1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#using-r-via-rstudio"><i class="fa fa-check"></i><b>1.1.2</b> Using R via RStudio</a></li>
 </ul></li>
 <li class="chapter" data-level="1.2" data-path="1-getting-started.html"><a href="1-getting-started.html#code"><i class="fa fa-check"></i><b>1.2</b> How do I code in R?</a><ul>
@@ -180,7 +185,7 @@
 <li class="chapter" data-level="1.4.1" data-path="1-getting-started.html"><a href="1-getting-started.html#nycflights13-package"><i class="fa fa-check"></i><b>1.4.1</b> <code>nycflights13</code> package</a></li>
 <li class="chapter" data-level="1.4.2" data-path="1-getting-started.html"><a href="1-getting-started.html#flights-data-frame"><i class="fa fa-check"></i><b>1.4.2</b> <code>flights</code> data frame</a></li>
 <li class="chapter" data-level="1.4.3" data-path="1-getting-started.html"><a href="1-getting-started.html#exploredataframes"><i class="fa fa-check"></i><b>1.4.3</b> Exploring data frames</a></li>
-<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification &amp; measurement variables</a></li>
+<li class="chapter" data-level="1.4.4" data-path="1-getting-started.html"><a href="1-getting-started.html#identification-vs-measurement-variables"><i class="fa fa-check"></i><b>1.4.4</b> Identification and measurement variables</a></li>
 <li class="chapter" data-level="1.4.5" data-path="1-getting-started.html"><a href="1-getting-started.html#help-files"><i class="fa fa-check"></i><b>1.4.5</b> Help files</a></li>
 </ul></li>
 <li class="chapter" data-level="1.5" data-path="1-getting-started.html"><a href="1-getting-started.html#conclusion"><i class="fa fa-check"></i><b>1.5</b> Conclusion</a><ul>
@@ -188,37 +193,37 @@
 <li class="chapter" data-level="1.5.2" data-path="1-getting-started.html"><a href="1-getting-started.html#whats-to-come"><i class="fa fa-check"></i><b>1.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>I Data Science via the tidyverse</b></span></li>
+<li class="part"><span><b>I Data Science with tidyverse</b></span></li>
 <li class="chapter" data-level="2" data-path="2-viz.html"><a href="2-viz.html"><i class="fa fa-check"></i><b>2</b> Data Visualization</a><ul>
 <li class="chapter" data-level="" data-path="2-viz.html"><a href="2-viz.html#needed-packages"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The Grammar of Graphics</a><ul>
-<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the Grammar</a></li>
+<li class="chapter" data-level="2.1" data-path="2-viz.html"><a href="2-viz.html#grammarofgraphics"><i class="fa fa-check"></i><b>2.1</b> The grammar of graphics</a><ul>
+<li class="chapter" data-level="2.1.1" data-path="2-viz.html"><a href="2-viz.html#components-of-the-grammar"><i class="fa fa-check"></i><b>2.1.1</b> Components of the grammar</a></li>
 <li class="chapter" data-level="2.1.2" data-path="2-viz.html"><a href="2-viz.html#gapminder"><i class="fa fa-check"></i><b>2.1.2</b> Gapminder data</a></li>
 <li class="chapter" data-level="2.1.3" data-path="2-viz.html"><a href="2-viz.html#other-components"><i class="fa fa-check"></i><b>2.1.3</b> Other components</a></li>
 <li class="chapter" data-level="2.1.4" data-path="2-viz.html"><a href="2-viz.html#ggplot2-package"><i class="fa fa-check"></i><b>2.1.4</b> ggplot2 package</a></li>
 </ul></li>
-<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five Named Graphs - The 5NG</a></li>
+<li class="chapter" data-level="2.2" data-path="2-viz.html"><a href="2-viz.html#FiveNG"><i class="fa fa-check"></i><b>2.2</b> Five named graphs - the 5NG</a></li>
 <li class="chapter" data-level="2.3" data-path="2-viz.html"><a href="2-viz.html#scatterplots"><i class="fa fa-check"></i><b>2.3</b> 5NG#1: Scatterplots</a><ul>
-<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via geom_point</a></li>
-<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Over-plotting</a></li>
+<li class="chapter" data-level="2.3.1" data-path="2-viz.html"><a href="2-viz.html#geompoint"><i class="fa fa-check"></i><b>2.3.1</b> Scatterplots via <code>geom_point</code></a></li>
+<li class="chapter" data-level="2.3.2" data-path="2-viz.html"><a href="2-viz.html#overplotting"><i class="fa fa-check"></i><b>2.3.2</b> Overplotting</a></li>
 <li class="chapter" data-level="2.3.3" data-path="2-viz.html"><a href="2-viz.html#summary"><i class="fa fa-check"></i><b>2.3.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.4" data-path="2-viz.html"><a href="2-viz.html#linegraphs"><i class="fa fa-check"></i><b>2.4</b> 5NG#2: Linegraphs</a><ul>
-<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via geom_line</a></li>
+<li class="chapter" data-level="2.4.1" data-path="2-viz.html"><a href="2-viz.html#geomline"><i class="fa fa-check"></i><b>2.4.1</b> Linegraphs via <code>geom_line</code></a></li>
 <li class="chapter" data-level="2.4.2" data-path="2-viz.html"><a href="2-viz.html#summary-1"><i class="fa fa-check"></i><b>2.4.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.5" data-path="2-viz.html"><a href="2-viz.html#histograms"><i class="fa fa-check"></i><b>2.5</b> 5NG#3: Histograms</a><ul>
-<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via geom_histogram</a></li>
+<li class="chapter" data-level="2.5.1" data-path="2-viz.html"><a href="2-viz.html#geomhistogram"><i class="fa fa-check"></i><b>2.5.1</b> Histograms via <code>geom_histogram</code></a></li>
 <li class="chapter" data-level="2.5.2" data-path="2-viz.html"><a href="2-viz.html#adjustbins"><i class="fa fa-check"></i><b>2.5.2</b> Adjusting the bins</a></li>
 <li class="chapter" data-level="2.5.3" data-path="2-viz.html"><a href="2-viz.html#summary-2"><i class="fa fa-check"></i><b>2.5.3</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.6" data-path="2-viz.html"><a href="2-viz.html#facets"><i class="fa fa-check"></i><b>2.6</b> Facets</a></li>
 <li class="chapter" data-level="2.7" data-path="2-viz.html"><a href="2-viz.html#boxplots"><i class="fa fa-check"></i><b>2.7</b> 5NG#4: Boxplots</a><ul>
-<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via geom_boxplot</a></li>
+<li class="chapter" data-level="2.7.1" data-path="2-viz.html"><a href="2-viz.html#geomboxplot"><i class="fa fa-check"></i><b>2.7.1</b> Boxplots via <code>geom_boxplot</code></a></li>
 <li class="chapter" data-level="2.7.2" data-path="2-viz.html"><a href="2-viz.html#summary-3"><i class="fa fa-check"></i><b>2.7.2</b> Summary</a></li>
 </ul></li>
 <li class="chapter" data-level="2.8" data-path="2-viz.html"><a href="2-viz.html#geombar"><i class="fa fa-check"></i><b>2.8</b> 5NG#5: Barplots</a><ul>
-<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via geom_bar or geom_col</a></li>
+<li class="chapter" data-level="2.8.1" data-path="2-viz.html"><a href="2-viz.html#barplots-via-geom_bar-or-geom_col"><i class="fa fa-check"></i><b>2.8.1</b> Barplots via <code>geom_bar</code> or <code>geom_col</code></a></li>
 <li class="chapter" data-level="2.8.2" data-path="2-viz.html"><a href="2-viz.html#must-avoid-pie-charts"><i class="fa fa-check"></i><b>2.8.2</b> Must avoid pie charts!</a></li>
 <li class="chapter" data-level="2.8.3" data-path="2-viz.html"><a href="2-viz.html#two-categ-barplot"><i class="fa fa-check"></i><b>2.8.3</b> Two categorical variables</a></li>
 <li class="chapter" data-level="2.8.4" data-path="2-viz.html"><a href="2-viz.html#summary-4"><i class="fa fa-check"></i><b>2.8.4</b> Summary</a></li>
@@ -257,13 +262,13 @@
 <li class="chapter" data-level="3.9.3" data-path="3-wrangling.html"><a href="3-wrangling.html#whats-to-come-1"><i class="fa fa-check"></i><b>3.9.3</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing &amp; “Tidy” Data</a><ul>
+<li class="chapter" data-level="4" data-path="4-tidy.html"><a href="4-tidy.html"><i class="fa fa-check"></i><b>4</b> Data Importing and “Tidy” Data</a><ul>
 <li class="chapter" data-level="" data-path="4-tidy.html"><a href="4-tidy.html#needed-packages-2"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="4.1" data-path="4-tidy.html"><a href="4-tidy.html#csv"><i class="fa fa-check"></i><b>4.1</b> Importing data</a><ul>
 <li class="chapter" data-level="4.1.1" data-path="4-tidy.html"><a href="4-tidy.html#using-the-console"><i class="fa fa-check"></i><b>4.1.1</b> Using the console</a></li>
 <li class="chapter" data-level="4.1.2" data-path="4-tidy.html"><a href="4-tidy.html#using-rstudios-interface"><i class="fa fa-check"></i><b>4.1.2</b> Using RStudio’s interface</a></li>
 </ul></li>
-<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> Tidy data</a><ul>
+<li class="chapter" data-level="4.2" data-path="4-tidy.html"><a href="4-tidy.html#tidy-data-ex"><i class="fa fa-check"></i><b>4.2</b> “Tidy” data</a><ul>
 <li class="chapter" data-level="4.2.1" data-path="4-tidy.html"><a href="4-tidy.html#tidy-definition"><i class="fa fa-check"></i><b>4.2.1</b> Definition of “tidy” data</a></li>
 <li class="chapter" data-level="4.2.2" data-path="4-tidy.html"><a href="4-tidy.html#converting-to-tidy-data"><i class="fa fa-check"></i><b>4.2.2</b> Converting to “tidy” data</a></li>
 <li class="chapter" data-level="4.2.3" data-path="4-tidy.html"><a href="4-tidy.html#nycflights13-package-1"><i class="fa fa-check"></i><b>4.2.3</b> <code>nycflights13</code> package</a></li>
@@ -275,7 +280,7 @@
 <li class="chapter" data-level="4.5.2" data-path="4-tidy.html"><a href="4-tidy.html#whats-to-come-2"><i class="fa fa-check"></i><b>4.5.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>II Data Modeling via moderndive</b></span></li>
+<li class="part"><span><b>II Data Modeling with moderndive</b></span></li>
 <li class="chapter" data-level="5" data-path="5-regression.html"><a href="5-regression.html"><i class="fa fa-check"></i><b>5</b> Basic Regression</a><ul>
 <li class="chapter" data-level="" data-path="5-regression.html"><a href="5-regression.html#needed-packages-3"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="5.1" data-path="5-regression.html"><a href="5-regression.html#model1"><i class="fa fa-check"></i><b>5.1</b> One numerical explanatory variable</a><ul>
@@ -300,7 +305,7 @@
 </ul></li>
 <li class="chapter" data-level="6" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html"><i class="fa fa-check"></i><b>6</b> Multiple Regression</a><ul>
 <li class="chapter" data-level="" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#needed-packages-4"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical &amp; one categorical explanatory variable</a><ul>
+<li class="chapter" data-level="6.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4"><i class="fa fa-check"></i><b>6.1</b> One numerical and one categorical explanatory variable</a><ul>
 <li class="chapter" data-level="6.1.1" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4EDA"><i class="fa fa-check"></i><b>6.1.1</b> Exploratory data analysis</a></li>
 <li class="chapter" data-level="6.1.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4interactiontable"><i class="fa fa-check"></i><b>6.1.2</b> Interaction model</a></li>
 <li class="chapter" data-level="6.1.3" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#model4table"><i class="fa fa-check"></i><b>6.1.3</b> Parallel slopes model</a></li>
@@ -321,7 +326,7 @@
 <li class="chapter" data-level="6.4.2" data-path="6-multiple-regression.html"><a href="6-multiple-regression.html#whats-to-come-5"><i class="fa fa-check"></i><b>6.4.2</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="part"><span><b>III Statistical Inference via infer</b></span></li>
+<li class="part"><span><b>III Statistical Inference with infer</b></span></li>
 <li class="chapter" data-level="7" data-path="7-sampling.html"><a href="7-sampling.html"><i class="fa fa-check"></i><b>7</b> Sampling</a><ul>
 <li class="chapter" data-level="" data-path="7-sampling.html"><a href="7-sampling.html#needed-packages-5"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="7.1" data-path="7-sampling.html"><a href="7-sampling.html#sampling-activity"><i class="fa fa-check"></i><b>7.1</b> Sampling bowl activity</a><ul>
@@ -337,7 +342,7 @@
 <li class="chapter" data-level="7.2.4" data-path="7-sampling.html"><a href="7-sampling.html#different-shovels"><i class="fa fa-check"></i><b>7.2.4</b> Using different shovels</a></li>
 </ul></li>
 <li class="chapter" data-level="7.3" data-path="7-sampling.html"><a href="7-sampling.html#sampling-framework"><i class="fa fa-check"></i><b>7.3</b> Sampling framework</a><ul>
-<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology &amp; notation</a></li>
+<li class="chapter" data-level="7.3.1" data-path="7-sampling.html"><a href="7-sampling.html#terminology-and-notation"><i class="fa fa-check"></i><b>7.3.1</b> Terminology and notation</a></li>
 <li class="chapter" data-level="7.3.2" data-path="7-sampling.html"><a href="7-sampling.html#sampling-definitions"><i class="fa fa-check"></i><b>7.3.2</b> Statistical definitions</a></li>
 <li class="chapter" data-level="7.3.3" data-path="7-sampling.html"><a href="7-sampling.html#moral-of-the-story"><i class="fa fa-check"></i><b>7.3.3</b> The moral of the story</a></li>
 </ul></li>
@@ -349,7 +354,7 @@
 <li class="chapter" data-level="7.5.4" data-path="7-sampling.html"><a href="7-sampling.html#whats-to-come-6"><i class="fa fa-check"></i><b>7.5.4</b> What’s to come?</a></li>
 </ul></li>
 </ul></li>
-<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping &amp; Confidence Intervals</a><ul>
+<li class="chapter" data-level="8" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html"><i class="fa fa-check"></i><b>8</b> Bootstrapping and Confidence Intervals</a><ul>
 <li class="chapter" data-level="" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#needed-packages-6"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="8.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#resampling-tactile"><i class="fa fa-check"></i><b>8.1</b> Pennies activity</a><ul>
 <li class="chapter" data-level="8.1.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#what-is-the-average-year-on-us-pennies-in-2019"><i class="fa fa-check"></i><b>8.1.1</b> What is the average year on US pennies in 2019?</a></li>
@@ -368,17 +373,17 @@
 </ul></li>
 <li class="chapter" data-level="8.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#bootstrap-process"><i class="fa fa-check"></i><b>8.4</b> Constructing confidence intervals</a><ul>
 <li class="chapter" data-level="8.4.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#original-workflow"><i class="fa fa-check"></i><b>8.4.1</b> Original workflow</a></li>
-<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> infer package workflow</a></li>
-<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with infer</a></li>
-<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with infer</a></li>
+<li class="chapter" data-level="8.4.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-workflow"><i class="fa fa-check"></i><b>8.4.2</b> <code>infer</code> package workflow</a></li>
+<li class="chapter" data-level="8.4.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#percentile-method-infer"><i class="fa fa-check"></i><b>8.4.3</b> Percentile method with <code>infer</code></a></li>
+<li class="chapter" data-level="8.4.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#infer-se"><i class="fa fa-check"></i><b>8.4.4</b> Standard error method with <code>infer</code></a></li>
 </ul></li>
 <li class="chapter" data-level="8.5" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#one-prop-ci"><i class="fa fa-check"></i><b>8.5</b> Interpreting confidence intervals</a><ul>
 <li class="chapter" data-level="8.5.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ilyas-yohan"><i class="fa fa-check"></i><b>8.5.1</b> Did the net capture the fish?</a></li>
-<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise &amp; shorthand interpretation</a></li>
+<li class="chapter" data-level="8.5.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#shorthand"><i class="fa fa-check"></i><b>8.5.2</b> Precise and shorthand interpretation</a></li>
 <li class="chapter" data-level="8.5.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-width"><i class="fa fa-check"></i><b>8.5.3</b> Width of confidence intervals</a></li>
 </ul></li>
 <li class="chapter" data-level="8.6" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#case-study-two-prop-ci"><i class="fa fa-check"></i><b>8.6</b> Case study: Is yawning contagious?</a><ul>
-<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> Mythbusters study data</a></li>
+<li class="chapter" data-level="8.6.1" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#mythbusters-study-data"><i class="fa fa-check"></i><b>8.6.1</b> <em>Mythbusters</em> study data</a></li>
 <li class="chapter" data-level="8.6.2" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#sampling-scenario"><i class="fa fa-check"></i><b>8.6.2</b> Sampling scenario</a></li>
 <li class="chapter" data-level="8.6.3" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#ci-build"><i class="fa fa-check"></i><b>8.6.3</b> Constructing the confidence interval</a></li>
 <li class="chapter" data-level="8.6.4" data-path="8-confidence-intervals.html"><a href="8-confidence-intervals.html#interpreting-the-confidence-interval"><i class="fa fa-check"></i><b>8.6.4</b> Interpreting the confidence interval</a></li>
@@ -393,14 +398,14 @@
 <li class="chapter" data-level="9" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html"><i class="fa fa-check"></i><b>9</b> Hypothesis Testing</a><ul>
 <li class="chapter" data-level="" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#needed-packages-7"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="9.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-activity"><i class="fa fa-check"></i><b>9.1</b> Promotions activity</a><ul>
-<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at bank?</a></li>
+<li class="chapter" data-level="9.1.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#does-gender-affect-promotions-at-a-bank"><i class="fa fa-check"></i><b>9.1.1</b> Does gender affect promotions at a bank?</a></li>
 <li class="chapter" data-level="9.1.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-once"><i class="fa fa-check"></i><b>9.1.2</b> Shuffling once</a></li>
 <li class="chapter" data-level="9.1.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#shuffling-16-times"><i class="fa fa-check"></i><b>9.1.3</b> Shuffling 16 times</a></li>
 <li class="chapter" data-level="9.1.4" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#what-did-we-just-do-2"><i class="fa fa-check"></i><b>9.1.4</b> What did we just do?</a></li>
 </ul></li>
 <li class="chapter" data-level="9.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#understanding-ht"><i class="fa fa-check"></i><b>9.2</b> Understanding hypothesis tests</a></li>
 <li class="chapter" data-level="9.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#ht-infer"><i class="fa fa-check"></i><b>9.3</b> Conducting hypothesis tests</a><ul>
-<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> infer package workflow</a></li>
+<li class="chapter" data-level="9.3.1" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#infer-workflow-ht"><i class="fa fa-check"></i><b>9.3.1</b> <code>infer</code> package workflow</a></li>
 <li class="chapter" data-level="9.3.2" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#comparing-infer-workflows"><i class="fa fa-check"></i><b>9.3.2</b> Comparison with confidence intervals</a></li>
 <li class="chapter" data-level="9.3.3" data-path="9-hypothesis-testing.html"><a href="9-hypothesis-testing.html#only-one-test"><i class="fa fa-check"></i><b>9.3.3</b> “There is only one test”</a></li>
 </ul></li>
@@ -425,7 +430,7 @@
 <li class="chapter" data-level="10" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html"><i class="fa fa-check"></i><b>10</b> Inference for Regression</a><ul>
 <li class="chapter" data-level="" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#needed-packages-8"><i class="fa fa-check"></i>Needed packages</a></li>
 <li class="chapter" data-level="10.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-refresher"><i class="fa fa-check"></i><b>10.1</b> Regression refresher</a><ul>
-<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evals-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evals analysis</a></li>
+<li class="chapter" data-level="10.1.1" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#teaching-evaluations-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Teaching evaluations analysis</a></li>
 <li class="chapter" data-level="10.1.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#sampling-scenario-2"><i class="fa fa-check"></i><b>10.1.2</b> Sampling scenario</a></li>
 </ul></li>
 <li class="chapter" data-level="10.2" data-path="10-inference-for-regression.html"><a href="10-inference-for-regression.html#regression-interp"><i class="fa fa-check"></i><b>10.2</b> Interpreting regression tables</a><ul>
@@ -455,18 +460,20 @@
 </ul></li>
 </ul></li>
 <li class="part"><span><b>IV Conclusion</b></span></li>
-<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell the Story with Data</a><ul>
+<li class="chapter" data-level="11" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html"><i class="fa fa-check"></i><b>11</b> Tell Your Story with Data</a><ul>
+<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#review"><i class="fa fa-check"></i><b>11.1</b> Review</a><ul>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#needed-packages-9"><i class="fa fa-check"></i>Needed packages</a></li>
-<li class="chapter" data-level="11.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.1</b> Case study: Seattle house prices</a><ul>
-<li class="chapter" data-level="11.1.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.1.1</b> Exploratory data analysis: Part I</a></li>
-<li class="chapter" data-level="11.1.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.1.2</b> Exploratory data analysis: Part II</a></li>
-<li class="chapter" data-level="11.1.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.1.3</b> Regression modeling</a></li>
-<li class="chapter" data-level="11.1.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.1.4</b> Making predictions</a></li>
 </ul></li>
-<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.2</b> Case study: Effective data storytelling</a><ul>
-<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.2.1</b> Bechdel test for Hollywood gender representation</a></li>
-<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.2.2</b> US Births in 1999</a></li>
-<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#script-of-r-code"><i class="fa fa-check"></i><b>11.2.3</b> Script of R code</a></li>
+<li class="chapter" data-level="11.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#seattle-house-prices"><i class="fa fa-check"></i><b>11.2</b> Case study: Seattle house prices</a><ul>
+<li class="chapter" data-level="11.2.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-I"><i class="fa fa-check"></i><b>11.2.1</b> Exploratory data analysis: Part I</a></li>
+<li class="chapter" data-level="11.2.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-EDA-II"><i class="fa fa-check"></i><b>11.2.2</b> Exploratory data analysis: Part II</a></li>
+<li class="chapter" data-level="11.2.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-regression"><i class="fa fa-check"></i><b>11.2.3</b> Regression modeling</a></li>
+<li class="chapter" data-level="11.2.4" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#house-prices-making-predictions"><i class="fa fa-check"></i><b>11.2.4</b> Making predictions</a></li>
+</ul></li>
+<li class="chapter" data-level="11.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#data-journalism"><i class="fa fa-check"></i><b>11.3</b> Case study: Effective data storytelling</a><ul>
+<li class="chapter" data-level="11.3.1" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#bechdel-test-for-hollywood-gender-representation"><i class="fa fa-check"></i><b>11.3.1</b> Bechdel test for Hollywood gender representation</a></li>
+<li class="chapter" data-level="11.3.2" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#us-births-in-1999"><i class="fa fa-check"></i><b>11.3.2</b> US Births in 1999</a></li>
+<li class="chapter" data-level="11.3.3" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#scripts-of-r-code"><i class="fa fa-check"></i><b>11.3.3</b> Scripts of R code</a></li>
 </ul></li>
 <li class="chapter" data-level="" data-path="11-thinking-with-data.html"><a href="11-thinking-with-data.html#concluding-remarks"><i class="fa fa-check"></i>Concluding remarks</a></li>
 </ul></li>
@@ -540,13 +547,19 @@
 </ul></li>
 </ul></li>
 <li class="chapter" data-level="D" data-path="D-appendixD.html"><a href="D-appendixD.html"><i class="fa fa-check"></i><b>D</b> Learning Check Solutions</a><ul>
-<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 2 Solutions</a></li>
-<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 3 Solutions</a></li>
-<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 4 Solutions</a></li>
-<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 5 Solutions</a></li>
-<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 6 Solutions</a></li>
-</ul></li>
-<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Information about R Packages Used</a></li>
+<li class="chapter" data-level="D.1" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-1-solutions"><i class="fa fa-check"></i><b>D.1</b> Chapter 1 Solutions</a></li>
+<li class="chapter" data-level="D.2" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-2-solutions"><i class="fa fa-check"></i><b>D.2</b> Chapter 2 Solutions</a></li>
+<li class="chapter" data-level="D.3" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-3-solutions"><i class="fa fa-check"></i><b>D.3</b> Chapter 3 Solutions</a></li>
+<li class="chapter" data-level="D.4" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-4-solutions"><i class="fa fa-check"></i><b>D.4</b> Chapter 4 Solutions</a></li>
+<li class="chapter" data-level="D.5" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-5-solutions"><i class="fa fa-check"></i><b>D.5</b> Chapter 5 Solutions</a></li>
+<li class="chapter" data-level="D.6" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-6-solutions"><i class="fa fa-check"></i><b>D.6</b> Chapter 6 Solutions</a></li>
+<li class="chapter" data-level="D.7" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-7-solutions"><i class="fa fa-check"></i><b>D.7</b> Chapter 7 Solutions</a></li>
+<li class="chapter" data-level="D.8" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-8-solutions"><i class="fa fa-check"></i><b>D.8</b> Chapter 8 Solutions</a></li>
+<li class="chapter" data-level="D.9" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-9-solutions"><i class="fa fa-check"></i><b>D.9</b> Chapter 9 Solutions</a></li>
+<li class="chapter" data-level="D.10" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-10-solutions"><i class="fa fa-check"></i><b>D.10</b> Chapter 10 Solutions</a></li>
+<li class="chapter" data-level="D.11" data-path="D-appendixD.html"><a href="D-appendixD.html#chapter-11-solutions"><i class="fa fa-check"></i><b>D.11</b> Chapter 11 Solutions</a></li>
+</ul></li>
+<li class="chapter" data-level="E" data-path="E-appendixE.html"><a href="E-appendixE.html"><i class="fa fa-check"></i><b>E</b> Versions of R Packages Used</a></li>
 <li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
 </ul>
 
@@ -573,34 +586,37 @@ <h1>References</h1>
 
 <div id="refs" class="references">
 <div>
-<p>Bray, Andrew, Chester Ismay, Evgeni Chasnovski, Ben Baumer, and Mine Cetinkaya-Rundel. 2019. <em>Infer: Tidy Statistical Inference</em>. <a href="https://github.com/tidymodels/infer">https://github.com/tidymodels/infer</a>.</p>
+<p>Bray, Andrew, Chester Ismay, Evgeni Chasnovski, Ben Baumer, and Mine Cetinkaya-Rundel. 2019. <em>Infer: Tidy Statistical Inference</em>.</p>
 </div>
 <div>
-<p>Chihara, Laura M., and Tim C. Hesterberg. 2011. <em>Mathematical Statistics with Resampling and R</em>. Hoboken, NJ: John Wiley; Sons. <a href="https://sites.google.com/site/chiharahesterberg/home">https://sites.google.com/site/chiharahesterberg/home</a>.</p>
+<p>Chihara, Laura M., and Tim C. Hesterberg. 2011. <em>Mathematical Statistics with Resampling and R</em>. First. Hoboken, NJ: John Wiley &amp; Sons. <a href="https://sites.google.com/site/chiharahesterberg/home">https://sites.google.com/site/chiharahesterberg/home</a>.</p>
 </div>
 <div>
-<p>Diez, David M, Christopher D Barr, and Mine Çetinkaya-Rundel. 2014. <em>Introductory Statistics with Randomization and Simulation</em>. First Edition. <a href="https://www.openintro.org/stat/textbook.php?stat_book=isrs">https://www.openintro.org/stat/textbook.php?stat_book=isrs</a>.</p>
+<p>Diez, David M, Christopher D Barr, and Mine Çetinkaya-Rundel. 2014. <em>Introductory Statistics with Randomization and Simulation</em>. First. Scotts Valley, CA: CreateSpace Independent Publishing Platform. <a href="https://www.openintro.org/stat/textbook.php?stat_book=isrs">https://www.openintro.org/stat/textbook.php?stat_book=isrs</a>.</p>
 </div>
 <div>
 <p>Firke, Sam. 2019. <em>Janitor: Simple Tools for Examining and Cleaning Dirty Data</em>. <a href="https://CRAN.R-project.org/package=janitor">https://CRAN.R-project.org/package=janitor</a>.</p>
 </div>
 <div>
-<p>Grolemund, Garrett, and Hadley Wickham. 2016. <em>R for Data Science</em>. <a href="http://r4ds.had.co.nz/">http://r4ds.had.co.nz/</a>.</p>
+<p>Grolemund, Garrett, and Hadley Wickham. 2017. <em>R for Data Science</em>. First. Sebastopol, CA: O’Reilly Media. <a href="https://r4ds.had.co.nz/">https://r4ds.had.co.nz/</a>.</p>
 </div>
 <div>
-<p>Ismay, Chester. 2016. <em>Getting Used to R, RStudio, and R Markdown</em>. <a href="http://ismayc.github.io/rbasics-book">http://ismayc.github.io/rbasics-book</a>.</p>
+<p>Ismay, Chester, and Patrick C. Kennedy. 2016. <em>Getting Used to R, RStudio, and R Markdown</em>. <a href="https://rbasics.netlify.com">https://rbasics.netlify.com</a>.</p>
 </div>
 <div>
-<p>———. 2019. <em>Moderndive: Tidyverse-Friendly Introductory Linear Regression</em>. <a href="https://CRAN.R-project.org/package=moderndive">https://CRAN.R-project.org/package=moderndive</a>.</p>
+<p>James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2017. <em>An Introduction to Statistical Learning: With Applications in R</em>. First. New York, NY: Springer.</p>
 </div>
 <div>
-<p>Kim, Albert Y., Chester Ismay, and Jennifer Chunn. 2018. <em>Fivethirtyeight: Data and Code Behind the Stories and Interactives at ’Fivethirtyeight’</em>. <a href="https://CRAN.R-project.org/package=fivethirtyeight">https://CRAN.R-project.org/package=fivethirtyeight</a>.</p>
+<p>Kim, Albert Y., and Chester Ismay. 2019. <em>Moderndive: Tidyverse-Friendly Introductory Linear Regression</em>. <a href="https://CRAN.R-project.org/package=moderndive">https://CRAN.R-project.org/package=moderndive</a>.</p>
 </div>
 <div>
-<p>Quinn, Michael, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2019. <em>Skimr: Compact and Flexible Summaries of Data</em>. <a href="https://CRAN.R-project.org/package=skimr">https://CRAN.R-project.org/package=skimr</a>.</p>
+<p>Kim, Albert Y., Chester Ismay, and Jennifer Chunn. 2019. <em>Fivethirtyeight: Data and Code Behind the Stories and Interactives at ’Fivethirtyeight’</em>. <a href="https://CRAN.R-project.org/package=fivethirtyeight">https://CRAN.R-project.org/package=fivethirtyeight</a>.</p>
 </div>
 <div>
-<p>Robbins, Naomi. 2013. <em>Creating More Effective Graphs</em>. Chart House.</p>
+<p>Quinn, Michael, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2019. <em>Skimr: Compact and Flexible Summaries of Data</em>. <a href="https://github.com/ropenscilabs/skimr">https://github.com/ropenscilabs/skimr</a>.</p>
+</div>
+<div>
+<p>Robbins, Naomi. 2013. <em>Creating More Effective Graphs</em>. First. New York, NY: Chart House.</p>
 </div>
 <div>
 <p>Robinson, David, and Alex Hayes. 2019. <em>Broom: Convert Statistical Analysis Objects into Tidy Tibbles</em>. <a href="https://CRAN.R-project.org/package=broom">https://CRAN.R-project.org/package=broom</a>.</p>
@@ -609,10 +625,10 @@ <h1>References</h1>
 <p>Wickham, Hadley. 2014. “Tidy Data.” <em>Journal of Statistical Software</em> Volume 59 (Issue 10). <a href="https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf">https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf</a>.</p>
 </div>
 <div>
-<p>———. 2017. <em>Tidyverse: Easily Install and Load the ’Tidyverse’</em>. <a href="https://CRAN.R-project.org/package=tidyverse">https://CRAN.R-project.org/package=tidyverse</a>.</p>
+<p>———. 2019a. <em>Nycflights13: Flights That Departed Nyc in 2013</em>. <a href="https://CRAN.R-project.org/package=nycflights13">https://CRAN.R-project.org/package=nycflights13</a>.</p>
 </div>
 <div>
-<p>———. 2018. <em>Nycflights13: Flights That Departed Nyc in 2013</em>. <a href="https://CRAN.R-project.org/package=nycflights13">https://CRAN.R-project.org/package=nycflights13</a>.</p>
+<p>———. 2019b. <em>Tidyverse: Easily Install and Load the ’Tidyverse’</em>. <a href="https://CRAN.R-project.org/package=tidyverse">https://CRAN.R-project.org/package=tidyverse</a>.</p>
 </div>
 <div>
 <p>Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, and Hiroaki Yutani. 2019. <em>Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics</em>. <a href="https://CRAN.R-project.org/package=ggplot2">https://CRAN.R-project.org/package=ggplot2</a>.</p>
@@ -621,19 +637,29 @@ <h1>References</h1>
 <p>Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2019. <em>Dplyr: A Grammar of Data Manipulation</em>. <a href="https://CRAN.R-project.org/package=dplyr">https://CRAN.R-project.org/package=dplyr</a>.</p>
 </div>
 <div>
-<p>Wickham, Hadley, and Lionel Henry. 2019. <em>Tidyr: Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions</em>. <a href="https://CRAN.R-project.org/package=tidyr">https://CRAN.R-project.org/package=tidyr</a>.</p>
+<p>Wickham, Hadley, and Lionel Henry. 2019. <em>Tidyr: Tidy Messy Data</em>. <a href="https://CRAN.R-project.org/package=tidyr">https://CRAN.R-project.org/package=tidyr</a>.</p>
 </div>
 <div>
 <p>Wickham, Hadley, Jim Hester, and Romain Francois. 2018. <em>Readr: Read Rectangular Text Data</em>. <a href="https://CRAN.R-project.org/package=readr">https://CRAN.R-project.org/package=readr</a>.</p>
 </div>
 <div>
-<p>Wilkinson, Leland. 2005. <em>The Grammar of Graphics (Statistics and Computing)</em>. Secaucus, NJ, USA: Springer-Verlag New York, Inc.</p>
+<p>Wilkinson, Leland. 2005. <em>The Grammar of Graphics (Statistics and Computing)</em>. First. Secaucus, NJ: Springer-Verlag.</p>
 </div>
 <div>
 <p>Xie, Yihui. 2019. <em>Bookdown: Authoring Books and Technical Documents with R Markdown</em>. <a href="https://CRAN.R-project.org/package=bookdown">https://CRAN.R-project.org/package=bookdown</a>.</p>
 </div>
 </div>
 </div>
+
+
+
+
+
+
+
+
+
+
             </section>
 
           </div>
@@ -645,11 +671,13 @@ <h1>References</h1>
   </div>
 <script src="libs/gitbook-2.6.7/js/app.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/lunr.js"></script>
+<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
 <script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
 <script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
+<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
 <script>
 gitbook.require(["gitbook"], function(gitbook) {
 gitbook.start({
@@ -657,12 +685,11 @@ <h1>References</h1>
 "github": false,
 "facebook": true,
 "twitter": true,
-"google": false,
 "linkedin": false,
 "weibo": false,
 "instapaper": false,
 "vk": false,
-"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
+"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
 },
 "fontsettings": {
 "theme": "white",
@@ -677,6 +704,10 @@ <h1>References</h1>
 "link": null,
 "text": null
 },
+"view": {
+"link": null,
+"text": null
+},
 "download": null,
 "toc": {
 "collapse": "section",
@@ -693,8 +724,9 @@ <h1>References</h1>
     script.type = "text/javascript";
     var src = "true";
     if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
-    if (location.protocol !== "file:" && /^https?:/.test(src))
-      src = src.replace(/^https?:/, '');
+    if (location.protocol !== "file:")
+      if (/^https?:/.test(src))
+        src = src.replace(/^https?:/, '');
     script.src = src;
     document.getElementsByTagName("head")[0].appendChild(script);
   })();
diff --git a/docs/scripts/01-getting-started.R b/docs/scripts/01-getting-started.R
index 6840cbda9..4229b72b5 100644
--- a/docs/scripts/01-getting-started.R
+++ b/docs/scripts/01-getting-started.R
@@ -1,4 +1,4 @@
-## ----message=FALSE, warning=FALSE, echo=FALSE----------------------------
+## ----message=FALSE, warning=FALSE, echo=FALSE---------------------------------
 # Packages needed internally, but not in text.
 library(scales)
 
@@ -13,11 +13,11 @@ library(scales)
 
 
 
-## \vspace{-0.25in}
+## \vspace{-0.15in}
 
 ## **_Learning check_**
 
-## \vspace{-0.25in}
+## \vspace{-0.1in}
 
 
 ## \vspace{-0.25in}
@@ -25,15 +25,15 @@ library(scales)
 ## \vspace{-0.25in}
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## library(ggplot2)
 
 
-## \vspace{-0.25in}
+## \vspace{-0.15in}
 
 ## **_Learning check_**
 
-## \vspace{-0.25in}
+## \vspace{-0.1in}
 
 
 ## \vspace{-0.25in}
@@ -41,21 +41,21 @@ library(scales)
 ## \vspace{-0.25in}
 
 
-## ----message=FALSE-------------------------------------------------------
+## ----message=FALSE------------------------------------------------------------
 library(nycflights13)
 library(dplyr)
 library(knitr)
 
 
-## ----load_flights--------------------------------------------------------
+## ----load_flights-------------------------------------------------------------
 flights
 
 
-## \vspace{-0.25in}
+## \vspace{-0.15in}
 
 ## **_Learning check_**
 
-## \vspace{-0.25in}
+## \vspace{-0.1in}
 
 
 ## \vspace{-0.25in}
@@ -63,15 +63,15 @@ flights
 ## \vspace{-0.25in}
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 glimpse(flights)
 
 
-## \vspace{-0.25in}
+## \vspace{-0.15in}
 
 ## **_Learning check_**
 
-## \vspace{-0.25in}
+## \vspace{-0.1in}
 
 
 ## \vspace{-0.25in}
@@ -79,35 +79,43 @@ glimpse(flights)
 ## \vspace{-0.25in}
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## airlines
 ## kable(airlines)
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## airlines$name
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 glimpse(airports)
 
 
-## \vspace{-0.25in}
+## \vspace{-0.15in}
 
 ## **_Learning check_**
 
-## \vspace{-0.25in}
+## \vspace{-0.1in}
 
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## ?flights
 
 
-## \vspace{-0.25in}
+## \vspace{-0.15in}
 
 ## **_Learning check_**
 
-## \vspace{-0.25in}
+## \vspace{-0.1in}
+
+
+
+
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("Solutions to all *Learning checks* can be found online in [Appendix D](https://moderndive.com/D-appendixD.html).")
+} 
 
diff --git a/docs/scripts/02-visualization.R b/docs/scripts/02-visualization.R
index 4dd9feb10..93ab00c65 100644
--- a/docs/scripts/02-visualization.R
+++ b/docs/scripts/02-visualization.R
@@ -1,10 +1,20 @@
-## ----message=FALSE-------------------------------------------------------
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("# (PART) (ref:tidyversepart) {-} ")
+} else {
+  cat("# (PART) Data Science with tidyverse {-} ")
+}
+
+
+
+
+## ----message=FALSE------------------------------------------------------------
 library(nycflights13)
 library(ggplot2)
 library(dplyr)
 
 
-## ----message=FALSE, warning=FALSE, echo=FALSE----------------------------
+## ----message=FALSE, warning=FALSE, echo=FALSE---------------------------------
 # Packages needed internally, but not in book.
 library(gapminder)
 library(knitr)
@@ -15,7 +25,7 @@ library(scales)
 library(stringr)
 
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 gapminder_2007 <- gapminder %>% 
   filter(year == 2007) %>% 
   select(-year) %>% 
@@ -28,54 +38,50 @@ gapminder_2007 <- gapminder %>%
   )
 
 
-## ----gapminder-2007, echo=FALSE------------------------------------------
+## ----gapminder-2007, echo=FALSE-----------------------------------------------
 gapminder_2007 %>% 
-  head() %>% 
+  head(3) %>% 
   kable(
-    digits=2,
-    caption = "Gapminder 2007 Data: First 6 of 142 countries"#, 
+    digits = 2,
+    caption = "Gapminder 2007 Data: First 3 of 142 countries"#, 
 #    booktabs = TRUE
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ----gapminder, echo=FALSE, fig.cap="Life expectancy over GDP per capita in 2007."----
+## ----gapminder, echo=FALSE, fig.cap="Life expectancy over GDP per capita in 2007.", fig.height=2.95----
+gapminder_plot <- ggplot(data = gapminder_2007, 
+                         mapping = aes(x = `GDP per Capita`, 
+                                       y = `Life Expectancy`, 
+                                       size = Population, 
+                                       color = Continent)) +
+  geom_point() +
+  labs(x = "GDP per capita", y = "Life expectancy")
+
 if(knitr::is_html_output()){
-  ggplot(data = gapminder_2007, 
-         mapping = aes(x = `GDP per Capita`, 
-                       y = `Life Expectancy`, 
-                       size = Population, 
-                       color = Continent)) +
-    geom_point() +
-    labs(x = "GDP per capita", y = "Life expectancy")
+  gapminder_plot
 } else {
-    ggplot(data = gapminder_2007, 
-         mapping = aes(x = `GDP per Capita`, 
-                       y = `Life Expectancy`, 
-                       size = Population, 
-                       color = Continent)) +
-    geom_point() +
-    labs(x = "GDP per capita", y = "Life expectancy") +
-    scale_color_grey()
+  gapminder_plot + scale_color_grey()
 }
 
 
-## ----summary-table-gapminder, echo=FALSE---------------------------------
+## ----summary-table-gapminder, echo=FALSE--------------------------------------
 tibble(
   `data variable` = c("GDP per Capita", "Life Expectancy", "Population", "Continent"),
   aes = c("x", "y", "size", "color"),
   geom = c("point", "point", "point", "point")
 ) %>% 
   kable(
-    caption = "Summary of Grammar of Graphics for this plot", 
-    booktabs = TRUE
+    caption = "Summary of the grammar of graphics for this plot", 
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 alaska_flights <- flights %>% 
   filter(carrier == "AS")
 
@@ -86,17 +92,17 @@ alaska_flights <- flights %>%
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
 ##   geom_point()
 
 
-## ----noalpha, fig.cap="Arrival delays vs departure delays for Alaska Airlines flights from NYC in 2013.", warning=TRUE, echo=FALSE----
+## ----noalpha, fig.cap="Arrival delays versus departure delays for Alaska Airlines flights from NYC in 2013.", fig.height=1.8, warning=TRUE, echo=FALSE----
 ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
   geom_point()
 
 
-## ----nolayers, fig.cap="A plot with no layers."--------------------------
+## ----nolayers, fig.cap="A plot with no layers.", fig.height=2.5---------------
 ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay))
 
 
@@ -104,12 +110,12 @@ ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay))
 
 
 
-## ----alpha, fig.cap="Arrival vs departure delays scatterplot with alpha = 0.2."----
+## ----alpha, fig.cap="Arrival vs. departure delays scatterplot with alpha = 0.2.", fig.height=4.9, warning=FALSE----
 ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
   geom_point(alpha = 0.2)
 
 
-## ----jitter-example-plot-1, fig.cap="Regular and jittered scatterplot.", echo=FALSE----
+## ----jitter-example-plot-1, fig.cap="Regular and jittered scatterplot.", echo=FALSE, fig.height=5, warning=FALSE----
 jitter_example <- tibble(
   x = rep(0, 4),
   y = rep(0, 4)
@@ -125,7 +131,7 @@ jittered_plot_2 <- ggplot(data = jitter_example, mapping = aes(x = x, y = y)) +
 jittered_plot_1 + jittered_plot_2
 
 
-## ----jitter, fig.cap="Arrival vs departure delays jittered scatterplot."----
+## ----jitter, fig.cap="Arrival versus departure delays jittered scatterplot.", fig.height=4.7, warning=FALSE----
 ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
   geom_jitter(width = 30, height = 30)
 
@@ -134,7 +140,7 @@ ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 early_january_weather <- weather %>% 
   filter(origin == "EWR" & month == 1 & day <= 15)
 
@@ -144,7 +150,8 @@ early_january_weather <- weather %>%
 
 
 ## ----hourlytemp, fig.cap="Hourly temperature in Newark for January 1-15, 2013."----
-ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = temp)) +
+ggplot(data = early_january_weather, 
+       mapping = aes(x = time_hour, y = temp)) +
   geom_line()
 
 
@@ -161,32 +168,32 @@ ggplot(data = weather, mapping = aes(x = temp, y = factor("A"))) +
 hist_title <- "Histogram of Hourly Temperature Recordings from NYC in 2013"
 
 
-## ----histogramexample, warning=FALSE, echo=FALSE, fig.cap="Example histogram."----
+## ----histogramexample, warning=FALSE, echo=FALSE, fig.cap="Example histogram.", fig.height=2----
 ggplot(data = weather, mapping = aes(x = temp)) +
   geom_histogram(binwidth = 10, boundary = 70, color = "white")
 
 
-## ----weather-histogram, warning=TRUE, fig.cap="Histogram of hourly temperatures at three NYC airports."----
+## ----weather-histogram, warning=TRUE, fig.cap="Histogram of hourly temperatures at three NYC airports.", fig.height=2.3----
 ggplot(data = weather, mapping = aes(x = temp)) +
   geom_histogram()
 
 
-## ----weather-histogram-2, warning=FALSE, message=FALSE, fig.cap="Histogram of hourly temperatures at three NYC airports with white borders."----
+## ----weather-histogram-2, warning=FALSE, message=FALSE, fig.cap="Histogram of hourly temperatures at three NYC airports with white borders.", fig.height=3----
 ggplot(data = weather, mapping = aes(x = temp)) +
   geom_histogram(color = "white")
 
 
-## ---- eval = FALSE-------------------------------------------------------
+## ---- eval = FALSE------------------------------------------------------------
 ## ggplot(data = weather, mapping = aes(x = temp)) +
 ##   geom_histogram(color = "white", fill = "steelblue")
 
 
-## ---- eval = FALSE-------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## ggplot(data = weather, mapping = aes(x = temp)) +
 ##   geom_histogram(bins = 40, color = "white")
 
 
-## ---- eval = FALSE-------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## ggplot(data = weather, mapping = aes(x = temp)) +
 ##   geom_histogram(binwidth = 10, color = "white")
 
@@ -197,7 +204,7 @@ hist_1 <- ggplot(data = weather, mapping = aes(x = temp)) +
   labs(title = "With 40 bins")
 hist_2 <- ggplot(data = weather, mapping = aes(x = temp)) +
   geom_histogram(binwidth = 10, color = "white") +
-  labs(title = "With binwidth = 10 deg F")
+  labs(title = "With binwidth = 10 degrees F")
 hist_1 + hist_2
 
 
@@ -205,23 +212,55 @@ hist_1 + hist_2
 
 
 
-## ----facethistogram, fig.cap="Faceted histogram of hourly temperatures by month."----
-ggplot(data = weather, mapping = aes(x = temp)) +
+## ----eval=FALSE---------------------------------------------------------------
+## ggplot(data = weather, mapping = aes(x = temp)) +
+##   geom_histogram(binwidth = 5, color = "white") +
+##   facet_wrap(~ month)
+
+
+## ----facethistogram, fig.cap="Faceted histogram of hourly temperatures by month.", echo=FALSE, fig.height=3.3----
+month_facet <- ggplot(data = weather, mapping = aes(x = temp)) +
   geom_histogram(binwidth = 5, color = "white") +
   facet_wrap(~ month)
 
+if(knitr::is_latex_output()){
+  month_facet + 
+  theme(
+    strip.text = element_text(colour = 'black'),
+    strip.background = element_rect(fill = "grey93")
+  )
+} else {
+  month_facet
+}
+
 
-## ----facethistogram2, fig.cap="Faceted histogram with 4 instead of 3 rows."----
-ggplot(data = weather, mapping = aes(x = temp)) +
+## ----eval=FALSE---------------------------------------------------------------
+## ggplot(data = weather, mapping = aes(x = temp)) +
+##   geom_histogram(binwidth = 5, color = "white") +
+##   facet_wrap(~ month, nrow = 4)
+
+
+## ----facethistogram2, fig.cap="Faceted histogram with 4 instead of 3 rows.", echo=FALSE, fig.height=3.3----
+month_facet_4 <- ggplot(data = weather, mapping = aes(x = temp)) +
   geom_histogram(binwidth = 5, color = "white") +
   facet_wrap(~ month, nrow = 4)
 
+if(knitr::is_latex_output()){
+  month_facet_4 + 
+  theme(
+    strip.text = element_text(colour = 'black'),
+    strip.background = element_rect(fill = "grey93")
+  )
+} else {
+  month_facet_4
+}
+
 
 
 
 
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 n_nov <- weather %>% 
   filter(month == 11) %>% 
   nrow()
@@ -238,14 +277,14 @@ max_nov <- weather %>%
 quartiles <- weather %>% 
   filter(month == 11) %>% 
   pull(temp) %>% 
-  quantile(prob=c(0.25, 0.5, 0.75)) %>% 
+  quantile(prob = c(0.25, 0.5, 0.75)) %>% 
   round(0)
 five_number_summary <- tibble(
   temp = c(min_nov, quartiles, max_nov)
 )
 
 
-## ----nov1, echo=FALSE, fig.cap="November temperatures represented as points."----
+## ----nov1, echo=FALSE, fig.cap="November temperatures represented as jittered points.", fig.height=1.7----
 base_plot <- weather %>% 
   filter(month %in% c(11)) %>% 
   ggplot(mapping = aes(x = factor(month), y = temp)) +
@@ -254,7 +293,7 @@ base_plot +
   geom_jitter(width = 0.075, height = 0.5, alpha = 0.1)
 
 
-## ----nov2, echo=FALSE, fig.cap="Building up a boxplot of November temperatures."----
+## ----nov2, echo=FALSE, fig.cap="Building up a boxplot of November temperatures.", fig.height=3----
 boxplot_1 <- base_plot +
   geom_hline(data = five_number_summary, aes(yintercept=temp), linetype = "dashed") +
   geom_jitter(width = 0.075, height = 0.5, alpha = 0.1)
@@ -267,12 +306,12 @@ boxplot_3 <- base_plot +
 boxplot_1 + boxplot_2 + boxplot_3
 
 
-## ----badbox, fig.cap="Invalid boxplot specification.", fig.height=3.5----
+## ----badbox, fig.cap="Invalid boxplot specification.", fig.height=2.4---------
 ggplot(data = weather, mapping = aes(x = month, y = temp)) +
   geom_boxplot()
 
 
-## ----monthtempbox, fig.cap="Side-by-side boxplot of temperature split by month.", fig.height=3.7----
+## ----monthtempbox, fig.cap="Side-by-side boxplot of temperature split by month.", fig.height=4.2----
 ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
   geom_boxplot()
 
@@ -281,7 +320,7 @@ ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 fruits <- tibble(
   fruit = c("apple", "apple", "orange", "apple", "orange")
 )
@@ -291,15 +330,15 @@ fruits_counted <- tibble(
 )
 
 
-## ----fruits, echo=FALSE--------------------------------------------------
+## ----fruits, echo=FALSE-------------------------------------------------------
 fruits
 
 
-## ----fruitscounted, echo=FALSE-------------------------------------------
+## ----fruitscounted, echo=FALSE------------------------------------------------
 fruits_counted
 
 
-## ----geombar, fig.cap="Barplot when counts are not pre-counted.", fig.height=2.5----
+## ----geombar, fig.cap="Barplot when counts are not pre-counted.", fig.height=1.8----
 ggplot(data = fruits, mapping = aes(x = fruit)) +
   geom_bar()
 
@@ -309,35 +348,31 @@ ggplot(data = fruits_counted, mapping = aes(x = fruit, y = number)) +
   geom_col()
 
 
-## ----flightsbar, fig.cap='(ref:geombar)', fig.height=2.5-----------------
+## ----flightsbar, fig.cap='(ref:geombar)', fig.height=2.8----------------------
 ggplot(data = flights, mapping = aes(x = carrier)) +
   geom_bar()
 
 
-## ----flights-counted, message=FALSE, echo=FALSE--------------------------
-flights_table <- flights %>% 
+## ----flights-counted, message=FALSE, echo=FALSE-------------------------------
+flights_counted <- flights %>% 
   group_by(carrier) %>% 
   summarize(number = n())
-kable(flights_table,
+kable(flights_counted,
       digits = 3,
-      caption = "Number of flights pre-counted for each carrier.", 
+      caption = "Number of flights pre-counted for each carrier", 
       booktabs = TRUE,
-      longtable = TRUE
+      longtable = TRUE,
+    linesep = ""
 ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## ggplot(data = flights_table, mapping = aes(x = carrier, y = number)) +
-##   geom_col()
 
 
 
 
-
-
-## ----carrierpie, echo=FALSE, fig.cap="The dreaded pie chart.", out.width="75%"----
+## ----carrierpie, echo=FALSE, fig.cap="The dreaded pie chart.", fig.height=4.8----
 if(knitr::is_html_output()){
   ggplot(flights, mapping = aes(x = factor(1), fill = carrier)) +
     geom_bar(width = 1) +
@@ -371,17 +406,17 @@ if(knitr::is_html_output()){
 
 
 
-## ---- fig.height=2.5, eval=FALSE-----------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## ggplot(data = flights, mapping = aes(x = carrier)) +
 ##   geom_bar()
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
 ##   geom_bar()
 
 
-## ----flights-stacked-bar, echo=FALSE, fig.cap="Stacked barplot comparing the number of flights by carrier and origin.", fig.height=3.5----
+## ----flights-stacked-bar, echo=FALSE, fig.cap="Stacked barplot of flight amount by carrier and origin.", fig.height=2.8----
 if(knitr::is_html_output()) {
   ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
     geom_bar()
@@ -392,17 +427,12 @@ if(knitr::is_html_output()) {
 }
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## ggplot(data = flights, mapping = aes(x = carrier), fill = origin) +
-##   geom_bar()
-
-
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## ggplot(data = flights, mapping = aes(x = carrier, color = origin)) +
 ##   geom_bar()
 
 
-## ----flights-stacked-bar-color, echo=FALSE, fig.cap="Stacked barplot with color aesthetic used instead of fill.", fig.height=3.5----
+## ----flights-stacked-bar-color, echo=FALSE, fig.cap="Stacked barplot with color aesthetic used instead of fill.", fig.height=2.2----
 if(knitr::is_html_output()){
   ggplot(data = flights, mapping = aes(x = carrier, color = origin)) +
     geom_bar()
@@ -413,7 +443,12 @@ if(knitr::is_html_output()){
 }
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
+## ggplot(data = flights, mapping = aes(x = carrier), fill = origin) +
+##   geom_bar()
+
+
+## ----eval=FALSE---------------------------------------------------------------
 ## ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
 ##   geom_bar(position = "dodge")
 
@@ -429,11 +464,42 @@ if(knitr::is_html_output()){
 }
 
 
-## ----facet-bar-vert, fig.cap="Faceted barplot comparing the number of flights by carrier and origin.", fig.height=7.5----
-ggplot(data = flights, mapping = aes(x = carrier)) +
+## ----eval=FALSE---------------------------------------------------------------
+## ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
+##   geom_bar(position = position_dodge(preserve = "single"))
+
+
+## ----flights-dodged-bar-color-tweak, echo=FALSE, fig.cap="Side-by-side barplot comparing number of flights by carrier and origin (with formatting tweak).", fig.height=2.5----
+if(knitr::is_html_output()){
+  ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
+    geom_bar(position = position_dodge(preserve = "single"))
+} else {
+  ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
+    geom_bar(position = position_dodge(preserve = "single")) +
+    scale_fill_grey()
+}
+
+
+## ----eval=FALSE---------------------------------------------------------------
+## ggplot(data = flights, mapping = aes(x = carrier)) +
+##   geom_bar() +
+##   facet_wrap(~ origin, ncol = 1)
+
+
+## ----facet-bar-vert, fig.cap="Faceted barplot comparing the number of flights by carrier and origin.", fig.height=6, echo=FALSE----
+month_facet_ncol <- ggplot(data = flights, mapping = aes(x = carrier)) +
   geom_bar() +
   facet_wrap(~ origin, ncol = 1)
 
+if(knitr::is_latex_output()){
+  month_facet_ncol + 
+  theme(
+    strip.text = element_text(colour = 'black'),
+    strip.background = element_rect(fill = "grey93")
+  )
+} else {
+  month_facet_ncol
+}
 
 
 
@@ -441,7 +507,8 @@ ggplot(data = flights, mapping = aes(x = carrier)) +
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Segment 1:
 ## ggplot(data = flights, mapping = aes(x = carrier)) +
 ##   geom_bar()
@@ -451,6 +518,12 @@ ggplot(data = flights, mapping = aes(x = carrier)) +
 ##   geom_bar()
 
 
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("Solutions to all *Learning checks* can be found online in [Appendix D](https://moderndive.com/D-appendixD.html).")
+} 
+
+
 
 
 ## ----ggplot-cheatsheet, echo=FALSE, fig.cap="Data Visualization with ggplot2 cheatsheet."----
@@ -468,7 +541,7 @@ if(knitr:::is_html_output()){
 }
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## alaska_flights <- flights %>%
 ##   filter(carrier == "AS")
 ## 
@@ -476,7 +549,7 @@ if(knitr:::is_html_output()){
 ##   geom_point()
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## early_january_weather <- weather %>%
 ##   filter(origin == "EWR" & month == 1 & day <= 15)
 ## 
diff --git a/docs/scripts/03-wrangling.R b/docs/scripts/03-wrangling.R
index 1915bfbac..ba833c4d4 100644
--- a/docs/scripts/03-wrangling.R
+++ b/docs/scripts/03-wrangling.R
@@ -1,9 +1,9 @@
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## alaska_flights <- flights %>%
 ##   filter(carrier == "AS")
 
 
-## ---- message=FALSE------------------------------------------------------
+## ---- message=FALSE-----------------------------------------------------------
 library(dplyr)
 library(ggplot2)
 library(nycflights13)
@@ -11,61 +11,59 @@ library(nycflights13)
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## h(g(f(x)))
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## x %>%
 ##   f() %>%
 ##   g() %>%
 ##   h()
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## alaska_flights <- flights %>%
 ##   filter(carrier == "AS")
 
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## portland_flights <- flights %>%
 ##   filter(dest == "PDX")
 ## View(portland_flights)
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## btv_sea_flights_fall <- flights %>%
 ##   filter(origin == "JFK" & (dest == "BTV" | dest == "SEA") & month >= 10)
 ## View(btv_sea_flights_fall)
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## btv_sea_flights_fall <- flights %>%
 ##   filter(origin == "JFK", (dest == "BTV" | dest == "SEA"), month >= 10)
 ## View(btv_sea_flights_fall)
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## not_BTV_SEA <- flights %>%
 ##   filter(!(dest == "BTV" | dest == "SEA"))
 ## View(not_BTV_SEA)
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## flights %>%
-##   filter(!dest == "BTV" | dest == "SEA")
+## ---- eval=FALSE--------------------------------------------------------------
+## flights %>% filter(!dest == "BTV" | dest == "SEA")
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## many_airports <- flights %>%
 ##   filter(dest == "SEA" | dest == "SFO" | dest == "PDX" |
 ##          dest == "BTV" | dest == "BDL")
-## View(many_airports)
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## many_airports <- flights %>%
 ##   filter(dest %in% c("SEA", "SFO", "PDX", "BTV", "BDL"))
 ## View(many_airports)
@@ -79,13 +77,13 @@ library(nycflights13)
 
 
 
-## ---- eval=TRUE----------------------------------------------------------
+## ---- eval=TRUE---------------------------------------------------------------
 summary_temp <- weather %>% 
   summarize(mean = mean(temp), std_dev = sd(temp))
 summary_temp
 
 
-## ---- eval = TRUE--------------------------------------------------------
+## -----------------------------------------------------------------------------
 summary_temp <- weather %>% 
   summarize(mean = mean(temp, na.rm = TRUE), 
             std_dev = sd(temp, na.rm = TRUE))
@@ -94,7 +92,7 @@ summary_temp
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## summary_temp <- weather %>%
 ##   summarize(mean = mean(temp, na.rm = TRUE)) %>%
 ##   summarize(std_dev = sd(temp, na.rm = TRUE))
@@ -104,7 +102,7 @@ summary_temp
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 summary_monthly_temp <- weather %>% 
   group_by(month) %>% 
   summarize(mean = mean(temp, na.rm = TRUE), 
@@ -112,42 +110,42 @@ summary_monthly_temp <- weather %>%
 summary_monthly_temp
 
 
-## ---- eval=TRUE----------------------------------------------------------
+## ---- eval=TRUE---------------------------------------------------------------
 diamonds
 
 
-## ---- eval=TRUE----------------------------------------------------------
+## ---- eval=TRUE---------------------------------------------------------------
 diamonds %>% 
   group_by(cut)
 
 
-## ---- eval=TRUE----------------------------------------------------------
+## ---- eval=TRUE---------------------------------------------------------------
 diamonds %>% 
   group_by(cut) %>% 
   summarize(avg_price = mean(price))
 
 
-## ---- eval=TRUE----------------------------------------------------------
+## ---- eval=TRUE---------------------------------------------------------------
 diamonds %>% 
   group_by(cut) %>% 
   ungroup()
 
 
-## ---- eval=TRUE----------------------------------------------------------
+## ---- eval=TRUE---------------------------------------------------------------
 by_origin <- flights %>% 
   group_by(origin) %>% 
   summarize(count = n())
 by_origin
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 by_origin_monthly <- flights %>% 
   group_by(origin, month) %>% 
   summarize(count = n())
 by_origin_monthly
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 by_origin_monthly_incorrect <- flights %>% 
   group_by(origin) %>% 
   group_by(month) %>% 
@@ -164,12 +162,12 @@ by_origin_monthly_incorrect
 
 
 
-## ---- eval=TRUE----------------------------------------------------------
+## ---- eval=TRUE---------------------------------------------------------------
 weather <- weather %>% 
   mutate(temp_in_C = (temp - 32) / 1.8)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 summary_monthly_temp <- weather %>% 
   group_by(month) %>% 
   summarize(mean_temp_in_F = mean(temp, na.rm = TRUE), 
@@ -177,22 +175,22 @@ summary_monthly_temp <- weather %>%
 summary_monthly_temp
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 flights <- flights %>% 
   mutate(gain = dep_delay - arr_delay)
 
 
-## ----first-five-flights, echo=FALSE--------------------------------------
+## ----first-five-flights, echo=FALSE-------------------------------------------
 flights %>% 
   select(dep_delay, arr_delay, gain) %>% 
   slice(1:5) %>% 
   kable(
-    caption = "First five rows of departure/arrival delay and gain variables."
+    caption = "First five rows of departure/arrival delay and gain variables"
     ) %>% 
   kable_styling(position = "center", latex_options = "hold_position")
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 gain_summary <- flights %>% 
   summarize(
     min = min(gain, na.rm = TRUE),
@@ -207,12 +205,12 @@ gain_summary <- flights %>%
 gain_summary
 
 
-## ----gain-hist, message=FALSE, fig.cap="Histogram of gain variable."-----
+## ----gain-hist, message=FALSE, fig.cap="Histogram of gain variable.", fig.height=3----
 ggplot(data = flights, mapping = aes(x = gain)) +
   geom_histogram(color = "white", bins = 20)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 flights <- flights %>% 
   mutate(
     gain = dep_delay - arr_delay,
@@ -225,30 +223,30 @@ flights <- flights %>%
 
 
 
-## ---- eval---------------------------------------------------------------
+## ---- eval--------------------------------------------------------------------
 freq_dest <- flights %>% 
   group_by(dest) %>% 
   summarize(num_flights = n())
 freq_dest
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 freq_dest %>% 
   arrange(num_flights)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 freq_dest %>% 
   arrange(desc(num_flights))
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## View(airlines)
 
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## flights_joined <- flights %>%
 ##   inner_join(airlines, by = "carrier")
 ## View(flights)
@@ -257,17 +255,17 @@ freq_dest %>%
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## View(airports)
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## flights_with_airport_names <- flights %>%
 ##   inner_join(airports, by = c("dest" = "faa"))
 ## View(flights_with_airport_names)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 named_dests <- flights %>%
   group_by(dest) %>%
   summarize(num_flights = n()) %>%
@@ -277,7 +275,7 @@ named_dests <- flights %>%
 named_dests
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## flights_weather_joined <- flights %>%
 ##   inner_join(weather, by = c("year", "month", "day", "hour", "origin"))
 ## View(flights_weather_joined)
@@ -287,81 +285,65 @@ named_dests
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## joined_flights <- flights %>%
 ##   inner_join(airlines, by = "carrier")
 ## View(joined_flights)
 
 
-## \vspace{-0.25in}
+## \vspace{-0.15in}
 
 ## **_Learning check_**
 
-## \vspace{-0.25in}
+## \vspace{-0.1in}
 
 
 
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## glimpse(flights)
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## flights %>%
 ##   select(carrier, flight)
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## flights_no_year <- flights %>%
-##   select(-year)
+## ---- eval=FALSE--------------------------------------------------------------
+## flights_no_year <- flights %>% select(-year)
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## flight_arr_times <- flights %>%
-##   select(month:day, arr_time:sched_arr_time)
+## ---- eval=FALSE--------------------------------------------------------------
+## flight_arr_times <- flights %>% select(month:day, arr_time:sched_arr_time)
 ## flight_arr_times
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## flights_reorder <- flights %>%
 ##   select(year, month, day, hour, minute, time_hour, everything())
 ## glimpse(flights_reorder)
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## flights_begin_a <- flights %>%
-##   select(starts_with("a"))
-## flights_begin_a
-
-
-## ---- eval=FALSE---------------------------------------------------------
-## flights_delays <- flights %>%
-##   select(ends_with("delay"))
-## flights_delays
+## ---- eval=FALSE--------------------------------------------------------------
+## flights %>% select(starts_with("a"))
+## flights %>% select(ends_with("delay"))
+## flights %>% select(contains("time"))
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## flights_time <- flights %>%
-##   select(contains("time"))
-## flights_time
-
-
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## flights_time_new <- flights %>%
 ##   select(dep_time, arr_time) %>%
-##   rename(departure_time = dep_time,
-##          arrival_time = arr_time)
+##   rename(departure_time = dep_time, arrival_time = arr_time)
 ## glimpse(flights_time_new)
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## named_dests %>%
-##   top_n(n = 10, wt = num_flights)
+## ---- eval=FALSE--------------------------------------------------------------
+## named_dests %>% top_n(n = 10, wt = num_flights)
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## named_dests  %>%
 ##   top_n(n = 10, wt = num_flights) %>%
 ##   arrange(desc(num_flights))
@@ -371,7 +353,7 @@ named_dests
 
 
 
-## ----wrangle-summary-table, echo=FALSE, message=FALSE--------------------
+## ----wrangle-summary-table, echo=FALSE, message=FALSE-------------------------
 # The following Google Doc is published to CSV and loaded using read_csv():
 # https://docs.google.com/spreadsheets/d/1nRkXfYMQiTj79c08xQPY0zkoJSpde3NC1w6DRhsWCss/edit#gid=0
 
@@ -392,8 +374,9 @@ if(knitr:::is_latex_output()){
       `Data wrangling operation` = str_replace_all(`Data wrangling operation`, "`", ""),
     ) %>% 
     kable(
-      caption = "Summary of data wrangling verbs.", 
+      caption = "Summary of data wrangling verbs", 
       booktabs = TRUE,
+      linesep = "",
       format = "latex"
     ) %>% 
     kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
@@ -403,7 +386,7 @@ if(knitr:::is_latex_output()){
 } else {
   ch4_scenarios %>% 
     kable(
-      caption = "Summary of data wrangling verbs.", 
+      caption = "Summary of data wrangling verbs", 
       booktabs = TRUE,
       format = "html"
     )
@@ -416,6 +399,14 @@ if(knitr:::is_latex_output()){
 
 
 
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("Solutions to all *Learning checks* can be found online in [Appendix D](https://moderndive.com/D-appendixD.html).")
+} 
+
+
+
+
 ## ----dplyr-cheatsheet, echo=FALSE, fig.cap="Data Transformation with dplyr cheatsheet."----
 if(knitr::is_html_output())
   include_graphics("images/cheatsheets/dplyr_cheatsheet-1.png")
diff --git a/docs/scripts/04-tidy.R b/docs/scripts/04-tidy.R
index 91d5fb7d0..0376f3907 100644
--- a/docs/scripts/04-tidy.R
+++ b/docs/scripts/04-tidy.R
@@ -1,4 +1,4 @@
-## ----setup_tidy, include=FALSE-------------------------------------------
+## ----setup_tidy, include=FALSE------------------------------------------------
 chap <- 4
 lc <- 0
 rq <- 0
@@ -22,7 +22,7 @@ options(knitr.kable.NA = '')
 set.seed(76)
 
 
-## ----warning=FALSE, message=FALSE----------------------------------------
+## ----warning=FALSE, message=FALSE---------------------------------------------
 library(dplyr)
 library(ggplot2)
 library(readr)
@@ -31,7 +31,7 @@ library(nycflights13)
 library(fivethirtyeight)
 
 
-## ----message=FALSE, warning=FALSE, echo=FALSE----------------------------
+## ----message=FALSE, warning=FALSE, echo=FALSE---------------------------------
 # Packages needed internally, but not in text.
 library(knitr)
 library(kableExtra)
@@ -39,25 +39,26 @@ library(stringr)
 library(scales)
 
 
-## ----message=FALSE, eval=FALSE-------------------------------------------
+## ----message=FALSE, eval=FALSE------------------------------------------------
 ## library(readr)
 ## dem_score <- read_csv("https://moderndive.com/data/dem_score.csv")
 ## dem_score
 
-## ----message=FALSE, echo=FALSE-------------------------------------------
+## ----message=FALSE, echo=FALSE------------------------------------------------
 dem_score <- read_csv("data/dem_score.csv")
 dem_score
 
 
-## ----read-excel, echo=FALSE, fig.cap="Importing an Excel file to R."-----
+## ----read-excel, echo=FALSE, fig.cap="Importing an Excel file to R."----------
 include_graphics("images/rstudio_screenshots/read_excel.png")
 
 
-## ------------------------------------------------------------------------
-drinks
+## ---- echo=FALSE--------------------------------------------------------------
+drinks %>% 
+  head(5)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 drinks_smaller <- drinks %>% 
   filter(country %in% c("USA", "China", "Italy", "Saudi Arabia")) %>% 
   select(-total_litres_of_pure_alcohol) %>% 
@@ -65,7 +66,7 @@ drinks_smaller <- drinks %>%
 drinks_smaller
 
 
-## ----drinks-smaller, fig.cap="Comparing alcohol consumption in 4 countries.", fig.height=3.5, echo=FALSE----
+## ----drinks-smaller, fig.cap="Comparing alcohol consumption in 4 countries.", fig.height=3.9, echo=FALSE----
 drinks_smaller_tidy <- drinks_smaller %>% 
   gather(type, servings, -country)
 drinks_smaller_tidy_plot <- ggplot(
@@ -81,37 +82,37 @@ if(knitr::is_html_output()){
 }
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 drinks_smaller_tidy
 
 
-## ------------------------------------------------------------------------
-drinks_smaller
-
 
 
 
 
-
-## ----tidy-stocks, echo=FALSE---------------------------------------------
+## ----tidy-stocks, echo=FALSE--------------------------------------------------
 stocks_tidy <- stocks %>% 
   rename(
     Boeing = `Boeing stock price`,
     Amazon = `Amazon stock price`,
     Google = `Google stock price`
   ) %>% 
-  gather(`Stock name`, `Stock price`, -Date)
+#  gather(`Stock name`, `Stock price`, -Date)
+  pivot_longer(cols = -Date, 
+               names_to = "Stock Name", 
+               values_to = "Stock Price")
 stocks_tidy %>% 
   kable(
     digits = 2,
     caption = "Stock prices (tidy format)", 
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ----tidy-stocks-2, echo=FALSE-------------------------------------------
+## ----tidy-stocks-2, echo=FALSE------------------------------------------------
 stocks <- tibble(
   Date = as.Date('2009-01-01') + 0:4,
   `Boeing Price` = paste("$", c("173.55", "172.61", "173.86", "170.77", "174.29"), sep = ""),
@@ -121,45 +122,54 @@ stocks <- tibble(
 stocks %>% 
   kable(
     digits = 2,
-    caption = "Example of tidy data.", 
-    booktabs = TRUE
+    caption = "Example of tidy data"#, 
+#    booktabs = TRUE
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16), 
                 latex_options = c("hold_position"))
 
 
-## \vspace{-0.25in}
+## \vspace{-0.15in}
 
 ## **_Learning check_**
 
-## \vspace{-0.25in}
+## \vspace{-0.1in}
 
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 drinks_smaller
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 drinks_smaller_tidy <- drinks_smaller %>% 
-  gather(key = type, value = servings, -country)
+  pivot_longer(names_to = "type", 
+               values_to = "servings", 
+               cols = -country)
 drinks_smaller_tidy
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## drinks_smaller_tidy <- drinks_smaller %>%
-##   gather(key = type, value = servings, c(beer, spirit, wine))
-## drinks_smaller_tidy
+## ---- eval=FALSE--------------------------------------------------------------
+## drinks_smaller %>%
+##   pivot_longer(names_to = "type",
+##                values_to = "servings",
+##                cols = c(beer, spirit, wine))
+
 
+## ---- eval=FALSE--------------------------------------------------------------
+## drinks_smaller %>%
+##   pivot_longer(names_to = "type",
+##                values_to = "servings",
+##                cols = beer:wine)
 
-## ----eval=FALSE----------------------------------------------------------
-## ggplot(drinks_smaller_tidy,
-##        aes(x = country, y = servings, fill = type)) +
+
+## ----eval=FALSE---------------------------------------------------------------
+## ggplot(drinks_smaller_tidy, aes(x = country, y = servings, fill = type)) +
 ##   geom_col(position = "dodge")
 
 
-## ----drinks-smaller-tidy-barplot, echo=FALSE, fig.cap="Comparing alcohol consumption in 4 countries.", fig.height=3.5----
+## ----drinks-smaller-tidy-barplot, echo=FALSE, fig.cap='(ref:drinks-col)', fig.height=2.5----
 if(knitr::is_html_output()){
   drinks_smaller_tidy_plot
 } else {
@@ -167,43 +177,41 @@ if(knitr::is_html_output()){
 }
 
 
-## \vspace{-0.25in}
+## \vspace{-0.15in}
 
 ## **_Learning check_**
 
-## \vspace{-0.25in}
+## \vspace{-0.1in}
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## airline_safety
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 airline_safety_smaller <- airline_safety %>% 
-  select(-c(incl_reg_subsidiaries, avail_seat_km_per_week))
+  select(airline, starts_with("fatalities"))
 airline_safety_smaller
 
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 guat_dem <- dem_score %>% 
   filter(country == "Guatemala")
 guat_dem
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 guat_dem_tidy <- guat_dem %>% 
-  gather(key = year, value = democracy_score, -country) 
+  pivot_longer(names_to = "year", 
+               values_to = "democracy_score", 
+               cols = -country,
+               names_ptypes = list(year = integer())) 
 guat_dem_tidy
 
 
-## ------------------------------------------------------------------------
-guat_dem_tidy <- guat_dem_tidy %>% 
-  mutate(year = as.numeric(year))
-
-
-## ----guat-dem-tidy, fig.cap="Democracy scores in Guatemala 1952-1992.", fig.height=3.5----
+## ----guat-dem-tidy, fig.cap="Democracy scores in Guatemala 1952-1992.", fig.height=3----
 ggplot(guat_dem_tidy, aes(x = year, y = democracy_score)) +
   geom_line() +
   labs(x = "Year", y = "Democracy Score")
@@ -213,28 +221,34 @@ ggplot(guat_dem_tidy, aes(x = year, y = democracy_score)) +
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## library(dplyr)
+## ---- eval=FALSE--------------------------------------------------------------
 ## library(ggplot2)
+## library(dplyr)
 ## library(readr)
 ## library(tidyr)
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## library(tidyverse)
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## library(ggplot2)
 ## library(dplyr)
-## library(tidyr)
 ## library(readr)
+## library(tidyr)
 ## library(purrr)
 ## library(tibble)
 ## library(stringr)
 ## library(forcats)
 
 
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("Solutions to all *Learning checks* can be found online in [Appendix D](https://moderndive.com/D-appendixD.html).")
+} 
+
+
 
 
 ## ----import-cheatsheet, echo=FALSE, fig.cap="Data Import cheatsheet (first page): readr package.", out.width="66%"----
diff --git a/docs/scripts/05-regression.R b/docs/scripts/05-regression.R
index ca321e8da..e53671a26 100644
--- a/docs/scripts/05-regression.R
+++ b/docs/scripts/05-regression.R
@@ -1,10 +1,20 @@
-## ---- eval=FALSE---------------------------------------------------------
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("# (PART) (ref:moderndivepart) {-}")
+} else {
+  cat("# (PART) Data Modeling with moderndive {-} ")
+}
+
+
+
+
+## ---- eval=FALSE--------------------------------------------------------------
 ## library(tidyverse)
 ## library(moderndive)
 ## library(skimr)
 ## library(gapminder)
 
-## ---- echo=FALSE, message=FALSE, warning=FALSE---------------------------
+## ---- echo=FALSE, message=FALSE, warning=FALSE--------------------------------
 library(tidyverse)
 library(moderndive)
 # DO NOT load the skimr package as a whole as it will break all kable() code for 
@@ -14,7 +24,7 @@ library(moderndive)
 library(gapminder)
 
 
-## ---- message=FALSE, warning=FALSE, echo=FALSE---------------------------
+## ---- message=FALSE, warning=FALSE, echo=FALSE--------------------------------
 # Packages needed internally, but not in text.
 library(mvtnorm)
 library(broom)
@@ -22,48 +32,47 @@ library(kableExtra)
 library(patchwork)
 
 
-## ------------------------------------------------------------------------
-evals_ch6 <- evals %>%
+## -----------------------------------------------------------------------------
+evals_ch5 <- evals %>%
   select(ID, score, bty_avg, age)
 
 
-## ------------------------------------------------------------------------
-glimpse(evals_ch6)
+## -----------------------------------------------------------------------------
+glimpse(evals_ch5)
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## evals_ch6 %>%
+## ---- eval=FALSE--------------------------------------------------------------
+## evals_ch5 %>%
 ##   sample_n(size = 5)
 
-## ----five-random-courses, echo=FALSE-------------------------------------
-evals_ch6 %>%
+## ----five-random-courses, echo=FALSE------------------------------------------
+evals_ch5 %>%
   sample_n(5) %>%
   knitr::kable(
     digits = 3,
     caption = "A random sample of 5 out of the 463 courses at UT Austin",
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ----eval=TRUE-----------------------------------------------------------
-evals_ch6 %>%
+## ----eval=TRUE----------------------------------------------------------------
+evals_ch5 %>%
   summarize(mean_bty_avg = mean(bty_avg), mean_score = mean(score),
             median_bty_avg = median(bty_avg), median_score = median(score))
 
 
-## ----eval=FALSE----------------------------------------------------------
-## evals_ch6 %>%
-##   select(score, bty_avg) %>%
-##   skim()
+## ----eval=FALSE---------------------------------------------------------------
+## evals_ch5 %>% select(score, bty_avg) %>% skim()
 
 
-## ----correlation1, echo=FALSE, fig.cap="Different correlation coefficients."----
+## ----correlation1, echo=FALSE, fig.cap="Nine different correlation coefficients.", fig.height=2.6----
 correlation <- c(-0.9999, -0.9, -0.75, -0.3, 0, 0.3, 0.75, 0.9, 0.9999)
 n_sim <- 100
 values <- NULL
-for(i in seq_len(length(correlation))){
+for(i in seq_along(correlation)){
   rho <- correlation[i]
   sigma <- matrix(c(5, rho * sqrt(50), rho * sqrt(50), 10), 2, 2)
   sim <- rmvnorm(
@@ -78,41 +87,52 @@ for(i in seq_len(length(correlation))){
   values <- bind_rows(values, sim)
 }
 
-ggplot(data = values, mapping = aes(V1, V2)) +
+corr_plot <- ggplot(data = values, mapping = aes(V1, V2)) +
   geom_point() +
   facet_wrap(~ correlation, ncol = 3) +
   labs(x = "x", y = "y") +
   theme(
     axis.text.x = element_blank(),
     axis.text.y = element_blank(),
-    axis.ticks = element_blank()
+    axis.ticks = element_blank())
+
+if(knitr::is_latex_output()){
+  corr_plot +
+  theme(
+    strip.text = element_text(colour = 'black'),
+    strip.background = element_rect(fill = "grey93")
   )
+} else {
+  corr_plot
+}
 
 
-## ------------------------------------------------------------------------
-evals_ch6 %>%
+## -----------------------------------------------------------------------------
+evals_ch5 %>% 
   get_correlation(formula = score ~ bty_avg)
 
 
-## ------------------------------------------------------------------------
-evals_ch6 %>%
-  summarize(correlation = cor(score, bty_avg))
+## ---- eval=FALSE--------------------------------------------------------------
+## evals_ch5 %>%
+##   summarize(correlation = cor(score, bty_avg))
 
 
-## ---- echo=FALSE---------------------------------------------------------
-cor_ch6 <- evals_ch6 %>%
-  summarize(correlation = cor(score, bty_avg)) %>%
-  pull(correlation) %>%
-  round(3)
+## ----echo=FALSE---------------------------------------------------------------
+cor_ch5 <- evals_ch5 %>%
+  summarize(correlation = cor(score, bty_avg)) %>% 
+  round(3) %>% 
+  pull()
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
+## ---- eval=FALSE--------------------------------------------------------------
+## ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
 ##   geom_point() +
-##   labs(x = "Beauty Score", y = "Teaching Score",
+##   labs(x = "Beauty Score",
+##        y = "Teaching Score",
 ##        title = "Scatterplot of relationship of teaching and beauty scores")
 
-## ----numxplot1, warning=FALSE, echo=FALSE, fig.cap="Instructor evaluation scores at UT Austin."----
+
+## ----numxplot1, warning=FALSE, echo=FALSE, fig.cap="Instructor evaluation scores at UT Austin.", fig.height=4.5----
 # Define orange box
 margin_x <- 0.15
 margin_y <- 0.075
@@ -121,29 +141,30 @@ box <- tibble(
   y = c(4.6, 4.6, 5, 5, 4.6) + c(-1, -1, 1, 1, -1) * margin_y
   )
 
-ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
+ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
   geom_point() +
-  labs(x = "Beauty Score", y = "Teaching Score",
+  labs(x = "Beauty Score", 
+       y = "Teaching Score",
        title = "Scatterplot of relationship of teaching and beauty scores") +
   geom_path(data = box, aes(x=x, y=y), col = "orange", size = 1)
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
+## ---- eval=FALSE--------------------------------------------------------------
+## ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
 ##   geom_jitter() +
 ##   labs(x = "Beauty Score", y = "Teaching Score",
 ##        title = "Scatterplot of relationship of teaching and beauty scores")
 
-## ----numxplot2, warning=FALSE, echo=FALSE, fig.cap="Instructor evaluation scores at UT Austin."----
-ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
+## ----numxplot2, warning=FALSE, echo=FALSE, fig.cap="Instructor evaluation scores at UT Austin.", fig.height=4.2----
+ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
   geom_jitter() +
   labs(x = "Beauty Score", y = "Teaching Score",
        title = "(Jittered) Scatterplot of relationship of teaching and beauty scores") +
-  geom_path(data = box, aes(x=x, y=y), col = "orange", size = 1)
+  geom_path(data = box, aes(x = x, y = y), col = "orange", size = 1)
 
 
-## ----numxplot3, warning=FALSE, fig.cap="Regression line."----------------
-ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
+## ----numxplot3, warning=FALSE, fig.cap="Regression line."---------------------
+ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
   geom_point() +
   labs(x = "Beauty Score", y = "Teaching Score",
        title = "Relationship between teaching and beauty scores") +  
@@ -154,37 +175,38 @@ ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Fit regression model:
-## score_model <- lm(score ~ bty_avg, data = evals_ch6)
+## score_model <- lm(score ~ bty_avg, data = evals_ch5)
 ## # Get regression table:
 ## get_regression_table(score_model)
 
-## ---- echo=FALSE---------------------------------------------------------
-score_model <- lm(score ~ bty_avg, data = evals_ch6)
+## ---- echo=FALSE--------------------------------------------------------------
+score_model <- lm(score ~ bty_avg, data = evals_ch5)
 evals_line <- score_model %>%
   get_regression_table() %>%
   pull(estimate)
 
-## ----regtable, echo=FALSE------------------------------------------------
+## ----regtable, echo=FALSE-----------------------------------------------------
 get_regression_table(score_model) %>%
   knitr::kable(
     digits = 3,
     caption = "Linear regression table",
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Fit regression model:
-## score_model <- lm(score ~ bty_avg, data = evals_ch6)
+## score_model <- lm(score ~ bty_avg, data = evals_ch5)
 ## # Get regression table:
 ## get_regression_table(score_model)
 
 
-## ----moderndive-figure-wrapper, echo=FALSE, fig.align='center', fig.cap="The concept of a wrapper function."----
+## ----moderndive-figure-wrapper, echo=FALSE, fig.align='center', fig.cap="The concept of a wrapper function.", out.height="60%", out.width="60%"----
 knitr::include_graphics("images/shutterstock/wrapper_function.png")
 
 
@@ -192,8 +214,8 @@ knitr::include_graphics("images/shutterstock/wrapper_function.png")
 
 
 
-## ----instructor-21, echo=FALSE-------------------------------------------
-index <- which(evals_ch6$bty_avg == 7.333 & evals_ch6$score == 4.9)
+## ----instructor-21, echo=FALSE------------------------------------------------
+index <- which(evals_ch5$bty_avg == 7.333 & evals_ch5$score == 4.9)
 target_point <- score_model %>%
   get_regression_points() %>%
   slice(index)
@@ -201,19 +223,20 @@ x <- target_point$bty_avg
 y <- target_point$score
 y_hat <- target_point$score_hat
 resid <- target_point$residual
-evals_ch6 %>%
+evals_ch5 %>%
   slice(index) %>%
   knitr::kable(
     digits = 4,
     caption = "Data for the 21st course out of 463",
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ----numxplot4, echo=FALSE, warning=FALSE, fig.cap="Example of observed value, fitted value, and residual."----
-best_fit_plot <- ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
+## ----numxplot4, echo=FALSE, warning=FALSE, fig.cap="Example of observed value, fitted value, and residual.", fig.height=2.8----
+best_fit_plot <- ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
   geom_point(color = "grey") +
   labs(x = "Beauty Score", y = "Teaching Score",
        title = "Relationship of teaching and beauty scores") +
@@ -225,12 +248,12 @@ best_fit_plot <- ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
 best_fit_plot
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## regression_points <- get_regression_points(score_model)
 ## regression_points
 
 
-## ----regression-points-1, echo=FALSE-------------------------------------
+## ----regression-points-1, echo=FALSE------------------------------------------
 set.seed(76)
 regression_points <- get_regression_points(score_model)
 regression_points %>%
@@ -238,7 +261,8 @@ regression_points %>%
   knitr::kable(
     digits = 3,
     caption = "Regression points (for only the 21st through 24th courses)",
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   )
 
 
@@ -246,14 +270,14 @@ regression_points %>%
 
 
 
-## ---- warning=FALSE, message=FALSE---------------------------------------
+## ---- warning=FALSE, message=FALSE--------------------------------------------
 library(gapminder)
 gapminder2007 <- gapminder %>%
   filter(year == 2007) %>%
   select(country, lifeExp, continent, gdpPercap)
 
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 # Hidden: internally compute mean and median life expectancy
 lifeExp_worldwide <- gapminder2007 %>%
   summarize(median = median(lifeExp), mean = mean(lifeExp))
@@ -263,77 +287,99 @@ mean_africa <- gapminder2007 %>%
   pull(mean_africa)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 glimpse(gapminder2007)
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## gapminder2007 %>%
-##   sample_n(size = 5)
+## ---- eval=FALSE--------------------------------------------------------------
+## gapminder2007 %>% sample_n(size = 5)
 
-## ----model2-data-preview, echo=FALSE-------------------------------------
+## ----model2-data-preview, echo=FALSE------------------------------------------
 gapminder2007 %>%
   sample_n(5) %>%
   knitr::kable(
     digits = 3,
     caption = "Random sample of 5 out of 142 countries",
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## gapminder2007 %>%
 ##   select(lifeExp, continent) %>%
 ##   skim()
 
 
-## ----lifeExp2007hist, echo=TRUE, warning=FALSE, fig.cap="Histogram of Life Expectancy in 2007."----
+## ----lifeExp2007hist, echo=TRUE, warning=FALSE, fig.cap="Histogram of life expectancy in 2007.", fig.height=5.2----
 ggplot(gapminder2007, aes(x = lifeExp)) +
   geom_histogram(binwidth = 5, color = "white") +
   labs(x = "Life expectancy", y = "Number of countries",
        title = "Histogram of distribution of worldwide life expectancies")
 
 
-## ----catxplot0b, warning=FALSE, fig.cap="Life expectancy in 2007."-------
-ggplot(gapminder2007, aes(x = lifeExp)) +
+## ----eval=FALSE---------------------------------------------------------------
+## ggplot(gapminder2007, aes(x = lifeExp)) +
+##   geom_histogram(binwidth = 5, color = "white") +
+##   labs(x = "Life expectancy",
+##        y = "Number of countries",
+##        title = "Histogram of distribution of worldwide life expectancies") +
+##   facet_wrap(~ continent, nrow = 2)
+
+
+## ----catxplot0b, echo=FALSE, warning=FALSE, fig.cap="Life expectancy in 2007.", fig.height=4.3----
+faceted_life_exp <- ggplot(gapminder2007, aes(x = lifeExp)) +
   geom_histogram(binwidth = 5, color = "white") +
   labs(x = "Life expectancy", y = "Number of countries",
        title = "Histogram of distribution of worldwide life expectancies") +
   facet_wrap(~ continent, nrow = 2)
 
+# Make the text black and reduce darkness of the grey in the facet labels
+if(knitr::is_latex_output()) {
+  faceted_life_exp + 
+    theme(strip.text = element_text(colour = 'black'),
+          strip.background = element_rect(fill = "grey93")
+    )
+} else {
+  faceted_life_exp
+}
+
 
-## ----catxplot1, warning=FALSE, fig.cap="Life expectancy in 2007."--------
+## ----catxplot1, warning=FALSE, fig.cap="Life expectancy in 2007.", fig.height=3.4----
 ggplot(gapminder2007, aes(x = continent, y = lifeExp)) +
   geom_boxplot() +
-  labs(x = "Continent", y = "Life expectancy (years)",
+  labs(x = "Continent", y = "Life expectancy",
        title = "Life expectancy by continent")
 
 
-## ---- eval=TRUE----------------------------------------------------------
+## ---- eval=TRUE---------------------------------------------------------------
 lifeExp_by_continent <- gapminder2007 %>%
   group_by(continent) %>%
-  summarize(median = median(lifeExp), mean = mean(lifeExp))
+  summarize(median = median(lifeExp), 
+            mean = mean(lifeExp))
 
-## ----catxplot0, echo=FALSE-----------------------------------------------
+## ----catxplot0, echo=FALSE----------------------------------------------------
 lifeExp_by_continent %>%
   knitr::kable(
     digits = 3,
     caption = "Life expectancy by continent",
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   )
 
 
-## ----continent-mean-life-expectancies, echo=FALSE------------------------
+## ----continent-mean-life-expectancies, echo=FALSE-----------------------------
 gapminder2007 %>%
   group_by(continent) %>%
   summarize(mean = mean(lifeExp)) %>%
   mutate(`Difference versus Africa` = mean - mean_africa) %>%
   knitr::kable(
     digits = 3,
-    caption = "Mean life expectancy by continent and relative differences from mean for Africa.",
-    booktabs = TRUE
+    caption = "Mean life expectancy by continent and relative differences from mean for Africa",
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
@@ -343,21 +389,20 @@ gapminder2007 %>%
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## # Fit regression model:
+## ---- eval=FALSE--------------------------------------------------------------
 ## lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007)
-## # Get regression table:
 ## get_regression_table(lifeExp_model)
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007)
 
-## ----catxplot4b, echo=FALSE----------------------------------------------
+## ----catxplot4b, echo=FALSE---------------------------------------------------
 get_regression_table(lifeExp_model) %>%
   knitr::kable(
     digits = 3,
     caption = "Linear regression table",
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
@@ -367,18 +412,19 @@ get_regression_table(lifeExp_model) %>%
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## regression_points <- get_regression_points(lifeExp_model, ID = "country")
 ## regression_points
 
-## ----model2-residuals, echo=FALSE----------------------------------------
+## ----model2-residuals, echo=FALSE---------------------------------------------
 regression_points <- get_regression_points(lifeExp_model, ID = "country")
 regression_points %>%
   slice(1:10) %>%
   knitr::kable(
     digits = 3,
     caption = "Regression points (First 10 out of 142 countries)",
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
@@ -388,17 +434,17 @@ regression_points %>%
 
 
 
-## ----moderndive-figure-causal-graph-2, echo=FALSE, fig.align='center', fig.cap="Does sleeping with shoes on cause headaches?"----
+## ----moderndive-figure-causal-graph-2, echo=FALSE, fig.align='center', fig.cap="Does sleeping with shoes on cause headaches?", out.width="60%", out.height="60%"----
 knitr::include_graphics("images/shutterstock/shoes_headache.png")
 
 
-## ----moderndive-figure-causal-graph, echo=FALSE, fig.align='center', fig.cap="Causal graph."----
+## ----moderndive-figure-causal-graph, echo=FALSE, fig.align='center', out.width="50%", fig.cap="Causal graph."----
 knitr::include_graphics("images/flowcharts/flowchart.009-cropped.png")
 
 
-## ----best-fitting-line, fig.height = 8, fig.width = 8, echo=FALSE, warning=FALSE, fig.cap="Example of observed value, fitted value, and residual."----
+## ----best-fitting-line, fig.height=5.5, echo=FALSE, warning=FALSE, fig.cap="Example of observed value, fitted value, and residual."----
 # First residual
-best_fit_plot <- ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
+best_fit_plot <- ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
   geom_point(size = 0.8, color = "grey") +
   labs(x = "Beauty Score", y = "Teaching Score") +
   geom_smooth(method = "lm", se = FALSE) +
@@ -409,7 +455,7 @@ best_fit_plot <- ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
 p1 <- best_fit_plot + labs(title = "First instructor's residual")
 
 # Second residual
-index <- which(evals_ch6$bty_avg == 2.333 & evals_ch6$score == 2.7)
+index <- which(evals_ch5$bty_avg == 2.333 & evals_ch5$score == 2.7)
 target_point <- get_regression_points(score_model) %>%
   slice(index)
 x <- target_point$bty_avg
@@ -425,8 +471,8 @@ best_fit_plot <- best_fit_plot +
 p2 <- best_fit_plot + labs(title = "Adding second instructor's residual")
 
 # Third residual
-index <- which(evals_ch6$bty_avg == 3.667 & evals_ch6$score == 4.4)
-score_model <- lm(score ~ bty_avg, data = evals_ch6)
+index <- which(evals_ch5$bty_avg == 3.667 & evals_ch5$score == 4.4)
+score_model <- lm(score ~ bty_avg, data = evals_ch5)
 target_point <- get_regression_points(score_model) %>%
   slice(index)
 x <- target_point$bty_avg
@@ -442,8 +488,8 @@ best_fit_plot <- best_fit_plot +
            arrow = arrow(type = "closed", length = unit(0.02, "npc")))
 p3 <- best_fit_plot + labs(title = "Adding third instructor's residual")
 
-index <- which(evals_ch6$bty_avg == 6 & evals_ch6$score == 3.8)
-score_model <- lm(score ~ bty_avg, data = evals_ch6)
+index <- which(evals_ch5$bty_avg == 6 & evals_ch5$score == 3.8)
+score_model <- lm(score ~ bty_avg, data = evals_ch5)
 target_point <- get_regression_points(score_model) %>%
   slice(index)
 x <- target_point$bty_avg
@@ -461,14 +507,14 @@ p4 <- best_fit_plot + labs(title = "Adding fourth instructor's residual")
 p1 + p2 + p3 + p4 + plot_layout(nrow = 2)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 # Fit regression model:
-score_model <- lm(score ~ bty_avg, data = evals_ch6)
+score_model <- lm(score ~ bty_avg, 
+                  data = evals_ch5)
 
 # Get regression points:
 regression_points <- get_regression_points(score_model)
 regression_points
-
 # Compute sum of squared residuals
 regression_points %>%
   mutate(squared_residuals = residual^2) %>%
@@ -477,51 +523,52 @@ regression_points %>%
 
 
 
-## ----three-lines, fig.cap="Regression line and two others.", out.width="80%", echo=FALSE----
+## ----three-lines, fig.cap="Regression line and two others.", out.width="85%", echo=FALSE----
 example <- tibble(
   x = c(0, 0.5, 1),
   y = c(2, 1, 3)
 )
+
 ggplot(example, aes(x = x, y = y)) +
   geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
-  geom_hline(yintercept = 2.5, col = "red", linetype = "dashed", size = 1) +
-  geom_abline(intercept = 2, slope = -1, col = "forestgreen", linetype = "dashed", size = 1) +
+  geom_hline(yintercept = 2.5, col = "red", linetype = "dotted", size = 1) +
+  geom_abline(intercept = 2, slope = -1, col = "forestgreen", 
+              linetype = "dashed", size = 1) +
   geom_point(size = 4)
-# model_example <- lm(y ~ x, data = example)
-# get_regression_table(model_example)
 
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Fit regression model:
-## score_model <- lm(score ~ bty_avg, data = evals_ch6)
+## score_model <- lm(formula = score ~ bty_avg, data = evals_ch5)
 ## # Get regression table:
 ## get_regression_table(score_model)
 
-## ----recall-table, echo=FALSE--------------------------------------------
-score_model <- lm(score ~ bty_avg, data = evals_ch6)
+
+## ----recall-table, echo=FALSE-------------------------------------------------
+score_model <- lm(score ~ bty_avg, data = evals_ch5)
 get_regression_table(score_model) %>%
   knitr::kable(
     digits = 3,
-    caption = "Regression table.",
-    booktabs = TRUE
+    caption = "Regression table",
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## library(broom)
 ## library(janitor)
 ## score_model %>%
 ##   tidy(conf.int = TRUE) %>%
 ##   mutate_if(is.numeric, round, digits = 3) %>%
 ##   clean_names() %>%
-##   rename(lower_ci = conf_low,
-##          upper_ci = conf_high)
+##   rename(lower_ci = conf_low, upper_ci = conf_high)
 
-## ----regtable-broom, echo=FALSE, message=FALSE, warning=FALSE------------
+## ----regtable-broom, echo=FALSE, message=FALSE, warning=FALSE-----------------
 library(broom)
 library(janitor)
 score_model %>%
@@ -532,14 +579,15 @@ score_model %>%
          upper_ci = conf_high) %>%
   knitr::kable(
     digits = 3,
-    caption = "Regression table using tidy() from broom package.",
-    booktabs = TRUE
+    caption = "Regression table using tidy() from broom package",
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## library(broom)
 ## library(janitor)
 ## score_model %>%
@@ -548,7 +596,7 @@ score_model %>%
 ##   clean_names() %>%
 ##   select(-c("se_fit", "hat", "sigma", "cooksd", "std_resid"))
 
-## ----regpoints-augment, echo=FALSE---------------------------------------
+## ----regpoints-augment, echo=FALSE--------------------------------------------
 library(broom)
 library(janitor)
 score_model %>%
@@ -559,9 +607,16 @@ score_model %>%
   slice(1:10) %>%
   knitr::kable(
     digits = 3,
-    caption = "Regression points using augment() from broom package.",
-    booktabs = TRUE
+    caption = "Regression points using augment() from broom package",
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
+
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("Solutions to all *Learning checks* can be found online in [Appendix D](https://moderndive.com/D-appendixD.html).")
+} 
+
diff --git a/docs/scripts/06-multiple-regression.R b/docs/scripts/06-multiple-regression.R
index 1ed1e703f..b212f034f 100644
--- a/docs/scripts/06-multiple-regression.R
+++ b/docs/scripts/06-multiple-regression.R
@@ -1,10 +1,10 @@
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## library(tidyverse)
 ## library(moderndive)
 ## library(skimr)
 ## library(ISLR)
 
-## ---- echo=FALSE, message=FALSE, warning=FALSE---------------------------
+## ---- echo=FALSE, message=FALSE, warning=FALSE--------------------------------
 library(tidyverse)
 library(moderndive)
 # DO NOT load the skimr package as a whole as it will break all kable() code for 
@@ -14,64 +14,62 @@ library(moderndive)
 library(ISLR)
 
 
-## ---- message=FALSE, warning=FALSE, echo=FALSE---------------------------
+## ---- message=FALSE, warning=FALSE, echo=FALSE--------------------------------
 # Packages needed internally, but not in text:
 library(kableExtra)
 library(patchwork)
 library(gapminder)
 
 
-## ------------------------------------------------------------------------
-evals_ch7 <- evals %>%
+## -----------------------------------------------------------------------------
+evals_ch6 <- evals %>%
   select(ID, score, age, gender)
 
 
-## ------------------------------------------------------------------------
-glimpse(evals_ch7)
+## -----------------------------------------------------------------------------
+glimpse(evals_ch6)
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## evals_ch7 %>%
-##   sample_n(size = 5)
+## ---- eval=FALSE--------------------------------------------------------------
+## evals_ch6 %>% sample_n(size = 5)
 
-## ----model4-data-preview, echo=FALSE-------------------------------------
-evals_ch7 %>%
+## ----model4-data-preview, echo=FALSE------------------------------------------
+evals_ch6 %>%
   sample_n(5) %>%
   knitr::kable(
     digits = 3,
     caption = "A random sample of 5 out of the 463 courses at UT Austin",
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ---- eval =FALSE--------------------------------------------------------
-## evals_ch7 %>%
-##   select(score, age, gender) %>%
-##   skim()
+## ---- eval =FALSE-------------------------------------------------------------
+## evals_ch6 %>% select(score, age, gender) %>% skim()
 
 
-## ------------------------------------------------------------------------
-evals_ch7 %>% 
+## -----------------------------------------------------------------------------
+evals_ch6 %>% 
   get_correlation(formula = score ~ age)
 
 
-## ----eval=FALSE----------------------------------------------------------
-## ggplot(evals_ch7, aes(x = age, y = score, color = gender)) +
+## ----eval=FALSE---------------------------------------------------------------
+## ggplot(evals_ch6, aes(x = age, y = score, color = gender)) +
 ##   geom_point() +
 ##   labs(x = "Age", y = "Teaching Score", color = "Gender") +
 ##   geom_smooth(method = "lm", se = FALSE)
 
 
-## ----numxcatxplot1, echo=FALSE, warning=FALSE, fig.cap="Colored scatterplot of relationship of teaching and beauty scores."----
+## ----numxcatxplot1, echo=FALSE, warning=FALSE, fig.cap="Colored scatterplot of relationship of teaching and beauty scores.", fig.height=3.2----
 if(knitr::is_html_output()){
-  ggplot(evals_ch7, aes(x = age, y = score, color = gender)) +
+  ggplot(evals_ch6, aes(x = age, y = score, color = gender)) +
     geom_point() +
     labs(x = "Age", y = "Teaching Score", color = "Gender") +
     geom_smooth(method = "lm", se = FALSE)
 } else {
-    ggplot(evals_ch7, aes(x = age, y = score, color = gender)) +
+    ggplot(evals_ch6, aes(x = age, y = score, color = gender)) +
     geom_point() +
     labs(x = "Age", y = "Teaching Score", color = "Gender") +
     geom_smooth(method = "lm", se = FALSE) +
@@ -79,7 +77,7 @@ if(knitr::is_html_output()){
 }
 
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 # Wrangle data
 gapminder2007 <- gapminder %>%
   filter(year == 2007) %>%
@@ -92,32 +90,35 @@ lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007)
 get_regression_table(lifeExp_model) %>%
   knitr::kable(
     digits = 3,
-    caption = "Regression table for life expectancy as a function of continent.",
-    booktabs = TRUE
+    caption = "Regression table for life expectancy as a function of continent",
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Fit regression model:
-## score_model_interaction <- lm(score ~ age * gender, data = evals_ch7)
+## score_model_interaction <- lm(score ~ age * gender, data = evals_ch6)
+## 
 ## # Get regression table:
 ## get_regression_table(score_model_interaction)
 
-## ----regtable-interaction, echo=FALSE------------------------------------
-score_model_interaction <- lm(score ~ age * gender, data = evals_ch7)
+## ----regtable-interaction, echo=FALSE-----------------------------------------
+score_model_interaction <- lm(score ~ age * gender, data = evals_ch6)
 get_regression_table(score_model_interaction) %>% 
   knitr::kable(
     digits = 3,
-    caption = "Regression table for interaction model.", 
-    booktabs = TRUE
+    caption = "Regression table for interaction model", 
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ----interaction-summary, echo=FALSE-------------------------------------
+## ----interaction-summary, echo=FALSE------------------------------------------
 options(digits = 4)
 tibble(
   Gender = c("Female instructors", "Male instructors"),
@@ -126,22 +127,27 @@ tibble(
 ) %>% 
   knitr::kable(
     digits = 4,
-    caption = "Comparison of intercepts and slopes for interaction model.", 
-    booktabs = TRUE
+    caption = "Comparison of intercepts and slopes for interaction model", 
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 options(digits = 3)
 
 
-## ----eval=FALSE----------------------------------------------------------
-## gg_parallel_slopes(y = "score", num_x = "age", cat_x = "gender",
-##                    data = evals_ch7)
+## ----eval=FALSE---------------------------------------------------------------
+## ggplot(evals_ch6, aes(x = age, y = score, color = gender)) +
+##   geom_point() +
+##   labs(x = "Age", y = "Teaching Score", color = "Gender") +
+##   geom_parallel_slopes(se = FALSE)
 
 
-## ----numxcatx-parallel, echo=FALSE, warning=FALSE, fig.cap="Parallel slopes model of relationship of score with age and gender."----
-par_slopes <- gg_parallel_slopes(y = "score", num_x = "age", cat_x = "gender", 
-                   data = evals_ch7)
+## ----numxcatx-parallel, echo=FALSE, warning=FALSE, fig.cap="Parallel slopes model of score with age and gender.", fig.height=3.5----
+par_slopes <- ggplot(evals_ch6, aes(x = age, y = score, color = gender)) +
+  geom_point() +
+  labs(x = "Age", y = "Teaching Score", color = "Gender") +
+  geom_parallel_slopes(se = FALSE)
 if(knitr::is_html_output()){
   par_slopes
 } else {
@@ -150,25 +156,32 @@ if(knitr::is_html_output()){
 }
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Fit regression model:
-## score_model_parallel_slopes <- lm(score ~ age + gender, data = evals_ch7)
+## score_model_parallel_slopes <- lm(score ~ age + gender, data = evals_ch6)
 ## # Get regression table:
 ## get_regression_table(score_model_parallel_slopes)
 
-## ----regtable-parallel-slopes, echo=FALSE--------------------------------
-score_model_parallel_slopes <- lm(score ~ age + gender, data = evals_ch7)
+## ----regtable-parallel-slopes, echo=FALSE-------------------------------------
+score_model_parallel_slopes <- lm(score ~ age + gender, data = evals_ch6)
 get_regression_table(score_model_parallel_slopes) %>% 
   knitr::kable(
     digits = 3,
-    caption = "Regression table for parallel slopes model.", 
-    booktabs = TRUE
+    caption = "Regression table for parallel slopes model", 
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ----parallel-slopes-summary, echo=FALSE---------------------------------
+## ----echo=FALSE---------------------------------------------------------------
+age_coef <- get_regression_table(score_model_parallel_slopes) %>% 
+  filter(term == "age") %>% 
+  pull(estimate)
+
+
+## ----parallel-slopes-summary, echo=FALSE--------------------------------------
 options(digits = 4)
 tibble(
   Gender = c("Female instructors", "Male instructors"),
@@ -177,8 +190,9 @@ tibble(
 ) %>% 
   knitr::kable(
     digits = 4,
-    caption = "Comparison of intercepts and slope for parallel slopes model.", 
-    booktabs = TRUE
+    caption = "Comparison of intercepts and slope for parallel slopes model", 
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
@@ -186,15 +200,15 @@ options(digits = 3)
 
 
 ## ----numxcatx-comparison, fig.width=8, echo=FALSE, warning=FALSE, fig.cap="Comparison of interaction and parallel slopes models."----
-interaction_plot <- ggplot(evals_ch7, aes(x = age, y = score, color = gender), show.legend = FALSE) +
+interaction_plot <- ggplot(evals_ch6, aes(x = age, y = score, color = gender), show.legend = FALSE) +
   geom_point() +
   labs(x = "Age", y = "Teaching Score", title = "Interaction model") +
   geom_smooth(method = "lm", se = FALSE) +
   theme(legend.position = "none")
-parallel_slopes_plot <- gg_parallel_slopes(y = "score", 
-                                           num_x = "age", 
-                                           cat_x = "gender", 
-                                           data = evals_ch7) +
+parallel_slopes_plot <- ggplot(evals_ch6, aes(x = age, y = score, color = gender), show.legend = FALSE) +
+  geom_point() +
+  labs(x = "Age", y = "Teaching Score", title = "Interaction model") +
+  geom_parallel_slopes(se = FALSE) +
   labs(x = "Age", y = "Teaching Score", title = "Parallel slopes model") +
   theme(axis.title.y = element_blank())
 
@@ -209,12 +223,12 @@ if(knitr::is_html_output()){
 }
 
 
-## ----fitted-values, echo=FALSE, warning=FALSE, fig.cap="Fitted values for two new professors."----
-newpoints <- evals_ch7 %>% 
+## ----fitted-values, echo=FALSE, warning=FALSE, fig.cap="Fitted values for two new professors.", fig.height=4.7----
+newpoints <- evals_ch6 %>% 
   slice(c(1, 5)) %>% 
   get_regression_points(score_model_interaction, newdata = .)
 
-fitted_plot <- ggplot(evals_ch7, aes(x = age, y = score, color = gender), show.legend = FALSE) +
+fitted_plot <- ggplot(evals_ch6, aes(x = age, y = score, color = gender), show.legend = FALSE) +
   geom_point() +
   labs(x = "Age", y = "Teaching Score", title = "Interaction model") +
   geom_smooth(method = "lm", se = FALSE) +
@@ -228,18 +242,18 @@ if(knitr::is_html_output()){
 }
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## regression_points <- get_regression_points(score_model_interaction)
 ## regression_points
 
-## ----model4-points-table, echo=FALSE-------------------------------------
+## ----model4-points-table, echo=FALSE------------------------------------------
 regression_points <- get_regression_points(score_model_interaction)
 regression_points %>%
   slice(1:10) %>%
   knitr::kable(
     digits = 3,
-    caption = "Regression points (First 10 out of 463 courses)",
-    booktabs = TRUE
+    caption = "Regression points (First 10 out of 463 courses)"#,
+#    booktabs = TRUE
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
@@ -249,88 +263,84 @@ regression_points %>%
 
 
 
-## ---- warning=FALSE, message=FALSE---------------------------------------
+## ---- warning=FALSE, message=FALSE--------------------------------------------
 library(ISLR)
-credit_ch7 <- Credit %>%
-  as_tibble() %>% 
+credit_ch6 <- Credit %>% as_tibble() %>% 
   select(ID, debt = Balance, credit_limit = Limit, 
          income = Income, credit_rating = Rating, age = Age)
 
 
-## ------------------------------------------------------------------------
-glimpse(credit_ch7)
+## -----------------------------------------------------------------------------
+glimpse(credit_ch6)
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## set.seed(9)
-## credit_ch7 %>%
-##   sample_n(size = 5)
+## ---- eval=FALSE--------------------------------------------------------------
+## credit_ch6 %>% sample_n(size = 5)
 
-## ----model3-data-preview, echo=FALSE-------------------------------------
-credit_ch7 %>%
+## ----model3-data-preview, echo=FALSE------------------------------------------
+credit_ch6 %>%
   sample_n(5) %>%
   knitr::kable(
     digits = 3,
-    caption = "Random sample of 5 credit card holders.",
-    booktabs = TRUE
+    caption = "Random sample of 5 credit card holders",
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## credit_ch7 %>%
-##   select(debt, credit_limit, income) %>%
-##   skim()
+## ---- eval=FALSE--------------------------------------------------------------
+## credit_ch6 %>% select(debt, credit_limit, income) %>% skim()
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## credit_ch7 %>%
-##   get_correlation(debt ~ credit_limit)
-## credit_ch7 %>%
-##   get_correlation(debt ~ income)
+## ---- eval=FALSE--------------------------------------------------------------
+## credit_ch6 %>% get_correlation(debt ~ credit_limit)
+## credit_ch6 %>% get_correlation(debt ~ income)
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## credit_ch7 %>%
+## ---- eval=FALSE--------------------------------------------------------------
+## credit_ch6 %>%
 ##   select(debt, credit_limit, income) %>%
 ##   cor()
 
-## ----model3-correlation, echo=FALSE--------------------------------------
-credit_ch7 %>% 
+## ----model3-correlation, echo=FALSE-------------------------------------------
+credit_ch6 %>% 
   select(debt, credit_limit, income) %>% 
   cor() %>% 
   knitr::kable(
     digits = 3,
-    caption = "Correlation coefficients between credit card debt, credit limit, and income.", 
-    booktabs = TRUE
+    caption = "Correlation coefficients between credit card debt, credit limit, and income", 
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## ggplot(credit_ch7, aes(x = credit_limit, y = debt)) +
+## ---- eval=FALSE--------------------------------------------------------------
+## ggplot(credit_ch6, aes(x = credit_limit, y = debt)) +
 ##   geom_point() +
 ##   labs(x = "Credit limit (in $)", y = "Credit card debt (in $)",
 ##        title = "Debt and credit limit") +
 ##   geom_smooth(method = "lm", se = FALSE)
 ## 
-## ggplot(credit_ch7, aes(x = income, y = debt)) +
+## ggplot(credit_ch6, aes(x = income, y = debt)) +
 ##   geom_point() +
 ##   labs(x = "Income (in $1000)", y = "Credit card debt (in $)",
 ##        title = "Debt and income") +
 ##   geom_smooth(method = "lm", se = FALSE)
 
-## ----2numxplot1, echo=FALSE, fig.cap="Relationship between credit card debt and credit limit/income."----
-model3_balance_vs_limit_plot <- ggplot(credit_ch7, aes(x = credit_limit, y = debt)) +
+
+## ----2numxplot1, echo=FALSE, fig.cap="Relationship between credit card debt and credit limit/income.", fig.height=3.2----
+model3_balance_vs_limit_plot <- ggplot(credit_ch6, aes(x = credit_limit, y = debt)) +
   geom_point() +
   labs(x = "Credit limit (in $)", y = "Credit card debt (in $)", 
        title = "Debt and credit limit") +
   geom_smooth(method = "lm", se = FALSE) +
   scale_y_continuous(limits = c(0, 2000))
 
-model3_balance_vs_income_plot <- ggplot(credit_ch7, aes(x = income, y = debt)) +
+model3_balance_vs_income_plot <- ggplot(credit_ch6, aes(x = income, y = debt)) +
   geom_point() +
   labs(x = "Income (in $1000)", y = "Credit card debt (in $)", 
        title = "Debt and income") +
@@ -343,16 +353,16 @@ model3_balance_vs_limit_plot + model3_balance_vs_income_plot
 
 
 
-## ---- eval=FALSE, echo=FALSE---------------------------------------------
+## ---- eval=FALSE, echo=FALSE--------------------------------------------------
 ## # Source code for above 3D scatterplot & regression plane.
 ## library(ISLR)
 ## library(plotly)
 ## library(tidyverse)
 ## 
 ## # setup hideous grid required by plotly
-## model_lm <- lm(debt ~ income + credit_limit, data=credit_ch7)
-## x_grid <- seq(from = min(credit_ch7$income), to = max(credit_ch7$income), length = 100)
-## y_grid <- seq(from = min(credit_ch7$credit_limit), to = max(credit_ch7$credit_limit), length = 200)
+## model_lm <- lm(debt ~ income + credit_limit, data = credit_ch6)
+## x_grid <- seq(from = min(credit_ch6$income), to = max(credit_ch6$income), length = 100)
+## y_grid <- seq(from = min(credit_ch6$credit_limit), to = max(credit_ch6$credit_limit), length = 200)
 ## z_grid <- expand.grid(x_grid, y_grid) %>%
 ##   tbl_df() %>%
 ##   rename(x_grid = Var1, y_grid = Var2) %>%
@@ -364,16 +374,16 @@ model3_balance_vs_limit_plot + model3_balance_vs_income_plot
 ## # Plot points
 ## plot_ly() %>%
 ##   add_markers(
-##     x = credit_ch7$income,
-##     y = credit_ch7$credit_limit,
-##     z = credit_ch7$debt,
+##     x = credit_ch6$income,
+##     y = credit_ch6$credit_limit,
+##     z = credit_ch6$debt,
 ##     hoverinfo = 'text',
 ##     text = ~paste("x1 - Income: ",
-##                   credit_ch7$income,
+##                   credit_ch6$income,
 ##                   "</br> x2 - Credit Limit: ",
-##                   credit_ch7$credit_limit,
+##                   credit_ch6$credit_limit,
 ##                   "</br> y - Debt: ",
-##                   credit_ch7$debt)
+##                   credit_ch6$debt)
 ##   ) %>%
 ##   # Label axes
 ##   layout(
@@ -395,21 +405,22 @@ model3_balance_vs_limit_plot + model3_balance_vs_income_plot
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Fit regression model:
-## debt_model <- lm(debt ~ credit_limit + income, data = credit_ch7)
+## debt_model <- lm(debt ~ credit_limit + income, data = credit_ch6)
 ## # Get regression table:
 ## get_regression_table(debt_model)
 
-## ----model3-table-output, echo=FALSE-------------------------------------
-debt_model <- lm(debt ~ credit_limit + income, data = credit_ch7)
+## ----model3-table-output, echo=FALSE------------------------------------------
+debt_model <- lm(debt ~ credit_limit + income, data = credit_ch6)
 credit_line <- get_regression_table(debt_model) %>%
   pull(estimate)
 get_regression_table(debt_model) %>% 
   knitr::kable(
     digits = 3,
     caption = "Multiple regression table", 
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
@@ -419,27 +430,27 @@ get_regression_table(debt_model) %>%
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## regression_points <- get_regression_points(debt_model)
-## regression_points
+## ---- eval=FALSE--------------------------------------------------------------
+## get_regression_points(debt_model)
 
-## ----model3-points-table, echo=FALSE-------------------------------------
+## ----model3-points-table, echo=FALSE------------------------------------------
 set.seed(76)
 regression_points <- get_regression_points(debt_model)
 regression_points %>%
   slice(1:10) %>%
   knitr::kable(
     digits = 3,
-    caption = "Regression points (First 10 credit card holders out of 400).",
-    booktabs = TRUE
+    caption = "Regression points (First 10 credit card holders out of 400)",
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ----recall-parallel-vs-interaction, fig.width=8, echo=FALSE, fig.cap="Previously seen comparison of interaction and parallel slopes models."----
+## ----recall-parallel-vs-interaction, fig.height=3.5, echo=FALSE, fig.cap="Previously seen comparison of interaction and parallel slopes models."----
 if(knitr::is_html_output()){
-  interaction_plot + parallel_slopes_plot
+  interaction_plot + (parallel_slopes_plot + labs(color = "gender\n(recorded\nas binary)"))
 } else {
   grey_interaction_plot <- interaction_plot +
     scale_color_grey()
@@ -449,33 +460,37 @@ if(knitr::is_html_output()){
 }
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Interaction model
 ## ggplot(MA_schools,
 ##        aes(x = perc_disadvan, y = average_sat_math, color = size)) +
 ##   geom_point(alpha = 0.25) +
-##   geom_smooth(method = "lm", se = FALSE ) +
+##   geom_smooth(method = "lm", se = FALSE) +
 ##   labs(x = "Percent economically disadvantaged", y = "Math SAT Score",
 ##        color = "School size", title = "Interaction model")
-## 
+
+
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Parallel slopes model
-## gg_parallel_slopes(y = "average_sat_math", num_x = "perc_disadvan",
-##                    cat_x = "size", data = MA_schools, alpha = 0.25) +
-##   labs(x = "Percent economically disadvantaged",
-##        y = "Math SAT Score",
-##        color = "School size",
-##        title = "Parallel slopes model")
-
-## ----numxcatx-comparison-2, fig.width=8, echo=FALSE, warning=FALSE, fig.cap="Comparison of interaction and parallel slopes models for MA schools."----
+## ggplot(MA_schools,
+##        aes(x = perc_disadvan, y = average_sat_math, color = size)) +
+##   geom_point(alpha = 0.25) +
+##   geom_parallel_slopes(se = FALSE) +
+##   labs(x = "Percent economically disadvantaged", y = "Math SAT Score",
+##        color = "School size", title = "Parallel slopes model")
+
+## ----numxcatx-comparison-2, fig.height=3.4, echo=FALSE, warning=FALSE, fig.cap="Comparison of interaction and parallel slopes models for Massachusetts schools."----
 p1 <- ggplot(MA_schools, 
              aes(x = perc_disadvan, y = average_sat_math, color = size)) +
   geom_point(alpha = 0.25) +
-  geom_smooth(method = "lm", se = FALSE ) +
+  geom_smooth(method = "lm", se = FALSE) +
   labs(x = "Percent economically disadvantaged", y = "Math SAT Score", 
        color = "School size", title = "Interaction model") + 
   theme(legend.position = "none")
-p2 <- gg_parallel_slopes(y = "average_sat_math", num_x = "perc_disadvan", 
-                         cat_x = "size", data = MA_schools, alpha = 0.25) + 
+p2 <- ggplot(MA_schools, 
+       aes(x = perc_disadvan, y = average_sat_math, color = size)) +
+  geom_point(alpha = 0.25) +
+  geom_parallel_slopes(se = FALSE) + 
   labs(x = "Percent economically disadvantaged", y = "Math SAT Score", 
        color = "School size", title = "Parallel slopes model")  +
   theme(axis.title.y = element_blank())
@@ -488,88 +503,91 @@ if(knitr::is_html_output()){
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## model_2_interaction <- lm(average_sat_math ~ perc_disadvan * size,
 ##                           data = MA_schools)
 ## get_regression_table(model_2_interaction)
 
-## ----model2-interaction, echo=FALSE--------------------------------------
+## ----model2-interaction, echo=FALSE-------------------------------------------
 model_2_interaction <- lm(average_sat_math ~ perc_disadvan * size, 
                           data = MA_schools)
 get_regression_table(model_2_interaction) %>% 
   knitr::kable(
     digits = 3,
     caption = "Interaction model regression table", 
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## model_2_parallel_slopes <- lm(average_sat_math ~ perc_disadvan + size,
 ##                               data = MA_schools)
 ## get_regression_table(model_2_parallel_slopes)
 
-## ----model2-parallel-slopes, echo=FALSE----------------------------------
+## ----model2-parallel-slopes, echo=FALSE---------------------------------------
 model_2_parallel_slopes <- lm(average_sat_math ~ perc_disadvan + size, 
                               data = MA_schools)
 get_regression_table(model_2_parallel_slopes) %>% 
   knitr::kable(
     digits = 3,
     caption = "Parallel slopes regression table", 
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## credit_ch7 %>%
-##   select(debt, income) %>%
+## ---- eval=FALSE--------------------------------------------------------------
+## credit_ch6 %>% select(debt, income) %>%
 ##   mutate(income = income * 1000) %>%
 ##   cor()
 
-## ----cor-credit-2, echo=FALSE--------------------------------------------
-credit_ch7 %>% 
+## ----cor-credit-2, echo=FALSE-------------------------------------------------
+credit_ch6 %>% 
   select(debt, income) %>% 
   mutate(income = income * 1000) %>% 
   cor() %>% 
   knitr::kable(
     digits = 3,
     caption = "Correlation between income (in dollars) and credit card debt", 
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ----2numxplot1-repeat, echo=FALSE, fig.cap="Relationship between credit card debt and income."----
+## ----2numxplot1-repeat, echo=FALSE, fig.cap="Relationship between credit card debt and income.", fig.height=1.8----
 model3_balance_vs_income_plot
 
 
-## ----model3-table-output-repeat, echo=FALSE------------------------------
-debt_model <- lm(debt ~ credit_limit + income, data = credit_ch7)
+## ----model3-table-output-repeat, echo=FALSE-----------------------------------
+debt_model <- lm(debt ~ credit_limit + income, data = credit_ch6)
 credit_line <- get_regression_table(debt_model) %>%
   pull(estimate)
 get_regression_table(debt_model) %>% 
   knitr::kable(
     digits = 3,
-    caption = "Multiple regression table", 
-    booktabs = TRUE
+    caption = "Multiple regression results", 
+    booktabs = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
-## ----credit-limit-quartiles, echo=FALSE, fig.height=4, fig.cap="Histogram of credit limits and brackets.", message=FALSE----
-ggplot(credit_ch7, aes(x = credit_limit)) +
+## ----credit-limit-quartiles, echo=FALSE, fig.height=2.5, fig.cap="Histogram of credit limits and brackets.", message=FALSE----
+ggplot(credit_ch6, aes(x = credit_limit)) +
   geom_histogram(color = "white") +
-  geom_vline(xintercept = quantile(credit_ch7$credit_limit, probs = c(0.25, 0.5, 0.75)), linetype = "dashed", size = 1) +
+  geom_vline(xintercept = quantile(credit_ch6$credit_limit, probs = c(0.25, 0.5, 0.75)), linetype = "dashed", size = 1) +
   labs(x = "Credit limit", title = "Credit limit and 4 credit limit brackets.")
 
 
-## ----2numxplot4, echo=FALSE, fig.cap="Relationship between credit card debt and income by credit limit bracket."----
-credit_ch7 <- credit_ch7 %>% 
+## ----2numxplot4, echo=FALSE, fig.cap="Relationship between credit card debt and income by credit limit bracket.", fig.height=3----
+credit_ch6 <- credit_ch6 %>% 
   mutate(limit_bracket = cut_number(credit_limit, 4)) %>% 
   mutate(limit_bracket = fct_recode(limit_bracket,
     "low" =  "[855,3.09e+03]",
@@ -578,14 +596,14 @@ credit_ch7 <- credit_ch7 %>%
     "high" = "(5.87e+03,1.39e+04]"
   ))
 
-model3_balance_vs_income_plot <- ggplot(credit_ch7, aes(x = income, y = debt)) +
+model3_balance_vs_income_plot <- ggplot(credit_ch6, aes(x = income, y = debt)) +
   geom_point() +
   labs(x = "Income (in $1000)", y = "Credit card debt (in $)", 
        title = "Two scatterplots of credit card debt vs income") +
   geom_smooth(method = "lm", se = FALSE) +
   scale_y_continuous(limits = c(0, NA))
 
-model3_balance_vs_income_plot_colored <- ggplot(credit_ch7, 
+model3_balance_vs_income_plot_colored <- ggplot(credit_ch6, 
                                                 aes(x = income, y = debt, 
                                                     col = limit_bracket)) +
   geom_point() +
@@ -602,3 +620,9 @@ if(knitr::is_html_output()){
     (model3_balance_vs_income_plot_colored + scale_color_grey())
 }
 
+
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("Solutions to all *Learning checks* can be found online in [Appendix D](https://moderndive.com/D-appendixD.html).")
+} 
+
diff --git a/docs/scripts/07-sampling.R b/docs/scripts/07-sampling.R
index 2800755f1..ce9dbef33 100644
--- a/docs/scripts/07-sampling.R
+++ b/docs/scripts/07-sampling.R
@@ -1,9 +1,19 @@
-## ----message=FALSE, warning=FALSE----------------------------------------
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("# (PART) (ref:inferpart) {-}")
+} else {
+  cat("# (PART) Statistical Inference with infer {-} ")
+}
+
+
+
+
+## ----message=FALSE, warning=FALSE---------------------------------------------
 library(tidyverse)
 library(moderndive)
 
 
-## ----message=FALSE, warning=FALSE, echo=FALSE----------------------------
+## ----message=FALSE, warning=FALSE, echo=FALSE---------------------------------
 # Packages needed internally, but not in text.
 library(knitr)
 library(kableExtra)
@@ -22,17 +32,17 @@ library(patchwork)
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 tactile_prop_red
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## ggplot(tactile_prop_red, aes(x = prop_red)) +
 ##   geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
 ##   labs(x = "Proportion of 50 balls that were red",
 ##        title = "Distribution of 33 proportions red")
 
-## ----samplingdistribution-tactile, echo=FALSE, fig.cap="Distribution of 33 proportions based on 33 samples of size 50."----
+## ----samplingdistribution-tactile, echo=FALSE, fig.cap="Distribution of 33 proportions based on 33 samples of size 50.", fig.height=3.1----
 tactile_histogram <- ggplot(tactile_prop_red, aes(x = prop_red)) +
   geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white")
 tactile_histogram + 
@@ -44,69 +54,77 @@ tactile_histogram +
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 bowl
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 fruit_basket <- tibble(
   fruit = c("Mango", "Tangerine", "Apricot", "Pamplemousse", "Lime")
 )
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 fruit_basket %>% 
   rep_sample_n(size = 3)
 
 
-## ---- eval = FALSE-------------------------------------------------------
+## ---- eval = FALSE------------------------------------------------------------
 ## fruit_basket %>%
 ##   rep_sample_n(size = 6)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_shovel <- bowl %>% 
   rep_sample_n(size = 50)
 virtual_shovel
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_shovel %>% 
   mutate(is_red = (color == "red"))
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_shovel %>% 
   mutate(is_red = (color == "red")) %>% 
   summarize(num_red = sum(is_red))
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 n_red_virtual_shovel <- virtual_shovel %>% 
   mutate(is_red = (color == "red")) %>% 
   summarize(num_red = sum(is_red)) %>% 
   pull(num_red)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_shovel %>% 
   mutate(is_red = color == "red") %>% 
   summarize(num_red = sum(is_red)) %>% 
   mutate(prop_red = num_red / 50)
 
+## ---- echo=FALSE--------------------------------------------------------------
+virtual_shovel_prop_red <- virtual_shovel %>% 
+  mutate(is_red = color == "red") %>% 
+  summarize(num_red = sum(is_red)) %>% 
+  mutate(prop_red = num_red / 50) %>% 
+  pull(prop_red) 
+virtual_shovel_perc_red <- virtual_shovel_prop_red * 100
+
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_shovel %>% 
   summarize(num_red = sum(color == "red")) %>% 
   mutate(prop_red = num_red / 50)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_samples <- bowl %>% 
   rep_sample_n(size = 50, reps = 33)
 virtual_samples
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_prop_red <- virtual_samples %>% 
   group_by(replicate) %>% 
   summarize(red = sum(color == "red")) %>% 
@@ -114,13 +132,13 @@ virtual_prop_red <- virtual_samples %>%
 virtual_prop_red
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## ggplot(virtual_prop_red, aes(x = prop_red)) +
 ##   geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
 ##   labs(x = "Proportion of 50 balls that were red",
 ##        title = "Distribution of 33 proportions red")
 
-## ----samplingdistribution-virtual, echo=FALSE, fig.cap="Distribution of 33 proportions based on 33 samples of size 50."----
+## ----samplingdistribution-virtual, echo=FALSE, fig.cap="Distribution of 33 proportions based on 33 samples of size 50.", fig.height=3.2----
 virtual_histogram <- ggplot(virtual_prop_red, aes(x = prop_red)) +
   geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white")
 virtual_histogram + 
@@ -128,8 +146,8 @@ virtual_histogram +
        title = "Distribution of 33 proportions red")
 
 
-## ----tactile-vs-virtual, echo=FALSE, fig.cap="Comparing 33 virtual and 33 tactile proportions red."----
-bind_rows(
+## ----tactile-vs-virtual, echo=FALSE, fig.cap="Comparing 33 virtual and 33 tactile proportions red.", fig.height=2.9----
+facet_compare <- bind_rows(
   virtual_prop_red %>% 
     mutate(type = "Virtual sampling"), 
   tactile_prop_red %>% 
@@ -139,22 +157,33 @@ bind_rows(
   mutate(type = factor(type, levels = c("Virtual sampling", "Tactile sampling"))) %>% 
   ggplot(aes(x = prop_red)) +
   geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
-  facet_wrap(~type) +
+  facet_wrap(~ type) +
   labs(x = "Proportion of 50 balls that were red", 
-         title = "Comparing distributions")
+         title = "Comparing distributions") 
+
+if(knitr::is_latex_output()){
+  facet_compare +
+  theme(
+    strip.text = element_text(colour = 'black'),
+    strip.background = element_rect(fill = "grey93")
+  )
+} else {
+  facet_compare
+}
 
 
 
 
 
 
-## ------------------------------------------------------------------------
+
+## -----------------------------------------------------------------------------
 virtual_samples <- bowl %>% 
   rep_sample_n(size = 50, reps = 1000)
 virtual_samples
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_prop_red <- virtual_samples %>% 
   group_by(replicate) %>% 
   summarize(red = sum(color == "red")) %>% 
@@ -162,13 +191,13 @@ virtual_prop_red <- virtual_samples %>%
 virtual_prop_red
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## ggplot(virtual_prop_red, aes(x = prop_red)) +
 ##   geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
 ##   labs(x = "Proportion of 50 balls that were red",
 ##        title = "Distribution of 1000 proportions red")
 
-## ----samplingdistribution-virtual-1000, echo=FALSE, fig.cap="Distribution of 1000 proportions based on 33 samples of size 50."----
+## ----samplingdistribution-virtual-1000, echo=FALSE, fig.cap="Distribution of 1000 proportions based on 1000 samples of size 50."----
 virtual_prop_red <- virtual_samples %>% 
   group_by(replicate) %>% 
   summarize(red = sum(color == "red")) %>% 
@@ -186,7 +215,7 @@ virtual_histogram +
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Segment 1: sample size = 25 ------------------------------
 ## # 1.a) Virtually use shovel 1000 times
 ## virtual_samples_25 <- bowl %>%
@@ -238,7 +267,7 @@ virtual_histogram +
 ##   labs(x = "Proportion of 100 balls that were red", title = "100")
 
 
-## ----comparing-sampling-distributions, echo=FALSE, fig.cap="Comparing the distributions of proportion red for different sample sizes."----
+## ----comparing-sampling-distributions, echo=FALSE, fig.height=3, fig.cap="Comparing the distributions of proportion red for different sample sizes."----
 # n = 25
 if(!file.exists("rds/virtual_samples_25.rds")){
   virtual_samples_25 <- bowl %>% 
@@ -281,16 +310,29 @@ virtual_prop_red_100 <- virtual_samples_100 %>%
   mutate(prop_red = red / 100) %>% 
   mutate(n = 100)
 
-virtual_prop <- bind_rows(virtual_prop_red_25, virtual_prop_red_50, virtual_prop_red_100)
+virtual_prop <- bind_rows(virtual_prop_red_25, 
+                          virtual_prop_red_50, 
+                          virtual_prop_red_100)
 
 comparing_sampling_distributions <- ggplot(virtual_prop, aes(x = prop_red)) +
   geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
-  labs(x = "Proportion of shovel's balls that are red", title = "Comparing distributions of proportions red for 3 different shovels.") +
-  facet_wrap(~n)
-comparing_sampling_distributions
+  labs(x = "Proportion of shovel's balls that are red", 
+       title = "Comparing distributions of proportions red for three different shovel sizes.") +
+  facet_wrap(~ n) 
+
+if(knitr::is_latex_output()){
+  comparing_sampling_distributions +
+  theme(
+    strip.text = element_text(colour = 'black'),
+    strip.background = element_rect(fill = "grey93")
+  )
+} else {
+  comparing_sampling_distributions
+}
+
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## # n = 25
 ## virtual_prop_red_25 %>%
 ##   summarize(sd = sd(prop_red))
@@ -304,7 +346,7 @@ comparing_sampling_distributions
 ##   summarize(sd = sd(prop_red))
 
 
-## ----comparing-n, eval=TRUE, echo=FALSE----------------------------------
+## ----comparing-n, eval=TRUE, echo=FALSE---------------------------------------
 comparing_n_table <- virtual_prop %>% 
   group_by(n) %>% 
   summarize(sd = sd(prop_red)) %>% 
@@ -313,9 +355,10 @@ comparing_n_table <- virtual_prop %>%
 comparing_n_table  %>% 
   kable(
     digits = 3,
-      caption = "Comparing standard deviations of proportions red for 3 different shovels.", 
-      booktabs = TRUE
-) %>% 
+    caption = "Comparing standard deviations of proportions red for three different shovels", 
+    booktabs = TRUE,
+    linesep = ""
+  ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
@@ -328,11 +371,11 @@ comparing_n_table  %>%
 
 
 
-## ----comparing-sampling-distributions-1b, echo=FALSE, fig.cap="Previously seen three sampling distributions of the sample proportion $\\widehat{p}$."----
+## ----comparing-sampling-distributions-1b, echo=FALSE, fig.cap="Previously seen three distributions of the sample proportion $\\widehat{p}$.", fig.height=3.1----
 comparing_sampling_distributions
 
 
-## ----comparing-n-repeat, eval=TRUE, echo=FALSE---------------------------
+## ----comparing-n-repeat, eval=TRUE, echo=FALSE--------------------------------
 comparing_n_table <- virtual_prop %>% 
   group_by(n) %>% 
   summarize(sd = sd(prop_red)) %>% 
@@ -341,15 +384,16 @@ comparing_n_table <- virtual_prop %>%
 comparing_n_table  %>% 
   kable(
     digits = 3,
-      caption = "Previously seen comparing standard deviations of proportions red for 3 different shovels.", 
-      booktabs = TRUE
+    caption = "Previously seen comparing standard deviations of proportions red for three different shovels", 
+    booktabs = TRUE,
+    linesep = ""
 ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
 
 ## ----comparing-sampling-distributions-2, echo=FALSE, fig.cap="Three sampling distributions of the sample proportion $\\widehat{p}$."----
-virtual_prop %>% 
+p_hat_compare <- virtual_prop %>% 
   mutate(
     n = str_c("n = ", n),
     n = factor(n, levels = c("n = 25", "n = 50", "n = 100"))
@@ -357,11 +401,22 @@ virtual_prop %>%
   ggplot( aes(x = prop_red)) +
   geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
   labs(x = expression(paste("Sample proportion ", hat(p))), 
-       title = expression(paste("Sampling distributions of the sample proportion ", hat(p), " based on n = 25, 50, 100.")) ) +
-  facet_wrap(~n)
+       title = expression(paste("Sampling distributions of ", hat(p), " based on n = 25, 50, 100.")) ) +
+  facet_wrap(~ n)
+
+if(knitr::is_latex_output()){
+  p_hat_compare  +
+  theme(
+    strip.text = element_text(colour = 'black'),
+    strip.background = element_rect(fill = "grey93")
+  )
+} else {
+  p_hat_compare
+}
+
 
 
-## ----comparing-n-2, eval=TRUE, echo=FALSE--------------------------------
+## ----comparing-n-2, eval=TRUE, echo=FALSE-------------------------------------
 comparing_n_table <- virtual_prop %>% 
   group_by(n) %>% 
   summarize(sd = sd(prop_red)) %>% 
@@ -374,9 +429,10 @@ comparing_n_table <- virtual_prop %>%
 comparing_n_table  %>% 
   kable(
     digits = 3,
-    caption = "Three standard errors of the sample proportion based on n = 25, 50, 100.", 
-    booktabs = TRUE#,
-#    escape = TRUE
+    caption = "Standard errors of the sample proportion based on sample sizes of 25, 50, and 100", 
+    booktabs = TRUE,
+    escape = FALSE,
+    linesep = ""
 ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
@@ -386,50 +442,64 @@ comparing_n_table  %>%
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 bowl %>% 
   summarize(sum_red = sum(color == "red"), 
             sum_not_red = sum(color != "red"))
 
 
-## ----comparing-sampling-distributions-3, echo=FALSE, fig.cap="Three sampling distributions with population proportion $p$ marked in red."----
+## ----comparing-sampling-distributions-3, echo=FALSE, fig.cap="Three sampling distributions with population proportion $p$ marked by vertical line."----
 p <- bowl %>% 
-  summarize(p = mean(color == "red")) %>% 
-  pull(p)
-virtual_prop %>% 
+  summarize(mean(color == "red")) %>% 
+  pull()
+samp_distn_compare <- virtual_prop %>% 
   mutate(
     n = str_c("n = ", n),
     n = factor(n, levels = c("n = 25", "n = 50", "n = 100"))
     ) %>% 
-  ggplot( aes(x = prop_red)) +
-  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
+  ggplot(aes(x = prop_red)) +
+  geom_histogram(binwidth = 0.05, boundary = 0.4, 
+                 color = "black", fill = "white") +
   labs(x = expression(paste("Sample proportion ", hat(p))), 
-       title = expression(paste("Sampling distributions of ", hat(p), " based on n = 25, 50, 100.")) ) +
-  facet_wrap(~n) +
+       title = expression(paste("Sampling distributions of ", hat(p), 
+                                " based on n = 25, 50, 100.")) ) +
+  facet_wrap(~ n) +
   geom_vline(xintercept = p, col = "red", size = 1)
 
+if(knitr::is_latex_output()){
+  samp_distn_compare  +
+  theme(
+    strip.text = element_text(colour = 'black'),
+    strip.background = element_rect(fill = "grey93")
+  )
+} else {
+  samp_distn_compare
+}
+
 
 
 
 
 
-## ----comparing-n-3, eval=TRUE, echo=FALSE--------------------------------
+## ----comparing-n-3, eval=TRUE, echo=FALSE-------------------------------------
 set.seed(76)
 comparing_n_table <- virtual_prop %>% 
   group_by(n) %>% 
   summarize(sd = sd(prop_red)) %>% 
   mutate(
     n = str_c("n = ")
-    ) %>% 
-  rename(`Sample size` = n, `Standard error of p-hat` = sd) %>% 
- sample_frac(1)
-  
+  ) %>% 
+  rename(`Sample size` = n, `Standard error of $\\widehat{p}$` = sd) %>% 
+  sample_frac(1)
+
 comparing_n_table  %>% 
   kable(
     digits = 3,
-      caption = "Three standard errors of the sample proportion based on n = 25, 50, 100. ", 
-      booktabs = TRUE
-) %>% 
+    caption = "Standard errors of $\\widehat{p}$ based on n = 25, 50, 100", 
+    booktabs = TRUE,
+    escape = FALSE,
+    linesep = ""
+  ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
 
@@ -440,13 +510,14 @@ comparing_n_table  %>%
 
 
 
-## ----table-ch8, echo=FALSE, message=FALSE--------------------------------
+## ----table-ch8, echo=FALSE, message=FALSE-------------------------------------
 # The following Google Doc is published to CSV and loaded using read_csv():
 # https://docs.google.com/spreadsheets/d/1QkOpnBGqOXGyJjwqx1T2O5G5D72wWGfWlPyufOgtkk4/edit#gid=0
 
 if(!file.exists("rds/sampling_scenarios.rds")){
   sampling_scenarios <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRd6bBgNwM3z-AJ7o4gZOiPAdPfbTp_V15HVHRmOH5Fc9w62yaG-fEKtjNUD2wOSa5IJkrDMaEBjRnA/pub?gid=0&single=true&output=csv" %>% 
-    read_csv(na = "")
+    read_csv(na = "") %>% 
+    slice(1:5)
     write_rds(sampling_scenarios, "rds/sampling_scenarios.rds")
 } else {
   sampling_scenarios <- read_rds("rds/sampling_scenarios.rds")
@@ -456,13 +527,22 @@ sampling_scenarios %>%
   kable(
     caption = "\\label{tab:summarytable-ch8}Scenarios of sampling for inference", 
     booktabs = TRUE,
-    escape = FALSE
+    escape = FALSE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position")) %>%
   column_spec(1, width = "0.5in") %>% 
-  column_spec(2, width = "0.7in") %>%
-  column_spec(3, width = "1in") %>%
-  column_spec(4, width = "1.1in") %>% 
-  column_spec(5, width = "1in")
+  column_spec(2, width = "1.2in") %>%
+  column_spec(3, width = "0.8in") %>%
+  column_spec(4, width = "1.5in") %>% 
+  column_spec(5, width = "0.6in")
+
+
+
+
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("Solutions to all *Learning checks* can be found online in [Appendix D](https://moderndive.com/D-appendixD.html).")
+} 
 
diff --git a/docs/scripts/08-confidence-intervals.R b/docs/scripts/08-confidence-intervals.R
index 5ca4a39cd..166e8c875 100644
--- a/docs/scripts/08-confidence-intervals.R
+++ b/docs/scripts/08-confidence-intervals.R
@@ -1,10 +1,10 @@
-## ----message=FALSE, warning=FALSE----------------------------------------
+## ----message=FALSE, warning=FALSE---------------------------------------------
 library(tidyverse)
 library(moderndive)
 library(infer)
 
 
-## ----message=FALSE, warning=FALSE, echo=FALSE----------------------------
+## ----message=FALSE, warning=FALSE, echo=FALSE---------------------------------
 # Packages needed internally, but not in the text
 library(knitr)
 library(kableExtra)
@@ -16,7 +16,7 @@ library(purrr)
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 pennies_sample
 
 
@@ -25,23 +25,24 @@ ggplot(pennies_sample, aes(x = year)) +
   geom_histogram(binwidth = 10, color = "white")
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 pennies_sample %>% 
   summarize(mean_year = mean(year))
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 x_bar <- pennies_sample %>% 
   summarize(mean_year = mean(year))
 
 
-## ----table-ch8-b, echo=FALSE, message=FALSE------------------------------
+## ----table-ch8-b, echo=FALSE, message=FALSE-----------------------------------
 # The following Google Doc is published to CSV and loaded using read_csv():
 # https://docs.google.com/spreadsheets/d/1QkOpnBGqOXGyJjwqx1T2O5G5D72wWGfWlPyufOgtkk4/edit#gid=0
 
 if(!file.exists("rds/sampling_scenarios.rds")){
   sampling_scenarios <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRd6bBgNwM3z-AJ7o4gZOiPAdPfbTp_V15HVHRmOH5Fc9w62yaG-fEKtjNUD2wOSa5IJkrDMaEBjRnA/pub?gid=0&single=true&output=csv" %>% 
-    read_csv(na = "")
-    write_rds(table_ch3, "rds/sampling_scenarios.rds")
+    read_csv(na = "") %>% 
+    slice(1:5)
+  write_rds(sampling_scenarios, "rds/sampling_scenarios.rds")
 } else {
   sampling_scenarios <- read_rds("rds/sampling_scenarios.rds")
 }
@@ -52,7 +53,8 @@ sampling_scenarios %>%
   kable(
     caption = "Scenarios of sampling for inference", 
     booktabs = TRUE,
-    escape = FALSE
+    escape = FALSE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position")) %>%
@@ -71,7 +73,7 @@ sampling_scenarios %>%
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 pennies_resample <- tibble(
   year = c(1976, 1962, 1976, 1983, 2017, 2015, 2015, 1962, 2016, 1976, 
            2006, 1997, 1988, 2015, 2015, 1988, 2016, 1978, 1979, 1997, 
@@ -83,7 +85,7 @@ pennies_resample <- tibble(
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## ggplot(pennies_resample, aes(x = year)) +
 ##   geom_histogram(binwidth = 10, color = "white") +
 ##   labs(title = "Resample of 50 pennies")
@@ -94,22 +96,22 @@ pennies_resample <- tibble(
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 pennies_resample %>% 
   summarize(mean_year = mean(year))
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 resample_mean <- pennies_resample %>% 
   summarize(mean_year = mean(year))
 
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 pennies_resamples
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 resampled_means <- pennies_resamples %>% 
   group_by(name) %>% 
   summarize(mean_year = mean(year))
@@ -120,27 +122,27 @@ resampled_means
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_resample <- pennies_sample %>% 
   rep_sample_n(size = 50, replace = TRUE)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_resample
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_resample %>% 
   summarize(resample_mean = mean(year))
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_resamples <- pennies_sample %>% 
   rep_sample_n(size = 50, replace = TRUE, reps = 35)
 virtual_resamples
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_resampled_means <- virtual_resamples %>% 
   group_by(replicate) %>% 
   summarize(mean_year = mean(year))
@@ -155,7 +157,7 @@ virtual_resampled_means
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 # Repeat resampling 1000 times
 virtual_resamples <- pennies_sample %>% 
   rep_sample_n(size = 50, replace = TRUE, reps = 1000)
@@ -166,7 +168,7 @@ virtual_resampled_means <- virtual_resamples %>%
   summarize(mean_year = mean(year))
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_resampled_means <- pennies_sample %>% 
   rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>% 
   group_by(replicate) %>% 
@@ -180,11 +182,11 @@ ggplot(virtual_resampled_means, aes(x = mean_year)) +
   labs(x = "sample mean")
 
 
-## ----eval=TRUE-----------------------------------------------------------
+## ----eval=TRUE----------------------------------------------------------------
 virtual_resampled_means %>% 
   summarize(mean_of_means = mean(mean_year))
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 mean_of_means <- virtual_resampled_means %>% 
   summarize(mean(mean_year)) %>% 
   pull() %>% 
@@ -197,7 +199,7 @@ mean_of_means <- virtual_resampled_means %>%
 
 
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 # Can also use conf_int() and get_confidence_interval() instead of get_ci(),
 # as they are aliases that work the exact same way.
 percentile_ci <- virtual_resampled_means %>% 
@@ -205,7 +207,7 @@ percentile_ci <- virtual_resampled_means %>%
   get_ci(level = 0.95, type = "percentile")
 
 
-## ----percentile-method, echo=FALSE, message=FALSE, fig.cap="Percentile method 95 percent confidence interval. Interval marked by vertical lines."----
+## ----percentile-method, echo=FALSE, message=FALSE, fig.cap='(ref:perc-method)', fig.height=3.4----
 ggplot(virtual_resampled_means, aes(x = mean_year)) +
   geom_histogram(binwidth = 1, color = "white", boundary = 1988) +
   labs(x = "Resample sample mean") +
@@ -214,7 +216,7 @@ ggplot(virtual_resampled_means, aes(x = mean_year)) +
   geom_vline(xintercept = percentile_ci[[1, 2]], size = 1)
 
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 # Can also use get_confidence_interval() instead of get_ci(),
 # as it is an alias that works the exact same way.
 standard_error_ci <- virtual_resampled_means %>% 
@@ -227,12 +229,12 @@ bootstrap_se <- virtual_resampled_means %>%
   pull(se)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 virtual_resampled_means %>% 
   summarize(SE = sd(mean_year))
 
 
-## ----percentile-and-se-method, echo=FALSE, message=FALSE, fig.cap="Comparing two 95 percent confidence interval methods."----
+## ----percentile-and-se-method, echo=FALSE, message=FALSE, fig.cap='(ref:both-methods)', fig.height=5.2----
 both_CI <- bind_rows(
   percentile_ci %>% gather(endpoint, value) %>% mutate(type = "percentile"),
   standard_error_ci %>% gather(endpoint, value) %>% mutate(type = "SE")
@@ -247,7 +249,7 @@ ggplot(virtual_resampled_means, aes(x = mean_year)) +
   geom_vline(xintercept = standard_error_ci[[1, 2]], linetype = "dashed", size = 1)
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## standard_error_ci <- bootstrap_distribution %>%
 ##   get_ci(type = "se", point_estimate = x_bar)
 ## standard_error_ci
@@ -257,30 +259,30 @@ ggplot(virtual_resampled_means, aes(x = mean_year)) +
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## pennies_sample %>%
 ##   rep_sample_n(size = 50, replace = TRUE, reps = 1000)
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## pennies_sample %>%
 ##   rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>%
 ##   group_by(replicate)
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## pennies_sample %>%
 ##   rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>%
 ##   group_by(replicate) %>%
 ##   summarize(mean_year = mean(year))
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## pennies_sample %>%
 ##   summarize(stat = mean(year))
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## pennies_sample %>%
 ##   specify(response = year) %>%
 ##   calculate(stat = "mean")
@@ -288,25 +290,25 @@ ggplot(virtual_resampled_means, aes(x = mean_year)) +
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 pennies_sample %>% 
   specify(response = year)
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## pennies_sample %>%
 ##   specify(formula = year ~ NULL)
 
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## pennies_sample %>%
 ##   specify(response = year) %>%
 ##   generate(reps = 1000, type = "bootstrap")
 
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 if(!file.exists("rds/pennies_sample_generate.rds")){
   pennies_sample_generate <- pennies_sample %>% 
     specify(response = year) %>% 
@@ -318,7 +320,7 @@ if(!file.exists("rds/pennies_sample_generate.rds")){
 pennies_sample_generate
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## # infer workflow:                   # Original workflow:
 ## pennies_sample %>%                  pennies_sample %>%
 ##   specify(response = year) %>%        rep_sample_n(size = 50, replace = TRUE,
@@ -328,7 +330,7 @@ pennies_sample_generate
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## bootstrap_distribution <- pennies_sample %>%
 ##   specify(response = year) %>%
 ##   generate(reps = 1000) %>%
@@ -336,7 +338,7 @@ pennies_sample_generate
 ## bootstrap_distribution
 
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 if(!file.exists("rds/bootstrap_distribution_pennies.rds")){
   bootstrap_distribution <- pennies_sample %>% 
     specify(response = year) %>% 
@@ -349,24 +351,24 @@ if(!file.exists("rds/bootstrap_distribution_pennies.rds")){
 bootstrap_distribution
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## # infer workflow:                   # Original workflow:
 ## pennies_sample %>%                  pennies_sample %>%
 ##   specify(response = year) %>%        rep_sample_n(size = 50, replace = TRUE,
 ##   generate(reps = 1000) %>%                        reps = 1000) %>%
 ##   calculate(stat = "mean")            group_by(replicate) %>%
-##                                       summarize(mean_year = mean(year))
+##                                       summarize(stat = mean(year))
 
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## visualize(bootstrap_distribution)
 
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## # infer workflow:                    # Original workflow:
 ## visualize(bootstrap_distribution)    ggplot(bootstrap_distribution,
 ##                                             aes(x = stat)) +
@@ -375,31 +377,32 @@ bootstrap_distribution
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 percentile_ci <- bootstrap_distribution %>% 
   get_confidence_interval(level = 0.95, type = "percentile")
 percentile_ci
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## visualize(bootstrap_distribution) +
 ##   shade_confidence_interval(endpoints = percentile_ci)
 
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## visualize(bootstrap_distribution) +
 ##   shade_ci(endpoints = percentile_ci, color = "hotpink", fill = "khaki")
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
+x_bar
 standard_error_ci <- bootstrap_distribution %>% 
-  get_confidence_interval(type = "se", point_estimate = 1995.44)
+  get_confidence_interval(type = "se", point_estimate = x_bar)
 standard_error_ci
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## visualize(bootstrap_distribution) +
 ##   shade_confidence_interval(endpoints = standard_error_ci)
 
@@ -409,36 +412,37 @@ standard_error_ci
 
 
 
-## ------------------------------------------------------------------------
+
+## -----------------------------------------------------------------------------
 bowl %>% 
   summarize(p_red = mean(color == "red"))
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 p_red <- bowl %>% 
   summarize(prop_red = mean(color == "red")) %>% 
   pull(prop_red)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 bowl_sample_1
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## bowl_sample_1 %>%
 ##   specify(response = color)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 bowl_sample_1 %>% 
   specify(response = color, success = "red")
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## bowl_sample_1 %>%
 ##   specify(response = color, success = "red") %>%
 ##   generate(reps = 1000, type = "bootstrap")
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 if(!file.exists("rds/bowl_sample_1_generate.rds")){
    bowl_sample_1_generate <- bowl_sample_1 %>% 
     specify(response = color, success = "red") %>% 
@@ -451,7 +455,7 @@ if(!file.exists("rds/bowl_sample_1_generate.rds")){
 bowl_sample_1_generate
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## sample_1_bootstrap <- bowl_sample_1 %>%
 ##   specify(response = color, success = "red") %>%
 ##   generate(reps = 1000, type = "bootstrap") %>%
@@ -459,7 +463,7 @@ bowl_sample_1_generate
 ## sample_1_bootstrap
 
 
-## ----calculate_prop, echo=FALSE------------------------------------------
+## ----calculate_prop, echo=FALSE-----------------------------------------------
 # Note this takes a few minutes to run
 if(!file.exists("rds/sample_1_bootstrap.rds")){
   sample_1_bootstrap <- bowl_sample_1_generate %>% 
@@ -471,13 +475,13 @@ if(!file.exists("rds/sample_1_bootstrap.rds")){
 sample_1_bootstrap
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 percentile_ci_1 <- sample_1_bootstrap %>% 
   get_confidence_interval(level = 0.95, type = "percentile")
 percentile_ci_1
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## sample_1_bootstrap %>%
 ##   visualize(bins = 15) +
 ##   shade_confidence_interval(endpoints = percentile_ci_1) +
@@ -485,25 +489,29 @@ percentile_ci_1
 
 
 
-## ------------------------------------------------------------------------
-bowl_sample_2 <- bowl %>% 
-  rep_sample_n(size = 50)
+
+## -----------------------------------------------------------------------------
+bowl_sample_2 <- bowl %>% rep_sample_n(size = 50)
 bowl_sample_2
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## sample_2_bootstrap <- bowl_sample_2 %>%
-##   specify(response = color, success = "red") %>%
-##   generate(reps = 1000, type = "bootstrap") %>%
+##   specify(response = color,
+##           success = "red") %>%
+##   generate(reps = 1000,
+##            type = "bootstrap") %>%
 ##   calculate(stat = "prop")
 ## sample_2_bootstrap
 
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 if(!file.exists("rds/sample_2_bootstrap.rds")){
   sample_2_bootstrap <- bowl_sample_2 %>% 
-    specify(response = color, success = "red") %>% 
-    generate(reps = 1000, type = "bootstrap") %>% 
+    specify(response = color, 
+            success = "red") %>% 
+    generate(reps = 1000, 
+             type = "bootstrap") %>% 
     calculate(stat = "prop")
   write_rds(sample_2_bootstrap, "rds/sample_2_bootstrap.rds")
 } else {
@@ -512,13 +520,13 @@ if(!file.exists("rds/sample_2_bootstrap.rds")){
 sample_2_bootstrap
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 percentile_ci_2 <- sample_2_bootstrap %>% 
   get_confidence_interval(level = 0.95, type = "percentile")
 percentile_ci_2
 
 
-## ----reliable-percentile, fig.cap="100 percentile-based 95 percent confidence intervals for $p$.",echo=FALSE----
+## ----reliable-percentile, fig.cap='(ref:reliable-perc)', echo=FALSE, fig.height=4.2----
 if(!file.exists("rds/balls_percentile_cis.rds")){
   set.seed(4)
 
@@ -563,16 +571,18 @@ ggplot(percentile_cis) +
   # Removed point estimates since it doesn't necessarily act as center for 
   # percentile-based CI's
   # geom_point(aes(x = sample_prop, y = replicate, color = captured)) +
-  labs(x = expression("Proportion of red balls"), y = "Confidence interval number", 
+  labs(x = expression("Proportion of red balls"), 
+       y = "Confidence interval number", 
        alpha = "Captured") +
   geom_vline(xintercept = p_red, color = "red") + 
   coord_cartesian(xlim = c(0.1, 0.7)) + 
   theme_light() + 
-  theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank(),
+  theme(panel.grid.major.y = element_blank(), 
+        panel.grid.minor.y = element_blank(),
         panel.grid.minor.x = element_blank())
 
 
-## ----reliable-se, fig.cap="100 SE-based 80 percent confidence intervals for $p$ with point estimate center marked with dots.",echo=FALSE----
+## ----reliable-se, fig.cap='(ref:rel-se)', echo=FALSE, fig.height=6.6----------
 if(!file.exists("rds/balls_se_cis.rds")){
   # Set random number generator seed value.
   set.seed(9)
@@ -631,7 +641,7 @@ ggplot(se_cis) +
         panel.grid.minor.x = element_blank())
 
 
-## ----perc-sizes, echo=FALSE----------------------------------------------
+## ----perc-sizes, echo=FALSE---------------------------------------------------
 if(!file.exists("rds/balls_perc_cis_80_95_99.rds")){
   set.seed(9)
   
@@ -696,36 +706,47 @@ if(!file.exists("rds/balls_perc_cis_80_95_99.rds")){
 }
 
 
-## ----perc-cis-level-print, eval=FALSE, echo=FALSE------------------------
+## ----perc-cis-level-print, eval=FALSE, echo=FALSE-----------------------------
 ## percentile_cis_by_level %>%
 ##   sample_n(10) %>%
 ##   kable(
 ##     digits = 3,
 ##     caption = "10 randomly sampled confidence intervals for p for varying confidence levels",
-##     booktabs = TRUE,
+##     booktabs = TRUE,,
+##     linesep = ""
 ##     longtable = TRUE
 ##   ) %>%
 ##   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
 ##                 latex_options = c("hold_position", "repeat_header"))
 
 
-## ----reliable-percentile-80-95-99, fig.cap="Ten 80, 95, and 99 percent confidence intervals for $p$ based on $n = 50$.", echo=FALSE----
+## ----reliable-percentile-80-95-99, fig.cap='(ref:many-percs)', echo=FALSE, fig.height=3----
 sample_of_cis <- percentile_cis_by_level %>% 
   group_by(confidence_level) %>% 
   mutate(sample_row = 1:10)
 
-ggplot(sample_of_cis) +
+perc_interval_plot <- ggplot(sample_of_cis) +
   # Doesn't make sense to show point_estimate center for percentile confidence 
   # intervals:
   # geom_point(aes(x = point_estimate, y = sample_row)) +
   geom_segment(aes(y = sample_row, yend = sample_row, x = lower, xend = upper)) +
   labs(x = expression("Proportion of red balls"), y = "") +
   scale_y_continuous(breaks = 1:10) +
-  facet_wrap(~confidence_level) + 
+  facet_wrap(~ confidence_level) + 
   geom_vline(xintercept = p_red, color = "red")
 
+if(knitr::is_latex_output()){
+  perc_interval_plot  +
+  theme(
+    strip.text = element_text(colour = 'black'),
+    strip.background = element_rect(fill = "grey93")
+  )
+} else {
+  perc_interval_plot
+}
+
 
-## ----perc-cis-average-width, echo=FALSE----------------------------------
+## ----perc-cis-average-width, echo=FALSE---------------------------------------
 percentile_cis_by_level %>% 
   mutate(width = upper - lower) %>% 
   group_by(confidence_level) %>% 
@@ -733,15 +754,16 @@ percentile_cis_by_level %>%
   rename(`Confidence level` = confidence_level) %>% 
   kable(
     digits = 3,
-    caption = "Average width of 80, 95, and 99 percent confidence intervals.", 
+    caption = "Average width of 80, 95, and 99\\% confidence intervals", 
     booktabs = TRUE,
-    longtable = TRUE
+    longtable = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position", "repeat_header"))
 
 
-## ----perc-sizes-2, echo=FALSE--------------------------------------------
+## ----perc-sizes-2, echo=FALSE-------------------------------------------------
 if(!file.exists("rds/balls_perc_cis_n_25_50_100.rds")){
   set.seed(9)
   
@@ -807,12 +829,12 @@ if(!file.exists("rds/balls_perc_cis_n_25_50_100.rds")){
 }
 
 
-## ----reliable-percentile-n-25-50-100, fig.cap="Ten 95 percent confidence intervals for $p$ based on n = 25, 50, and 100.", echo=FALSE----
+## ----reliable-percentile-n-25-50-100, fig.cap='(ref:rel-perc-n)', echo=FALSE, fig.height=2.5----
 sample_of_cis <- percentile_cis_by_n %>% 
   group_by(sample_size) %>% 
   mutate(sample_row = 1:10)
 
-ggplot(sample_of_cis) +
+cis_plot <- ggplot(sample_of_cis) +
   # Doesn't make sense to show point_estimate center for percentile confidence 
   # intervals:
   # geom_point(aes(x = point_estimate, y = sample_row)) +
@@ -822,8 +844,18 @@ ggplot(sample_of_cis) +
   facet_wrap(~sample_size) + 
   geom_vline(xintercept = p_red, color = "red")
 
+if(knitr::is_latex_output()){
+  cis_plot  +
+  theme(
+    strip.text = element_text(colour = 'black'),
+    strip.background = element_rect(fill = "grey93")
+  )
+} else {
+  cis_plot
+}
+
 
-## ----perc-cis-average-width-2, echo=FALSE--------------------------------
+## ----perc-cis-average-width-2, echo=FALSE-------------------------------------
 percentile_cis_by_n %>% 
   mutate(width = upper - lower) %>% 
   group_by(sample_size) %>% 
@@ -831,32 +863,35 @@ percentile_cis_by_n %>%
   rename(`Sample size` = sample_size) %>% 
   kable(
     digits = 3,
-    caption = "Average width of 95 percent confidence intervals based on n = 25, 50, and 100.", 
+    caption = "Average width of 95\\% confidence intervals based on $n = 25$, $50$, and $100$", 
     booktabs = TRUE,
-    longtable = TRUE
+    longtable = TRUE,
+    escape = FALSE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position", "repeat_header"))
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 mythbusters_yawn
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 mythbusters_yawn %>% 
   group_by(group, yawn) %>% 
   summarize(count = n())
 
 
-## ----table-ch8-c, echo=FALSE, message=FALSE------------------------------
+## ----table-ch8-c, echo=FALSE, message=FALSE-----------------------------------
 # The following Google Doc is published to CSV and loaded using read_csv():
 # https://docs.google.com/spreadsheets/d/1QkOpnBGqOXGyJjwqx1T2O5G5D72wWGfWlPyufOgtkk4/edit#gid=0
 
 if(!file.exists("rds/sampling_scenarios.rds")){
   sampling_scenarios <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRd6bBgNwM3z-AJ7o4gZOiPAdPfbTp_V15HVHRmOH5Fc9w62yaG-fEKtjNUD2wOSa5IJkrDMaEBjRnA/pub?gid=0&single=true&output=csv" %>% 
-    read_csv(na = "")
-  write_rds(table_ch3, "rds/sampling_scenarios.rds")
+    read_csv(na = "") %>% 
+    slice(1:5)
+  write_rds(sampling_scenarios, "rds/sampling_scenarios.rds")
 } else {
   sampling_scenarios <- read_rds("rds/sampling_scenarios.rds")
 }
@@ -867,37 +902,39 @@ sampling_scenarios %>%
   kable(
     caption = "Scenarios of sampling for inference", 
     booktabs = TRUE,
-    escape = FALSE
+    escape = FALSE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position")) %>%
   column_spec(1, width = "0.5in") %>% 
-  column_spec(2, width = "0.7in") %>%
-  column_spec(3, width = "1in") %>%
-  column_spec(4, width = "1.1in") %>% 
-  column_spec(5, width = "1in")
+  column_spec(2, width = "1.5in") %>%
+  column_spec(3, width = "0.65in") %>%
+  column_spec(4, width = "1.6in") %>% 
+  column_spec(5, width = "0.65in")
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## mythbusters_yawn %>%
 ##   specify(formula = yawn ~ group)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 mythbusters_yawn %>% 
   specify(formula = yawn ~ group, success = "yes")
 
 
-## ------------------------------------------------------------------------
-head(mythbusters_yawn)
+## -----------------------------------------------------------------------------
+first_six_rows <- head(mythbusters_yawn)
+first_six_rows
 
 
-## ------------------------------------------------------------------------
-head(mythbusters_yawn) %>% 
+## -----------------------------------------------------------------------------
+first_six_rows %>% 
   sample_n(size = 6, replace = TRUE)
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## mythbusters_yawn %>%
 ##   specify(formula = yawn ~ group, success = "yes") %>%
 ##   generate(reps = 1000, type = "bootstrap")
@@ -905,14 +942,14 @@ head(mythbusters_yawn) %>%
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## mythbusters_yawn %>%
 ##   specify(formula = yawn ~ group, success = "yes") %>%
 ##   generate(reps = 1000, type = "bootstrap") %>%
 ##   calculate(stat = "diff in props")
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## bootstrap_distribution_yawning <- mythbusters_yawn %>%
 ##   specify(formula = yawn ~ group, success = "yes") %>%
 ##   generate(reps = 1000, type = "bootstrap") %>%
@@ -920,7 +957,7 @@ head(mythbusters_yawn) %>%
 ## bootstrap_distribution_yawning
 
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 if(!file.exists("rds/bootstrap_distribution_yawning.rds")){
   bootstrap_distribution_yawning <- mythbusters_yawn %>% 
     specify(formula = yawn ~ group, success = "yes") %>% 
@@ -935,67 +972,78 @@ if(!file.exists("rds/bootstrap_distribution_yawning.rds")){
 bootstrap_distribution_yawning
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## visualize(bootstrap_distribution_yawning) +
 ##   geom_vline(xintercept = 0)
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 bootstrap_distribution_yawning %>% 
   get_confidence_interval(type = "percentile", level = 0.95)
 
-## ----include=FALSE-------------------------------------------------------
+## ----include=FALSE------------------------------------------------------------
 myth_ci_percentile <- bootstrap_distribution_yawning %>% 
   get_confidence_interval(type = "percentile", level = 0.95)
 
 
-## ------------------------------------------------------------------------
-mythbusters_yawn %>% 
+## -----------------------------------------------------------------------------
+obs_diff_in_props <- mythbusters_yawn %>% 
   specify(formula = yawn ~ group, success = "yes") %>% 
   # generate(reps = 1000, type = "bootstrap") %>% 
   calculate(stat = "diff in props", order = c("seed", "control"))
+obs_diff_in_props
 
 
-## ------------------------------------------------------------------------
-bootstrap_distribution_yawning %>% 
-  get_confidence_interval(type = "se", point_estimate = 0.0441176)
-
-## ----include=FALSE-------------------------------------------------------
+## -----------------------------------------------------------------------------
 myth_ci_se <- bootstrap_distribution_yawning %>% 
-  get_confidence_interval(type = "se", point_estimate = 0.0441176)
+  get_confidence_interval(type = "se", point_estimate = obs_diff_in_props)
+myth_ci_se
 
 
 
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 set.seed(76)
 
 
+## ----sampling-distribution-part-deux, fig.show='hold', fig.cap="Previously seen sampling distribution of sample proportion red for $n = 1000$.", echo=TRUE, fig.height=2----
+# Take 1000 virtual samples of size 50 from the bowl:
+virtual_samples <- bowl %>% 
+  rep_sample_n(size = 50, reps = 1000)
+# Compute the sampling distribution of 1000 values of p-hat
+sampling_distribution <- virtual_samples %>% 
+  group_by(replicate) %>% 
+  summarize(red = sum(color == "red")) %>% 
+  mutate(prop_red = red / 50)
+# Visualize sampling distribution of p-hat
+ggplot(sampling_distribution, aes(x = prop_red)) +
+  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
+  labs(x = "Proportion of 50 balls that were red", 
+       title = "Sampling distribution")
+
 
-## ------------------------------------------------------------------------
-sampling_distribution %>% 
-  summarize(se = sd(prop_red))
+## -----------------------------------------------------------------------------
+sampling_distribution %>% summarize(se = sd(prop_red))
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 se_samp <- sampling_distribution %>% 
   summarize(se = sd(prop_red)) %>% 
   pull(se)
 
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 set.seed(76)
 
 
-## ----eval=FALSE----------------------------------------------------------
-## # Compute the bootstrap distribution using infer workflow:
+## ----eval=FALSE---------------------------------------------------------------
 ## bootstrap_distribution <- bowl_sample_1 %>%
 ##   specify(response = color, success = "red") %>%
 ##   generate(reps = 1000, type = "bootstrap") %>%
 ##   calculate(stat = "prop")
 
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 if(!file.exists("rds/bootstrap_distribution_balls.rds")){
   bootstrap_distribution <- bowl_sample_1 %>% 
     specify(response = color, success = "red") %>% 
@@ -1010,17 +1058,16 @@ if(!file.exists("rds/bootstrap_distribution_balls.rds")){
 
 
 
-## ------------------------------------------------------------------------
-bootstrap_distribution %>% 
-  summarize(se = sd(stat))
+## -----------------------------------------------------------------------------
+bootstrap_distribution %>% summarize(se = sd(stat))
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 se_boot <- bootstrap_distribution %>% 
   summarize(se = sd(stat)) %>% 
   pull(se)
 
 
-## ----side-by-side, fig.height=7.5, fig.cap="Comparing the sampling and bootstrap distributions of $\\widehat{p}$", echo=FALSE----
+## ----side-by-side, fig.height=4.5, fig.cap="Comparing the sampling and bootstrap distributions of $\\widehat{p}$.", echo=FALSE----
 p_samp <- ggplot(sampling_distribution, aes(x = prop_red)) +
   geom_histogram(binwidth = 0.05, boundary = 0.4, fill = "salmon", 
                  color = "white") +
@@ -1035,7 +1082,7 @@ p_boot <- ggplot(bootstrap_distribution, aes(x = stat)) +
                  color = "white") + 
   labs(x = "Proportion of 50 balls that were red", 
        title = 
-         "Bootstrap distribution: similar shape & spread but different center"
+         "Bootstrap distribution: similar shape and spread but different center"
        ) +
   geom_vline(xintercept = 0.42, size = 1, linetype = "dashed") +
   scale_x_continuous(limits = c(0.15, 0.65), 
@@ -1045,7 +1092,7 @@ p_boot <- ggplot(bootstrap_distribution, aes(x = stat)) +
 p_samp + p_boot + plot_layout(ncol = 1, heights = c(1, 1))
 
 
-## ----comparing-se, echo=FALSE, message=FALSE-----------------------------
+## ----comparing-se, echo=FALSE, message=FALSE----------------------------------
 tibble(
   `Distribution type` = c("Sampling distribution", "Bootstrap distribution"),
   `Standard error` = c(se_samp, se_boot)
@@ -1054,13 +1101,14 @@ tibble(
     caption = "Comparing standard errors", 
     digits = 3, 
     booktabs = TRUE,
-    escape = FALSE
+    escape = FALSE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position", "repeat_header"))
 
 
-## ----comparing-se-2, echo=FALSE, message=FALSE---------------------------
+## ----comparing-se-2, echo=FALSE, message=FALSE--------------------------------
 tibble(
   `Distribution type` = c("Sampling distribution", "Bootstrap distribution", 
                           "Formula approximation"),
@@ -1070,13 +1118,14 @@ tibble(
     caption = "Comparing standard errors", 
     digits = 3, 
     booktabs = TRUE,
-    escape = FALSE
+    escape = FALSE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position", "repeat_header"))
 
 
-## ---- message=FALSE, warning=FALSE---------------------------------------
+## ---- message=FALSE, warning=FALSE--------------------------------------------
 conf_ints <- tactile_prop_red %>% 
   rename(p_hat = prop_red) %>% 
   mutate(
@@ -1086,10 +1135,14 @@ conf_ints <- tactile_prop_red %>%
     lower_ci = p_hat - MoE,
     upper_ci = p_hat + MoE
   )
-conf_ints
 
 
-## ----tactile-conf-int, echo=FALSE, message=FALSE, warning=FALSE, fig.cap= "33 95 percent confidence intervals based on 33 tactile samples of size n = 50.", fig.height=6----
+## ----echo=FALSE---------------------------------------------------------------
+if(!knitr::is_latex_output())
+  conf_ints
+
+
+## ----tactile-conf-int, echo=FALSE, message=FALSE, warning=FALSE, fig.cap= "33 confidence intervals at the 95\\% level based on 33 tactile samples of size $n = 50$.", fig.height=6----
 conf_ints <- conf_ints %>% 
   mutate(
     y = 1:n(),
@@ -1112,14 +1165,14 @@ ggplot(conf_ints) +
     alpha = factor(captured, levels = c("TRUE", "FALSE"))
   )) +
   scale_y_continuous(breaks = 1:33, labels = groups) +
-  labs(x = expression("Proportion of red balls"), y = "Confidence interval number", 
+  labs(x = expression("Proportion of red balls"), y = "", 
        alpha = "Captured") + 
   theme_light() + 
   theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank(),
         panel.grid.minor.x = element_blank())
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## # First: Take 100 virtual samples of n=50 balls
 ## virtual_samples <- bowl %>%
 ##   rep_sample_n(size = 50, reps = 100)
@@ -1181,3 +1234,9 @@ ggplot(conf_ints) +
 ##   ) +
 ##   geom_vline(xintercept = 900 / 2400, color = "red")
 
+
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("Solutions to all *Learning checks* can be found online in [Appendix D](https://moderndive.com/D-appendixD.html).")
+} 
+
diff --git a/docs/scripts/09-hypothesis-testing.R b/docs/scripts/09-hypothesis-testing.R
index db320a786..23859bdc0 100644
--- a/docs/scripts/09-hypothesis-testing.R
+++ b/docs/scripts/09-hypothesis-testing.R
@@ -1,11 +1,11 @@
-## ----appendixb, echo=FALSE, results="asis"-------------------------------
+## ----appendixb, echo=FALSE, results="asis"------------------------------------
 if(!knitr::is_latex_output()){
   cat("If you'd like more practice or you're curious to see how this framework applies to different scenarios, you can find fully-worked out examples for many common hypothesis tests and their corresponding confidence intervals in Appendix B. ")
   cat("We recommend that you carefully review these examples as they also cover how the general frameworks apply to traditional theory-based methods like the $t$-test and normal-theory confidence intervals.  You'll see there that these traditional methods are just approximations for the computer-based methods we've been focusing on. However, they also require conditions to be met for their results to be valid. Computer-based methods using randomization, simulation, and bootstrapping have much fewer restrictions. Furthermore, they help develop your computational thinking, which is one big reason they are emphasized throughout this book.")
 }
 
 
-## ----message=FALSE, warning=FALSE----------------------------------------
+## ----message=FALSE, warning=FALSE---------------------------------------------
 library(tidyverse)
 library(infer)
 library(moderndive)
@@ -13,7 +13,7 @@ library(nycflights13)
 library(ggplot2movies)
 
 
-## ----message=FALSE, warning=FALSE, echo=FALSE----------------------------
+## ----message=FALSE, warning=FALSE, echo=FALSE---------------------------------
 # Packages needed internally, but not in text.
 library(knitr)
 library(kableExtra)
@@ -22,20 +22,26 @@ library(scales)
 library(viridis)
 
 
-## ------------------------------------------------------------------------
-promotions
+## ----echo=FALSE---------------------------------------------------------------
+set.seed(2102)
 
 
-## ----eval=FALSE----------------------------------------------------------
+## -----------------------------------------------------------------------------
+promotions %>% 
+  sample_n(size = 6) %>% 
+  arrange(id)
+
+
+## ----eval=FALSE---------------------------------------------------------------
 ## ggplot(promotions, aes(x = gender, fill = decision)) +
 ##   geom_bar() +
-##   labs(x = "Gender of name on resume")
+##   labs(x = "Gender of name on résumé")
 
 
-## ----promotions-barplot, echo=FALSE, fig.cap="Barplot of relationship between gender and promotion decision."----
+## ----promotions-barplot, echo=FALSE, fig.cap="Barplot relating gender to promotion decision.", fig.height=1.6----
 promotions_barplot <- ggplot(promotions, aes(x = gender, fill = decision)) +
   geom_bar() +
-  labs(x = "Gender of name on resume")
+  labs(x = "Gender of name on résumé")
 if(knitr::is_html_output()){
   promotions_barplot
 } else {
@@ -43,12 +49,13 @@ if(knitr::is_html_output()){
 }
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 promotions %>% 
   group_by(gender, decision) %>% 
-  summarize(n = n())
+  tally()
+
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 observed_test_statistic <- promotions %>% 
   specify(decision ~ gender, success = "promoted") %>% 
   calculate(stat = "diff in props", order = c("male", "female")) %>% 
@@ -56,21 +63,22 @@ observed_test_statistic <- promotions %>%
   round(3)
 
 
-## ----compare-six, echo=FALSE---------------------------------------------
+## ----compare-six, echo=FALSE--------------------------------------------------
 set.seed(2019)
 # Pick out 6 rows
 promotions_sample <- promotions %>%
   slice(c(36, 39, 40, 1, 2, 22)) %>% 
   mutate(`shuffled gender` = sample(gender)) %>% 
   select(-id) %>% 
-  mutate(`resume number` = 1:n()) %>% 
-  select(`resume number`, everything())
+  mutate(`résumé number` = 1:n()) %>% 
+  select(`résumé number`, everything())
 
 promotions_sample  %>% 
   kable(
-    caption = "One example of shuffling gender variable.", 
+    caption = "One example of shuffling gender variable", 
     booktabs = TRUE,
-    longtable = TRUE
+    longtable = TRUE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position", "repeat_header"))
@@ -80,16 +88,17 @@ promotions_sample  %>%
 
 
 
-## ------------------------------------------------------------------------
-promotions_shuffled
+## ---- eval=FALSE--------------------------------------------------------------
+## promotions_shuffled %>% slice(c(11, 26, 28, 36, 37, 46))
 
 
-## ---- eval=FALSE---------------------------------------------------------
-## ggplot(promotions_shuffled, aes(x = gender, fill = decision)) +
+## ---- eval=FALSE--------------------------------------------------------------
+## ggplot(promotions_shuffled,
+##        aes(x = gender, fill = decision)) +
 ##   geom_bar() +
-##   labs(x = "Gender of resume name")
+##   labs(x = "Gender of résumé name")
 
-## ----promotions-barplot-permuted, fig.cap="Barplots of relationship of promotion with gender (left) and shuffled gender (right).", echo=FALSE----
+## ----promotions-barplot-permuted, fig.cap="Barplots of relationship of promotion with gender (left) and shuffled gender (right).", fig.height=4.7, echo=FALSE----
 height1 <- promotions %>% 
   group_by(gender, decision) %>% 
   summarize(n = n()) %>% 
@@ -104,12 +113,12 @@ height <- max(height1, height2)
 
 plot1 <- ggplot(promotions, aes(x = gender, fill = decision)) +
   geom_bar() +
-  labs(x = "Gender of resume name", title = "Original") +
+  labs(x = "Gender of résumé name", title = "Original") +
   theme(legend.position = "none") +
   coord_cartesian(ylim= c(0, height))
 plot2 <- ggplot(promotions_shuffled, aes(x = gender, fill = decision)) +
   geom_bar() +
-  labs(x = "Gender of resume name", y ="", title = "Shuffled") +
+  labs(x = "Gender of résumé name", y ="", title = "Shuffled") +
   coord_cartesian(ylim= c(0, height))
 if(knitr::is_html_output()){
   plot1 + plot2
@@ -118,12 +127,12 @@ if(knitr::is_html_output()){
 }
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 promotions_shuffled %>% 
   group_by(gender, decision) %>% 
-  summarize(n = n())
+  tally() # Same as summarize(n = n())
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 # male stats
 n_men_promoted <- promotions_shuffled %>% 
   filter(decision == "promoted", gender == "male") %>% 
@@ -152,7 +161,7 @@ prop_women_promoted <- round(prop_women_promoted, 3)
 
 
 
-## ---- eval=TRUE, echo=FALSE, message=FALSE, warning=FALSE----------------
+## ---- eval=TRUE, echo=FALSE, message=FALSE, warning=FALSE---------------------
 # https://docs.google.com/spreadsheets/d/1Q-ENy3o5IrpJshJ7gn3hJ5A0TOWV2AZrKNHMsshQtiE/edit#gid=0
 if(!file.exists("rds/shuffled_data.rds")){
   shuffled_data <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQXLJxwSp1ALEJ1JRNn3o8K3jVdqRG_5yxpoOhIFYflbFIkb2ttH73w8mljptn12CsDyIvjr5p0IGUe/pub?gid=0&single=true&output=csv")
@@ -181,14 +190,14 @@ shuffled_data_tidy <- shuffled_data_tidy %>%
 
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## obs_diff_prop <- promotions %>%
 ##   specify(decision ~ gender, success = "promoted") %>%
 ##   calculate(stat = "diff in props", order = c("male", "female"))
 ## obs_diff_prop
 
 
-## ----echo=FALSE, eval=FALSE----------------------------------------------
+## ----echo=FALSE, eval=FALSE---------------------------------------------------
 ## set.seed(2019)
 ## tactile_permutes <- promotions %>%
 ##   specify(decision ~ gender, success = "promoted") %>%
@@ -201,14 +210,15 @@ shuffled_data_tidy <- shuffled_data_tidy %>%
 ##   scale_y_continuous(breaks = 0:10)
 
 
-## ----table-diff-prop, echo=FALSE, message=FALSE--------------------------
+## ----table-diff-prop, echo=FALSE, message=FALSE-------------------------------
 # The following Google Doc is published to CSV and loaded using read_csv():
 # https://docs.google.com/spreadsheets/d/1QkOpnBGqOXGyJjwqx1T2O5G5D72wWGfWlPyufOgtkk4/edit#gid=0
 
 if(!file.exists("rds/sampling_scenarios.rds")){
   sampling_scenarios <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRd6bBgNwM3z-AJ7o4gZOiPAdPfbTp_V15HVHRmOH5Fc9w62yaG-fEKtjNUD2wOSa5IJkrDMaEBjRnA/pub?gid=0&single=true&output=csv" %>% 
-    read_csv(na = "")
-  write_rds(table_ch3, "rds/sampling_scenarios.rds")
+    read_csv(na = "") %>% 
+    slice(1:5)
+  write_rds(sampling_scenarios, "rds/sampling_scenarios.rds")
 } else {
   sampling_scenarios <- read_rds("rds/sampling_scenarios.rds")
 }
@@ -219,7 +229,8 @@ sampling_scenarios %>%
   kable(
     caption = "Scenarios of sampling for inference", 
     booktabs = TRUE,
-    escape = FALSE
+    escape = FALSE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position")) %>%
@@ -232,7 +243,7 @@ sampling_scenarios %>%
 
 
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 num <- sum(shuffled_data_tidy$stat >= observed_test_statistic)
 denom <- nrow(shuffled_data_tidy)
 p_val <- round((num + 1)/(denom + 1),3)
@@ -242,31 +253,32 @@ p_val <- round((num + 1)/(denom + 1),3)
 
 
 
-## ---- echo=FALSE---------------------------------------------------------
-alpha <- 0.001
+## ---- echo=FALSE--------------------------------------------------------------
+alpha <- 0.05
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 promotions %>% 
-  specify(formula = decision ~ gender, success = "promoted")
+  specify(formula = decision ~ gender, success = "promoted") 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 promotions %>% 
   specify(formula = decision ~ gender, success = "promoted") %>% 
   hypothesize(null = "independence")
 
 
-## ----eval=FALSE----------------------------------------------------------
-## promotions %>%
+## ----eval=FALSE---------------------------------------------------------------
+## promotions_generate <- promotions %>%
 ##   specify(formula = decision ~ gender, success = "promoted") %>%
 ##   hypothesize(null = "independence") %>%
 ##   generate(reps = 1000, type = "permute")
+## nrow(promotions_generate)
 
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## null_distribution <- promotions %>%
 ##   specify(formula = decision ~ gender, success = "promoted") %>%
 ##   hypothesize(null = "independence") %>%
@@ -277,31 +289,33 @@ promotions %>%
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 obs_diff_prop <- promotions %>% 
   specify(decision ~ gender, success = "promoted") %>% 
   calculate(stat = "diff in props", order = c("male", "female"))
 obs_diff_prop
 
 
+## ----null-distribution-infer, fig.show='hold', fig.cap="Null distribution.", fig.height=1.8----
+visualize(null_distribution, bins = 10)
 
 
-## ----null-distribution-infer-2, fig.cap="Shaded histogram to show p-value."----
+## ----null-distribution-infer-2, fig.cap="Shaded histogram to show $p$-value."----
 visualize(null_distribution, bins = 10) + 
   shade_p_value(obs_stat = obs_diff_prop, direction = "right")
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 null_distribution %>% 
   get_p_value(obs_stat = obs_diff_prop, direction = "right")
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 p_value <- null_distribution %>% 
   get_p_value(obs_stat = obs_diff_prop, direction = "right") %>% 
   mutate(p_value = round(p_value, 3))
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## null_distribution <- promotions %>%
 ##   specify(formula = decision ~ gender, success = "promoted") %>%
 ##   hypothesize(null = "independence") %>%
@@ -309,7 +323,7 @@ p_value <- null_distribution %>%
 ##   calculate(stat = "diff in props", order = c("male", "female"))
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## bootstrap_distribution <- promotions %>%
 ##   specify(formula = decision ~ gender, success = "promoted") %>%
 ##   # Change 1 - Remove hypothesize():
@@ -321,26 +335,26 @@ p_value <- null_distribution %>%
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 percentile_ci <- bootstrap_distribution %>% 
   get_confidence_interval(level = 0.95, type = "percentile")
 percentile_ci
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## visualize(bootstrap_distribution) +
 ##   shade_confidence_interval(endpoints = percentile_ci)
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 se_ci <- bootstrap_distribution %>% 
   get_confidence_interval(level = 0.95, type = "se", 
                           point_estimate = obs_diff_prop)
 se_ci
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## visualize(bootstrap_distribution) +
 ##   shade_confidence_interval(endpoints = se_ci)
 
@@ -352,7 +366,11 @@ se_ci
 
 
 
-## ----eval=FALSE, echo=FALSE----------------------------------------------
+
+
+
+
+## ----eval=FALSE, echo=FALSE---------------------------------------------------
 ## tibble(
 ##   verdict = c("Not guilty verdict", "Guilty verdict"),
 ##   `Truly not guilty` = c("Correct", "Type I error"),
@@ -374,7 +392,7 @@ se_ci
 knitr::include_graphics("images/gt_error_table.png")
 
 
-## ----eval=FALSE, echo=FALSE----------------------------------------------
+## ----hypo-test-errors, eval=FALSE, echo=FALSE---------------------------------
 ## tibble(
 ##   Decision = c("Fail to reject H0", "Reject H0"),
 ##   `H0 true` = c("Correct", "Type I error"),
@@ -400,26 +418,26 @@ knitr::include_graphics("images/gt_error_table_ht.png")
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 movies
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 movies_sample
 
 
-## ----action-romance-boxplot, fig.cap="Boxplot of IMDb rating vs genre."----
+## ----action-romance-boxplot, fig.cap="Boxplot of IMDb rating vs. genre.", fig.height=2.7----
 ggplot(data = movies_sample, aes(x = genre, y = rating)) +
   geom_boxplot() +
   labs(y = "IMDb rating")
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 movies_sample %>% 
   group_by(genre) %>% 
   summarize(n = n(), mean_rating = mean(rating), std_dev = sd(rating))
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 movies_genre_summaries <- movies_sample %>% 
   group_by(genre) %>% 
   summarize(n = n(), mean_rating = mean(rating), std_dev = sd(rating))
@@ -438,13 +456,14 @@ n_romance <- movies_genre_summaries %>%
   pull(n)
 
 
-## ----summarytable-ch10, echo=FALSE, message=FALSE------------------------
+## ----summarytable-ch10, echo=FALSE, message=FALSE-----------------------------
 # The following Google Doc is published to CSV and loaded using read_csv():
 # https://docs.google.com/spreadsheets/d/1QkOpnBGqOXGyJjwqx1T2O5G5D72wWGfWlPyufOgtkk4/edit#gid=0
 
 if(!file.exists("rds/sampling_scenarios.rds")){
   sampling_scenarios <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRd6bBgNwM3z-AJ7o4gZOiPAdPfbTp_V15HVHRmOH5Fc9w62yaG-fEKtjNUD2wOSa5IJkrDMaEBjRnA/pub?gid=0&single=true&output=csv" %>% 
-    read_csv(na = "")
+    read_csv(na = "") %>% 
+    slice(1:5)
   write_rds(sampling_scenarios, "rds/sampling_scenarios.rds")
 } else {
   sampling_scenarios <- read_rds("rds/sampling_scenarios.rds")
@@ -455,7 +474,8 @@ sampling_scenarios %>%
   kable(
     caption = "Scenarios of sampling for inference", 
     booktabs = TRUE,
-    escape = FALSE
+    escape = FALSE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position")) %>%
@@ -466,25 +486,27 @@ sampling_scenarios %>%
   column_spec(5, width = "1in")
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 movies_sample %>% 
   specify(formula = rating ~ genre)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 movies_sample %>% 
   specify(formula = rating ~ genre) %>% 
   hypothesize(null = "independence")
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## movies_sample %>%
 ##   specify(formula = rating ~ genre) %>%
 ##   hypothesize(null = "independence") %>%
-##   generate(reps = 1000, type = "permute")
+##   generate(reps = 1000, type = "permute") %>%
+##   View()
 
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
+set.seed(76)
 if(!file.exists("rds/movies_sample_generate.rds")){
   movies_sample_generate <- movies_sample %>% 
     specify(formula = rating ~ genre) %>% 
@@ -494,10 +516,9 @@ if(!file.exists("rds/movies_sample_generate.rds")){
 } else {
   movies_sample_generate <- read_rds("rds/movies_sample_generate.rds")
 }
-movies_sample_generate
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## null_distribution_movies <- movies_sample %>%
 ##   specify(formula = rating ~ genre) %>%
 ##   hypothesize(null = "independence") %>%
@@ -508,19 +529,19 @@ movies_sample_generate
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 obs_diff_means <- movies_sample %>% 
   specify(formula = rating ~ genre) %>% 
   calculate(stat = "diff in means", order = c("Action", "Romance"))
 obs_diff_means
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## visualize(null_distribution_movies, bins = 10) +
 ##   shade_p_value(obs_stat = obs_diff_means, direction = "both")
 
 
-## ----null-distribution-movies-2, echo=FALSE, fig.cap="Null distribution, observed test statistic, and p-value."----
+## ----null-distribution-movies-2, echo=FALSE, fig.cap="Null distribution, observed test statistic, and $p$-value.", fig.height=1.8----
 if(knitr::is_html_output()){
   visualize(null_distribution_movies, bins = 10) + 
     shade_p_value(obs_stat = obs_diff_means, direction = "both")
@@ -531,11 +552,11 @@ if(knitr::is_html_output()){
 }
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 null_distribution_movies %>% 
   get_p_value(obs_stat = obs_diff_means, direction = "both")
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 p_value_movies <- null_distribution_movies %>% 
   get_p_value(obs_stat = obs_diff_means, direction = "both") %>% 
   mutate(p_value = round(p_value, 3))
@@ -545,7 +566,7 @@ p_value_movies <- null_distribution_movies %>%
 
 
 
-## ----zcurve, echo=FALSE, out.width="80%", fig.cap="Standard normal z curve."----
+## ----zcurve, echo=FALSE, out.width="100%", fig.cap="Standard normal z curve.", fig.height=1.3----
 ggplot(data.frame(x = c(-4, 4)), aes(x)) + stat_function(fun = dnorm) +
   labs(x = "z", y = "") + 
   theme_light() +
@@ -559,12 +580,12 @@ ggplot(data.frame(x = c(-4, 4)), aes(x)) + stat_function(fun = dnorm) +
 
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 movies_sample %>% 
   group_by(genre) %>% 
   summarize(n = n(), mean_rating = mean(rating), std_dev = sd(rating))
 
-## ---- echo=FALSE---------------------------------------------------------
+## ---- echo=FALSE--------------------------------------------------------------
 t_stat <- movies_sample %>% 
   specify(formula = rating ~ genre) %>% 
   calculate(stat = "t", order = c("Action", "Romance")) %>% 
@@ -572,7 +593,7 @@ t_stat <- movies_sample %>%
   round(3)
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Construct null distribution of xbar_a - xbar_m:
 ## null_distribution_movies <- movies_sample %>%
 ##   specify(formula = rating ~ genre) %>%
@@ -582,7 +603,7 @@ t_stat <- movies_sample %>%
 ## visualize(null_distribution_movies, bins = 10)
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## # Construct null distribution of t:
 ## null_distribution_movies_t <- movies_sample %>%
 ##   specify(formula = rating ~ genre) %>%
@@ -595,7 +616,7 @@ t_stat <- movies_sample %>%
 
 
 
-## ----comparing-diff-means-t-stat, fig.align='center', out.width='100%', fig.cap="Comparing the null distributions of two test statistics.", echo=FALSE----
+## ----comparing-diff-means-t-stat, fig.align='center', fig.height=3, fig.cap="Comparing the null distributions of two test statistics.", echo=FALSE----
 # Visualize:
 null_dist_1 <- visualize(null_distribution_movies, bins = 10) +
   labs(title = "Difference in means")
@@ -604,68 +625,72 @@ null_dist_2 <- visualize(null_distribution_movies_t, bins = 10) +
 null_dist_1 + null_dist_2
 
 
-## ----t-stat-3, fig.align='center', out.width='100%', fig.cap="Null distribution using t-statistic and t-distribution."----
+## ----t-stat-3, fig.align='center', fig.cap="Null distribution using t-statistic and t-distribution.", fig.height=2.2----
 visualize(null_distribution_movies_t, bins = 10, method = "both")
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 obs_two_sample_t <- movies_sample %>% 
   specify(formula = rating ~ genre) %>% 
   calculate(stat = "t", order = c("Action", "Romance"))
 obs_two_sample_t
 
 
-## ----t-stat-4, fig.align='center', out.width='100%', fig.cap="Null distribution using t-statistic and t-distribution with p-value shaded."----
+## ----t-stat-4, fig.align='center', fig.cap="Null distribution using t-statistic and t-distribution with $p$-value shaded.", warning=TRUE, fig.height=1.7----
 visualize(null_distribution_movies_t, method = "both") +
   shade_p_value(obs_stat = obs_two_sample_t, direction = "both")
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 null_distribution_movies_t %>% 
   get_p_value(obs_stat = obs_two_sample_t, direction = "both")
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 flights_sample <- flights %>% 
   filter(carrier %in% c("HA", "AS"))
 
 
-## ----ha-as-flights-boxplot, fig.cap="Air time for Hawaiian and Alaska Airlines flights departing NYC in 2013."----
+## ----ha-as-flights-boxplot, fig.cap="Air time for Hawaiian and Alaska Airlines flights departing NYC in 2013.", fig.height=2.8----
 ggplot(data = flights_sample, mapping = aes(x = carrier, y = air_time)) +
   geom_boxplot() +
   labs(x = "Carrier", y = "Air Time")
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 flights_sample %>% 
   group_by(carrier, dest) %>% 
-  summarize(n = n(), mean_time = mean(air_time, na.rm =TRUE))
+  summarize(n = n(), mean_time = mean(air_time, na.rm = TRUE))
 
 
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("Solutions to all *Learning checks* can be found online in [Appendix D](https://moderndive.com/D-appendixD.html).")
+} 
 
 
 
 
 
 
-
-
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Fit regression model:
 ## score_model <- lm(score ~ bty_avg, data = evals)
+## 
 ## # Get regression table:
 ## get_regression_table(score_model)
 
 
-## ----regression-table-inference, echo=FALSE------------------------------
+## ----regression-table-inference, echo=FALSE-----------------------------------
 # Fit regression model:
 score_model <- lm(score ~ bty_avg, data = evals)
 # Get regression table:
 get_regression_table(score_model) %>%
   knitr::kable(
     digits = 3,
-    caption = "Linear regression table.",
-    booktabs = TRUE
+    caption = "Linear regression table",
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
diff --git a/docs/scripts/10-inference-for-regression.R b/docs/scripts/10-inference-for-regression.R
index 0ea2947dd..c790b093a 100644
--- a/docs/scripts/10-inference-for-regression.R
+++ b/docs/scripts/10-inference-for-regression.R
@@ -1,4 +1,4 @@
-## ----setup_inference_regression, include=FALSE---------------------------
+## ----setup_inference_regression, include=FALSE--------------------------------
 chap <- 10
 lc <- 0
 rq <- 0
@@ -18,13 +18,13 @@ options(scipen = 99, digits = 3)
 set.seed(76)
 
 
-## ----message=FALSE, warning=FALSE----------------------------------------
+## ----message=FALSE, warning=FALSE---------------------------------------------
 library(tidyverse)
 library(moderndive)
 library(infer)
 
 
-## ----message=FALSE, warning=FALSE, echo=FALSE----------------------------
+## ----message=FALSE, warning=FALSE, echo=FALSE---------------------------------
 # Packages needed internally, but not in text.
 library(knitr)
 library(tidyr)
@@ -32,40 +32,43 @@ library(kableExtra)
 library(patchwork)
 
 
-## ------------------------------------------------------------------------
-evals_ch6 <- evals %>%
+## -----------------------------------------------------------------------------
+evals_ch5 <- evals %>%
   select(ID, score, bty_avg, age)
-glimpse(evals_ch6)
+glimpse(evals_ch5)
 
-## ---- echo=FALSE---------------------------------------------------------
-cor_ch6 <- evals_ch6 %>%
+## ---- echo=FALSE--------------------------------------------------------------
+cor_ch6 <- evals_ch5 %>%
   summarize(correlation = cor(score, bty_avg)) %>%
   pull(correlation) %>%
   round(3)
 
 
-## ----regline, fig.cap="Relationship with regression line."---------------
-ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
+## ----regline, fig.cap="Relationship with regression line.", fig.height=3.2----
+ggplot(evals_ch5, 
+       aes(x = bty_avg, y = score)) +
   geom_point() +
-  labs(x = "Beauty Score", y = "Teaching Score",
+  labs(x = "Beauty Score", 
+       y = "Teaching Score",
        title = "Relationship between teaching and beauty scores") +  
   geom_smooth(method = "lm", se = FALSE)
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## # Fit regression model:
-## score_model <- lm(score ~ bty_avg, data = evals_ch6)
+## score_model <- lm(score ~ bty_avg, data = evals_ch5)
 ## # Get regression table:
 ## get_regression_table(score_model)
 
-## ----regtable-11, echo=FALSE---------------------------------------------
+## ----regtable-11, echo=FALSE--------------------------------------------------
 # Fit regression model:
-score_model <- lm(score ~ bty_avg, data = evals_ch6)
+score_model <- lm(score ~ bty_avg, data = evals_ch5)
 get_regression_table(score_model) %>%
   knitr::kable(
     digits = 3,
-    caption = "Previously seen linear regression table.",
-    booktabs = TRUE
+    caption = "Previously seen linear regression table",
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
@@ -88,24 +91,26 @@ lower0 <- intercept_row %>% pull(lower_ci)
 upper0 <- intercept_row %>% pull(upper_ci)
 
 
-## ----summarytable-ch11, echo=FALSE, message=FALSE------------------------
+## ----summarytable-ch11, echo=FALSE, message=FALSE-----------------------------
 # The following Google Doc is published to CSV and loaded using read_csv():
 # https://docs.google.com/spreadsheets/d/1QkOpnBGqOXGyJjwqx1T2O5G5D72wWGfWlPyufOgtkk4/edit#gid=0
 
 if(!file.exists("rds/sampling_scenarios.rds")){
   sampling_scenarios <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRd6bBgNwM3z-AJ7o4gZOiPAdPfbTp_V15HVHRmOH5Fc9w62yaG-fEKtjNUD2wOSa5IJkrDMaEBjRnA/pub?gid=0&single=true&output=csv" %>% 
-    read_csv(na = "")
-  write_rds(table_ch3, "rds/sampling_scenarios.rds")
+    read_csv(na = "") %>% 
+    slice(1:5)
+  write_rds(sampling_scenarios, "rds/sampling_scenarios.rds")
 } else {
   sampling_scenarios <- read_rds("rds/sampling_scenarios.rds")
 }
 
 sampling_scenarios %>%  
-  filter(Scenario %in% 1:6) %>% 
+#  filter(Scenario %in% 1:5) %>% 
   kable(
     caption = "Scenarios of sampling for inference", 
     booktabs = TRUE,
-    escape = FALSE
+    escape = FALSE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position")) %>%
@@ -116,12 +121,13 @@ sampling_scenarios %>%
   column_spec(5, width = "1in")
 
 
-## ----score-model-part-deux, echo=FALSE-----------------------------------
+## ----score-model-part-deux, echo=FALSE----------------------------------------
 get_regression_table(score_model) %>%
   knitr::kable(
-    caption = "Previously seen regression table.", 
+    caption = "Previously seen regression table", 
     digits = 3,
-    booktabs = TRUE
+    booktabs = TRUE,
+    linesep = ""
   ) %>%
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position"))
@@ -129,7 +135,7 @@ get_regression_table(score_model) %>%
 
 ## ----residual-example, echo=FALSE, warning=FALSE, fig.cap="Example of observed value, fitted value, and residual."----
 # Pick out one particular point to drill down on
-index <- which(evals_ch6$bty_avg == 7.333 & evals_ch6$score == 4.9)
+index <- which(evals_ch5$bty_avg == 7.333 & evals_ch5$score == 4.9)
 target_point <- score_model %>%
   get_regression_points() %>%
   slice(index)
@@ -139,7 +145,7 @@ y_hat <- target_point$score_hat
 resid <- target_point$residual
 
 # Plot residual
-best_fit_plot <- ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
+best_fit_plot <- ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
   geom_point() +
   labs(x = "Beauty Score", y = "Teaching Score",
        title = "Relationship of teaching and beauty scores") +
@@ -151,17 +157,17 @@ best_fit_plot <- ggplot(evals_ch6, aes(x = bty_avg, y = score)) +
 best_fit_plot
 
 
-## ---- eval=TRUE, echo=TRUE-----------------------------------------------
+## ---- eval=TRUE, echo=TRUE----------------------------------------------------
 # Fit regression model:
-score_model <- lm(score ~ bty_avg, data = evals_ch6)
+score_model <- lm(score ~ bty_avg, data = evals_ch5)
 # Get regression points:
 regression_points <- get_regression_points(score_model)
 regression_points
 
 
-## ----non-linear, fig.cap="Example of clearly non-linear relationship.", echo=FALSE----
+## ----non-linear, fig.cap="Example of a clearly non-linear relationship.", echo=FALSE, fig.height=3.3----
 set.seed(76)
-evals_ch6 %>% 
+evals_ch5 %>% 
   mutate(
     x = bty_avg,
     y = (x-3)*(x-6) + rnorm(n(), 0, 0.75)
@@ -173,12 +179,12 @@ evals_ch6 %>%
   expand_limits(y = 10)
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 evals %>% 
   select(ID, prof_ID, score, bty_avg)
 
 
-## ---- eval=FALSE, echo=TRUE----------------------------------------------
+## ---- eval=FALSE, echo=TRUE---------------------------------------------------
 ## ggplot(regression_points, aes(x = residual)) +
 ##   geom_histogram(binwidth = 0.25, color = "white") +
 ##   labs(x = "Residual")
@@ -189,9 +195,9 @@ ggplot(regression_points, aes(x = residual)) +
   labs(x = "Residual")
 
 
-## ----normal-residuals, echo=FALSE, warning=FALSE, fig.cap="Example of clearly normal and clearly non-normal residuals."----
+## ----normal-residuals, echo=FALSE, warning=FALSE, fig.cap="Example of clearly normal and clearly not normal residuals."----
 sigma <- sd(regression_points$residual)
-evals_ch6 %>% 
+normal_and_not_examples <- evals_ch5 %>% 
   mutate(
     `Clearly normal` = rnorm(n = n(), 0, sd = sigma),
     `Clearly not normal` = rnorm(n = n(), mean = 0, sd = sigma)^2,
@@ -202,10 +208,19 @@ evals_ch6 %>%
   ggplot(aes(x = eps)) +
   geom_histogram(binwidth = 0.25, color = "white") +
   labs(x = "Residual") +
-  facet_wrap( ~ type, scales = "free")
+  facet_wrap(~ type, scales = "free")
+
+if(knitr::is_latex_output()){
+  normal_and_not_examples +
+    theme(strip.text = element_text(colour = 'black'),
+          strip.background = element_rect(fill = "grey93"))
+} else {
+  normal_and_not_examples
+}
+  
 
 
-## ---- eval=FALSE, echo=TRUE----------------------------------------------
+## ---- eval=FALSE, echo=TRUE---------------------------------------------------
 ## ggplot(regression_points, aes(x = bty_avg, y = residual)) +
 ##   geom_point() +
 ##   labs(x = "Beauty Score", y = "Residual") +
@@ -219,7 +234,7 @@ ggplot(regression_points, aes(x = bty_avg, y = residual)) +
 
 
 ## ----equal-variance-residuals, echo=FALSE, warning=FALSE, fig.cap="Example of clearly non-equal variance."----
-evals_ch6 %>% 
+evals_ch5 %>% 
   mutate(eps = (rnorm(n(), 0, 0.075 * bty_avg ^ 2)) * 0.4) %>% 
   ggplot(aes(x = bty_avg, y = eps)) +
   geom_point() +
@@ -231,13 +246,14 @@ evals_ch6 %>%
 
 
 
-## ----eval=FALSE----------------------------------------------------------
-## bootstrap_distn_slope <- evals_ch6 %>%
+## ----eval=FALSE---------------------------------------------------------------
+## bootstrap_distn_slope <- evals_ch5 %>%
 ##   specify(formula = score ~ bty_avg) %>%
 ##   generate(reps = 1000, type = "bootstrap") %>%
 ##   calculate(stat = "slope")
+## bootstrap_distn_slope
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 if(!file.exists("rds/bootstrap_distn_slope.rds")){
   set.seed(76)
   bootstrap_distn_slope <- evals %>% 
@@ -249,54 +265,54 @@ if(!file.exists("rds/bootstrap_distn_slope.rds")){
 } else {
   bootstrap_distn_slope <- readRDS("rds/bootstrap_distn_slope.rds")
 }
-
-## ------------------------------------------------------------------------
 bootstrap_distn_slope
 
 
-## ----eval=FALSE----------------------------------------------------------
+## ----eval=FALSE---------------------------------------------------------------
 ## visualize(bootstrap_distn_slope)
 
 
 
-## ------------------------------------------------------------------------
+
+## -----------------------------------------------------------------------------
 percentile_ci <- bootstrap_distn_slope %>% 
   get_confidence_interval(type = "percentile", level = 0.95)
 percentile_ci
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 observed_slope <- evals %>% 
   specify(score ~ bty_avg) %>% 
   calculate(stat = "slope")
 observed_slope
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 se_ci <- bootstrap_distn_slope %>% 
   get_ci(level = 0.95, type = "se", point_estimate = observed_slope)
 se_ci
 
 
-## ---- eval=FALSE---------------------------------------------------------
+## ---- eval=FALSE--------------------------------------------------------------
 ## visualize(bootstrap_distn_slope) +
 ##   shade_confidence_interval(endpoints = percentile_ci, fill = NULL,
-##                             linetype = "solid", color = "black") +
+##                             linetype = "solid", color = "grey90") +
 ##   shade_confidence_interval(endpoints = se_ci, fill = NULL,
-##                             linetype = "dashed", color = "black") +
+##                             linetype = "dashed", color = "grey60") +
 ##   shade_confidence_interval(endpoints = c(0.035, 0.099), fill = NULL,
 ##                             linetype = "dotted", color = "black")
 
 
 
-## ----eval=FALSE----------------------------------------------------------
+
+## ----eval=FALSE---------------------------------------------------------------
 ## null_distn_slope <- evals %>%
 ##   specify(score ~ bty_avg) %>%
 ##   hypothesize(null = "independence") %>%
 ##   generate(reps = 1000, type = "permute") %>%
 ##   calculate(stat = "slope")
 
-## ----echo=FALSE----------------------------------------------------------
+## ----echo=FALSE---------------------------------------------------------------
 if(!file.exists("rds/null_distn_slope.rds")){
   set.seed(76)
   null_distn_slope <- evals %>% 
@@ -311,18 +327,16 @@ if(!file.exists("rds/null_distn_slope.rds")){
 }
 
 
-## ----eval=FALSE----------------------------------------------------------
-## visualize(null_distn_slope)
-
-
+## ----null-distribution-slope, echo=FALSE, fig.show='hold', fig.cap="Null distribution of slopes.", fig.height=2.5----
+visualize(null_distn_slope)
 
-## ---- eval=FALSE---------------------------------------------------------
-## visualize(null_distn_slope) +
-##   shade_p_value(obs_stat = observed_slope, direction = "both")
 
+## ----p-value-slope, echo=FALSE, fig.show='hold', fig.cap="Null distribution and $p$-value.", fig.height=3----
+visualize(null_distn_slope) + 
+  shade_p_value(obs_stat = observed_slope, direction = "both")
 
 
-## ------------------------------------------------------------------------
+## -----------------------------------------------------------------------------
 null_distn_slope %>% 
   get_p_value(obs_stat = observed_slope, direction = "both")
 
@@ -331,14 +345,15 @@ null_distn_slope %>%
 
 
 
-## ----table-ch11, echo=FALSE, message=FALSE-------------------------------
+## ----table-ch11, echo=FALSE, message=FALSE------------------------------------
 # The following Google Doc is published to CSV and loaded using read_csv():
 # https://docs.google.com/spreadsheets/d/1QkOpnBGqOXGyJjwqx1T2O5G5D72wWGfWlPyufOgtkk4/edit#gid=0
 
 if(!file.exists("rds/sampling_scenarios.rds")){
   sampling_scenarios <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRd6bBgNwM3z-AJ7o4gZOiPAdPfbTp_V15HVHRmOH5Fc9w62yaG-fEKtjNUD2wOSa5IJkrDMaEBjRnA/pub?gid=0&single=true&output=csv" %>% 
-    read_csv(na = "")
-  write_rds(table_ch3, "rds/sampling_scenarios.rds")
+    read_csv(na = "") %>% 
+    slice(1:5)
+  write_rds(sampling_scenarios, "rds/sampling_scenarios.rds")
 } else {
   sampling_scenarios <- read_rds("rds/sampling_scenarios.rds")
 }
@@ -348,13 +363,20 @@ sampling_scenarios %>%
   kable(
     caption = "\\label{tab:summarytable-ch9}Scenarios of sampling for inference", 
     booktabs = TRUE,
-    escape = FALSE
+    escape = FALSE,
+    linesep = ""
   ) %>% 
   kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
                 latex_options = c("hold_position")) %>%
   column_spec(1, width = "0.5in") %>% 
-  column_spec(2, width = "0.7in") %>%
-  column_spec(3, width = "1in") %>%
-  column_spec(4, width = "1.1in") %>% 
-  column_spec(5, width = "1in")
+  column_spec(2, width = "1.5in") %>%
+  column_spec(3, width = "0.65in") %>%
+  column_spec(4, width = "1.6in") %>% 
+  column_spec(5, width = "0.65in")
+
+
+## ----echo=FALSE, results="asis"-----------------------------------------------
+if(knitr::is_latex_output()){
+  cat("Solutions to all *Learning checks* can be found online in [Appendix D](https://moderndive.com/D-appendixD.html).")
+} 
 
diff --git a/docs/scripts/11-tell-the-story-with-data.R b/docs/scripts/11-tell-the-story-with-data.R
deleted file mode 100644
index 18091e34d..000000000
--- a/docs/scripts/11-tell-the-story-with-data.R
+++ /dev/null
@@ -1,295 +0,0 @@
-## ----setup_thinking_with_data, include=FALSE-----------------------------
-chap <- 11
-lc <- 0
-rq <- 0
-# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
-# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
-
-knitr::opts_chunk$set(
-  tidy = FALSE, 
-  out.width = '\\textwidth', 
-  fig.height = 4,
-  warning = FALSE
-  )
-
-options(scipen = 99, digits = 3)
-
-# Set random number generator see value for replicable pseudorandomness.
-set.seed(76)
-
-
-
-
-## ----pipeline-figure-conclusion, echo=FALSE, fig.align='center', fig.cap="Data/Science Pipeline."----
-knitr::include_graphics("images/r4ds/data_science_pipeline.png")
-
-
-## ---- eval = FALSE-------------------------------------------------------
-## library(tidyverse)
-## library(moderndive)
-## library(skimr)
-## library(fivethirtyeight)
-
-## ---- message=FALSE, warning=FALSE, echo=FALSE---------------------------
-library(tidyverse)
-library(moderndive)
-# DO NOT load the skimr package as a whole as it will break all kable() code for 
-# the remaining chapters in the book.
-# Furthermore all skimr::skim() output in this Chapter has been hard coded. 
-# library(skimr)
-library(fivethirtyeight)
-
-
-## ----message=FALSE, warning=FALSE, echo=FALSE----------------------------
-# Packages needed internally, but not in text.
-library(knitr)
-library(kableExtra)
-library(patchwork)
-library(scales)
-
-
-## ---- eval = FALSE-------------------------------------------------------
-## View(house_prices)
-## glimpse(house_prices)
-
-## ---- echo=FALSE---------------------------------------------------------
-glimpse(house_prices)
-
-
-## ---- eval = FALSE-------------------------------------------------------
-## gain_summary <- flights %>%
-##   summarize(
-##     min = min(gain, na.rm = TRUE),
-##     q1 = quantile(gain, 0.25, na.rm = TRUE),
-##     median = quantile(gain, 0.5, na.rm = TRUE),
-##     q3 = quantile(gain, 0.75, na.rm = TRUE),
-##     max = max(gain, na.rm = TRUE),
-##     mean = mean(gain, na.rm = TRUE),
-##     sd = sd(gain, na.rm = TRUE),
-##     missing = sum(is.na(gain))
-##   )
-
-
-## ---- eval = FALSE-------------------------------------------------------
-## house_prices %>%
-##   select(price, sqft_living, condition) %>%
-##   skim()
-
-
-## ---- eval = FALSE, message=FALSE, warning=FALSE-------------------------
-## # Histogram of house price:
-## ggplot(house_prices, aes(x = price)) +
-##   geom_histogram(color = "white") +
-##   labs(x = "price (USD)", title = "House price")
-## 
-## # Histogram of sqft_living:
-## ggplot(house_prices, aes(x = sqft_living)) +
-##   geom_histogram(color = "white") +
-##   labs(x = "living space (square feet)", title = "House size")
-## 
-## # Barplot of condition:
-## ggplot(house_prices, aes(x = condition)) +
-##   geom_bar() +
-##   labs(x = "condition", title = "House condition")
-
-
-## ----house-prices-viz, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="Exploratory visualizations of Seattle house prices data.", fig.width=16/2, fig.height=9*2/3----
-p1 <- ggplot(house_prices, aes(x = price)) +
-  geom_histogram(color = "white") +
-  labs(x = "price (USD)", title = "House price") 
-p2 <- ggplot(house_prices, aes(x = sqft_living)) +
-  geom_histogram(color = "white") +
-  labs(x = "living space (square feet)", title = "House size")
-p3 <- ggplot(house_prices, aes(x = condition)) +
-  geom_bar() +
-  labs(x = "condition", title = "House condition")
-p1 + p2 + p3 + plot_layout(ncol = 2)
-
-
-## ------------------------------------------------------------------------
-house_prices <- house_prices %>%
-  mutate(
-    log10_price = log10(price),
-    log10_size = log10(sqft_living)
-    )
-
-
-## ------------------------------------------------------------------------
-house_prices %>% 
-  select(price, log10_price, sqft_living, log10_size)
-
-
-## ---- eval = FALSE-------------------------------------------------------
-## # Before log10-transformation:
-## ggplot(house_prices, aes(x = price)) +
-##   geom_histogram(color = "white") +
-##   labs(x = "price (USD)", title = "House price: Before")
-## 
-## # After log10-transformation:
-## ggplot(house_prices, aes(x = log10_price)) +
-##   geom_histogram(color = "white") +
-##   labs(x = "log10 price (USD)", title = "House price: After")
-
-## ----log10-price-viz, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="House price before and after log10-transformation.", fig.width=16/2, fig.height=9/2----
-p1 <- ggplot(house_prices, aes(x = price)) +
-  geom_histogram(color = "white") +
-  labs(x = "price (USD)", title = "House price: Before")
-p2 <- ggplot(house_prices, aes(x = log10_price)) +
-  geom_histogram(color = "white") +
-  labs(x = "log10 price (USD)", title = "House price: After")
-p1 + p2
-
-
-## ---- eval = FALSE-------------------------------------------------------
-## # Before log10-transformation:
-## ggplot(house_prices, aes(x = sqft_living)) +
-##   geom_histogram(color = "white") +
-##   labs(x = "living space (square feet)",
-##        title = "House size: Before")
-## 
-## # After log10-transformation:
-## ggplot(house_prices, aes(x = log10_size)) +
-##   geom_histogram(color = "white") +
-##   labs(x = "log10 living space (square feet)",
-##        title = "House size: After")
-
-## ----log10-size-viz, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="House size before and after log10-transformation.", fig.width=16/2, fig.height=9/2----
-p1 <- ggplot(house_prices, aes(x = sqft_living)) +
-  geom_histogram(color = "white") +
-  labs(x = "living space (square feet)", 
-       title = "House size: Before")
-p2 <- ggplot(house_prices, aes(x = log10_size)) +
-  geom_histogram(color = "white") +
-  labs(x = "log10 living space (square feet)", 
-       title = "House size: After")
-p1 + p2
-
-
-## ---- eval = FALSE-------------------------------------------------------
-## # Plot interaction model
-## ggplot(house_prices,
-##        aes(x = log10_size, y = log10_price, col = condition)) +
-##   geom_point(alpha = 0.05) +
-##   geom_smooth(method = "lm", se = FALSE) +
-##   labs(y = "log10 price", x = "log10 size",
-##        title = "House prices in Seattle")
-## 
-## # Plot parallel slopes model
-## gg_parallel_slopes(y = "log10_price", num_x = "log10_size",
-##                    cat_x = "condition", data = house_prices,
-##                    alpha = 0.05)
-
-## ----house-price-parallel-slopes, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="Interaction and parallel slopes models."----
-interaction <- ggplot(house_prices, 
-                      aes(x = log10_size, y = log10_price, col = condition)) +
-  geom_point(alpha = 0.05) +
-  labs(y = "log10 price", x = "log10 size") +
-  geom_smooth(method = "lm", se = FALSE) +
-  guides(color=FALSE) +
-  labs(title = "House prices in Seattle", x = "log10 size", y = "log10 price")
-parallel_slopes <- 
-  gg_parallel_slopes(y = "log10_price", num_x = "log10_size", 
-                     cat_x = "condition", data = house_prices, alpha = 0.05) +
-  labs(y = NULL, x = "log10 size")
-if(knitr::is_html_output()){
-  interaction + parallel_slopes
-} else {
-  (interaction + scale_color_grey()) + 
-    (parallel_slopes + scale_color_grey())
-}
-
-
-## ----eval=FALSE----------------------------------------------------------
-## ggplot(house_prices,
-##        aes(x = log10_size, y = log10_price, col = condition)) +
-##   geom_point(alpha = 0.4) +
-##   geom_smooth(method = "lm", se = FALSE) +
-##   labs(y = "log10 price", x = "log10 size",
-##        title = "House prices in Seattle") +
-##   facet_wrap(~condition)
-
-
-## ----house-price-interaction-2, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="Facetted plot of interaction model."----
-interaction_2_plot <- ggplot(house_prices, 
-                             aes(x = log10_size, y = log10_price, 
-                                 col = condition)) +
-  geom_point(alpha = 0.4) +
-  geom_smooth(method = "lm", se = FALSE) +
-  labs(y = "log10 price", x = "log10 size", 
-       title = "House prices in Seattle") +
-  facet_wrap(~condition)
-if(knitr::is_html_output()){
-  interaction_2_plot
-} else {
-  interaction_2_plot + scale_color_grey()
-}
-
-
-## ---- eval=FALSE---------------------------------------------------------
-## # Fit regression model:
-## price_interaction <- lm(log10_price ~ log10_size * condition,
-##                         data = house_prices)
-## # Get regression table:
-## get_regression_table(price_interaction)
-
-## ----seattle-interaction, echo=FALSE-------------------------------------
-price_interaction <- lm(log10_price ~ log10_size * condition, 
-                        data = house_prices)
-get_regression_table(price_interaction) %>% 
-  knitr::kable(
-    digits = 3,
-    caption = "Regression table for interaction model.", 
-    booktabs = TRUE
-  ) %>% 
-  kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16),
-                latex_options = c("hold_position"))
-
-
-## ----house-price-interaction-3, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="Interaction model with prediction.", fig.width=16/2, fig.height=9/2----
-new_house <- data_frame(log10_size = log10(1900), condition = factor(5)) %>% 
-  get_regression_points(price_interaction, newdata = .)
-
-with_prediction_plot <- ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) +
-  geom_point(alpha = 0.05) +
-  labs(y = "log10 price", x = "log10 size", title = "House prices in Seattle") +
-  geom_smooth(method = "lm", se = FALSE) +
-  geom_vline(xintercept = log10(1900), linetype = "dashed", size = 1) +
-  geom_point(data = new_house, aes(y = log10_price_hat), col ="black", size = 3)
-if(knitr::is_html_output()){
-  with_prediction_plot
-} else {
-  with_prediction_plot + scale_color_grey()  
-}
-
-
-## ------------------------------------------------------------------------
-2.45 + 1 * log10(1900)
-
-
-## ------------------------------------------------------------------------
-10^(2.45 + 1 * log10(1900))
-
-
-
-
-
-
-## ------------------------------------------------------------------------
-glimpse(US_births_1994_2003)
-
-
-## ------------------------------------------------------------------------
-US_births_1999 <- US_births_1994_2003 %>%
-  filter(year == 1999)
-
-
-## ----us-births, fig.cap="Number of births in US in 1999.", fig.align='center'----
-ggplot(US_births_1999, aes(x = date, y = births)) +
-  geom_line() +
-  labs(x = "Data", y = "Number of births", title = "US Births in 1999")
-
-
-## ------------------------------------------------------------------------
-US_births_1999 %>% 
-  arrange(desc(births))
-
diff --git a/docs/search_index.json b/docs/search_index.json
index eb981ae04..cf608db6b 100644
--- a/docs/search_index.json
+++ b/docs/search_index.json
@@ -1,20 +1,23 @@
 [
-["index.html", "Statistical Inference via Data Science A moderndive into R and the tidyverse Preface Introduction for students Introduction for instructors Connect and contribute About this book About the authors", " Statistical Inference via Data Science A moderndive into R and the tidyverse Chester Ismay and Albert Y. Kim August 28, 2019 Preface Special Announcement We’re excited to announce that we’ve signed a book deal with CRC Press! We will be publishing our first fully complete online version of ModernDive in Summer 2019, with a corresponding print edition to follow in Fall 2019. Don’t worry though, our content will remain freely available on ModernDive.com. Please note that you are currently looking at the “development version” of ModernDive, which is a work in progress currently being edited and thus subject to frequent change. For the latest “released version” of ModernDive, which is updated around twice a year, please visit ModernDive.com. Help! I’m new to R and RStudio and I need to learn about them! However, I’m completely new to coding! What do I do? If you’re asking yourself this question, then you’ve come to the right place! Start with the “Introduction for students” section. Are you an instructor hoping to use this book in your courses? Then read the “Introduction for instructors” section for more information on how to teach with this book. Are you looking to connect with and contribute to ModernDive? Then read the “Connect and contribute” section for information on how. Are you curious about the publishing of this book? Then read the “About this book” section for more information on the open-source technology, in particular R Markdown and the bookdown package. This is version 0.6.1 of ModernDive published on August 28, 2019. For previous versions of ModernDive, see the “About this book” section below. Introduction for students This book assumes no prerequisites: no algebra, no calculus, and no prior programming/coding experience. This is intended to be a gentle introduction to the practice of analyzing data and answering questions using data the way data scientists, statisticians, data journalists, and other researchers would. We present a map of your upcoming journey in Figure 0.1. FIGURE 0.1: ModernDive Flowchart. You’ll first get started with data in Chapter 1 where you’ll learn about the difference between R and RStudio, start coding in R, install and load your first R packages, and explore your first dataset: all domestic departure flights from a New York City airport in 2013. Then you’ll cover the following three portions of this book: Data science with tidyverse. You’ll assemble your data science toolbox using tidyverse packages. In particular you’ll Ch.2: Visualize data using the ggplot2 package. Ch.3: Wrangle data using the dplyr package. Ch.4: Learn about the concept of “tidy” data as a standardized data frame input and output format for all packages in the tidyverse. Furthermore, you’ll learn how to import spreadsheet files into R using the readr package. Data modeling with moderndive. Using these data science tools and helper functions from the moderndive package, you’ll fit your first data models. In particular: Ch.5: Basic regression models with only one explanatory variable. Ch.6: Multiple regression models with more than one explanatory variable. Statistical inference with infer. Once again using your newly acquired data science tools, you’ll unpack statistical inference using the infer package. In particular you’ll: Ch.7: Learn about the role that sampling variability plays in statistical inference and the role that sample size plays in sampling variability. Ch.8: Construct confidence intervals. Ch.9: Conduct hypothesis tests. Data modeling with moderndive (revisited): Armed with your understanding of statistical inference, you’ll revisit and review the models you’ll construct in Ch.5 &amp; Ch.6. In particular you’ll: Ch.10: Interpret confidence intervals and hypothesis tests in a regression setting. We’ll end with a discussion on what it means to “tell the story with data” in Chapter 11 by presenting example case studies. What we hope you will learn from this book We hope that by the end of this book, you’ll have learned how to Use R and the tidyverse suite of R packages for data science. Fit your first models to data, using a method known as linear regression. Perform statistical inference using confidence intervals and hypothesis tests. Tell your story with data using these tools. What do we mean by data stories? We mean any analysis involving data that engages the reader in answering questions with careful visuals and thoughtful discussion. Further discussions on data stories can be found in the blogpost “Tell a Meaningful Story With Data.” Over the course of this book, you will develop your “data science toolbox,” equipping yourself with tools such as data visualization, data formatting, data wrangling, and data modeling using regression. In particular, this book will lean heavily on data visualization. In today’s world, we are bombarded with graphics that attempt to convey ideas. We will explore what makes a good graphic and what the standard ways are used to convey relationships within data. In general, we’ll use visualization as a way of building almost all of the ideas in this book. To impart the statistical lessons of this book, we have intentionally minimized the number of mathematical formulas used. Instead, you’ll develop a conceptual understanding of statistics using data visualization and computer simulations. We hope this is a more intuitive experience than the way statistics has traditionally been taught in the past and how it is commonly perceived. Finally, you’ll learn the importance of literate programming. By this we mean you’ll learn how to write code that is useful not just for a computer to execute but also for readers to understand exactly what your analysis is doing and how you did it. This is part of a greater effort to encourage reproducible research (see the “Reproducible research” subsection for more details). Hal Abelson coined the phrase that we will follow throughout this book: “Programs must be written for people to read, and only incidentally for machines to execute.” We understand that there may be challenging moments as you learn to program. Both of us continue to struggle and find ourselves often using web searches to find answers and reach out to colleagues for help. In the long run though, we all can solve problems faster and more elegantly via programming. We wrote this book as our way to help you get started and you should know that there is a huge community of R users that are always happy to help everyone along as well. This community exists in particular on the internet on various forums and websites such as stackoverflow.com. Data/science pipeline You may think of statistics as just being a bunch of numbers. We commonly hear the phrase “statistician” when listening to broadcasts of sporting events. Statistics (in particular, data analysis), in addition to describing numbers like with baseball batting averages, plays a vital role in all of the sciences. You’ll commonly hear the phrase “statistically significant” thrown around in the media. You’ll see articles that say “Science now shows that chocolate is good for you.” Underpinning these claims is data analysis. By the end of this book, you’ll be able to better understand whether these claims should be trusted or whether we should be wary. Inside data analysis are many sub-fields that we will discuss throughout this book (though not necessarily in this order): data collection data wrangling data visualization data modeling inference correlation and regression interpretation of results data communication/storytelling These sub-fields are summarized in what Grolemund and Wickham term the “Data/Science Pipeline” in Figure 0.2. FIGURE 0.2: Data/Science Pipeline. We will begin by digging into the gray Understand portion of the cycle with data visualization, then with a discussion on what is meant by tidy data and data wrangling, and then conclude by talking about interpreting and discussing the results of our models via Communication. These steps are vital to any statistical analysis. But why should you care about statistics? “Why did they make me take this class?” There’s a reason so many fields require a statistics course. Scientific knowledge grows through an understanding of statistical significance and data analysis. You needn’t be intimidated by statistics. It’s not the beast that it used to be and, paired with computation, you’ll see how reproducible research in the sciences particularly increases scientific knowledge. Reproducible research “The most important tool is the mindset, when starting, that the end product will be reproducible.” – Keith Baggerly Another goal of this book is to help readers understand the importance of reproducible analyses. The hope is to get readers into the habit of making their analyses reproducible from the very beginning. This means we’ll be trying to help you build new habits. This will take practice and be difficult at times. You’ll see just why it is so important for you to keep track of your code and well-document it to help yourself later and any potential collaborators as well. Copying and pasting results from one program into a word processor is not the way that efficient and effective scientific research is conducted. It’s much more important for time to be spent on data collection and data analysis and not on copying and pasting plots back and forth across a variety of programs. In traditional analyses if an error was made with the original data, we’d need to step through the entire process again: recreate the plots and copy-and-paste all of the new plots and our statistical analysis into your document. This is error prone and a frustrating use of time. We’ll see how to use R Markdown to get away from this tedious activity so that we can spend more time doing science. “We are talking about computational reproducibility.” - Yihui Xie Reproducibility means a lot of things in terms of different scientific fields. Are experiments conducted in a way that another researcher could follow the steps and get similar results? In this book, we will focus on what is known as computational reproducibility. This refers to being able to pass all of one’s data analysis, data-sets, and conclusions to someone else and have them get exactly the same results on their machine. This allows for time to be spent interpreting results and considering assumptions instead of the more error prone way of starting from scratch or following a list of steps that may be different from machine to machine. Final note for students At this point, if you are interested in instructor perspectives on this book, ways to contribute and collaborate, or the technical details of this book’s construction and publishing, then continue with the rest of the chapter. Otherwise, let’s get started with R and RStudio in Chapter 1! Introduction for instructors Resources Here are some resources to help you use ModernDive: We’ve included review questions posed as Learning Checks. You can find all the solutions to all Learning Checks in Appendix D of the online version of the book at https://moderndive.com/D-appendixD.html. Dr. Jenny Smetzer and Albert Y. Kim have written a series of labs and problem sets. You can find them at https://moderndive.com/labs. You can see the webpages for two courses that use ModernDive: Smith College “SDS192 Introduction to Data Science”: https://rudeboybert.github.io/SDS192/. Smith College “SDS220 Introduction to Probability and Statistics” https://rudeboybert.github.io/SDS220/. Why did we write this book? This book is inspired by the following books: “Mathematical Statistics with Resampling and R” (Chihara and Hesterberg 2011), “OpenIntro: Intro Stat with Randomization and Simulation” (Diez, Barr, and Çetinkaya-Rundel 2014), and “R for Data Science” (Grolemund and Wickham 2016). The first book, while designed for upper-level undergraduates and graduate students, provides an excellent resource on how to use resampling to impart statistical concepts like sampling distributions using computation instead of large-sample approximations and other mathematical formulas. The last two books are free options to learning introductory statistics and data science, providing an alternative to the many traditionally expensive introductory statistics textbooks. When looking over the large number of introductory statistics textbooks that currently exist, we found that there wasn’t one that incorporated many newly developed R packages directly into the text, in particular the many packages included in the tidyverse collection of packages, such as ggplot2, dplyr, tidyr, and broom. Additionally, there wasn’t an open-source and easily reproducible textbook available that exposed new learners all of three of the learning goals we listed. Who is this book for? This book is intended for instructors of traditional introductory statistics classes using RStudio, either the desktop or server version, who would like to inject more data science topics into their syllabus. We assume that students taking the class will have no prior algebra, calculus, nor programming/coding experience. Here are some principles and beliefs we kept in mind while writing this text. If you agree with them, this might be the book for you. Blur the lines between lecture and lab With increased availability and accessibility of laptops and open-source non-proprietary statistical software, the strict dichotomy between lab and lecture can be loosened. It’s much harder for students to understand the importance of using software if they only use it once a week or less. They forget the syntax in much the same way someone learning a foreign language forgets the rules. Frequent reinforcement is key. Focus on the entire data/science research pipeline We believe that the entirety of Grolemund and Wickham’s data/science pipeline should be taught. We believe in George Cobb’s “minimizing prerequisites to research”: students should be answering questions with data as soon as possible. It’s all about the data We leverage R packages for rich, real, and realistic data-sets that at the same time are easy-to-load into R, such as the nycflights13 and fivethirtyeight packages. We believe that data visualization is a gateway drug for statistics and that the Grammar of Graphics as implemented in the ggplot2 package is the best way to impart such lessons. However, we often hear: “You can’t teach ggplot2 for data visualization in intro stats!” We, like David Robinson, are much more optimistic. dplyr has made data wrangling much more accessible to novices, and hence much more interesting data-sets can be explored. Use simulation/resampling to introduce statistical inference, not probability/mathematical formulas Instead of using formulas, large-sample approximations, and probability tables, we teach statistical concepts using resampling-based inference. This allows for a de-emphasis of traditional probability topics, freeing up room in the syllabus for other topics. Bridges to these mathematical concepts are given as well to help with relation of these traditional topics with more modern approaches. Don’t fence off students from the computation pool, throw them in! Computing skills are essential to working with data in the 21st century. Given this fact, we feel that to shield students from computing is to ultimately do them a disservice. We are not teaching a course on coding/programming per se, but rather just enough of the computational and algorithmic thinking necessary for data analysis. Complete reproducibility and customizability We are frustrated when textbooks give examples, but not the source code and the data itself. We give you the source code for all examples as well as the whole book! Ultimately the best textbook is one you’ve written yourself. You know best your audience, their background, and their priorities. You know best your own style and the types of examples and problems you like best. Customization is the ultimate end. For more about how to make this book your own, see About this Book. Connect and contribute If you would like to connect with ModernDive, check out the following links: If you would like to receive periodic updates about ModernDive (roughly every 6 months), please sign up for our mailing list. Contact Albert at albert.ys.kim@gmail.com and Chester at chester.ismay@gmail.com. We’re on Twitter at moderndive. If you would like to contribute to ModernDive, there are many ways! We would love your help and feedback to make this book as great as possible! For example, if you find any errors, typos, or areas for improvement, then please email us or post an issue on our GitHub issues page. If you are familiar with GitHub and would like to contribute more, please see the “About this book” section. The authors would like to thank Nina Sonneborn, Kristin Bott, Dr. Jenny Smetzer, and the participants of our 2017 and 2019 USCOTS workshops for their feedback and suggestions. We’d also like to thank Dr. Andrew Heiss for contributing Subsection 1.2.3 on “Errors, warnings, and messages.” and Starry Zhou for her many edits to the book. A special thanks goes to Dr. Yana Weinstein, cognitive psychological scientist and co-founder of The Learning Scientists, for her extensive feedback. About this book This book was written using RStudio’s bookdown package by Yihui Xie (Xie 2019). This package simplifies the publishing of books by having all content written in R Markdown. The bookdown/R Markdown source code for all versions of ModernDive is available on GitHub: Latest published version The most up-to-date release: Version 0.6.1 released on August 28, 2019 (source code). Available at ModernDive.com Development version The working copy of the next version which is currently being edited: Preview of development version is available at https://moderndive.netlify.com/ Source code: Available on ModernDive’s GitHub repository page Previous versions Older versions that may be out of date: Version 0.6.0 released on August 7, 2019 (source code)) Version 0.5.0 released on February 24, 2019 (source code) Version 0.4.0 released on July 21, 2018 (source code) Version 0.3.0 released on February 3, 2018 (source code) Version 0.2.0 released on August 2, 2017 (source code) Version 0.1.3 released on February 9, 2017 (source code) Version 0.1.2 released on January 22, 2017 (source code) Could this be a new paradigm for textbooks? Instead of the traditional model of textbook companies publishing updated editions of the textbook every few years, we apply a software design influenced model of publishing more easily updated versions. We can then leverage open-source communities of instructors and developers for ideas, tools, resources, and feedback. As such, we welcome your pull requests. Finally, feel free to modify the book as you wish for your own needs, but please list the authors at the top of index.Rmd as “Chester Ismay, Albert Y. Kim, and YOU!” About the authors Who we are! Chester Ismay Albert Y. Kim Chester Ismay: Data Science Evangelist - DataRobot, Portland, OR, USA. Email: chester.ismay@gmail.com Webpage: http://chester.rbind.io/ Twitter: old_man_chester GitHub: https://github.com/ismayc Albert Y. Kim: Assistant Professor of Statistical &amp; Data Sciences - Smith College, Northampton, MA, USA. Email: albert.ys.kim@gmail.com Webpage: http://rudeboybert.rbind.io/ Twitter: rudeboybert GitHub: https://github.com/rudeboybert References "],
-["1-getting-started.html", "Chapter 1 Getting Started with Data in R 1.1 What are R and RStudio? 1.2 How do I code in R? 1.3 What are R packages? 1.4 Explore your first datasets 1.5 Conclusion", " Chapter 1 Getting Started with Data in R Before we can start exploring data in R, there are some key concepts to understand first: What are R and RStudio? How do I code in R? What are R packages? We’ll introduce these concepts in the upcoming Sections 1.1-1.3. If you are already somewhat familiar with these concepts, feel free to skip to Section 1.4 where we’ll introduce our first data set: all domestic flights departing a New York City airport in 2013. This is a dataset we will explore in depth for the rest of this book. 1.1 What are R and RStudio? For much of this book, we will assume that you are using R via RStudio. First time users often confuse the two. At its simplest R is like a car’s engine while RStudio is like a car’s dashboard. FIGURE 1.1: Analogy of difference between R and RStudio. More precisely, R is a programming language that runs computations while RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rear-view mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well. 1.1.1 Installing R and RStudio Note about RStudio Server: If your instructor has provided you with a link and access to RStudio Server, then you can skip this section. We do recommend after a few months of working on RStudio Server that you return to these instructions to install this software on your own computer though. You will first need to download and install both R and RStudio (Desktop version) on your computer. It is important that you install R first and then install RStudio second. You must do this first: Download and install R. If you are a Windows user: Click on “Download R for Windows”, then click on “base”, then click on the Download link. If you are macOS user: Click on “Download R for (Mac) OS X”, then under “Latest release:” click on R-X.X.X.pkg, where R-X.X.X is the version number. For example, the latest version of R as of August 10, 2019 was R-3.6.1. You must do this second: Download and install RStudio. Scroll down to “Installers for Supported Platforms” near the bottom of the page. Click on the download link corresponding to your computer’s operating system. 1.1.2 Using R via RStudio Recall our car analogy from earlier. Much as we don’t drive a car by interacting directly with the engine but rather by interacting with elements on the car’s dashboard, we won’t be using R directly but rather we will use RStudio’s interface. After you install R and RStudio on your computer, you’ll have two new programs (also called applications) you can open. We’ll always work in RStudio and not R. Figure 1.2 shows what icon you should be clicking on your computer. FIGURE 1.2: Icons of R versus RStudio on your computer. After you open RStudio, you should see the following in Figure 1.3. FIGURE 1.3: RStudio interface to R. Note the three panes which are three panels dividing the screen: The console pane, the files pane, and the environment pane. Over the course of this chapter, you’ll come to learn what purpose each of these panes serve. 1.2 How do I code in R? Now that you’re set up with R and RStudio, you are probably asking yourself “OK. Now how do I use R?” The first thing to note is that unlike other statistical software programs like Excel, STATA, or SAS that provide point-and-click interfaces, R is an interpreted language. This means you have to type in commands written in R code. In other words, you have to code/program in R. Note that we’ll use the terms “coding” and “programming” interchangeably in this book. While it is not required to be a seasoned coder/computer programmer to use R, there is still a set of basic programming concepts that R users need to understand. Consequently, while this book is not a book on programming, you will still learn just enough of these basic programming concepts needed to explore and analyze data effectively. 1.2.1 Basic programming concepts and terminology We now introduce some basic programming concepts and terminology. Instead of asking you to learn all these concepts and terminology right now, we’ll guide you so that you’ll “learn by doing.” Note that in this book we will always use a different font to distinguish regular text from computer_code. The best way to master these topics is, in our opinions, “learning by doing” and lots of repetition. Basics: Console: Where you enter in commands. Running code: The act of telling R to perform an act by giving it commands in the console. Objects: Where values are saved in R. We’ll show you how to assign values to objects and how to display the contents of objects. Data types: Integers, doubles/numerics, logicals, and characters. Vectors: A series of values. These are created using the c() function, where c() stands for “combine” or “concatenate.” For example: c(6, 11, 13, 31, 90, 92). Factors: Categorical data are represented in R as factors. Data frames: Data frames are like rectangular spreadsheets: they are representations of datasets in R where the rows correspond to observations and the columns correspond to variables that describe the observations. We’ll cover data frames later in Section 1.4. Conditionals: Testing for equality in R using == (and not = which is typically used for assignment). Ex: 2 + 1 == 3 compares 2 + 1 to 3 and is correct R code, while 2 + 1 = 3 will return an error. Boolean algebra: TRUE/FALSE statements and mathematical operators such as &lt; (less than), &lt;= (less than or equal), and != (not equal to). Logical operators: &amp; representing “and” as well as | representing “or.” Ex: (2 + 1 == 3) &amp; (2 + 1 == 4) returns FALSE since both clauses are not TRUE (only the first clause is TRUE). On the other hand, (2 + 1 == 3) | (2 + 1 == 4) returns TRUE since at least one of the two clauses is TRUE. Functions, also called commands: Functions perform tasks in R. They take in inputs called arguments and return outputs. You can either manually specify a function’s arguments or use the function’s default values. This list is by no means an exhaustive list of all the programming concepts and terminology needed to become a savvy R user; such a list would be so large it wouldn’t be very useful, especially for novices. Rather, we feel this is a minimally viable list of programming concepts and terminology you need to know before getting started. We feel that you can learn the rest as you go. Remember that your mastery of all of these concepts and terminology will build as you practice more and more. 1.2.2 Errors, warnings, and messages One thing that intimidates new R and RStudio users is how it reports errors, warnings, and messages. R reports errors, warnings, and messages in a glaring red font, which makes it seem like it is scolding you. However, seeing red text in the console is not always bad. R will show red text in the console pane in three different situations: Errors: When the red text is a legitimate error, it will be prefaced with “Error in…” and try to explain what went wrong. Generally when there’s an error, the code will not run. For example, we’ll see in Subsection 1.3.3 if you see Error in ggplot(...) : could not find function &quot;ggplot&quot;, it means that the ggplot() function is not accessible because the package that contains the function (ggplot2) was not loaded with library(ggplot2). Thus you cannot use the ggplot() function without the ggplot2 package being loaded first. Warnings: When the red text is a warning, it will be prefaced with “Warning:” and R will try to explain why there’s a warning. Generally your code will still work, but with some caveats. For example, you will see in Chapter 2 if you create a scatterplot based on a dataset where one of the values is missing, you will see this warning: Warning: Removed 1 rows containing missing values (geom_point). R will still produce the scatterplot with all the remaining values, but it is warning you that one of the points isn’t there. Messages: When the red text doesn’t start with either “Error” or “Warning”, it’s just a friendly message. You’ll see these messages when you load R packages in the upcoming Subsection 1.3.2 or when you read data saved in spreadsheet files with the read_csv() function as you’ll see in Chapter 4. These are helpful diagnostic messages and they don’t stop your code from working. Additionally, you’ll see these messages when you install packages too using install.packages(). Remember, when you see red text in the console, don’t panic. It doesn’t necessarily mean anything is wrong. Rather: If the text starts with “Error”, figure out what’s causing it. Think of errors as a red traffic light: something is wrong! If the text starts with “Warning”, figure out if it’s something to worry about. For instance, if you get a warning about missing values in a scatterplot and you know there are missing values, you’re fine. If that’s surprising, look at your data and see what’s missing. Think of warnings as a yellow traffic light: everything is working fine, but watch out/pay attention. Otherwise the text is just a message. Read it, wave back at R, and thank it for talking to you. Think of messages as a green traffic light: everything is working fine. 1.2.3 Tips on learning to code Learning to code/program is very much like learning a foreign language. It can be very daunting and frustrating at first. Such frustrations are very common and it is very normal to feel discouraged as you learn. However just as with learning a foreign language, if you put in the effort and are not afraid to make mistakes, anybody can learn. Here are a few useful tips to keep in mind as you learn to program: Remember that computers are not actually that smart: You may think your computer or smartphone are “smart,” but really people spent a lot of time and energy designing them to appear “smart.” In reality, you have to tell a computer everything it needs to do. Furthermore, the instructions you give your computer can’t have any mistakes in them nor can they be ambiguous in any way. Take the “copy, paste, and tweak” approach: Especially when you learn your first programming language or you need to understand particularly complicated code, it is often much easier to take existing code that you know works and modify it to suit your ends. This is opposed to trying to type out the code from scratch. We call this the “copy, paste, and tweak” approach. So early on, we suggest not trying to write code from memory, but rather take existing examples we have provided you, then copy, paste, and tweak them to suit your goals. After you start feeling more confident, you can slowly move away from this approach. Think of the “copy, paste, and tweak” approach as training wheels for a child learning to ride a bike. After getting comfortable, they won’t need them anymore. The best way to learn to code is by doing: Rather than learning to code for its own sake, we feel that learning to code goes much smoother when you have a goal in mind or when you are working on a particular project, like analyzing data that you are interested in. Practice is key: Just as the only method to improve your foreign language skills is through lots of practice, the only method to improving your coding skills is through lots of practice. Don’t worry however, we’ll give you plenty of opportunities to do so! 1.3 What are R packages? Another point of confusion with many new R users is the idea of an R package. R packages extend the functionality of R by providing additional functions, data, and documentation. They are written by a world-wide community of R users and can be downloaded for free from the internet. For example, among the many packages we will use in this book are the ggplot2 package for data visualization in Chapter 2, the dplyr package (Wickham, François, et al. 2019) for data wrangling in Chapter 3, the moderndive package (Ismay 2019) that accompanies this book, and the infer package (Bray et al. 2019) for “tidy” and transparent statistical inference in Chapters 8, 9, and 10. A good analogy for R packages is they are like apps you can download onto a mobile phone: FIGURE 1.4: Analogy of R versus R packages. So R is like a new mobile phone: while it has a certain amount of features when you use it for the first time, it doesn’t have everything. R packages are like the apps you can download onto your phone from Apple’s App Store or Android’s Google Play. Let’s continue this analogy by considering the Instagram app for editing and sharing pictures. Say you have purchased a new phone and you would like to share a photo you have just taken with friends and family on Instagram. You need to: Install the app: Since your phone is new and does not include the Instagram app, you need to download the app from either the App Store or Google Play. You do this once and you’re set for the time being. You might need to do this again in the future when there is an update to the app. Open the app: After you’ve installed Instagram, you need to open the app. Once Instagram is open on your phone, you can then proceed to share your photo with your friends and family. The process is very similar for using an R package. You need to: Install the package: This is like installing an app on your phone. Most packages are not installed by default when you install R and RStudio. Thus if you want to use a package for the first time, you need to install it first. Once you’ve installed a package, you likely won’t install it again unless you want to update it to a newer version. “Load” the package: “Loading” a package is like opening an app on your phone. Packages are not “loaded” by default when you start RStudio on your computer; you need to “load” each package you want to use every time you start RStudio. Let’s now show you how to perform these two steps for the ggplot2 package for data visualization. 1.3.1 Package installation Note about RStudio Server: If your instructor has provided you with a link and access to RStudio Server, you probably will not need to install packages, as they have likely been pre-installed for you by your instructor. That being said, it is still a good idea to know this process for later on when you are not using RStudio Server, but rather RStudio Desktop on your own computer. There are two ways to install an R package: an easy way and a more advanced way. Let’s install the ggplot2 package the easy way first as shown in Figure 1.5. In the Files pane of RStudio: Click on the “Packages” tab. Click on “Install” next to Update. Type the name of the package under “Packages (separate multiple with space or comma):” In this case, type ggplot2. Click “Install.” FIGURE 1.5: Installing packages in R the easy way. An alternative but slightly less convenient way to install a package is by typing install.packages(&quot;ggplot2&quot;) in the console pane of RStudio and pressing Return/Enter on your keyboard. Note you must include the quotation marks around the name of the package. Much like an app on your phone, you only have to install a package once. However, if you want to update a previously installed package to a newer version, you need to reinstall it by repeating the earlier steps. Learning check (LC1.1) Repeat the earlier installing steps, but for the dplyr, nycflights13, and knitr packages. This will install the earlier mentioned dplyr package for data wrangling, the nycflights13 package containing data on all domestic flights leaving a NYC airport in 2013, and the knitr package for writing reports in R. We’ll use these packages in the next section. Note that if you’d like to match up exactly with what the output looks like throughout the book, you may want to use the exact versions of the packages that we used. You can find a full listing of these packages and their versions in Appendix E. This likely won’t be relevant for novices, but we included it for reproducibility reasons. 1.3.2 Package loading Recall that after you’ve installed a package, you need to “load it.” In other words, you need to “open it.” We do this by using the library() command. For example, to load the ggplot2 package, run the following code in the console pane. What do we mean by “run the following code”? Either type or copy &amp; paste the following code into the console pane and then hit the Enter key. library(ggplot2) If after running the earlier code, a blinking cursor returns next to the &gt; “prompt” sign, it means you were successful and the ggplot2 package is now loaded and ready to use. If however, you get a red “error message” that reads… Error in library(ggplot2) : there is no package called ‘ggplot2’ … it means that you didn’t successfully install it. This is an example of an “error message” we discussed in Subsection 1.2.2. If you get this error message, go back to Subsection 1.3.1 on R package installation and make sure to install it. Learning check (LC1.2) “Load” the dplyr, nycflights13, and knitr packages as well by repeating the earlier steps. 1.3.3 Package use One very common mistake new R users make when wanting to use particular packages is they forget to “load” them first by using the library() command we just saw. Remember: you have to load each package you want to use every time you start RStudio. If you don’t first “load” a package, but attempt to use one of its features, you’ll see an error message similar to: Error: could not find function This is a different error message than the one you just saw on a package not having been installed yet. R is telling you that you are trying to use a function in a package that has not yet been “loaded.” R doesn’t know where to find the function you are using. Almost all new users forget to do this when starting out, and it is a little annoying to get used to doing it. However, you’ll remember with practice. 1.4 Explore your first datasets Let’s put everything we’ve learned so far into practice and start exploring some real data! Data comes to us in a variety of formats, from pictures to text to numbers. Throughout this book, we’ll focus on datasets that are saved in “spreadsheet”-type format. This is probably the most common way data are collected and saved in many fields. Remember from Subsection 1.2.1 that these “spreadsheet”-type datasets are called data frames in R. We’ll focus on working with data saved as data frames throughout this book. Let’s first load all the packages needed for this chapter, assuming you’ve already installed them. Read Section 1.3 for information on how to install and load R packages if you haven’t already. library(nycflights13) library(dplyr) library(knitr) At the beginning of all subsequent chapters in this book, we’ll always have a list of packages that you should have installed and loaded in order to work with that chapter’s R code. 1.4.1 nycflights13 package Many of us have flown on airplanes or know someone who has. Air travel has become an ever-present aspect in many people’s lives. If you look at the Departures flight information board at an airport, you will frequently see that some flights are delayed for a variety of reasons. Are there ways that we can understand the reasons that cause flight delays? We’d all like to arrive at our destinations on time whenever possible. (Unless you secretly love hanging out at airports. If you are one of these people, pretend for a moment that you are very much anticipating being at your final destination.) Throughout this book, we’re going to analyze data related to all 2013 domestic flights departing from one of New York City’s three airports: Newark Liberty International (EWR), John F. Kennedy International (JFK), and La Guardia (LGA). We’ll access this data using the nycflights13 R package which contained five datasets saved in five data frames: flights: Information on all 336,776 flights. airlines: A table matching airline names and their two letter IATA airline codes (also known as carrier codes) for 16 airline companies. Ex: DL is the two letter code for Delta Air Lines. planes: Information about each of the 3,322 physical aircraft used. weather: Hourly meteorological data for each of the three NYC airports. This data frame has 26,115 rows, roughly corresponding to the 365 \\(\\times\\) 24 \\(\\times\\) 3 = 26,280 possible hourly measurements one can observe at three locations over the course of a year. airports: Airport names, codes, and locations for the 1,458 domestic destination airports. 1.4.2 flights data frame We’ll begin by exploring the flights data frame and get an idea of its structure. Run the following code in your console, either by typing it or by cutting &amp; pasting it. It displays the contents of the flights data frame in your console. Note that depending on the size of your monitor, the output may vary slightly. flights # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 544 545 -1 1004 1022 5 2013 1 1 554 600 -6 812 837 6 2013 1 1 554 558 -4 740 728 7 2013 1 1 555 600 -5 913 854 8 2013 1 1 557 600 -3 709 723 9 2013 1 1 557 600 -3 838 846 10 2013 1 1 558 600 -2 753 745 # … with 336,766 more rows, and 11 more variables: arr_delay &lt;dbl&gt;, # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt; Let’s unpack this output: A tibble: 336,776 x 19: A tibble is a specific kind of data frame and is short for “tidy table” (we’ll discuss what it means for a data frame to be “tidy” later on in Section 4.2). This particular data frame has 336,776 rows corresponding to different observations. Here, each observation is a flight. 19 columns corresponding to 19 variables describing each observation. year, month, day, dep_time, sched_dep_time, dep_delay, and arr_time are the different columns, in other words, the different variables of this data set. We then have a preview of the first 10 rows of observations corresponding to the first 10 flights. R is only showing the first 10 rows, because if it showed all 336,776 rows it would overwhelm your screen. ... with 336,766 more rows, and 11 more variables: indicating to us that 336,766 more rows of data and 11 more variables could not fit in this screen. Unfortunately, this output does not allow us to explore the data very well. Let’s look at some different ways to explore data frames. 1.4.3 Exploring data frames There are many ways to get a feel for the data contained in a data frame such as flights. We present three functions that take as their “argument” (their input) the data frame in question. We also include a fourth method for exploring one particular column of a data frame: Using the View() function, which brings up RStudio’s built-in spreadsheet viewer. Using the glimpse() function, which is included in the dplyr package. Using the kable() function, which is included in the knitr package. Using the $ “extraction operator”, which is used to view a single variable/column in a data frame. 1. View(): Run View(flights) in your console in RStudio, either by typing it or cutting &amp; pasting it into the console pane, and explore this data frame in the resulting pop-up viewer. You should get into the habit of always viewing any data frames you encounter. Note the uppercase V in View. R is case-sensitive, so you’ll get an error message if you run view(flights) instead of View(flights). Learning check (LC1.3) What does any ONE row in this flights dataset refer to? A. Data on an airline B. Data on a flight C. Data on an airport D. Data on multiple flights By running View(flights), we can explore the different variables listed in the columns. Observe that there are many different types of variables. Some of the variables like distance, day, and arr_delay are what we will call quantitative variables. These variables are numerical in nature. Other variables here are categorical. Note that if you look in the leftmost column of the View(flights) output, you will see a column of numbers. These are the row numbers of the dataset. If you glance across a row with the same number, say row 5, you can get an idea of what each row is representing. In other words, this will allow you to identify what object is being described in a given row. This is often called the observational unit. The observational unit in this example is an individual flight departing from New York City in 2013. You can identify the observational unit by determining what “thing” is being measured or described by each of the variables. We’ll talk more about observational units in Section 1.4.4 on identification and measurement variables. 2. glimpse(): The second way to explore a data frame is using the glimpse() function included in the dplyr package. Thus, you can only use the glimpse() function after you’ve loaded the dplyr package by running library(dplyr). This function provides us with an alternative perspective for exploring a data frame than the View() function: glimpse(flights) Observations: 336,776 Variables: 19 $ year &lt;int&gt; 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, … $ month &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … $ day &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … $ dep_time &lt;int&gt; 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,… $ sched_dep_time &lt;int&gt; 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600,… $ dep_delay &lt;dbl&gt; 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -… $ arr_time &lt;int&gt; 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849… $ sched_arr_time &lt;int&gt; 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851… $ arr_delay &lt;dbl&gt; 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -… $ carrier &lt;chr&gt; &quot;UA&quot;, &quot;UA&quot;, &quot;AA&quot;, &quot;B6&quot;, &quot;DL&quot;, &quot;UA&quot;, &quot;B6&quot;, &quot;EV&quot;, &quot;B6&quot;, … $ flight &lt;int&gt; 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, … $ tailnum &lt;chr&gt; &quot;N14228&quot;, &quot;N24211&quot;, &quot;N619AA&quot;, &quot;N804JB&quot;, &quot;N668DN&quot;, &quot;N39… $ origin &lt;chr&gt; &quot;EWR&quot;, &quot;LGA&quot;, &quot;JFK&quot;, &quot;JFK&quot;, &quot;LGA&quot;, &quot;EWR&quot;, &quot;EWR&quot;, &quot;LGA&quot;… $ dest &lt;chr&gt; &quot;IAH&quot;, &quot;IAH&quot;, &quot;MIA&quot;, &quot;BQN&quot;, &quot;ATL&quot;, &quot;ORD&quot;, &quot;FLL&quot;, &quot;IAD&quot;… $ air_time &lt;dbl&gt; 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, … $ distance &lt;dbl&gt; 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,… $ hour &lt;dbl&gt; 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, … $ minute &lt;dbl&gt; 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, … $ time_hour &lt;dttm&gt; 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 … Observe that glimpse() will give you the first few entries of each variable in a row after the variable name. In addition, the data type (see Subsection 1.2.1) of the variable is given immediately after each variable’s name inside &lt; &gt;. Here, int and dbl refer to “integer” and “double”, which are computer coding terminology for quantitative/numerical variables. In contrast, chr refers to “character”, which is computer terminology for text data. Text data, such as the carrier or origin of a flight, are categorical variables. The time_hour variable is another data type: dttm. These types of variables represent date and time combinations. However, we won’t work with dates and times in this book, we leave this topic for a more advanced data science book. Learning check (LC1.4) What are some other examples in this dataset of categorical variables? What makes them different than quantitative variables? 3. kable(): The final way to explore the entirety of a data frame is using the kable() function from the knitr package. Let’s explore the different carrier codes for all the airlines in our dataset two ways. Run both of these lines of code in the console: airlines kable(airlines) At first glance, it may not appear that there is much difference in the outputs. However when using tools for producing reproducible reports such as R Markdown, the latter code produces output that is much more legible and reader-friendly. 4. $ operator Lastly, the $ operator allows us to extract and then explore a single variable within a data frame. For example, run the following in your console airlines$name We used the $ operator to extract only the name variable and return it as a vector of length 16. We’ll only be occasionally exploring data frames using the $ operator, instead favoring the View() and glimpse() functions. 1.4.4 Identification &amp; measurement variables There is a subtle difference between the kinds of variables that you will encounter in data frames: identification variables and measurement variables. For example, let’s explore the airports data frame by showing the output of glimpse(airports): glimpse(airports) Observations: 1,458 Variables: 8 $ faa &lt;chr&gt; &quot;04G&quot;, &quot;06A&quot;, &quot;06C&quot;, &quot;06N&quot;, &quot;09J&quot;, &quot;0A9&quot;, &quot;0G6&quot;, &quot;0G7&quot;, &quot;0P2&quot;, … $ name &lt;chr&gt; &quot;Lansdowne Airport&quot;, &quot;Moton Field Municipal Airport&quot;, &quot;Schaumbu… $ lat &lt;dbl&gt; 41.1, 32.5, 42.0, 41.4, 31.1, 36.4, 41.5, 42.9, 39.8, 48.1, 39.… $ lon &lt;dbl&gt; -80.6, -85.7, -88.1, -74.4, -81.4, -82.2, -84.5, -76.8, -76.6, … $ alt &lt;int&gt; 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 1… $ tz &lt;dbl&gt; -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5,… $ dst &lt;chr&gt; &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;U&quot;, &quot;A&quot;, &quot;A&quot;, &quot;U&quot;, &quot;A&quot;… $ tzone &lt;chr&gt; &quot;America/New_York&quot;, &quot;America/Chicago&quot;, &quot;America/Chicago&quot;, &quot;Amer… The variables faa and name are what we will call identification variables, variables that uniquely identify each observational unit. In this case, the identification variables uniquely identify airports. Such variables are mainly used in practice to uniquely identify each row in a data frame. faa gives the unique code provided by the FAA for that airport, while the name variable gives the longer official name of the airport. The remaining variables (lat, lon, alt, tz, dst, tzone) are often called measurement or characteristic variables: variables that describe properties of each observational unit. For example, lat and long describe the latitude and longitude of each airport. Furthermore, sometimes a single variable might not be enough to uniquely identify each observational unit: combinations of variables might be needed. While it is not an absolute rule, for organizational purposes it is considered good practice to have your identification variables in the left-most columns of your data frame. Learning check (LC1.5) What properties of each airport do the variables lat, lon, alt, tz, dst, and tzone describe in the airports data frame? Take your best guess. (LC1.6) Provide the names of variables in a data frame with at least three variables in which one of them is an identification variable and the other two are not. In other words, create your own tidy data frame that matches these conditions. 1.4.5 Help files Another nice feature of R are help files, which provide documentation for various functions and datasets. You can bring up help files by adding a ? before the name of a function or data frame and then run this in the console. You will then be presented with a page showing the corresponding documentation if it exists. For example, let’s look at the help file for the flights data frame. ?flights The help file should pop-up in the Help pane of RStudio. If you have questions about a function or data frame included in an R package, you should get in the habit of consulting the help file right away. Learning check (LC1.7) Look at the help file for the airports data frame. Revise your earlier guesses about what the variables lat, lon, alt, tz, dst, and tzone each describe. How good were your guesses? 1.5 Conclusion We’ve given you what we feel is a minimally viable set of tools to explore data in R. Does this chapter contain everything you need to know? Absolutely not. To try to include everything in this chapter would make the chapter so large it wouldn’t be useful! As we said earlier, the best way to further add to your toolbox is to learn by doing. 1.5.1 Additional resources If you are completely new to the world of coding, R, and RStudio and feel you could benefit from a more detailed introduction, we suggest you check out ModernDive co-author Chester Ismay’s short book “Getting used to R, RStudio, and R Markdown” (Ismay 2016), which includes screencast recordings that you can follow along and pause as you learn. Furthermore, this book contains an introduction to R Markdown, a tool used for reproducible research in R. FIGURE 1.6: Preview of Getting used to R, RStudio, and R Markdown book. 1.5.2 What’s to come? As we stated earlier, however, the best way to learn R is to learn by doing. We’re now going to start the “Data Science with tidyverse” portion of this book in Chapter 2 with what we feel is the most important tool in a data scientist’s toolbox: data visualization. We’ll continue to explore the data included in the nycflights13 package using the ggplot2 package for data visualization. You’ll see that data visualization is a powerful tool to add to your toolbox for data exploring that provides additional insight to what the View() and glimpse() functions can provide. FIGURE 1.7: ModernDive flowchart - On to Part I! References "],
-["2-viz.html", "Chapter 2 Data Visualization 2.1 The Grammar of Graphics 2.2 Five Named Graphs - The 5NG 2.3 5NG#1: Scatterplots 2.4 5NG#2: Linegraphs 2.5 5NG#3: Histograms 2.6 Facets 2.7 5NG#4: Boxplots 2.8 5NG#5: Barplots 2.9 Conclusion", " Chapter 2 Data Visualization We begin the development of your data science toolbox with data visualization. By visualizing data, we gain valuable insights that we couldn’t initially obtain from just looking at the raw data values. We’ll use the ggplot2 package as it provides an easy way to customize your plots. ggplot2 is rooted in the data visualization theory known as The Grammar of Graphics (Wilkinson 2005), developed by Leland Wilkinson. At their most basic, graphics/plots/charts (we use these terms interchangeably in this book) provide a nice way to explore the patterns in data, such as the presence of outliers, distributions of individual variables, and relationships between groups of variables. Graphics are designed to emphasize the findings and insights you want your audience to understand. This does however require a balancing act. On the one hand, you want to highlight as many interesting findings as possible. On the other hand, you don’t want to include so much information that it overwhelms your audience. As we will see, plots also help us to identify patterns and outliers in our data. We’ll see that a common extension of these ideas is to compare the distribution of one quantitative variable (i.e., what the spread of a variable looks like or how the variable is distributed in terms of its values) as we go across the levels of a different categorical variable. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Read Section 1.3 for information on how to install and load R packages. library(nycflights13) library(ggplot2) library(dplyr) 2.1 The Grammar of Graphics We begin with a discussion of a theoretical framework for data visualization known as “The Grammar of Graphics.” This framework serves as the foundation for the ggplot2 package which we’ll use extensively in this chapter. Think of how we construct sentences in English to form sentences by combining different elements, like nouns, verbs, particles, subjects, objects, etc. We can’t just combine these elements in any arbitrary order; we must do so following a set of rules known as a linguistic grammar. Similarly to a linguistic grammar, “The Grammar of Graphics” defines a set of rules for constructing statistical graphics by combining different types of layers. This grammar was created by Leland Wilkinson (Wilkinson 2005) and has been implemented in a variety of data visualization software platforms like R, but also Plotly and Tableau. 2.1.1 Components of the Grammar In short, the grammar tells us that: A statistical graphic is a mapping of data variables to aesthetic attributes of geometric objects. Specifically, we can break a graphic into the following three essential components: data: the data set containing the variables of interest. geom: the geometric object in question. This refers to the type of object we can observe in a plot. For example: points, lines, and bars. aes: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are mapped to variables in the data set. You might be wondering why we wrote the terms data, geom, and aes in a computer code type font. We’ll see very shortly that we’ll specify the elements of the grammar in R using these terms. However, let’s first break down the grammar with an example. 2.1.2 Gapminder data In February 2006, a statistician named Hans Rosling gave a TED talk titled “The best stats you’ve ever seen” where he presented global economic, health, and development data from the website gapminder.org. For example, for data on 142 countries in 2007, let’s consider only 6 countries in Table 2.1. TABLE 2.1: Gapminder 2007 Data: First 6 of 142 countries Country Continent Life Expectancy Population GDP per Capita Afghanistan Asia 43.8 31889923 975 Albania Europe 76.4 3600523 5937 Algeria Africa 72.3 33333216 6223 Angola Africa 42.7 12420476 4797 Argentina Americas 75.3 40301927 12779 Australia Oceania 81.2 20434176 34435 Each row in this table corresponds to a country in 2007. For each row, we have 5 columns: Country: Name of country. Continent: Which of the five continents the country is part of. Note that “Americas” includes countries in both North and South America and that Antarctica is excluded. Life Expectancy: Life expectancy in years. Population: Number of people living in the country. GDP per Capita: Gross domestic product (in US dollars). Now consider Figure 2.1, which plots this data for all 142 countries in the data. FIGURE 2.1: Life expectancy over GDP per capita in 2007. Let’s view this plot through the grammar of graphics: The data variable GDP per Capita gets mapped to the x-position aesthetic of the points. The data variable Life Expectancy gets mapped to the y-position aesthetic of the points. The data variable Population gets mapped to the size aesthetic of the points. The data variable Continent gets mapped to the color aesthetic of the points. We’ll see shortly that data corresponds to the particular data frame where our data is saved and that “data variables” correspond to particular columns in the data frame. Furthermore, the type of geometric object considered in this plot are points. That being said, while in this example we are considering points, graphics are not limited to just points. We can also use lines, bars, and other geometric objects. Let’s summarize the three essential components of the Grammar in Table 2.2. TABLE 2.2: Summary of Grammar of Graphics for this plot data variable aes geom GDP per Capita x point Life Expectancy y point Population size point Continent color point 2.1.3 Other components There are other components of the Grammar of Graphics we can control as well. As you start to delve deeper into the Grammar of Graphics, you’ll start to encounter these topics more frequently. In this book, we’ll keep things simple and only work with these two additional components: faceting breaks up a plot into several plots split by the values of another variable (Section 2.6) position adjustments for barplots (Section 2.8) Other more complex components like scales and coordinate systems are left for a more advanced text such as R for Data Science (Grolemund and Wickham 2016). Generally speaking, the Grammar of Graphics allows for a high degree of customization of plots and also a consistent framework for easily updating and modifying them. 2.1.4 ggplot2 package In this book, we will use the ggplot2 package for data visualization, which is an implementation of the Grammar of Graphics for R (Wickham, Chang, et al. 2019). As we noted earlier, a lot of the previous section was written in a computer code type font. This is because the various components of the Grammar of Graphics are specified in the ggplot() function included in the ggplot2 package. The ggplot() function expects the following arguments (i.e. inputs) at a minimum: The data frame where the variables exist: the data argument. The mapping of the variables to aesthetic attributes: the mapping argument which specifies the aesthetic attributes involved. After we’ve specified these components, we then add layers to the plot using the + sign. The most essential layer to add to a plot is the layer that specifies which type of geometric object we want the plot to involve: points, lines, bars, and others. Other layers we can add to a plot include the plot title, axes labels, visual themes for the plots, and facets (which we’ll see in Section 2.6). Let’s now put the theory of the Grammar of Graphics into practice. 2.2 Five Named Graphs - The 5NG In order to keep things simple in this book, we will only focus on five different types of graphics in this book, each with a commonly given name. We term these “five named graphs” the 5NG: scatterplots linegraphs boxplots histograms barplots We’ll also present some variations of these plots, but with this basic repertoire of five graphics in your toolbox, you can visualize a wide array of different variable types. Note that certain plots are only appropriate for categorical variables while others are only appropriate for quantitative variables. 2.3 5NG#1: Scatterplots The simplest of the 5NG are scatterplots, also called bivariate plots. They allow you to visualize the relationship between two numerical variables. While you may already be familiar with scatterplots, let’s view them through the lens of the Grammar of Graphics we presented in Section 2.1. Specifically, we will visualize the relationship between the following two numerical variables in the flights data frame included in the nycflights13 package: dep_delay: departure delay on the horizontal “x” axis and arr_delay: arrival delay on the vertical “y” axis for Alaska Airlines flights leaving NYC in 2013. This requires paring down the data from all 336,776 flights that left NYC in 2013, to only the 714 Alaska Airlines flights that left NYC in 2013. We do this so our scatterplot will involve a manageable 714 points, and not an overwhelmingly large number like 336,776. To achieve this, we’ll take the flights data frame, filter the rows so that only the 714 rows corresponding to Alaska Airlines flights are kept, and save this in a new data frame called alaska_flights using the &lt;- assignment operator : alaska_flights &lt;- flights %&gt;% filter(carrier == &quot;AS&quot;) For now we suggest you don’t worry if you don’t fully understand this code. We’ll see later in Chapter 3 on data wrangling that this code uses the dplyr package for data wrangling to achieve our goal: it takes the flights data frame and filter it to only return the rows where carrier is equal to &quot;AS&quot;, Alaska Airlines’ carrier code. Recall from Section 1.2 that testing for equality is specified with == and not =. For now however, convince yourself that this code achieves what it is supposed to by exploring the resulting data frame by running View(alaska_flights). You’ll see that it has 714 rows, consisting of only 714 Alaska Airlines flights. Learning check (LC2.1) Take a look at both the flights and alaska_flights data frames by running View(flights) and View(alaska_flights). In what respect do these data frames differ? For example, think about the number of rows in each dataset. 2.3.1 Scatterplots via geom_point Let’s now go over the code that will create the desired scatterplot, while keeping in the Grammar of Graphics we introduced in Section 2.1. Let’s take a look at the code and break it down piece-by-piece. ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point() Within the ggplot() function, we specify two of the components of the Grammar of Graphics as arguments (i.e. inputs): The data to be the alaska_flights data frame by setting data = alaska_flights. The aesthetic mapping by setting mapping = aes(x = dep_delay, y = arr_delay). Specifically, the variable dep_delay maps to the x position aesthetic while the variable arr_delay maps to the y position aesthetic. We then add a layer to the ggplot() function call using the + sign. The added layer in question specifies the third component of the grammar: the geometric object. In this case the geometric object is set to be points by specifying geom_point(). After running these two lines of code in your console, you’ll notice two outputs: the graphic shown in Figure 2.2 and a warning message. Warning: Removed 5 rows containing missing values (geom_point). FIGURE 2.2: Arrival delays vs departure delays for Alaska Airlines flights from NYC in 2013. Let’s first unpack the graphic in Figure 2.2. Observe that a positive relationship exists between dep_delay and arr_delay: as departure delays increase, arrival delays tend to also increase. Observe also the large mass of points clustered near (0, 0), the point indicating flights that neither departed nor arrived late. Let’s turn our attention to the warning message. R is alerting us to the fact that 5 rows were ignored due to them being missing. For these 5 rows, either the value for dep_delay or arr_delay or both were missing (recorded in R as NA), and thus these rows were ignored in our plot. Before we continue, let’s make a few more observations about this code that created the scatterplot. Note that the + sign comes at the end of lines, and not at the beginning. You’ll get an error in R if you put it at the beginning of a line. When adding layers to a plot, you are encouraged to start a new line after the + (by pressing the Return/Enter button on your keyboard) so that the code for each layer is on a new line. As we add more and more layers to plots, you’ll see this will greatly improve the legibility of your code. To stress the importance of adding the layer specifying the geometric object, consider Figure 2.3 where no layers are added. Because the geometric object was not specified, we have a blank plot which is not very useful! ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) FIGURE 2.3: A plot with no layers. Learning check (LC2.2) What are some practical reasons why dep_delay and arr_delay have a positive relationship? (LC2.3) What variables in the weather data frame would you expect to have a negative correlation (i.e. a negative relationship) with dep_delay? Why? Remember that we are focusing on numerical variables here. Hint: Explore the weather dataset by using the View() function. (LC2.4) Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaskan flights? (LC2.5) What are some other features of the plot that stand out to you? (LC2.6) Create a new scatterplot using different variables in the alaska_flights data frame by modifying the example given. 2.3.2 Over-plotting The large mass of points near (0, 0) in Figure 2.2 can cause some confusion since it is hard to tell the true number of points that are plotted. This is the result of a phenomenon called overplotting. As one may guess, this corresponds to points being plotted on top of each other over and over again. When overplotting occurs, it is difficult to know the number of points being plotted. There are two methods to address the issue of overplotting. Either by Adjusting the transparency of the points or Adding a little random “jitter”, or random “nudges”, to each of the points. Method 1: Changing the transparency The first way of addressing overplotting is to change the transparency/opacity of the points by setting the alpha argument in geom_point(). We can change the alpha argument to be any value between 0 and 1, where 0 sets the points to be 100% transparent and 1 sets the points to be 100% opaque. By default, alpha is set to 1. In other words, if we don’t explicitly set an alpha value, R will use alpha = 1. Note how the following code is identical to the code in Section 2.3 that created the scatterplot with overplotting, but with alpha = 0.2 added to the geom_point(): ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2) FIGURE 2.4: Arrival vs departure delays scatterplot with alpha = 0.2. The key feature to note in Figure 2.4 is that the transparency of the points is cumulative: areas with a high-degree of overplotting are darker, whereas areas with a lower degree are less dark. Note furthermore that there is no aes() surrounding alpha = 0.2. This is because we are not mapping a variable to an aesthetic attribute, but rather merely changing the default setting of alpha. In fact, you’ll receive an error if you try to change the second line to read geom_point(aes(alpha = 0.2)). Method 2: Jittering the points The second way of addressing overplotting is by jittering all the points, in other words give each point a small “nudge” in a random direction. You can think of “jittering” as shaking the points around a bit on the plot. Let’s illustrate using a simple example first. Say we have a data frame with 4 identical rows of x &amp; y values: (0,0), (0,0), (0,0), and (0,0). In Figure 2.5, we present both the regular scatterplot of these 4 points (on the left) and its jittered counterpart (on the right). FIGURE 2.5: Regular and jittered scatterplot. In the left-hand regular scatterplot, observe that the 4 points are superimposed on top of each other. While we know there are 4 values being plotted, this fact might not be apparent to others. In the right-hand jittered scatterplot, observe that since each point is given a random “nudge”, it is now plainly evident that this plot involves four points. Keep in mind however that jittering is strictly a visualization tool; even after creating a jittered scatterplot, the original values saved in the data frame remain unchanged. To create a jittered scatterplot, instead of using geom_point(), we use geom_jitter(). Observe how the following code is very similar to the code that created the scatterplot with overplotting in Subsection 2.3.1, but with geom_point() replaced with geom_jitter(). ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_jitter(width = 30, height = 30) FIGURE 2.6: Arrival vs departure delays jittered scatterplot. In order to specify how much jitter to add, we adjusted the width and height arguments to geom_jitter(). This corresponds to how hard you’d like to shake the plot in horizontal x-axis units and vertical y-axis units respectively. In this case, both axes are in minutes. How much jitter should we add using the width and height arguments? On the one hand, it is important to add just enough jitter to break any overlap in points, but on the other hand, not so much that we completely alter the original pattern in points. As can be seen in the resulting Figure 2.6, in this case jittering doesn’t really provide much new insight. In this particular case, it can be argued that changing the transparency of the points by setting alpha proved more effective. When would it be better to use a jittered scatterplot? When would it be better to alter the points’ transparency? There is no single right answer that applies to all situations. You need to make a subjective choice and own that choice. At the very least when confronted with overplotting however, we suggest you make both types of plots and see which one better emphasizes the point you are trying to make. Learning check (LC2.7) Why is setting the alpha argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot? (LC2.8) After viewing Figure 2.4, give an approximate range of arrival delays and departure delays that occur the most frequently. How has that region changed compared to when you observed the same plot without the alpha = 0.2 set in Figure 2.2? 2.3.3 Summary Scatterplots display the relationship between two numerical variables. They are among the most commonly used plots because they can provide an immediate way to see the trend in one numerical variable versus another. However, if you try to create a scatterplot where either one of the two variables is not numerical, you might get strange results. Be careful! With medium to large datasets, you may need to play around with the different modifications to scatterplots we saw such as changing the transparency/opacity of the points or by jittering the points. This tweaking is often a fun part of data visualization, since you’ll have the chance to see different relationships emerge as you tinker with your plots. 2.4 5NG#2: Linegraphs The next of the five named graphs are linegraphs. Linegraphs show the relationship between two numerical variables when the variable on the x-axis, also called the explanatory variable, is of a sequential nature. In other words there is an inherent ordering to the variable. The most common example of linegraphs have some notion of time on the x-axis: hours, days, weeks, years, etc. Since time is sequential, we connect consecutive observations of the variable on the y-axis with a line. Linegraphs that have some notion of time on the x-axis are also called time series plots. Let’s illustrate linegraphs using another data set in the nycflights13 package: the weather data frame. Let’s explore the weather data frame by running View(weather) and glimpse(weather) and furthermore let’s read the associated help file by running ?weather to bring up the help file. Observe that there is a variable called temp of hourly temperature recordings in Fahrenheit at weather stations near all three airports in New York City: Newark (origin code EWR), John F. Kennedy International, and La Guardia (LGA). However, instead of considering hourly temperatures for all days in 2013 for all three airports, for simplicity let’s only consider hourly temperatures at Newark airport for the first 15 days in January. Recall in Section 2.3 we used the filter() function to only choose the subset of rows of flights corresponding to Alaska Airlines flights. We similarly use filter() here, but by using the &amp; operator we only choose the subset of rows of weather where the origin is &quot;EWR&quot;, the month is January, and the day is between 1 and 15. Recall we performed a similar task in Section 2.3 when creating the alaska_flights data frame of only Alaska Airlines flights, a topic we’ll explore more in Chapter 3 on data wrangling. early_january_weather &lt;- weather %&gt;% filter(origin == &quot;EWR&quot; &amp; month == 1 &amp; day &lt;= 15) Learning check (LC2.9) Take a look at both the weather and early_january_weather data frames by running View(weather) and View(early_january_weather). In what respect do these data frames differ? (LC2.10) View() the flights data frame again. Why does the time_hour variable uniquely identify the hour of the measurement whereas the hour variable does not? 2.4.1 Linegraphs via geom_line Let’s create a time series plot of the hourly temperatures saved in the early_january_weather data frame by using geom_line() to create a linegraph, instead of using geom_point() like we used previously to create scatterplots: ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = temp)) + geom_line() FIGURE 2.7: Hourly temperature in Newark for January 1-15, 2013. Much as with the ggplot() code that created the scatterplot of departure and arrival delays for Alaska Airlines flights in Figure 2.2, let’s break down this code piece-by-piece in terms of the Grammar of Graphics: Within the ggplot() function call, we specify two of the components of the Grammar of Graphics as arguments: The data to be the early_january_weather data frame by setting data = early_january_weather. The aesthetic mapping by setting mapping = aes(x = time_hour, y = temp). Specifically, the variable time_hour maps to the x position aesthetic while the variable temp maps to the y position aesthetic. We add a layer to the ggplot() function call using the + sign. The layer in question specifies the third component of the grammar: the geometric object in question. In this case the geometric object is a line, set by specifying geom_line(). Learning check (LC2.11) Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis? (LC2.12) Why are linegraphs frequently used when time is the explanatory variable on the x-axis? (LC2.13) Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013. 2.4.2 Summary Linegraphs, just like scatterplots, display the relationship between two numerical variables. However it is preferred to use linegraphs over scatterplots when the variable on the x-axis (i.e. the explanatory variable) has an inherent ordering, such as some notion of time. 2.5 5NG#3: Histograms Let’s consider the temp variable in the weather data frame once again, but unlike with the linegraphs in Section 2.4, let’s say we don’t care about its relationship with time, but rather we only care about how the values of temp distribute. In other words: What are the smallest and largest values? What is the “center” or “most typical” value? How do the values spread out? What are frequent and infrequent values? One way to visualize this distribution of this single variable temp is to plot them on a horizontal line as we do in Figure 2.8: FIGURE 2.8: Plot of hourly temperature recordings from NYC in 2013. This gives us a general idea of how the values of temp distribute: observe that temperatures vary from around 11°F (-11°C) up to 100°F (38°C). Furthermore, there appear to be more recorded temperatures between 40°F and 60°F than outside this range. However, because of the high degree of overplotting in the points, it’s hard to get a sense of exactly how many values are between say 50°F and 55°F. What is commonly produced instead of Figure 2.8 is known as a histogram. A histogram is a plot that visualizes the distribution of a numerical value as follows: We first cut up the x-axis into a series of bins, where each bin represents a range of values. For each bin, we count the number of observations that fall in the range corresponding to that bin. Then for each bin, we draw a bar whose height marks the corresponding count. Let’s drill-down on an example of a histogram, shown in Figure 2.9. FIGURE 2.9: Example histogram. Let’s focus only on temperatures between 30°F (-1°C) and 60°F (15°C) for now. Observe that there are three bins of equal width between 30 and 60°F. Thus we have three bins of width 10°F each: one bin for the 30-40°F range, another bin for the 40-50°F range, and another bin for the 50-60°F range. Since: The bin for the 30-40°F range has a height of around 5000. In other words, around 5000 of the hourly temperature recordings are between 30°F and 40°F. The bin for the 40-50°F range has a height of around 4300. In other words, around 4300 of the hourly temperature recordings are between 40°F and 50°F. The bin for the 50-60°F range has a height of around 3500. In other words, around 3500 of the hourly temperature recordings are between 50°F and 60°F. All nine bins spanning 10°F to 100°F on the x-axis have this interpretation. 2.5.1 Histograms via geom_histogram Let’s now present the ggplot() code to plot your first histogram! Unlike with scatterplots and linegraphs, there is now only one variable being mapped in aes(): the single numerical variable temp. The y-aesthetic of a histogram, the count of the observations in each bin, gets computed for you automatically. Furthermore, the geometric object layer is now a geom_histogram(). . After running the following code, you’ll see the histogram in Figure 2.10 as well as warning messages. We’ll discuss the warning messages first. ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram() `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Warning: Removed 1 rows containing non-finite values (stat_bin). FIGURE 2.10: Histogram of hourly temperatures at three NYC airports. The first message is telling us that the histogram was constructed using bins = 30, in other words 30 equally spaced bins. This is known in computer programming as a default value; unless you override this default number of bins with a number you specify, R will choose 30 by default. We’ll see in the next section how to change the number of bins away from this default value. The second message is telling us something similar to the warning message we received when we ran the code to create a scatterplot of departure and arrival delays for Alaska Airlines flights in Figure 2.2: that because one row has a missing NA value for temp, it was omitted from the histogram. R is just giving us a friendly heads up that this was the case. Now let’s unpack the resulting histogram in Figure 2.10. Observe that values less than 25°F as well as values above 80°F are rather rare. However, because of the large number of bins, it’s hard to get a sense for which range of temperatures is spanned by each bin; everything is one giant amorphous blob. So let’s add white vertical borders demarcating the bins by adding a color = &quot;white&quot; argument to geom_histogram(): ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(color = &quot;white&quot;) FIGURE 2.11: Histogram of hourly temperatures at three NYC airports with white borders. We now have an easier time associating ranges of temperatures to each of the bins in Figure 2.11. We can also vary the color of the bars by setting the fill argument. For example, you can set the bin colors to be “blue steel” by setting fill = &quot;steelblue&quot;: ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(color = &quot;white&quot;, fill = &quot;steelblue&quot;) If you’re curious, run colors() to see all 657 possible choice of colors in R! 2.5.2 Adjusting the bins Observe in Figure 2.11 that in the 50-75°F range there appear to be roughly 8 bins. Thus each bin has width 25 divided by 8, or roughly 3.12°F, which is not a very easily interpretable range to work with. Let’s improve this by adjusting the number of bins in our histogram in one of two ways: By adjusting the number of bins via the bins argument to geom_histogram(). By adjusting the width of the bins via the binwidth argument to geom_histogram(). Using the first method, we have the power to specify how many bins we would like to cut the x-axis up in. As mentioned in the previous section, the default number of bins is 30. We can override this default, to say 40 bins, as follows: ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bins = 40, color = &quot;white&quot;) Using the second method, instead of specifying the number of bins, we specify the width of the bins by using the binwidth argument in the geom_histogram() layer. For example, let’s set the width of each bin to be 10°F. ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(binwidth = 10, color = &quot;white&quot;) We compare both resulting histograms side-by-side in Figure 2.12. FIGURE 2.12: Setting histogram bins in two ways. Learning check (LC2.14) What does changing the number of bins from 30 to 40 tell us about the distribution of temperatures? (LC2.15) Would you classify the distribution of temperatures as symmetric or skewed? (LC2.16) What would you guess is the “center” value in this distribution? Why did you make that choice? (LC2.17) Is this data spread out greatly from the center or is it close? Why? 2.5.3 Summary Histograms, unlike scatterplots and linegraphs, present information on only a single numerical variable. Specifically, they are visualizations of the distribution of the numerical variable in question. 2.6 Facets Before continuing the next of the 5NG, let’s briefly introduce a new concept called faceting. Faceting is used when we’d like to split a particular visualization by the values of another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ. For example, suppose we were interested in looking at how the histogram of hourly temperature recordings at the three NYC airports we saw in Figure 2.9 differed in each month. We could “split” this histogram by the 12 possible months in a given year. In other words, we would plot histograms of temp for each month separately. We do this by adding facet_wrap(~ month) layer. Note the ~ is a “tilde” and can generally be found on the key next to the “1” key on US keyboards. The tilde is required and you’ll receive the error Error in as.quoted(facets) : object 'month' not found if you don’t include it here. ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(binwidth = 5, color = &quot;white&quot;) + facet_wrap(~ month) FIGURE 2.13: Faceted histogram of hourly temperatures by month. We can also specify the number of rows and columns in the grid by using the nrow and ncol arguments inside of facet_wrap(). For example, say we would like our faceted histogram to have 4 rows instead of 3. We simply add a nrow = 4 argument to facet_wrap(~ month) ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(binwidth = 5, color = &quot;white&quot;) + facet_wrap(~ month, nrow = 4) FIGURE 2.14: Faceted histogram with 4 instead of 3 rows. Observe in both Figures 2.13 and 2.14 that as we might expect in the Northern Hemisphere, temperatures tend to be higher in the summer months, while they tend to be lower in the winter. Learning check (LC2.18) What other things do you notice about this faceted plot? How does a faceted plot help us see relationships between two variables? (LC2.19) What do the numbers 1-12 correspond to in the plot? What about 25, 50, 75, 100? (LC2.20) For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics. (LC2.21) Does the temp variable in the weather data set have a lot of variability? Why do you say that? 2.7 5NG#4: Boxplots While faceted histograms are one type of visualization used to compare the distribution of a numerical variable split by the values of another variable, another type of visualization that achieves this same goal are side-by-side boxplots. A boxplot is constructed from the information provided in the five-number summary of a numerical variable (see Appendix A.1). To keep things simple for now, let’s only consider the 2141 hourly temperature recordings for the month of November, each represented as a point in Figure 2.15. FIGURE 2.15: November temperatures represented as points. These 2141 observations have the following five-number summary: Minimum: 21°F First quartile AKA 25th percentile: 36°F Median AKA second quartile AKA 50th percentile: 45°F Third quartile AKA 75th percentile: 52°F Maximum: 71°F In the left-most plot of Figure 2.16, let’s mark these 5 values with dashed horizontal lines on top of the 2141 points. In the middle plot of Figure 2.16 let’s add the boxplot. In the right-most plot of Figure 2.16, let’s remove the points and the dashed horizontal lines for clarity’s sake. FIGURE 2.16: Building up a boxplot of November temperatures. What the boxplot does is visually summarize the 2141 points by cutting the 2141 temperature recordings into quartiles at the dashed lines, where each quartile contains roughly 2141 \\(\\div\\) 4 \\(\\approx\\) 535 observations. Thus 25% of points fall below the bottom edge of the box, which is the first quartile of 36°F. In other words 25% of observations were colder than 36°F. 25% of points fall between the bottom edge of the box and the solid middle line, which is the median of 45°F. In other words 25% of observations were between 36 and 45°F and 50% of observations were colder than 45°F. 25% of points fall between the solid middle line and the top edge of the box, which is the third quartile of 52°F. In other words 25% of observations were between 45 and 52°F and 75% of observations were colder than 52°F. 25% of points fall above the top edge of the box. In other words 25% of observations were warmer than 52°F. The middle 50% of points lie within the interquartile range (IQR) between the first and third quartile of 52 - 36 = 16°F. The interquartile range is a measure of a numerical variable’s spread. Furthermore, in the right-most plot of Figure 2.16, we see the whiskers of the boxplot. The whiskers stick out from either end of the box all the way to the minimum and maximum observed temperatures of 21°F and 71°F respectively. However, the whiskers don’t always extend to the smallest and largest observed values as they do here. They in fact extend no more than 1.5 \\(\\times\\) the interquartile range from either end of the box. In this case of the November temperatures, no more than 1.5 \\(\\times\\) 16°F = 24°F from either end of the box. Any observed values outside this range get marked with points called outliers, which we’ll see in the next section. 2.7.1 Boxplots via geom_boxplot Let’s now create a side-by-side boxplot of hourly temperatures split by the 12 months as we did previously with the faceted histograms. We do this by mapping the month variable to the x-position aesthetic, the temp variable to the y-position aesthetic, and by adding a geom_boxplot() layer: ggplot(data = weather, mapping = aes(x = month, y = temp)) + geom_boxplot() FIGURE 2.17: Invalid boxplot specification. Warning messages: 1: Continuous x aesthetic -- did you forget aes(group=...)? 2: Removed 1 rows containing non-finite values (stat_boxplot). Observe in Figure 2.17 that this plot does not provide information about temperature separated by month. The first warning message clues us in as to why. It is telling us that we have a “continuous”, or numerical variable, on the x-position aesthetic. Boxplots however require a categorical variable to be mapped to the x-position aesthetic. The second warning message is identical to the warning message when plotting a histogram of hourly temperatures: that one of the values was recorded as NA missing. We can convert the numerical variable month into a categorical variable by using the factor() function. So after applying factor(month), month goes from having numerical values 1, 2, …, and 12 to having labels “1”, “2”, …, and “12.” ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) + geom_boxplot() FIGURE 2.18: Side-by-side boxplot of temperature split by month. The resulting Figure 2.18 shows 12 separate “box and whiskers” plots similar to the right-most plot of Figure 2.16 focusing only on November: The “box” portions of the visualization represent the 1st quartile, the median AKA the 2nd quartile, and the 3rd quartile. The height of each box, i.e. the value of the 3rd quartile minus the value of the 1st quartile, is the interquartile range (IQR). It is a measure of the spread of the middle 50% of values, with longer boxes indicating more variability. The “whisker” portions of these plots extend out from the bottoms and tops of the boxes and represent points less than the 25th percentile and greater than the 75th percentiles respectively. They’re set to extend out no more than \\(1.5 \\times IQR\\) units away from either end of the boxes. We say “no more than” because the ends of the whiskers have to correspond to observed temperatures. The length of these whiskers show how the data outside the middle 50% of values vary, with longer whiskers indicating more variability. The dots representing values falling outside the whiskers are called outliers. These can be thought of as anomalous values. It is important to keep in mind that the definition of an outlier is somewhat arbitrary and not absolute. In this case, they are defined by the length of the whiskers, which are no more than \\(1.5 \\times IQR\\) units long. Looking at this plot we can see, as expected, that summer months (6 through 8) have higher median temperatures as evidenced by the higher solid lines in the middle of the boxes. We can easily compare temperatures across months by drawing imaginary horizontal lines across the plot. Furthermore, the height of the 12 boxes as quantified by the interquartile ranges are informative too; they tell us about variability, or spread, of temperatures recorded in a given month. Learning check (LC2.22) What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point. (LC2.23) Which months have the highest variability in temperature? What reasons can you give for this? (LC2.24) We looked at the distribution of the numerical variable temp split by the numerical variable month that we converted to a categorical variable using the factor() function. Why would a boxplot of temp split by the numerical variable pressure similarly converted to a categorical variable using the factor() not be informative? (LC2.25) Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram? 2.7.2 Summary Side-by-side boxplots provide us with a way to compare the distribution of a numerical variable across multiple values of another variable. One can see where the median falls across the different groups by looking at the solid line in the center of the boxes. To study the spread of a numerical variable within one of the boxes, look at both the length of the box and also how far the whiskers extend from either end of the box. Outliers are even more easily identified when looking at a boxplot than when looking at a histogram as they are marked with distinct points. 2.8 5NG#5: Barplots Both histograms and boxplots are tools to visualize the distribution of numerical variables. Another common task is to visualize the distribution of a categorical variable. This is a simpler task, as we are simply counting different categories of a categorical variable, also known as the levels of the categorical variable. Often the best way to visualize these different counts, also known as frequencies, is with barplots (also called barcharts). One complication, however, is how your data is represented. Is the categorical variable of interest “pre-counted” or not? For example, run the following code that manually creates two data frames representing a collection of fruit: 3 apples and 2 oranges. fruits &lt;- tibble( fruit = c(&quot;apple&quot;, &quot;apple&quot;, &quot;orange&quot;, &quot;apple&quot;, &quot;orange&quot;) ) fruits_counted &lt;- tibble( fruit = c(&quot;apple&quot;, &quot;orange&quot;), number = c(3, 2) ) We see both the fruits and fruits_counted data frames represent the same collection of fruit. Whereas fruits just lists the fruit individually… # A tibble: 5 x 1 fruit &lt;chr&gt; 1 apple 2 apple 3 orange 4 apple 5 orange … fruits_counted has a variable count which represent the “pre-counted” values of each fruit. # A tibble: 2 x 2 fruit number &lt;chr&gt; &lt;dbl&gt; 1 apple 3 2 orange 2 Depending on how your categorical data is represented, you’ll need to add a different geometric layer type to your ggplot() to create a barplot, as we now explore. 2.8.1 Barplots via geom_bar or geom_col Let’s generate barplots using these two different representations of the same basket of fruit: 3 apples and 2 oranges. Using the fruits data frame where all 5 fruit are listed individually in 5 rows, we map the fruit variable to the x-position aesthetic and add a geom_bar() layer: ggplot(data = fruits, mapping = aes(x = fruit)) + geom_bar() FIGURE 2.19: Barplot when counts are not pre-counted. However, using the fruits_counted data frame where the fruit have been “pre-counted”, we once again map the fruit variable to the x-position aesthetic, but here we also map the count variable to the y-position aesthetic, and add a geom_col() layer instead. ggplot(data = fruits_counted, mapping = aes(x = fruit, y = number)) + geom_col() FIGURE 2.20: Barplot when counts are pre-counted. Compare the barplots in Figures 2.19 and 2.20. They are identical because they reflect counts of the same five fruit. However depending on how our categorical data is represented, either “pre-counted” or not, we must add a different geom layer. When the categorical variable whose distribution you want to visualize Is not pre-counted in your data frame, we use geom_bar(). Is pre-counted in your data frame, we use geom_col() with the y-position aesthetic mapped to the variable that has the counts. Let’s now go back to the flights data frame in the nycflights13 package and visualize the distribution of the categorical variable carrier. In other words, let’s visualize the number of domestic flights out New York City each airline company flew in 2013. Recall from Section 1.4.3 when you first explored the flights data frame you saw that each row corresponds to a flight. In other words the flights data frame is more like the fruits data frame than the fruits_counted data frame because the flights have not been pre-counted by carrier. Thus we should use geom_bar() instead of geom_col() to create a barplot. Much like a geom_histogram(), there is only one variable in the aes() aesthetic mapping: the variable carrier gets mapped to the x-position. ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() FIGURE 2.21: Number of flights departing NYC in 2013 by airline using geom_bar(). Observe in Figure 2.21 that United Air Lines (UA), JetBlue Airways (B6), and ExpressJet Airlines (EV) had the most flights depart New York City in 2013. If you don’t know which airlines correspond to which carrier codes, then run View(airlines) to see a directory of airlines. For example: AA is American Airlines; B6 is JetBlue Airways; DL is Delta Airlines; EV is ExpressJet Airlines; MQ is Envoy Air; while UA is United Airlines. Alternatively, say you had a data frame flights_counted where the number of flights for each carrier was pre-counted like in Table 2.3. TABLE 2.3: Number of flights pre-counted for each carrier. carrier number 9E 18460 AA 32729 AS 714 B6 54635 DL 48110 EV 54173 F9 685 FL 3260 HA 342 MQ 26397 OO 32 UA 58665 US 20536 VX 5162 WN 12275 YV 601 In order to create a barplot visualizing the distribution of the categorical variable carrier in this case, we would use geom_col() instead with x mapped to carrier and y mapped to number as shown in what follows. The resulting barplot would be identical to Figure 2.21. ggplot(data = flights_table, mapping = aes(x = carrier, y = number)) + geom_col() Learning check (LC2.26) Why are histograms inappropriate for visualizing categorical variables? (LC2.27) What is the difference between histograms and barplots? (LC2.28) How many Envoy Air flights departed NYC in 2013? (LC2.29) What was the seventh highest airline in terms of departed flights from NYC in 2013? How could we better present the table to get this answer quickly? 2.8.2 Must avoid pie charts! One of the most common plots used to visualize the distribution of categorical data is the pie chart. While they may seem harmless enough, they actually present a problem in that humans are unable to judge angles well. As Naomi Robbins describes in her book “Creating More Effective Graphs” (Robbins 2013), we overestimate angles greater than 90 degrees and we underestimate angles less than 90 degrees. In other words, it is difficult for us to determine the relative size of one piece of the pie compared to another. Let’s examine the same data used in our previous barplot of the number of flights departing NYC by airline in Figure 2.21, but this time we will use a pie chart in Figure 2.22. Try to answer the following questions: How much larger is the portion of the pie for ExpressJet Airlines (EV) compared to US Airways (US), What is the third largest carrier in terms of departing flights, and How many carriers have fewer flights than United Airlines (UA)? FIGURE 2.22: The dreaded pie chart. While it is quite difficult to answer these questions when looking at the pie chart in Figure 2.22, we can much more easily answer these questions using the barchart in Figure 2.21. This is true since barplots present the information in a way such that comparisons between categories can be made with single horizontal lines, whereas pie charts present the information in a way such that comparisons must be made by comparing angles. Learning check (LC2.30) Why should pie charts be avoided and replaced by barplots? (LC2.31) Why do you think people continue to use pie charts? 2.8.3 Two categorical variables Barplots are a very common way to visualize the frequency of different categories, or levels, of a single categorical variable. Another use of barplots is to visualize the joint distribution of two categorical variables at the same time. Let’s examine the joint distribution of outgoing domestic flights from NYC by carrier as well as origin. In other words, the number of flights for each carrier and origin combination. For example, the number of WestJet flights from JFK, the number of WestJet flights from LGA, the number of WestJet flights from EWR, the number of American Airlines flights from JFK, and so on. Recall the ggplot() code that created the barplot of carrier frequency in Figure 2.21: ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() We can now map the additional variable origin by adding a fill = origin inside the aes() aesthetic mapping; the fill aesthetic of any bar corresponds to the color used to fill the bars. ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + geom_bar() FIGURE 2.23: Stacked barplot comparing the number of flights by carrier and origin. Figure 2.23 is an example of a stacked barplot. While simple to make, in certain aspects it is not ideal. For example, it is difficult to compare the heights of the different colors between the bars, corresponding to comparing the number of flights from each origin airport between the carriers. Before we continue, let’s address some common points of confusion among new R users. First, note that fill is another aesthetic mapping much like x-position; thus we were careful to include it within the parentheses of the aes() mapping. The following code, where the fill aesthetic is specified outside the aes() mapping will yield an error. This is a fairly common error that new ggplot users make: ggplot(data = flights, mapping = aes(x = carrier), fill = origin) + geom_bar() Second, the fill aesthetic corresponds to the color used to fill the bars, while the color aesthetic corresponds to the color of the outline of the bars. This is identical to how we added color to our histogram in Subsection 2.5.1: we set the outline of the bars to white by setting color = &quot;white&quot; and the colors of the bars to be blue steel by setting fill = &quot;steelblue&quot;. Observe in Figure 2.24 that mapping origin to color and not fill yields grey bars with different colored outlines. ggplot(data = flights, mapping = aes(x = carrier, color = origin)) + geom_bar() FIGURE 2.24: Stacked barplot with color aesthetic used instead of fill. An alternative to stacked barplots are side-by-side barplots, also known as dodged barplots, as seen in Figure 2.25. The code to create a side-by-side barplot is identical to the code to create a stacked barplot, but with a position = &quot;dodge&quot; argument added to geom_bar(). In other words, we are overriding the default barplot type, which is a stacked barplot, and specifying it to be a side-by-side barplot. ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + geom_bar(position = &quot;dodge&quot;) FIGURE 2.25: Side-by-side barplot comparing number of flights by carrier and origin. Lastly, another type of barplot is a faceted barplot. Recall in Section 2.6 we visualized the distribution of hourly temperatures at the 3 NYC airports split by month using facets. We apply the same principle to our barplot visualizing the frequency of carrier split by origin: instead of mapping origin ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() + facet_wrap(~ origin, ncol = 1) FIGURE 2.26: Faceted barplot comparing the number of flights by carrier and origin. Learning check (LC2.32) What kinds of questions are not easily answered by looking at Figure 2.23? (LC2.33) What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights? (LC2.34) Why might the side-by-side (AKA dodged) barplot be preferable to a stacked barplot in this case? (LC2.35) What are the disadvantages of using a side-by-side (AKA dodged) barplot, in general? (LC2.36) Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case? (LC2.37) What information about the different carriers at different airports is more easily seen in the faceted barplot? 2.8.4 Summary Barplots are a very common way of displaying the distribution of a categorical variable, or in other words the frequency with which the different categories (also called levels) occur. They are easy to understand and make it easy to make comparisons across levels. Furthermore, when trying to visualize the relationship of two categorical variables, you have many options: stacked barplots, side-by-side barplots, and faceted barplots. Depending on what aspect of the relationship you are trying to emphasize, you will need to make a choice between these three types of barplots and own that choice. 2.9 Conclusion 2.9.1 Summary table Let’s recap all five of the Five Named Graphs (5NG) in Table 2.4 summarizing their differences. Using these 5NG, you’ll be able to visualize the distributions and relationships of variables contained in a wide array of datasets. This will be even more the case as we start to map more variables to more of each geometric object’s aesthetic attribute options, further unlocking the awesome power of the ggplot2 package. TABLE 2.4: Summary of Five Named Graphs Named graph Shows Geometric object Notes 1 Scatterplot Relationship between 2 numerical variables geom_point() 2 Linegraph Relationship between 2 numerical variables geom_line() Used when there is a sequential order to x-variable e.g. time 3 Histogram Distribution of 1 numerical variable geom_histogram() Facetted histograms show the distribution of 1 numerical variable split by the values of another variable 4 Boxplot Distribution of 1 numerical variable split by the values of another variable geom_boxplot() 5 Barplot Distribution of 1 categorical variable geom_bar() when counts are not pre-counted, geom_col() when counts are pre-counted Stacked, side-by-side, and faceted barplots show the joint distribution of 2 categorical variables 2.9.2 Function argument specification Let’s go over some important points about specifying the arguments (i.e. inputs) to functions. Run the following two segments of code: # Segment 1: ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() # Segment 2: ggplot(flights, aes(x = carrier)) + geom_bar() You’ll notice that that both code segments create the same barplot, even though in the second segment we omitted the data = and mapping = code argument names. This is because the ggplot() function by default assumes that the data argument comes first and the mapping argument comes second. So as long as you specify the data frame in question first and the aes() mapping second, you can omit the explicit statement of the argument names data = and mapping =. Going forward for the rest of this book, all ggplot() code will be like the second segment: with the data = and mapping = explicit naming of the argument omitted with the default ordering of arguments respected. We’ll do this for brevity’s sake and it’s common to see this style when reviewing the R code of other R users. 2.9.3 Additional resources An R script file of all R code used in this chapter is available here. If you want to further unlock the power of the ggplot2 package for data visualization, we suggest that you check out RStudio’s “Data Visualization with ggplot2” cheatsheet. This cheatsheet summarizes much more than what we’ve discussed in this chapter. In particular it presents many more than the 5 geometric objects we covered in this chapter while providing quick and easy to read visual descriptions. For all the geometric objects, it also lists all the possible aesthetic attributes one can tweak. You can access this cheatsheet by going to the RStudio Menu Bar -&gt; Help -&gt; Cheatsheets -&gt; “Data Visualization with ggplot2.” You can see a preview in the figure below. FIGURE 2.27: Data Visualization with ggplot2 cheatsheet. 2.9.4 What’s to come Recall in Figure 2.2 in Section 2.3 we visualized the relationship between departure delay and arrival delay for Alaska Airlines flights. This necessitated paring down the flights data frame to a new data frame alaska_flights consisting of only carrier == AS flights first: alaska_flights &lt;- flights %&gt;% filter(carrier == &quot;AS&quot;) ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point() Furthermore recall in Figure 2.7 in Section 2.4 we visualized hourly temperature recordings at Newark airport only for the first 15 days of January 2013. This necessitated paring down the weather data frame to a new data frame early_january_weather consisting of hourly temperature recordings only for origin == &quot;EWR&quot;, month == 1, and day less than or equal to 15 first: early_january_weather &lt;- weather %&gt;% filter(origin == &quot;EWR&quot; &amp; month == 1 &amp; day &lt;= 15) ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = temp)) + geom_line() These two code segments were a preview of Chapter 3 on data wrangling using the dplyr package. Data wrangling is the process of transforming and modifying existing data with the intent of making it more appropriate for analysis purposes. For example, the two code segments used the filter() function to create new data frames (alaska_flights and early_january_weather) by choosing only a subset of rows of existing data frames (flights and weather). In the next chapter, we’ll formally introduce the filter() and other data wrangling functions as well as the pipe operator %&gt;% which allows you to combine multiple data wrangling actions into a single sequential chain of actions. On to Chapter 3 on data wrangling! References "],
-["3-wrangling.html", "Chapter 3 Data Wrangling 3.1 The pipe operator: %&gt;% 3.2 filter rows 3.3 summarize variables 3.4 group_by rows 3.5 mutate existing variables 3.6 arrange and sort rows 3.7 join data frames 3.8 Other verbs 3.9 Conclusion", " Chapter 3 Data Wrangling So far in our journey, we’ve seen how to look at data saved in data frames using the glimpse() and View() functions in Chapter 1 and how to create data visualizations using the ggplot2 package in Chapter 2. In particular we studied what we term the “five named graphs” (5NG): scatterplots via geom_point() linegraphs via geom_line() boxplots via geom_boxplot() histograms via geom_histogram() barplots via geom_bar() or geom_col() We created these visualizations using the “Grammar of Graphics”, which maps variables in a data frame to the aesthetic attributes of one the 5 geometric objects. We can also control other aesthetic attributes of the geometric objects such as the size and color as seen in the Gapminder data example in Figure 2.1. Recall however that for two of our visualizations, we first needed to transform/modify existing data frames a little. For example, recall the scatterplot in Figure 2.2 of departure and arrival delay only for Alaska Airlines flights. In order to create this visualization, we first needed to pare down the flights data frame to a smaller data frame alaska_flights consisting of only carrier == &quot;AS&quot; flights. We did this using the filter() function: alaska_flights &lt;- flights %&gt;% filter(carrier == &quot;AS&quot;) In this chapter, we’ll introduce a series of functions from the dplyr package for data wrangling that will allow you to take a data frame and “wrangle” it (transform it) to suit your needs. Such functions include: filter() a data frame’s existing rows to only pick out a subset of them. For example, the alaska_flights data frame. summarize() one of its columns/variables with a summary statistic. Examples of summary statistics include the median and interquartile range of temperatures as we saw in Section 2.7 on boxplots. group_by() its rows. In other words, assign different rows to be part of the same group. Then we can combine group_by() with summarize() to report summary statistics for each group separately. For example, say you don’t want a single overall average departure delay dep_delay for all three origin airports combined, but rather three separate average departure delays, one for each of the three origin airports. mutate() its existing columns/variables to create new ones. For example, convert hourly temperature recordings from degrees Fahrenheit to degrees Celsius. arrange() its rows. For example, sort the rows of weather in ascending or descending order of temp. join() it with another data frame by matching along a “key” variable. In other words, merge these two data frames together. Notice how we used computer_code font to describe the actions we want to take on our data frames. This is because the dplyr package for data wrangling has intuitively verb-named functions that are easy to remember. There is a further benefit to learning to use the dplyr package for data wrangling: its similarity to the database querying language SQL (pronounced “sequel”). The SQL language is used to manage large databases quickly and efficiently and is widely used by many institutions with a lot of data. While SQL is a topic left for a book or a course on database management, keep in mind that once you learn dplyr you can learn SQL easily. We’ll talk more about their similarities in Subsection 3.7.4. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section 1.3 for information on how to install and load R packages. library(dplyr) library(ggplot2) library(nycflights13) 3.1 The pipe operator: %&gt;% Before we start data wrangling, let’s first introduce a very nifty tool that gets loaded along with the dplyr package: the pipe operator %&gt;%. The pipe operator allows us to combine multiple operations on a computer into a single sequential chain of actions. Let’s start with a hypothetical example. Say you would like to perform a hypothetical sequence of operations on a hypothetical data frame x using hypothetical functions f(), g(), and h(): Take x then Use x as an input to a function f() then Use the output of f(x) as an input to a function g() then Use the output of g(f(x)) as an input to a function h() One way to achieve this sequence of operations is by using nesting parentheses as follows: h(g(f(x))) This code isn’t so hard to read since we are applying only three functions: f(), then g(), then h(). However, you can imagine that this will get progressively harder to read as the number of functions applied in your sequence increases. This is where the pipe operator %&gt;% comes in handy. %&gt;% takes the output of one function and then “pipes” it to be the input of the next function. Furthermore, a helpful trick is to read %&gt;% as “then” or “and then.” For example, you can obtain the same output as the hypothetical sequence of functions as follows: x %&gt;% f() %&gt;% g() %&gt;% h() You would read this sequence as: Take x then Use this output as the input to the next function f() then Use this output as the input to the next function g() then Use this output as the input to the next function h() So while both approaches achieve the same goal, the latter is much more human-readable because you can clearly read the sequence of operations line-by-line. But what are the hypothetical x, f(), g(), and h()? Throughout this chapter on data wrangling: The starting value x will be a data frame. For example, the flights data frame we explored in Section 1.4. The sequence of functions, here f(), g(), and h(), will mostly be a sequence of any number of the six data wrangling verb-named functions we listed in the introduction to this chapter. For example, the filter(carrier == &quot;AS&quot;) function we previewed earlier. The result will be the transformed/modified data frame that you want. In our example, we’ll save the result in a new data frame by using the &lt;- assignment operator with the name alaska_flights via alaska_flights &lt;-. alaska_flights &lt;- flights %&gt;% filter(carrier == &quot;AS&quot;) Much like when adding layers to a ggplot() using the + sign, you form a single chain of data wrangling operations by combining verb-named functions into a single sequence using the pipe operator %&gt;%. Furthermore, much like how the + sign has to come at the end of lines when constructing plots, the pipe operator %&gt;% has to come at the end of lines as well. Keep in mind, there are many more advanced data wrangling functions than just the six listed in the introduction to this chapter; you’ll see some examples of these near in Section 3.8. However, just with these six verb-named functions you’ll be able to perform a broad array of data wrangling tasks for the rest of this book. 3.2 filter rows FIGURE 3.1: Diagram of filter() rows operation. The filter() function here works much like the “Filter” option in Microsoft Excel; it allows you to specify criteria about the values of a variable in your dataset and then filters out only the rows that match that criteria. We begin by focusing only on flights from New York City to Portland, Oregon. The dest destination code (or airport code) for Portland, Oregon is &quot;PDX&quot;. Run the following and look at the results in RStudio’s spreadsheet viewer to ensure that only flights heading to Portland are chosen here: portland_flights &lt;- flights %&gt;% filter(dest == &quot;PDX&quot;) View(portland_flights) Note the order of the code. First, take the flights data frame flights then filter() the data frame so that only those where the dest equals &quot;PDX&quot; are included. We test for equality using the double equal sign == and not a single equal sign =. In other words filter(dest = &quot;PDX&quot;) will yield an error. This is a convention across many programming languages. If you are new to coding, you’ll probably forget to use the double equal sign == a few times before you get the hang of it. You can use other operators beyond just the == operator that tests for equality: &gt; corresponds to “greater than” &lt; corresponds to “less than” &gt;= corresponds to “greater than or equal to” &lt;= corresponds to “less than or equal to” != corresponds to “not equal to”. The ! is used in many programming languages to indicate “not”. Furthermore, you can combine multiple criteria together using operators that make comparisons: | corresponds to “or” &amp; corresponds to “and” To see many of these in action, let’s filter flights for all rows that departed from JFK and were heading to Burlington, Vermont (&quot;BTV&quot;) or Seattle, Washington (&quot;SEA&quot;) and departed in the months of October, November, or December. Run the following: btv_sea_flights_fall &lt;- flights %&gt;% filter(origin == &quot;JFK&quot; &amp; (dest == &quot;BTV&quot; | dest == &quot;SEA&quot;) &amp; month &gt;= 10) View(btv_sea_flights_fall) Note that even though colloquially speaking one might say “all flights leaving Burlington, Vermont and Seattle, Washington,” in terms of computer operations, we really mean “all flights leaving Burlington, Vermont or leaving Seattle, Washington.” For a given row in the data, dest can be &quot;BTV&quot;, or &quot;SEA&quot;, or something else, but not both &quot;BTV&quot; and &quot;SEA&quot; at the same time. Furthermore, note the careful use of parentheses around dest == &quot;BTV&quot; | dest == &quot;SEA&quot;. We can often skip the use of &amp; and just separate our conditions with a comma. In other words the previous code will return the identical output btv_sea_flights_fall as the following code: btv_sea_flights_fall &lt;- flights %&gt;% filter(origin == &quot;JFK&quot;, (dest == &quot;BTV&quot; | dest == &quot;SEA&quot;), month &gt;= 10) View(btv_sea_flights_fall) Let’s present another example that uses the ! “not” operator to pick rows that don’t match a criteria. As mentioned earlier, the ! can be read as “not.” Here we are filtering rows corresponding to flights that didn’t go to Burlington, VT or Seattle, WA. not_BTV_SEA &lt;- flights %&gt;% filter(!(dest == &quot;BTV&quot; | dest == &quot;SEA&quot;)) View(not_BTV_SEA) Again, note the careful use of parentheses around the (dest == &quot;BTV&quot; | dest == &quot;SEA&quot;). If we didn’t use parentheses as follows: flights %&gt;% filter(!dest == &quot;BTV&quot; | dest == &quot;SEA&quot;) We would be returning all flights not headed to &quot;BTV&quot; or those headed to &quot;SEA&quot;, which is an entirely different resulting data frame. Now say we have a larger number of airports we want to filter for, say &quot;SEA&quot;, &quot;SFO&quot;, &quot;PDX&quot;, &quot;BTV&quot;, and &quot;BDL&quot;. We could continue to use the | or operator as so: many_airports &lt;- flights %&gt;% filter(dest == &quot;SEA&quot; | dest == &quot;SFO&quot; | dest == &quot;PDX&quot; | dest == &quot;BTV&quot; | dest == &quot;BDL&quot;) View(many_airports) but as we progressively include more airports, this will get unwieldy to write. A slightly shorter approach uses the %in% operator along with the c() function. Recall from Subsection 1.2.1 that the c() function “combines” or “concatenates” values into a single vector of values. many_airports &lt;- flights %&gt;% filter(dest %in% c(&quot;SEA&quot;, &quot;SFO&quot;, &quot;PDX&quot;, &quot;BTV&quot;, &quot;BDL&quot;)) View(many_airports) What this code is doing is filtering flights for all flights where dest is in the vector of airports c(&quot;BTV&quot;, &quot;SEA&quot;, &quot;PDX&quot;, &quot;SFO&quot;, &quot;BDL&quot;).Both outputs of many_airports are the same, but as you can see the latter takes much less energy to code. As a final note, we recommend that filter() should often be among the first verbs you consider applying to your data. This cleans your dataset to only those rows you care about, or put differently, it narrows down the scope of your data frame to just the observations you care about. Learning check (LC3.1) What’s another way of using the “not” operator ! to filter only the rows that are not going to Burlington, VT nor Seattle, WA in the flights data frame? Test this out using the previous code. 3.3 summarize variables The next common task when working with data frames is to compute summary statistics. Summary statistics are single numerical values that summarize a large number of values. Commonly known examples of summary statistics include the mean (also called the average) and the median (the middle value). Other examples of summary statistics that might not immediately come to mind include the sum, the smallest value also called the minimum, the largest value also called the maximum, and the standard deviation. See Appendix A.1 for a glossary of such summary statistics. Let’s calculate two summary statistics of the temp temperature variable in the weather data frame: the mean and standard deviation (recall from Section 1.4 that the weather data frame is included in the nycflights13 package). To compute these summary statistics, we need the mean() and sd() summary functions in R. Summary functions in R take in many values and return a single value, as illustrated in Figure 3.2. FIGURE 3.2: Diagram illustrating a summary function in R. More precisely, we’ll use the mean() and sd() summary functions within the summarize() function from the dplyr package. Note you can also use the UK spelling of summarise(). As shown in Figure 3.3, the summarize() function takes in a data frame and returns a data frame with only one row corresponding to the summary statistics. FIGURE 3.3: Diagram of summarize() rows. We’ll save the results in a new data frame called summary_temp that will have two columns/variables: the mean and the std_dev: summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp), std_dev = sd(temp)) summary_temp # A tibble: 1 x 2 mean std_dev &lt;dbl&gt; &lt;dbl&gt; 1 NA NA Why are the values returned NA? As we saw in Section 2.3.1 when creating the scatterplot of departure and arrival delays for alaska_flights, NA is how R encodes missing values where NA indicates “not available” or “not applicable.” If a value for a particular row and a particular column does not exist, NA is stored instead. Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it? Perhaps the data was not collected at all because it was too difficult to do so? Perhaps there was an erroneous value that someone entered that has been corrected to read as missing? You’ll often encounter issues with missing values when working with real data. Going back to our summary_temp output, by default any time you try to calculate a summary statistic of a variable that has one or more NA missing values in R, NA is returned. To work around this fact, you can set the na.rm argument to TRUE, where rm is short for “remove”; this will ignore any NA missing values and only return the summary value for all non-missing values. The code that follows computes the mean and standard deviation of all non-missing values of temp: summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE)) summary_temp # A tibble: 1 x 2 mean std_dev &lt;dbl&gt; &lt;dbl&gt; 1 55.3 17.8 Notice how the na.rm=TRUE are used as arguments to the mean() and sd() summary functions individually, and not to the summarize() function. However, one needs to be cautious whenever ignoring missing values as we’ve just done. In the upcoming Learning Checks we’ll consider the possible ramifications of blindly sweeping rows with missing values “under the rug.” This is in fact why the na.rm argument to any summary statistic function in R is set to FALSE by default. In other words, do not ignore rows with missing values by default. R is alerting you to the presence of missing data and you should be mindful of this missingness and any potential causes of this missingness throughout your analysis. What are other summary functions can we use inside the summarize() verb to compute summary statistics? As seen in the diagram in Figure 3.2, you can use any function in R that takes many values and returns just one. Here are just a few: mean(): the mean AKA the average sd(): the standard deviation, which is a measure of spread min() and max(): the minimum and maximum values respectively IQR(): Interquartile range sum(): the sum n(): a count of the number of rows/observations in each group. This particular summary function will make more sense when group_by() is covered in Section 3.4. Learning check (LC3.2) Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor’s approach? (LC3.3) Modify the summarize function to create summary_temp to also use the n() summary function: summarize(count = n()). What does the returned value correspond to? (LC3.4) Why doesn’t the following code work? Run the code line by line instead of all at once, and then look at the data. In other words, run summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE)) first. summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE)) %&gt;% summarize(std_dev = sd(temp, na.rm = TRUE)) 3.4 group_by rows FIGURE 3.4: Diagram of group_by() and summarize(). Say instead of a single mean temperature for the whole year, you would like 12 mean temperatures, one for each of the 12 months separately. In other words, we would like to compute the mean temperature split by month. We can do this by “grouping” temperature observations by the values of another variable, in this case by the 12 values of the variable month. Run the following code: summary_monthly_temp &lt;- weather %&gt;% group_by(month) %&gt;% summarize(mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE)) summary_monthly_temp # A tibble: 12 x 3 month mean std_dev &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 1 35.6 10.2 2 2 34.3 6.98 3 3 39.9 6.25 4 4 51.7 8.79 5 5 61.8 9.68 6 6 72.2 7.55 7 7 80.1 7.12 8 8 74.5 5.19 9 9 67.4 8.47 10 10 60.1 8.85 11 11 45.0 10.4 12 12 38.4 9.98 This code is identical to the previous code that created summary_temp, but with an extra group_by(month) added before the summarize(). Grouping the weather dataset by month and then applying the summarize() functions yields a data frame that displays the mean and standard deviation temperature split by the 12 months of the year. It is important to note that the group_by() function doesn’t change data frames by itself. Rather it changes the meta-data, or data about the data, specifically the grouping structure. It is only after we apply the summarize() function that the data frame changes. For example, let’s consider the diamonds data frame included in the ggplot2 package. Run this code: diamonds # A tibble: 53,940 x 10 carat cut color clarity depth table price x y z &lt;dbl&gt; &lt;ord&gt; &lt;ord&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 # … with 53,930 more rows Observe that the first line of the output reads # A tibble: 53,940 x 10. This is an example of meta-data, in this case the number of observations/rows and variables/columns in diamonds. The actual data itself are the subsequent table of values. Now let’s pipe the diamonds data frame into group_by(cut): diamonds %&gt;% group_by(cut) # A tibble: 53,940 x 10 # Groups: cut [5] carat cut color clarity depth table price x y z &lt;dbl&gt; &lt;ord&gt; &lt;ord&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 # … with 53,930 more rows Observe that now there is additional meta-data: # Groups: cut [5] indicating that the grouping structure meta-data has been set based on the 5 possible levels of the categorical variable cut: &quot;Fair&quot;, &quot;Good&quot;, &quot;Very Good&quot;, &quot;Premium&quot;, &quot;Ideal&quot;. On the other hand, observe that the data has not changed: it is still a table of 53,940 \\(\\times\\) 10 values. Only by combining a group_by() with another data wrangling operation, in this case summarize(), will the data actually be transformed. diamonds %&gt;% group_by(cut) %&gt;% summarize(avg_price = mean(price)) # A tibble: 5 x 2 cut avg_price &lt;ord&gt; &lt;dbl&gt; 1 Fair 4359. 2 Good 3929. 3 Very Good 3982. 4 Premium 4584. 5 Ideal 3458. If you would like to remove this grouping structure meta-data, we can pipe the resulting data frame into the ungroup() function: diamonds %&gt;% group_by(cut) %&gt;% ungroup() # A tibble: 53,940 x 10 carat cut color clarity depth table price x y z &lt;dbl&gt; &lt;ord&gt; &lt;ord&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 # … with 53,930 more rows Observe how the # Groups: cut [5] meta-data is no longer present. Let’s now revisit the n() counting summary function we briefly introduced in the previously. Recall that the n() function counts rows. This is opposed to the sum() summary function that returns the sum of a numerical variable. For example, suppose we’d like to count how many flights departed each of the three airports in New York City: by_origin &lt;- flights %&gt;% group_by(origin) %&gt;% summarize(count = n()) by_origin # A tibble: 3 x 2 origin count &lt;chr&gt; &lt;int&gt; 1 EWR 120835 2 JFK 111279 3 LGA 104662 We see that Newark (&quot;EWR&quot;) had the most flights departing in 2013 followed by &quot;JFK&quot; and lastly by LaGuardia (&quot;LGA&quot;). Note there is a subtle but important difference between sum() and n(); while sum() returns the sum of a numerical variable, n() returns a count of the number of rows/observations. 3.4.1 Grouping by more than one variable You are not limited to grouping by one variable. Say you want to know the number of flights leaving each of the three New York City airports for each month. We can also group by a second variable month using group_by(origin, month): by_origin_monthly &lt;- flights %&gt;% group_by(origin, month) %&gt;% summarize(count = n()) by_origin_monthly # A tibble: 36 x 3 # Groups: origin [3] origin month count &lt;chr&gt; &lt;int&gt; &lt;int&gt; 1 EWR 1 9893 2 EWR 2 9107 3 EWR 3 10420 4 EWR 4 10531 5 EWR 5 10592 6 EWR 6 10175 7 EWR 7 10475 8 EWR 8 10359 9 EWR 9 9550 10 EWR 10 10104 # … with 26 more rows Observe that there are 36 rows to by_origin_monthly because there are 12 months for 3 airports (EWR, JFK, and LGA). Why do we group_by(origin, month) and not group_by(origin) and then group_by(month)? Let’s investigate: by_origin_monthly_incorrect &lt;- flights %&gt;% group_by(origin) %&gt;% group_by(month) %&gt;% summarize(count = n()) by_origin_monthly_incorrect # A tibble: 12 x 2 month count &lt;int&gt; &lt;int&gt; 1 1 27004 2 2 24951 3 3 28834 4 4 28330 5 5 28796 6 6 28243 7 7 29425 8 8 29327 9 9 27574 10 10 28889 11 11 27268 12 12 28135 What happened here is that the second group_by(month) overwrote the grouping structure meta-data of the earlier group_by(origin), so that in the end we are only grouping by month. The lesson here is if you want to group_by() two or more variables, you should include all the variables at the same time in the same group_by() adding a comma between the variable names. Learning check (LC3.5) Recall from Chapter 2 when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the summary_monthly_temp data frame tell us about temperatures in New York City throughout the year? (LC3.6) What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC? (LC3.7) Recreate by_monthly_origin, but instead of grouping via group_by(origin, month), group variables in a different order group_by(month, origin). What differs in the resulting dataset? (LC3.8) How could we identify how many flights left each of the three airports for each carrier? (LC3.9) How does the filter() operation differ from a group_by() followed by a summarize()? 3.5 mutate existing variables FIGURE 3.5: Diagram of mutate() columns. Another common transformation of data is to create/compute new variables based on existing ones. For example, say you are more comfortable thinking of temperature in degrees Celsius °C instead of degrees Fahrenheit °F. The formula to convert temperatures from °F to °C is \\[ \\text{temp in C} = \\frac{\\text{temp in F} - 32}{1.8} \\] We can apply this formula to the temp variable using the mutate() function from the dplyr package, which takes existing variables and mutates them to create new ones. weather &lt;- weather %&gt;% mutate(temp_in_C = (temp - 32) / 1.8) In this code we mutate() the weather data frame by creating a new variable temp_in_C = (temp-32) / 1.8 and then overwrite the original weather data frame. Why did we overwrite the data frame weather, instead of assigning the result to a new data frame like weather_new? As a rough rule of thumb, as long as you are not losing original information that you might need later, it’s acceptable practice to overwrite existing data frames with updated ones, as we did here. On the other hand, why did we not overwrite the variable temp, but instead create a new variable called temp_in_C? Because if we did this, we would have erased the original information contained in temp of temperatures in Fahrenheit that may still be valuable to us. Let’s now compute monthly average temperatures in both °F and °C using the group_by() and summarize() code we saw in Section 3.4: summary_monthly_temp &lt;- weather %&gt;% group_by(month) %&gt;% summarize(mean_temp_in_F = mean(temp, na.rm = TRUE), mean_temp_in_C = mean(temp_in_C, na.rm = TRUE)) summary_monthly_temp # A tibble: 12 x 3 month mean_temp_in_F mean_temp_in_C &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 1 35.6 2.02 2 2 34.3 1.26 3 3 39.9 4.38 4 4 51.7 11.0 5 5 61.8 16.6 6 6 72.2 22.3 7 7 80.1 26.7 8 8 74.5 23.6 9 9 67.4 19.7 10 10 60.1 15.6 11 11 45.0 7.22 12 12 38.4 3.58 Let’s consider another example. Passengers are often frustrated when their flight departs late, but aren’t as annoyed if, in the end, pilots can make up some time during the flight. This is known in the airline industry as “gain” and we will create this variable using the mutate() function: flights &lt;- flights %&gt;% mutate(gain = dep_delay - arr_delay) Let’s take a look at only the dep_delay, arr_delay, and the resulting gain variables for the first 5 rows in our updated flights data frame in Table 3.1. TABLE 3.1: First five rows of departure/arrival delay and gain variables. dep_delay arr_delay gain 2 11 -9 4 20 -16 2 33 -31 -1 -18 17 -6 -25 19 The flight in the first row departed 2 minutes late but arrived 11 minutes late, so its “gained time in the air” is a loss of 9 minutes, hence its gain is 2 - 11 = -9. On the other hand, the flight in the fourth row departed a minute early (dep_delay of -1) but arrived 18 minutes early (arr_delay of -18), so its “gained time in the air” is -1 - (-18) = -1 + 18 = 17 minutes, hence its gain is +17. Let’s look at some summary statistics of the gain variable by considering multiple summary functions at once in the same summarize() code: gain_summary &lt;- flights %&gt;% summarize( min = min(gain, na.rm = TRUE), q1 = quantile(gain, 0.25, na.rm = TRUE), median = quantile(gain, 0.5, na.rm = TRUE), q3 = quantile(gain, 0.75, na.rm = TRUE), max = max(gain, na.rm = TRUE), mean = mean(gain, na.rm = TRUE), sd = sd(gain, na.rm = TRUE), missing = sum(is.na(gain)) ) gain_summary # A tibble: 1 x 8 min q1 median q3 max mean sd missing &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; 1 -196 -3 7 17 109 5.66 18.0 9430 We see for example that the average gain is +5 minutes while the largest is +109 minutes! However, this code would take some time to type out in practice. We’ll see later on in Subsection 5.1.1 that there is a much more succinct way to compute a variety of common summary statistics: using the skim() function from the skimr package. Recall from Section 2.5 that since gain is a numerical variable, we can visualize its distribution using a histogram. ggplot(data = flights, mapping = aes(x = gain)) + geom_histogram(color = &quot;white&quot;, bins = 20) FIGURE 3.6: Histogram of gain variable. The resulting histogram in Figure 3.6 provides a different perspective on the gain variable than the summary statistics we computed earlier. For example, note that most values of gain are right around 0. To close out our discussion on the mutate() function to create new variables, note that we can create multiple new variables at once in the same mutate() code. Furthermore, within the same mutate() code we can refer to new variables we just created. As an example, consider the mutate() code Hadley Wickham and Garrett Grolemund show in Chapter 5 of “R for Data Science” (Grolemund and Wickham 2016): flights &lt;- flights %&gt;% mutate( gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours ) Learning check (LC3.10) What do positive values of the gain variable in flights correspond to? What about negative values? And what about a zero value? (LC3.11) Could we create the dep_delay and arr_delay columns by simply subtracting dep_time from sched_dep_time and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights. (LC3.12) What can we say about the distribution of gain? Describe it in a few sentences using the plot and the gain_summary data frame values. 3.6 arrange and sort rows One of the most commonly performed data wrangling tasks is to sort a data frame’s rows in alphanumeric order of one of the variables. The dplyr package’s arrange() function allows us to sort/reorder a data frame’s rows according to the values of the specified variable. Suppose we are interested in determining the most frequent destination airports for all domestic flights departing from New York City in 2013: freq_dest &lt;- flights %&gt;% group_by(dest) %&gt;% summarize(num_flights = n()) freq_dest # A tibble: 105 x 2 dest num_flights &lt;chr&gt; &lt;int&gt; 1 ABQ 254 2 ACK 265 3 ALB 439 4 ANC 8 5 ATL 17215 6 AUS 2439 7 AVL 275 8 BDL 443 9 BGR 375 10 BHM 297 # … with 95 more rows Observe that by default the rows of the resulting freq_dest data frame are sorted in alphabetical order of dest destination. Say instead we would like to see the same data, but sorted from the most to the least number of flights num_flights instead: freq_dest %&gt;% arrange(num_flights) # A tibble: 105 x 2 dest num_flights &lt;chr&gt; &lt;int&gt; 1 LEX 1 2 LGA 1 3 ANC 8 4 SBN 10 5 HDN 15 6 MTJ 15 7 EYW 17 8 PSP 19 9 JAC 25 10 BZN 36 # … with 95 more rows This is however the opposite of what we want. The rows are sorted with the least frequent destination airports displayed first. This is because arrange() always returns rows sorted in ascending order by default. To switch the ordering to be in “descending” order instead, we use the desc() function as so: freq_dest %&gt;% arrange(desc(num_flights)) # A tibble: 105 x 2 dest num_flights &lt;chr&gt; &lt;int&gt; 1 ORD 17283 2 ATL 17215 3 LAX 16174 4 BOS 15508 5 MCO 14082 6 CLT 14064 7 SFO 13331 8 FLL 12055 9 MIA 11728 10 DCA 9705 # … with 95 more rows 3.7 join data frames Another common data transformation task is “joining” or “merging” two different datasets. For example, in the flights data frame the variable carrier lists the carrier code for the different flights. While the corresponding airline names for &quot;UA&quot; and &quot;AA&quot; might be somewhat easy to guess (United and American Airlines), what airlines have codes &quot;VX&quot;, &quot;HA&quot;, and &quot;B6&quot;? This information is provided in a separate data frame airlines. View(airlines) We see that in airports, carrier is the carrier code while name is the full name of the airline company. Using this table, we can see that &quot;VX&quot;, &quot;HA&quot;, and &quot;B6&quot; correspond to Virgin America, Hawaiian Airlines, and JetBlue respectively. However, wouldn’t it be nice to have all this information in a single data frame instead of two separate data frames? We can do this by “joining” i.e. “merging” the flights and airlines data frames. Note that the values in the variable carrier in the flights data frame match the values in the variable carrier in the airlines data frame. In this case, we can use the variable carrier as a key variable to match the rows of the two data frames. Key variables are almost always identification variables that uniquely identify the observational units as we saw in Subsection 1.4.4. This ensures that rows in both data frames are appropriately matched during the join. Hadley and Garrett (Grolemund and Wickham 2016) created the following diagram to help us understand how the different data frames in the nycflights13 package are linked by various key variables: FIGURE 3.7: Data relationships in nycflights13 from R for Data Science. 3.7.1 Matching “key” variable names In both the flights and airlines data frames, the key variable we want to join/merge/match the rows by has the same name: carrier. Let’s use the inner_join() function to join the two data frames, where the rows will be matched by the variable carrier, and then compare the resulting data frames: flights_joined &lt;- flights %&gt;% inner_join(airlines, by = &quot;carrier&quot;) View(flights) View(flights_joined) Observe that the flights and flights_joined data frames are identical except that flights_joined has an additional variable name. The values of name correspond to the airline companies’ names as indicated in the airlines data frame. A visual representation of the inner_join() is shown in Figure 3.8 (Grolemund and Wickham 2016). There are other types of joins available (such as left_join(), right_join(), outer_join(), and anti_join()), but the inner_join() will solve nearly all of the problems you’ll encounter in this book. FIGURE 3.8: Diagram of inner join from R for Data Science. 3.7.2 Different “key” variable names Say instead you are interested in the destinations of all domestic flights departing NYC in 2013 and you ask yourself questions like: “What cities are these airports in?” or “Is &quot;ORD&quot; Orlando?” or &quot;Where is &quot;FLL&quot;? The airports data frame contains the airport codes for each airport: View(airports) However, if you look at both the airports and flights data frames, you’ll find that the airport codes are in variables that have different names. In airports the airport code is in faa whereas in flights the airport codes are in origin and dest. This fact is further highlighted in the visual representation of the relationships between these data frames in Figure 3.7. In order to join these two data frames by airport code, our inner_join() operation will use the by = c(&quot;dest&quot; = &quot;faa&quot;) argument, which allows us to join two data frames where the key variable has a different name: flights_with_airport_names &lt;- flights %&gt;% inner_join(airports, by = c(&quot;dest&quot; = &quot;faa&quot;)) View(flights_with_airport_names) Let’s construct the chain of pipe operators %&gt;% that computes the number of flights from NYC to each destination, but also includes information about each destination airport: named_dests &lt;- flights %&gt;% group_by(dest) %&gt;% summarize(num_flights = n()) %&gt;% arrange(desc(num_flights)) %&gt;% inner_join(airports, by = c(&quot;dest&quot; = &quot;faa&quot;)) %&gt;% rename(airport_name = name) named_dests # A tibble: 101 x 9 dest num_flights airport_name lat lon alt tz dst tzone &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; 1 ORD 17283 Chicago Ohare Intl 42.0 -87.9 668 -6 A America… 2 ATL 17215 Hartsfield Jackson… 33.6 -84.4 1026 -5 A America… 3 LAX 16174 Los Angeles Intl 33.9 -118. 126 -8 A America… 4 BOS 15508 General Edward Law… 42.4 -71.0 19 -5 A America… 5 MCO 14082 Orlando Intl 28.4 -81.3 96 -5 A America… 6 CLT 14064 Charlotte Douglas … 35.2 -80.9 748 -5 A America… 7 SFO 13331 San Francisco Intl 37.6 -122. 13 -8 A America… 8 FLL 12055 Fort Lauderdale Ho… 26.1 -80.2 9 -5 A America… 9 MIA 11728 Miami Intl 25.8 -80.3 8 -5 A America… 10 DCA 9705 Ronald Reagan Wash… 38.9 -77.0 15 -5 A America… # … with 91 more rows In case you didn’t know, &quot;ORD&quot; is the airport code of Chicago O’Hare airport and &quot;FLL&quot; is the main airport in Fort Lauderdale, Florida, which we can be seen in the airport_name variable. 3.7.3 Multiple “key” variables Say instead we want to join two data frames by multiple key variables. For example, in Figure 3.7 we see that in order to join the flights and weather data frames, we need more than one key variable: year, month, day, hour, and origin. This is because the combination of these 5 variables act to uniquely identify each observational unit in the weather data frame: hourly weather recordings at each of the 3 NYC airports. We achieve this by specifying a vector of key variables to join by using the c() function. Recall from Subsection 1.2.1 that c() is short for “combine” or “concatenate”. flights_weather_joined &lt;- flights %&gt;% inner_join(weather, by = c(&quot;year&quot;, &quot;month&quot;, &quot;day&quot;, &quot;hour&quot;, &quot;origin&quot;)) View(flights_weather_joined) Learning check (LC3.13) Looking at Figure 3.7, when joining flights and weather (or, in other words, matching the hourly weather values with each flight), why do we need to join by all of year, month, day, hour, and origin, and not just hour? (LC3.14) What surprises you about the top 10 destinations from NYC in 2013? 3.7.4 Normal forms The data frames included in the nycflights13 package are in a form that minimizes redundancy of data. For example, the flights data frame only saves the carrier code of the airline company; it does not include the actual name of the airline. For example the first row of flights has carrier equal to UA, but does it does not include the airline name “United Air Lines Inc.” The names of the airline companies are included in the name variable of the airlines data frame. In order to have the airline company name included in flights, we could join these two data frames as follows: joined_flights &lt;- flights %&gt;% inner_join(airlines, by = &quot;carrier&quot;) View(joined_flights) We are capable of performing this join because each of the data frames have keys in common to relate one to another: the carrier variable in both the flights and airlines data frames. The key variable(s) that we base our joins on are often identification variables we mentioned previously. This is an important property of what’s known as normal forms of data. The process of decomposing data frames into less redundant tables without losing information is called normalization. More information is available on Wikipedia. Both dplyr and the SQL database querying language (pronounced “sequel”) we mentioned in the introduction of this chapter use such normal forms. Given that they share such commonalities, once you learn either of these two tools, you can learn the other very easily. Learning check (LC3.15) What are some advantages of data in normal forms? What are some disadvantages? 3.8 Other verbs Here are some other useful data wrangling verbs: select() only a subset of variables/columns. rename() variables/columns to have new names. Return only the top_n() values of a variable. 3.8.1 select variables FIGURE 3.9: Diagram of select() columns. We’ve seen that the flights data frame in the nycflights13 package contains 19 different variables. You can identify the names of these 19 variables by running the glimpse() function from the dplyr package: glimpse(flights) However, say you only need two of these 19 variables, say carrier and flight. You can select() these two variables: flights %&gt;% select(carrier, flight) This function makes it easier to explore large datasets since it allows us to limit the scope to only those variables we care most about. For example, if we select() only a smaller number of variables, it will make viewing the dataset in RStudio’s spreadsheet viewer more digestible. Let’s say instead you want to drop, or de-select, certain variables. For example, consider the variable year in the flights data frame. This variable isn’t quite a “variable” because it is always 2013 and hence doesn’t change. Say you want to remove this variable from the data frame. We can deselect year by using the - sign: flights_no_year &lt;- flights %&gt;% select(-year) Another way of selecting columns/variables is by specifying a range of columns: flight_arr_times &lt;- flights %&gt;% select(month:day, arr_time:sched_arr_time) flight_arr_times This will select() all columns between month and day, as well as between arr_time and sched_arr_time, and drop the rest. The select() function can also be used to reorder columns when used with the everything() helper function. For example, suppose we want the hour, minute, and time_hour variables to appear immediately after the year, month, and day variables, while not discarding the rest of the variables. In the following code, everything() will pick up all remaining variables: flights_reorder &lt;- flights %&gt;% select(year, month, day, hour, minute, time_hour, everything()) glimpse(flights_reorder) Lastly, the helper functions starts_with(), ends_with(), and contains() can be used to select variables/columns that match those conditions. For example: flights_begin_a &lt;- flights %&gt;% select(starts_with(&quot;a&quot;)) flights_begin_a flights_delays &lt;- flights %&gt;% select(ends_with(&quot;delay&quot;)) flights_delays flights_time &lt;- flights %&gt;% select(contains(&quot;time&quot;)) flights_time 3.8.2 rename variables Another useful function is rename(), which as you may have guessed renames variables. Suppose we want dep_time and arr_time to be departure_time and arrival_time instead in the flights_time data frame: flights_time_new &lt;- flights %&gt;% select(dep_time, arr_time) %&gt;% rename(departure_time = dep_time, arrival_time = arr_time) glimpse(flights_time_new) Note that in this case we used a single = sign within the rename(). For example departure_time = dep_time renames the dep_time variable to have the new name departure_time. This is because we are not testing for equality like we would using ==. Instead we want to assign a new variable departure_time to have the same values as dep_time and then delete the variable dep_time. It’s easy to forget if the new name comes before or after the equals sign. We usually remember this as “New Before, Old After” or NBOA. 3.8.3 top_n values of a variable We can also return the top n values of a variable using the top_n() function. For example, we can return a data frame of the top 10 destination airports using the example from Section 3.7.2. Observe that we set the number of values to return to n = 10 and wt = num_flights to indicate that we want the rows corresponding to the top 10 values of num_flights. See the help file for top_n() by running ?top_n for more information. named_dests %&gt;% top_n(n = 10, wt = num_flights) Let’s further arrange() these results in descending order of num_flights: named_dests %&gt;% top_n(n = 10, wt = num_flights) %&gt;% arrange(desc(num_flights)) Learning check (LC3.16) What are some ways to select all three of the dest, air_time, and distance variables from flights? Give the code showing how to do this in at least three different ways. (LC3.17) How could one use starts_with, ends_with, and contains to select columns from the flights data frame? Provide three different examples in total: one for starts_with, one for ends_with, and one for contains. (LC3.18) Why might we want to use the select function on a data frame? (LC3.19) Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013. 3.9 Conclusion 3.9.1 Summary table Let’s recap our data wrangling verbs in Table 3.2. Using these verbs and the pipe %&gt;% operator from Section 3.1, you’ll be able to write easily legible code to perform almost all the data wrangling and data transformation necessary for the rest of this book. TABLE 3.2: Summary of data wrangling verbs. Verb Data wrangling operation filter() Pick out a subset of rows summarize() Summarize many values to one using a summary statistic function like mean(), median(), etc. group_by() Add grouping structure to rows in data frame. Note this does not change values in data frame. mutate() Create new variables by mutating existing ones arrange() Arrange rows of a data variable in ascending (default) or descending order inner_join() Join/merge two data frames, matching rows by a key variable Learning check (LC3.20) Let’s now put your newly acquired data wrangling skills to the test! An airline industry measure of a passenger airline’s capacity is the available seat miles, which is equal to the number of seats available multiplied by the number of miles or kilometers flown summed over all flights. So for example say an airline had 2 flights using a plane with 10 seats that flew 500 miles and 3 flights using a plane with 20 seats that flew 1000 miles, the available seat miles would be 2 \\(\\times\\) 10 \\(\\times\\) 500 \\(+\\) 3 \\(\\times\\) 20 \\(\\times\\) 1000 = 70,000 seat miles. Using the datasets included in the nycflights13 package, compute the available seat miles for each airline sorted in descending order. After completing all the necessary data wrangling steps, the resulting data frame should have 16 rows (one for each airline) and 2 columns (airline name and available seat miles). Here are some hints: Crucial: Unless you are very confident in what you are doing, it is worthwhile to not starting to code right away. Rather first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level pseudocode that is informal yet detailed enough to articulate what you are doing. This way you won’t confuse what you are trying to do (the algorithm) with how you are going to do it (writing dplyr code). Take a close look at all the datasets using the View() function: flights, weather, planes, airports, and airlines to identify which variables are necessary to compute available seat miles. Figure 3.7 showing how the various datasets can be joined will also be useful. Consider the data wrangling verbs in Table 3.2 as your toolbox! 3.9.2 Additional resources An R script file of all R code used in this chapter is available here. If you want to further unlock the power of the dplyr package for data wrangling, we suggest you that you check out RStudio’s “Data Transformation with dplyr” cheatsheet. This cheatsheet summarizes much more than what we’ve discussed in this chapter, in particular more-intermediate level and advanced data wrangling functions, while providing quick and easy to read visual descriptions. In fact, many of the diagrams illustrating data wrangling operations in this chapter, such as Figure 3.1 on filter(), originate from this cheatsheet. You can access this cheatsheet by going to the RStudio Menu Bar -&gt; Help -&gt; Cheatsheets -&gt; “Data Transformation with dplyr”. You can see a preview in the figure below. FIGURE 3.10: Data Transformation with dplyr cheatsheet. On top of data wrangling verbs and examples we presented in this section, if you’d like to see more examples of using the dplyr package for data wrangling check out Chapter 5 of Garrett Grolemund and Hadley Wickham’s book (Grolemund and Wickham 2016). 3.9.3 What’s to come? So far in this book, we’ve explored, visualized, and wrangled data saved in data frames. These data frames were saved in a spreadsheet-like format: in a rectangular shape with a certain number of rows corresponding to observations and a certain number of columns corresponding to variables describing these observations. We’ll see in the upcoming Chapter 4 that there are actually two ways to represent data in spreadsheet-type rectangular format: 1) “wide” format and 2) “tall/narrow” format. The tall/narrow format is also known as “tidy” format in R user circles. While the distinction between “tidy” and non-“tidy” formatted data is very subtle, it has very large implications for our data science work. This is because almost all the packages used in this book, including the ggplot2 package for data visualization and the dplyr package for data wrangling, all assume that all data frames are in “tidy” format. Furthermore, up until now we’ve only explored, visualized, and wrangled data saved within R packages. But what if you want to analyze data that you have saved in a Microsoft Excel, a Google Sheets, or a “Comma-Separated Values” (CSV) file? In Section 4.1, we’ll show you how to import this data into R using the readr package. References "],
-["4-tidy.html", "Chapter 4 Data Importing &amp; “Tidy” Data 4.1 Importing data 4.2 Tidy data 4.3 Case study: Democracy in Guatemala 4.4 tidyverse package 4.5 Conclusion", " Chapter 4 Data Importing &amp; “Tidy” Data In Subsection 1.2.1, we introduced the concept of a data frame in R: a rectangular spreadsheet-like representation of data where the rows correspond to observations and the columns correspond to variables describing each observation. In Section 1.4, we started exploring our first data frame: the flights data frame included in the nycflights13 package. In Chapter 2 we created visualizations based on the data included in flights and other data frames such as weather. In Chapter 3, we learned how to wrangle data, in other words take existing data frames and transform/modify them to suit our ends. In this final chapter of the “Data Science via the tidyverse” portion of the book, we extend some of these ideas by discussing a type of data formatting called “tidy” data. You will see that having data stored in “tidy” format is about more than what the everyday definition of the term “tidy” might suggest: having your data “neatly organized.” Instead, we define the term “tidy” as it’s used by data scientists who use R, outlining a set of rules by which data is saved. Knowledge of this type of data formatting was not necessary for our treatment of data visualization in Chapter 2 and data wrangling in Chapter 3. This is because all the data used was already in “tidy” format. In this chapter, we’ll now see that this format is essential to using the tools we covered up until now. Furthermore, it will also be useful for all subsequent chapters in this book when we cover regression and statistical inference. First however, we’ll show you how to import spreadsheet data in R. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section 1.3 for information on how to install and load R packages. library(dplyr) library(ggplot2) library(readr) library(tidyr) library(nycflights13) library(fivethirtyeight) 4.1 Importing data Up to this point, we’ve almost entirely used data stored inside of an R package. Say instead you have your own data saved on your computer or somewhere online? How can you analyze this data in R? Spreadsheet data is often saved in one of the following three formats. First, a Comma Separated Values .csv file. You can think of a .csv file as a bare-bones spreadsheet where: Each line in the file corresponds to one row of data/one observation. Values for each line are separated with commas. In other words, the values of different variables are separated by commas. The first line is often, but not always, a header row indicating the names of the columns/variables. Second, an Excel .xlsx spreadsheet file. This format is based on Microsoft’s proprietary Excel software. As opposed to a bare-bones .csv file, an .xlsx Excel files contains a lot of meta-data, or in other words, data about data. Recall we saw a previous example of meta-data in Section 3.4 when adding “group structure” meta-data to a data frame by using the group_by() verb. Some examples of Excel spreadsheet meta-data include the use of bold and italic fonts, colored cells, different column widths, and formula macros. Third, a Google Sheets file, which is a “cloud” or online-based way to work with a spreadsheet. Google Sheets allows you to download your data in both comma separated values .csv and Excel .xlsx formats. One way to import Google Sheets data is to go to the Google Sheets menu bar -&gt; File -&gt; Download as -&gt; Select “Microsoft Excel” or “Comma-separated values” and then load that data into R. We’ll cover two methods for importing .csv and .xlsx spreadsheet data in R: one using the console and the other using RStudio’s graphical user interface, abbreviated by “GUI.” 4.1.1 Using the console First, let’s import a Comma Separated Values .csv file that exists on the internet. The .csv file dem_score.csv contains ratings of the level of democracy in different countries spanning 1952 to 1992 and is accessible at https://moderndive.com/data/dem_score.csv. Let’s use the read_csv() function from the readr (Wickham, Hester, and Francois 2018) package to read it off the web, import it into R, and save it in a data frame called dem_score. library(readr) dem_score &lt;- read_csv(&quot;https://moderndive.com/data/dem_score.csv&quot;) dem_score # A tibble: 96 x 10 country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 Albania -9 -9 -9 -9 -9 -9 -9 -9 5 2 Argentina -9 -1 -1 -9 -9 -9 -8 8 7 3 Armenia -9 -7 -7 -7 -7 -7 -7 -7 7 4 Australia 10 10 10 10 10 10 10 10 10 5 Austria 10 10 10 10 10 10 10 10 10 6 Azerbaijan -9 -7 -7 -7 -7 -7 -7 -7 1 7 Belarus -9 -7 -7 -7 -7 -7 -7 -7 7 8 Belgium 10 10 10 10 10 10 10 10 10 9 Bhutan -10 -10 -10 -10 -10 -10 -10 -10 -10 10 Bolivia -4 -3 -3 -4 -7 -7 8 9 9 # … with 86 more rows In this dem_score data frame, the minimum value of -10 corresponds to a highly autocratic nation whereas a value of 10 corresponds to a highly democratic nation. Note also that backticks surround the different variable names. Variable names in R by default are not allowed to start with a number nor include spaces, but we can get around this fact by surrounding the column name with backticks. We’ll revisit the dem_score data frame in a case study in the upcoming Section 4.3. Note that the read_csv() function included in the readr package is different than the read.csv() function that comes installed with R. While the difference in the names might seem trivial (an _ instead of a .), the read_csv() function is, in our opinion, easier to use since it can more easily read data off the web and generally imports data at a much faster speed. Furthermore, the read_csv() function included in the readr saves data frames as tibbles by default. tibble is short for “tidy table”; we’ll discuss what it makes for data to be “tidy” shortly in the upcoming Section 4.2. 4.1.2 Using RStudio’s interface Let’s read in the exact same data, but this time from an Excel file saved on your computer. Furthermore, we’ll do this using RStudio’s graphical interface instead of running read_csv() in the console. First, download the Excel file dem_score.xlsx by going to https://moderndive.com/data/dem_score.xlsx, then Go to the Files pane of RStudio. Navigate to the directory (i.e. folder on your computer) where the downloaded dem_score.xlsx Excel file is saved. For example, this might be in your Downloads folder. Click on dem_score.xlsx. Click “Import Dataset…” At this point you should see a screen pop-up like in Figure 4.1. After clicking on the “Import” button on the bottom right of Figure 4.1, RStudio will save this spreadsheet’s data in a data frame called dem_score and display its contents in the spreadsheet viewer. FIGURE 4.1: Importing an Excel file to R. Furthermore, note the “Code Preview” block in the bottom right of Figure 4.1. You can copy and paste this code to reload your data again later automatically, instead of repeating this manual point-and-click process. 4.2 Tidy data Let’s now switch gears and learn about the concept of “tidy” data format with a motivating example from the fivethirtyeight package. The fivethirtyeight package (Kim, Ismay, and Chunn 2018) provides access to the datasets used in many articles published by data journalism website FiveThirtyEight.com. For a complete list of all 107 data sets included in the fivethirtyeight package, check out the package webpage by going to https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html. Let’s focus our attention on the drinks data frame: drinks # A tibble: 193 x 5 country beer_servings spirit_servings wine_servings total_litres_of_pur… &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 Afghanistan 0 0 0 0 2 Albania 89 132 54 4.9 3 Algeria 25 0 14 0.7 4 Andorra 245 138 312 12.4 5 Angola 217 57 45 5.9 6 Antigua &amp; B… 102 128 45 4.9 7 Argentina 193 25 221 8.3 8 Armenia 21 179 11 3.8 9 Australia 261 72 212 10.4 10 Austria 279 75 191 9.7 # … with 183 more rows After reading the help file by running ?drinks, you’ll see that drinks is a data frame containing results from a survey of the average number of servings of beer, spirits, and wine consumed in 193 countries. This data was originally reported on FiveThirtyEight.com in Mona Chalabi’s article “Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?” Let’s apply some of the data wrangling verbs we learned in Chapter 3 on the drinks data frame: filter() the drinks data frame to only consider 4 countries: the United States, China, Italy, and Saudi Arabia then select() all columns except total_litres_of_pure_alcohol by using the - sign, then rename() the variables beer_servings, spirit_servings, and wine_servings to beer, spirit, and wine respectively. and save the resulting data frame in drinks_smaller: drinks_smaller &lt;- drinks %&gt;% filter(country %in% c(&quot;USA&quot;, &quot;China&quot;, &quot;Italy&quot;, &quot;Saudi Arabia&quot;)) %&gt;% select(-total_litres_of_pure_alcohol) %&gt;% rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings) drinks_smaller # A tibble: 4 x 4 country beer spirit wine &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; 1 China 79 192 8 2 Italy 85 42 237 3 Saudi Arabia 0 5 0 4 USA 249 158 84 Let’s now ask ourselves a question: “Using the drinks_smaller data frame, how would we create the side-by-side (i.e. dodged) barplot in Figure 4.2?” Recall we saw barplots displaying two categorical variables in Section 2.8.3. FIGURE 4.2: Comparing alcohol consumption in 4 countries. Let’s break down the Grammar of Graphics we introduced in Section 2.1: The categorical variable country with four levels (China, Italy, Saudi Arabia, USA) would have to be mapped to the x-position of the bars. The numerical variable servings would have to be mapped to the y-position of the bars (the height of the bars). The categorical variable type with three levels (beer, spirit, wine) would have to be mapped to the fill color of the bars. Observe however that drinks_smaller has three separate variables beer, spirit, and wine. In order to use the ggplot() function to recreate the barplot in Figure 4.2 however, we need a single variable type with three possible values: beer, spirit, and wine. We could then map this type variable to the fill aesthetic of our plot. In other words, to recreate the barplot in Figure 4.2, our data frame would have to look like this: drinks_smaller_tidy # A tibble: 12 x 3 country type servings &lt;chr&gt; &lt;chr&gt; &lt;int&gt; 1 China beer 79 2 Italy beer 85 3 Saudi Arabia beer 0 4 USA beer 249 5 China spirit 192 6 Italy spirit 42 7 Saudi Arabia spirit 5 8 USA spirit 158 9 China wine 8 10 Italy wine 237 11 Saudi Arabia wine 0 12 USA wine 84 Let’s compare drinks_smaller_tidy to the drinks_smaller data frame from earlier: drinks_smaller # A tibble: 4 x 4 country beer spirit wine &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; 1 China 79 192 8 2 Italy 85 42 237 3 Saudi Arabia 0 5 0 4 USA 249 158 84 Observe that while drinks_smaller and drinks_smaller_tidy are both rectangular in shape and contain the same 12 numerical values (3 alcohol types \\(\\times\\) 4 countries), they are formatted differently. drinks_smaller is formatted in what’s known as “wide” format, whereas drinks_smaller_tidy is formatted in what’s known as “long/narrow” format. In the context of doing data science in R, long/narrow format is also known as “tidy” format. In order to use the ggplot2 and dplyr packages for data visualization and data wrangling, your input data frames must be in “tidy” format. Thus, all non-“tidy” data must be converted to “tidy” format first. Before we show you how to convert non-“tidy” data frames like drinks_smaller to “tidy” data frames like drinks_smaller_tidy, let’s go over the explicit definition of “tidy” data. 4.2.1 Definition of “tidy” data You have surely heard the word “tidy” in your life: “Tidy up your room!” “Please write your homework in a tidy way so that it is easier to grade and to provide feedback.” Marie Kondo’s best-selling book The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing and Netflix TV series Tidying Up with Marie Kondo. “I am not by any stretch of the imagination a tidy person, and the piles of unread books on the coffee table and by my bed have a plaintive, pleading quality to me - ‘Read me, please!’” - Linda Grant What does it mean for your data to be “tidy”? While “tidy” has a clear English meaning of “organized”, “tidy” in the context of data science using R means that your data follows a standardized format. We will follow Hadley Wickham’s definition of tidy data (Wickham 2014). A dataset is a collection of values, usually either numbers (if quantitative) or strings AKA text data (if qualitative/categorical). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a city) across attributes. “Tidy” data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. FIGURE 4.3: Tidy data graphic from R for Data Science. For example, say you have the following table of stock prices in Table 4.1: TABLE 4.1: Stock prices (non-tidy format) Date Boeing stock price Amazon stock price Google stock price 2009-01-01 $173.55 $174.90 $174.34 2009-01-02 $172.61 $171.42 $170.04 Although the data are neatly organized in a rectangular spreadsheet-type format, they do not follow the definition of data in “tidy” format. While there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns. In “tidy” data format each variable should be its own column, as shown in Table 4.2. Notice that both tables present the same information, but in different formats. TABLE 4.2: Stock prices (tidy format) Date Stock name Stock price 2009-01-01 Boeing $173.55 2009-01-02 Boeing $172.61 2009-01-01 Amazon $174.90 2009-01-02 Amazon $171.42 2009-01-01 Google $174.34 2009-01-02 Google $170.04 Now we have the requisite three columns Date, Stock Name, and Stock Price. On the other hand, consider the data in Table 4.3. TABLE 4.3: Example of tidy data. Date Boeing Price Weather 2009-01-01 $173.55 Sunny 2009-01-02 $172.61 Overcast In this case, even though the variable “Boeing Price” occurs just like in our non-“tidy” data in Table 4.1, the data is “tidy” since there are three variables corresponding to three unique pieces of information: Date, Boeing stock price, and the weather that particular day. Learning check (LC4.1) What are common characteristics of “tidy” data frames? (LC4.2) What makes “tidy” data frames useful for organizing data? 4.2.2 Converting to “tidy” data In this book so far, you’ve only seen data frames that were already in “tidy” format. Furthermore, for the rest of this book, you’ll mostly only see data frames that are already in “tidy” format as well. This is not always the case however with all datasets in the world. If your original data frame is in wide i.e. non-“tidy” format and you would like to use the ggplot2 or dplyr packages, you will first have to convert it “tidy” format using the gather() function in the tidyr package (Wickham and Henry 2019). Going back to our drinks_smaller data frame from earlier: drinks_smaller # A tibble: 4 x 4 country beer spirit wine &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; 1 China 79 192 8 2 Italy 85 42 237 3 Saudi Arabia 0 5 0 4 USA 249 158 84 We convert it to “tidy” format by using the gather() function from the tidyr package as follows: drinks_smaller_tidy &lt;- drinks_smaller %&gt;% gather(key = type, value = servings, -country) drinks_smaller_tidy # A tibble: 12 x 3 country type servings &lt;chr&gt; &lt;chr&gt; &lt;int&gt; 1 China beer 79 2 Italy beer 85 3 Saudi Arabia beer 0 4 USA beer 249 5 China spirit 192 6 Italy spirit 42 7 Saudi Arabia spirit 5 8 USA spirit 158 9 China wine 8 10 Italy wine 237 11 Saudi Arabia wine 0 12 USA wine 84 We set the arguments to gather() as follows: key is the name of the variable in the new “tidy” data frame that will contain the column names of the original data. Observe how we set key = type. In the resulting drinks_smaller_tidy, the column type contains the three types of alcohol beer, spirit, and wine. value is the name of the variable in the new “tidy” data frame that will contain the rows and columns of values of the original data. Observe how we set value = servings. In the resulting drinks_smaller_tidy, the column value contains the 4 \\(\\times\\) 3 = 12 numerical values. The third argument is the columns you either want to or don’t want to tidy. Observe how we set this to -country indicating that we don’t want to tidy the country variable in drinks_smaller and rather only beer, spirit, and wine. The third argument is a little nuanced, so let’s consider code that’s written slightly differently but that produces the same output: drinks_smaller_tidy &lt;- drinks_smaller %&gt;% gather(key = type, value = servings, c(beer, spirit, wine)) drinks_smaller_tidy Note that the third argument now specifies which columns we want to tidy c(beer, spirit, wine), instead of the columns we don’t want to tidy using -country. We use the c() function to create a vector of the columns in drinks_smaller that we’d like to tidy. With our drinks_smaller_tidy “tidy” formatted data frame, we can now produce the barplot you saw in Figure 4.2 using geom_col(). Recall from Section 2.8 on barplots that we use geom_col() and not geom_bar(), since we would like to map the “pre-counted” servings variable to the y-aesthetic of the bars. ggplot(drinks_smaller_tidy, aes(x = country, y = servings, fill = type)) + geom_col(position = &quot;dodge&quot;) FIGURE 4.4: Comparing alcohol consumption in 4 countries. Converting “wide” format data to “tidy” format often confuses new R users. The only way to learn to get comfortable with the gather() function is with practice, practice, and more practice. For example, run ?gather and look at the examples in the bottom of the help file. We’ll show another example of using gather() to convert a “wide” formatted data frame to “tidy” format in Section 4.3. For other examples of converting a dataset into “tidy” format, check out the different functions available for data tidying and a case study using data from the World Health Organization in R for Data Science (Grolemund and Wickham 2016). Learning check (LC4.3) Take a look the airline_safety data frame included in the fivethirtyeight data package. Run the following: airline_safety After reading the help file by running ?airline_safety, we see that airline_safety is a data frame containing information on different airlines companies’ safety records. This data was originally reported on the data journalism website FiveThirtyEight.com in Nate Silver’s article “Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?”. Let’s ignore the incl_reg_subsidiaries and avail_seat_km_per_week variables for simplicity: airline_safety_smaller &lt;- airline_safety %&gt;% select(-c(incl_reg_subsidiaries, avail_seat_km_per_week)) airline_safety_smaller # A tibble: 56 x 7 airline incidents_85_99 fatal_accidents… fatalities_85_99 incidents_00_14 &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; 1 Aer Li… 2 0 0 0 2 Aerofl… 76 14 128 6 3 Aeroli… 6 0 0 1 4 Aerome… 3 1 64 5 5 Air Ca… 2 0 0 2 6 Air Fr… 14 4 79 6 7 Air In… 2 1 329 4 8 Air Ne… 3 0 0 5 9 Alaska… 5 0 0 5 10 Alital… 7 2 50 4 # … with 46 more rows, and 2 more variables: fatal_accidents_00_14 &lt;int&gt;, # fatalities_00_14 &lt;int&gt; This data frame is not in “tidy” format. How would you convert this data frame to be in “tidy” format, in particular so that it has a variable incident_type_years indicating the incident type/year and a variable count of the counts? 4.2.3 nycflights13 package Recall the nycflights13 package we introduced in Section 1.4 with data about all domestic flights departing from New York City in 2013. Let’s revisit the flights data frame by running View(flights). We saw that flights has a rectangular shape, with each of its 336,776 rows corresponding to a flight and each of its 22 columns corresponding to different characteristics/measurements of each flight. This satisfied the first two criteria of the definition of “tidy” data from Subsection 4.2.1: that “Each variable forms a column” and “Each observation forms a row.” But what about the third property of “tidy” data that “Each type of observational unit forms a table”? Recall that we also saw in Section 1.4.3 that the observational unit for the flights data frame is an individual flight. In other words, the rows of the flights data frame refer to characteristics/measurements of individual flights. Also included in the nycflights13 package are other data frames with their rows representing different observational units (Wickham 2018): airlines: translation between two letter IATA carrier codes and airline company names (16 in total). The observational unit is an airline company. planes: aircraft information about each of 3,322 planes used. i.e. the observational unit is an aircraft. weather: hourly meteorological data (about 8705 observations) for each of the three NYC airports. i.e. the observational unit is an hourly measurement of weather at one of the three airports. airports: airport names and locations. i.e. the observational unit is an airport. The organization of the information into these five data frames follow the third “tidy” data property: observations corresponding to the same observational unit should be saved in the same table i.e. data frame. You could think of this property as the old English expression: “birds of a feather flock together.” 4.3 Case study: Democracy in Guatemala In this section, we’ll show you another example of how to convert a data frame that isn’t in “tidy” format (in other words is in “wide” format) to a data frame that is in “tidy” format (in other words is in “long/narrow” format). We’ll do this using the gather() function from the tidyr package again. Furthermore, we’ll make use of functions from the ggplot2 and dplyr packages to produce a time-series plot showing how the democracy scores have changed over the 40 years from 1952 to 1992 for Guatemala. Recall that we saw time-series plots in Section 2.4 on creating linegraphs using geom_line(). Let’s use the dem_score data frame we imported in Section 4.1, but focus on only data corresponding to Guatemala. guat_dem &lt;- dem_score %&gt;% filter(country == &quot;Guatemala&quot;) guat_dem # A tibble: 1 x 10 country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 Guatemala 2 -6 -5 3 1 -3 -7 3 3 Let’s lay out the Grammar of Graphics we saw in Section 2.1. First we know we need to set data = guat_dem and use a geom_line() layer, but what is the aesthetic mapping of variables. We’d like to see how the democracy score has changed over the years, so we need to map: year to the x-position aesthetic and democracy_score to the y-position aesthetic Now we are stuck in a predicament, much like with our drinks_smaller example in Section 4.2. We see that we have a variable named country, but its only value is &quot;Guatemala&quot;. We have other variables denoted by different year values. Unfortunately, the guat_dem data frame is not “tidy” and hence is not in the appropriate format to apply the Grammar of Graphics and thus we cannot use the ggplot2 package just yet. We need to take the values of the columns corresponding to years in guat_dem and convert them into a new “key” variable called year. Furthermore, we need to take the democracy score values in the inside of the data frame and turn them into a new “value” variable called democracy_score. Our resulting data frame will thus have three columns: country, year, and democracy_score. Recall that the gather() function in the tidyr package can complete this task for us: guat_dem_tidy &lt;- guat_dem %&gt;% gather(key = year, value = democracy_score, -country) guat_dem_tidy # A tibble: 9 x 3 country year democracy_score &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; 1 Guatemala 1952 2 2 Guatemala 1957 -6 3 Guatemala 1962 -5 4 Guatemala 1967 3 5 Guatemala 1972 1 6 Guatemala 1977 -3 7 Guatemala 1982 -7 8 Guatemala 1987 3 9 Guatemala 1992 3 We set the arguments to gather() as follows: key is the name of the variable in the new “tidy” data frame that will contain the column names of the original data. Observe how we set key = year. In the resulting guat_dem_tidy, the column year contains the years where Guatemala’s democracy scores were measured. value is the name of the variable in the new “tidy” data frame that will contain the rows and columns of values of the original data. Observe how we set value = democracy_score. In the resulting guat_dem_tidy the column democracy_score contains the 1 \\(\\times\\) 9 = 9 democracy scores. The third argument is the columns you either want to or don’t want to tidy. Observe how we set this to -country indicating that we don’t want to tidy the country variable in guat_dem and rather only variables 1952 through 1992. However, observe in the output for guat_dem_tidy that the year variable is of type chr or character. Before we can plot this variable on the x-axis, we need to convert it into a numerical variable using the as.numeric() function within the mutate() function, which we saw in Section 3.5 on mutating existing variables to create new ones. guat_dem_tidy &lt;- guat_dem_tidy %&gt;% mutate(year = as.numeric(year)) We can now create the time-series plot to visualize how democracy scores in Guatemala have changed from 1952 to 1992 using a geom_line(). ggplot(guat_dem_tidy, aes(x = year, y = democracy_score)) + geom_line() + labs(x = &quot;Year&quot;, y = &quot;Democracy Score&quot;) FIGURE 4.5: Democracy scores in Guatemala 1952-1992. Learning check (LC4.4) Convert the dem_score data frame into a tidy data frame and assign the name of dem_score_tidy to the resulting long-formatted data frame. (LC4.5) Read in the life expectancy data stored at https://moderndive.com/data/le_mess.csv and convert it to a tidy data frame. 4.4 tidyverse package Notice at the beginning of the chapter we loaded the following four packages, which are among the four of the most frequently used R packages for data science: library(dplyr) library(ggplot2) library(readr) library(tidyr) There is a much quicker way to load these packages than by individually loading them: by installing and loading the tidyverse package. The tidyverse package acts as an “umbrella” package whereby installing/loading it will install/load multiple packages at once for you. After installing the tidyverse package as you would a normal package via install.packages(&quot;tidyverse&quot;), running: library(tidyverse) would be the same as running: library(ggplot2) library(dplyr) library(tidyr) library(readr) library(purrr) library(tibble) library(stringr) library(forcats) You’ve seen the first 4 of these packages: ggplot2 for data visualization, dplyr for data wrangling, tidyr for converting data to “tidy” format, and readr for importing spreadsheet data into R. The remaining packages (purrr, tibble, stringr, and forcats) are left for a more advanced book; check out R for Data Science to learn about these packages. For the remainder of this book, we’ll start every chapter by running library(tidyverse), instead of loading the various component packages individually. The tidyverse “umbrella” package gets its name from the fact that all the functions in all its packages are designed to have common inputs and outputs: data frames are in “tidy” format. This standardization of input and output data frames makes transitions between different functions in the different packages as seamless as possible. For more information, check out the tidyverse.org webpage for the package. 4.5 Conclusion 4.5.1 Additional resources An R script file of all R code used in this chapter is available here. If you want to learn more about using the readr and tidyr package, we suggest you that you check out RStudio’s “Data Import Cheat Sheet.” You can access these cheatsheets by going to the RStudio Menu Bar -&gt; Help -&gt; Cheatsheets -&gt; “Browse Cheatsheets” -&gt; Scroll down the page to the “Data Import Cheat Sheet”. The first page of this cheatsheet has information on using the readr package to import data while the second page has information on using the tidyr package to “tidy” data. You can see a preview of both cheatsheets in the figures below. FIGURE 4.6: Data Import cheatsheet (first page): readr package. FIGURE 4.7: Data Import cheatsheet (second page): tidyr package. 4.5.2 What’s to come? Congratulations! You’ve completed the “Data Science with tidyverse” portion of this book! We’ll now move to the “Data modeling with moderndive” portion of this book in Chapters 5 and 6, where you’ll leverage your data visualization and wrangling skills to model relationships between different variables in data frames. However, we’re going to leave the Chapter 10 on “Inference for Regression” until after we’ve covered statistical inference in Chapters 7, 8, and 9. Onwards and upwards! FIGURE 4.8: ModernDive flowchart - On to Part II! References "],
-["5-regression.html", "Chapter 5 Basic Regression 5.1 One numerical explanatory variable 5.2 One categorical explanatory variable 5.3 Related topics 5.4 Conclusion", " Chapter 5 Basic Regression Now that we are equipped with data visualization skills from Chapter 2, data wrangling skills from Chapter 3, and an understanding of how to import data and the concept of “tidy” data format from Chapter 4, let’s now proceed with data modeling. The fundamental premise of data modeling is to make explicit the relationship between: an outcome variable \\(y\\), also called a dependent variable or response variable, and an explanatory/predictor variable \\(x\\), also called an independent variable or covariate. Another way to state this is using mathematical terminology: we will model the outcome variable \\(y\\) “as a function” of the explanatory/predictor variable \\(x\\). When we say “function” here, we aren’t referring to functions in R like the ggplot() function, but rather as a mathematical function. But, why do we have two different labels, explanatory and predictor, for the variable \\(x\\)? That’s because even though the two terms are often used interchangeably, roughly speaking data modeling serves one of two purposes: Modeling for explanation: When you want to explicitly describe and quantify the relationship between the outcome variable \\(y\\) and a set of explanatory variables \\(x\\), determine the significance of any relationships, have measures summarizing these relationships, and possibly identify any causal relationships between the variables. Modeling for prediction: When you want to predict an outcome variable \\(y\\) based on the information contained in a set of predictor variables \\(x\\). Unlike modeling for explanation however, you don’t care so much about understanding how all the variables relate and interact with one another, but rather only whether you can make good predictions about \\(y\\) using the information in \\(x\\). For example, say you are interested in an outcome variable \\(y\\) of whether patients develop lung cancer and information \\(x\\) on their risk factors, such as smoking habits, age, and socioeconomic status. If we are modeling for explanation, we would be interested in both describing and quantifying the effects of the different risk factors. One reason could be because you want to design an intervention to reduce lung cancer incidence in a population, such as targeting smokers of a specific age group with advertising for smoking cessation programs. If we are modeling for prediction however, we wouldn’t care so much about understanding how all the individual risk factors contribute to lung cancer, but rather only whether we can make good predictions of who will contract lung cancer. In this book, we’ll focus on modeling for explanation and hence refer to \\(x\\) as explanatory variables. If you are interested in learning about modeling for prediction, we suggest you check out books and courses on the field of machine learning. Furthermore, while there exists many techniques for modeling, such as tree-based models and neural networks, in this book we’ll focus on one particular technique: linear regression. Linear regression is one of the most commonly-used and easy-to-understand approaches to modeling. Linear regression involves a numerical outcome variable \\(y\\) and explanatory variables \\(x\\) that are either numerical or categorical. Furthermore, the relationship between \\(y\\) and \\(x\\) is assumed to be linear, or in other words, a line. However, we’ll see that what constitutes a “line” will vary depending on the nature of your \\(x\\) explanatory variables. In Chapter 5 on basic regression, we’ll only consider models with a single explanatory variable \\(x\\). In Section 5.1, the explanatory variable will be numerical. This scenario is known as simple linear regression. In Section 5.2, the explanatory variable will be categorical. In Chapter 6 on multiple regression, we’ll extend the ideas behind basic regression and consider models with two explanatory variables \\(x_1\\) and \\(x_2\\). In Section 6.2, we’ll have one numerical and one categorical explanatory variable. In particular, we’ll consider two such models: interaction and parallel slopes models. In Section 6.1, we’ll have two numerical explanatory variables. In Chapter 10 on inference for regression, we’ll revisit our regression models and analyze the results using the tools for statistical inference you’ll develop in Chapters 7, 8, and 9 on sampling, confidence intervals, and hypothesis test/p-values respectively. Let’s now begin with basic regression, which are linear regression models with a single explanatory variable \\(x\\). We’ll also discuss important statistical concepts like the correlation coefficient, that “correlation isn’t necessarily causation,” and what it means for a line to be “best-fitting.” Needed packages Let’s now load all the packages needed for this chapter (this assumes you’ve already installed them). In this chapter we introduce some new packages: The tidyverse “umbrella” (Wickham 2017) package. Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once: ggplot2 for data visualization dplyr for data wrangling tidyr for converting data to “tidy” format readr for importing spreadsheet data into R As well as the more advanced purrr, tibble, stringr, and forcats packages The moderndive package of datasets and functions for tidyverse-friendly introductory linear regression. The skimr (Quinn et al. 2019) package, which provides a simple to use function to quickly compute a wide array of commonly-used summary statistics. If needed, read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(moderndive) library(skimr) library(gapminder) 5.1 One numerical explanatory variable Why do some professors and instructors at universities and colleges receive high teaching evaluations from students while others don’t? Are there differences in teaching evaluations between instructors of different demographic groups? Could there be an impact due to student biases? These are all questions that are of interest to university/college administrators, as teaching evaluations are among the many criteria considered in determining which instructors and professors get promoted. Researchers at the University of Texas in Austin, Texas (UT Austin) tried to answer the following research question: what factors can explain differences in instructor teaching evaluation scores? To this end, they collected instructor and course information on 463 courses. A full description of the study can be found at openintro.org. In this section, we’ll keep things simple for now and try to explain differences in instructor teaching scores as a function of one numerical variable: the instructor’s “beauty” score (we’ll describe how this score was determined shortly). Could it be that instructors with higher “beauty” scores also have higher teaching evaluations? Could it be instead that instructors with higher “beauty” scores tend to have lower teaching evaluations? Or could it be there is no relationship between “beauty” score and teaching evaluations? We’ll answer these questions by modeling the relationship between teaching scores and “beauty” scores using simple linear regression where we have: A numerical outcome variable \\(y\\), the instructor’s teaching score and A single numerical explanatory variable \\(x\\), the instructor’s “beauty” score. 5.1.1 Exploratory data analysis The data on the 463 courses at UT Austin can be found in the evals data frame included in the moderndive package. However, to keep things simple, let’s select() only the subset of the variables we’ll consider in this chapter, and save this data in a new data frame called eval_ch6: evals_ch6 &lt;- evals %&gt;% select(ID, score, bty_avg, age) A crucial step before doing any kind of analysis or modeling is performing an exploratory data analysis, or EDA for short. Exploratory data analysis gives you a sense of the distributions of the individual variables in your data, whether any potential relationships exist between variables, whether there are outliers and/or missing values, and most importantly, how to build your model. Here are three common steps in an exploratory data analysis. Most crucially, looking at the raw data values. Computing summary statistics, such as means, medians, and interquartile ranges. Creating data visualizations. Let’s perform the first common step in an exploratory data analysis: looking at the raw data values. Because this step seems so trivial, unfortunately many data analysts ignore it. However, getting an early sense of what your raw data looks like can often prevent many larger issues down the road. You can do this by using RStudio’s spreadsheet viewer or by using the glimpse() function as introduced in Section 1.4.3 on exploring data frames: glimpse(evals_ch6) Observations: 463 Variables: 4 $ ID &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18… $ score &lt;dbl&gt; 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4… $ bty_avg &lt;dbl&gt; 5.00, 5.00, 5.00, 5.00, 3.00, 3.00, 3.00, 3.33, 3.33, 3.17, 3… $ age &lt;int&gt; 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4… Observe that Observations: 463 indicates that there are 463 rows/observations in evals_ch6, where each row corresponds to one observed course at UT Austin. It is important to note that the observational unit are individual courses and not individual instructors. Recall from Subsection 1.4.3 that the observational unit is the “type of thing” that is being measured by our variables. Since instructors teach more than one course in an academic year, the same instructor will appear more than once in the data. Hence there are fewer than 463 unique instructors being represented in evals_ch6. We’ll revisit this idea in Section 10.3, when we talk about the “independence assumption” for inference for regression. A full description of all the variables included in evals can be found at openintro.org and by reading the associated help file (run ?evals in the console). However, let’s fully describe the 4 variables we selected in evals_ch6: ID: An identification variable used to distinguish between the 1 through 463 courses in the dataset. score: A numerical variable of the course instructor’s average teaching score, where the average is computed from the evaluation scores from all students in that course. Teaching scores of 1 are lowest and 5 are highest. This is the outcome variable \\(y\\) of interest. bty_avg: A numerical variable of the course instructor’s average “beauty” score, where the average is computed from a separate panel of 6 students. “Beauty” scores of 1 are lowest and 10 are highest. This is the explanatory variable \\(x\\) of interest. age: A numerical variable of the course instructor’s age. This will be another explanatory variable \\(x\\) we’ll study later. An alternative way to look at the raw data values is by choosing a random sample of the rows in evals_ch6 by piping it into the sample_n() function from the dplyr package. Here we set the size argument to be 5, indicating that we want a random sample of 5 rows. We display the results Table 5.1. Note due to the random nature of the sampling, you will likely end up with a different subset of 5 rows. evals_ch6 %&gt;% sample_n(size = 5) TABLE 5.1: A random sample of 5 out of the 463 courses at UT Austin ID score bty_avg age 129 3.7 3.00 62 109 4.7 4.33 46 28 4.8 5.50 62 434 2.8 2.00 62 330 4.0 2.33 64 Now that we’ve looked at the raw values in our evals_ch6 data frame and got a preliminary sense of the data, let’s move on to the next common step in an exploratory data analysis: computing summary statistics. Let’s start by computing the mean and median of our numerical outcome variable score and our numerical explanatory variable bty_avg “beauty” score. We’ll do this by using the summarize() function from dplyr along with the mean() and median() summary functions we saw in Section 3.3. evals_ch6 %&gt;% summarize(mean_bty_avg = mean(bty_avg), mean_score = mean(score), median_bty_avg = median(bty_avg), median_score = median(score)) # A tibble: 1 x 4 mean_bty_avg mean_score median_bty_avg median_score &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 4.42 4.17 4.33 4.3 However, what if we want other summary statistics as well, such as the standard deviation (a measure of spread), the minimum and maximum values, and various percentiles? Typing out all these summary statistic functions in summarize() would be long and tedious. Instead, let’s use the convenient skim() function from the skimr package. This function takes in a data frame, “skims” it, and returns commonly used summary statistics. Let’s take our evals_ch6 data frame, select() only the outcome and explanatory variables teaching score and bty_avg, and pipe them into the skim() function: evals_ch6 %&gt;% select(score, bty_avg) %&gt;% skim() Skim summary statistics n obs: 463 n variables: 2 ── Variable type:numeric ─────────────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 p75 p100 bty_avg 0 463 463 4.42 1.53 1.67 3.17 4.33 5.5 8.17 score 0 463 463 4.17 0.54 2.3 3.8 4.3 4.6 5 (Note that for formatting purposes, the inline histogram that is usually printed with skim() has been removed.) For our two numerical variables teaching score and “beauty” score bty_avg it returns: missing: the number of missing values complete: the number of non-missing or complete values n: the total number of values mean: the mean AKA average sd: the standard deviation p0: the 0th percentile: the value at which 0% of observations are smaller than it AKA the minimum value p25: the 25th percentile: the value at which 25% of observations are smaller than it AKA the 1st quartile p50: the 50th percentile: the value at which 50% of observations are smaller than it AKA the 2nd quartile and more commonly the median p75: the 75th percentile: the value at which 75% of observations are smaller than it AKA the 3rd quartile p100: the 100th percentile: the value at which 100% of observations are smaller than it AKA the maximum value Looking at this output, we get an idea of how the values of both variables distribute. For example, the mean teaching score was 4.17 out of 5 whereas the mean “beauty” score was 4.42 out of 10. Furthermore, the middle 50% of teaching scores were between 3.80 and 4.6 (the first and third quartiles) whereas the middle 50% of “beauty” scores were between 3.17 and 5.5 out of 10. However, the skim() function only returns what are known as univariate summary statistics: functions that take a single variable and return some numerical summary of that variable. However, there also exist bivariate summary statistics: functions that take in two variables and return some summary of those two variables. In particular, when the two variables are numerical, we can compute the correlation coefficient. Generally speaking, coefficients are quantitative expressions of a specific phenomenon. A correlation coefficient is a quantitative expression of the strength of the linear relationship between two numerical variables. Its value ranges between -1 and 1 where: -1 indicates a perfect negative relationship: As the value of one variable goes up, the value of the other variable tends to go down. 0 indicates no relationship: The values of both variables go up/down independently of each other. +1 indicates a perfect positive relationship: As the value of one variable goes up, the value of the other variable tends to go up as well. Figure 5.1 gives examples of 9 different correlation coefficient values for hypothetical numerical variables \\(x\\) and \\(y\\). For example, observe in the top right plot that for a correlation coefficient of -0.75 there is a negative linear relationship between \\(x\\) and \\(y\\), but it is not as strong as the negative linear relationship between \\(x\\) and \\(y\\) when the correlation coefficient is -0.9 or -1. FIGURE 5.1: Different correlation coefficients. The correlation coefficient can be computed using the get_correlation() function in the moderndive package, where in this case the inputs to the function are the two numerical variables for which we want to calculate the correlation coefficient. We put the name of the response variable on the left-hand side of the ~ “tilde” sign, while putting the name of the explanatory variable on the right-hand side. This is known as R’s formula notation. We will use this same “formula” syntax with regression later in this chapter. evals_ch6 %&gt;% get_correlation(formula = score ~ bty_avg) # A tibble: 1 x 1 correlation &lt;dbl&gt; 1 0.187 An alternative way to compute the correlation coefficient is to use the cor() function within a summarize(): evals_ch6 %&gt;% summarize(correlation = cor(score, bty_avg)) # A tibble: 1 x 1 correlation &lt;dbl&gt; 1 0.187 In our case, the correlation coefficient of 0.187 indicates that the relationship between teaching evaluation score and “beauty” average is “weakly positive.” There is a certain amount of subjectivity in interpreting correlation coefficients, especially those that aren’t close to the extreme values of -1, 0, and 1. To develop your intuition about correlation coefficients, play the “Guess the Correlation” 1980’s style video game in Subsection 5.4.1. Let’s now perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Since both the score and bty_avg variables are numerical, a scatterplot is an appropriate graph to visualize this data. Let’s do this using geom_point() and display the result in Figure 5.2. Furthermore, let’s highlight the 6 points in the top right of the visualization in an orange box. ggplot(evals_ch6, aes(x = bty_avg, y = score)) + geom_point() + labs(x = &quot;Beauty Score&quot;, y = &quot;Teaching Score&quot;, title = &quot;Scatterplot of relationship of teaching and beauty scores&quot;) FIGURE 5.2: Instructor evaluation scores at UT Austin. Observe that most “beauty” scores lie between 2 and 8 while most teaching scores lie between 3 and 5. Furthermore, while opinions may vary, it is our opinion that the relationship between teaching score and “beauty” score is “weakly positive.” This is consistent with our earlier computed correlation coefficient of 0.187. Furthermore, there appear to be 6 points in the top-right of this plot highlighted in the orange box. However, this is not actually the case, as this plot suffers from overplotting. Recall from Subsection 2.3.2 that overplotting occurs when several points are stacked directly on top of each other, making it difficult to distinguish them. So while it may appear that there are only 6 points in the orange box, there are actually more. This fact is only apparent when using geom_jitter() in place of geom_point(). We display the resulting plot in Figure 5.3 along with the same orange box as in Figure 5.2. ggplot(evals_ch6, aes(x = bty_avg, y = score)) + geom_jitter() + labs(x = &quot;Beauty Score&quot;, y = &quot;Teaching Score&quot;, title = &quot;Scatterplot of relationship of teaching and beauty scores&quot;) FIGURE 5.3: Instructor evaluation scores at UT Austin. It is now apparent that there are 12 points in the area highlighted in orange and not 6 as originally suggested in Figure 5.2. Recall from Section 2.3.2 on overplotting that jittering adds a little random “nudge” to each of the points to break up these ties. Furthermore, recall that jittering is strictly a visualization tool; it does not alter the original values in the data frame evals_ch6. To keep things simple going forward however, we’ll only present regular scatterplots rather than their jittered counterparts. Let’s build on the unjittered scatterplot in Figure 5.2 by adding a “best-fitting” line: of all possible lines we can draw on this scatterplot, it is the line that “best” fits through the cloud of points. We do this by adding a new geom_smooth(method = &quot;lm&quot;, se = FALSE) layer to the ggplot() code that created the scatterplot in Figure 5.2. The method = &quot;lm&quot; argument sets the line to be a “linear model” i.e. a line, while the se = FALSE argument suppresses “standard error” uncertainty bars. ggplot(evals_ch6, aes(x = bty_avg, y = score)) + geom_point() + labs(x = &quot;Beauty Score&quot;, y = &quot;Teaching Score&quot;, title = &quot;Relationship between teaching and beauty scores&quot;) + geom_smooth(method = &quot;lm&quot;, se = FALSE) FIGURE 5.4: Regression line. The blue line in the resulting Figure 5.4 is called a “regression line.” The regression line is a visual summary of the relationship between two numerical variables, in our case the outcome variable score and the explanatory variable bty_avg. The positive slope of the blue line is consistent with our earlier observed correlation coefficient of 0.187 suggesting that there is a positive relationship between these two variables: as instructors have higher “beauty” scores, so also do they receive higher teaching evaluations. We’ll see later however that while the correlation coefficient and the slope of a regression line always have the same sign (positive or negative), they do not necessarily have the same value. Furthermore, a regression line is “best-fitting” in that it minimizes some mathematical criteria. We present this mathematical criteria in Subsection 5.3.2, but we suggest you read this subsection only after reading the rest of this section on regression with one numerical explanatory variable. Learning check (LC5.1) Conduct a new exploratory data analysis with the same outcome variable \\(y\\) being score but with age as the new explanatory variable \\(x\\). Remember, this involves three things: Looking at the raw data values. Computing summary statistics. Creating data visualizations. What can you say about the relationship between age and teaching scores based on this exploration? 5.1.2 Simple linear regression You may recall from secondary/high school algebra that the equation of a line is \\(y = a + b\\cdot x\\). (Note that the \\(\\cdot\\) symbol is equivalent to the \\(\\times\\) “multiply by” mathematical symbol. We’ll use the \\(\\cdot\\) symbol in this book as it is more succinct.) It is defined by two coefficients \\(a\\) and \\(b\\): the intercept coefficient \\(a\\) i.e. the value of \\(y\\) when \\(x = 0\\) and the slope coefficient \\(b\\) for \\(x\\) i.e. the increase in \\(y\\) for every increase of one in \\(x\\). However, when defining a regression line like the regression line in Figure 5.4, we use slightly different notation: the equation of the regression line is \\(\\widehat{y} = b_0 + b_1 \\cdot x\\) where the intercept coefficient is \\(b_0\\) i.e. the value of \\(\\widehat{y}\\) when \\(x=0\\). The slope coefficient for \\(x\\) is \\(b_1\\) i.e. the increase in \\(\\widehat{y}\\) for every increase of one in \\(x\\). Why do we put a “hat” on top of the \\(y\\)? It’s a form of notation commonly used in regression to indicate that we have a “fitted value”, or the value of \\(y\\) on the regression line for a given \\(x\\) value. We’ll discuss this more in the upcoming Subsection 5.1.3. We know that the regression line in Figure 5.4 has a positive slope \\(b_1\\) corresponding to our explanatory \\(x\\) variable bty_avg. Why? Because as instructors have higher bty_avg scores, so also do they tend to have higher teaching evaluation scores. However, what is the numerical value of the slope \\(b_1\\)? What about the intercept \\(b_0\\)? Let’s not compute these two values by hand, but rather let’s use a computer! We can obtain the values of the intercept \\(b_0\\) and the slope for btg_avg \\(b_1\\) by outputting a linear regression table. This is done in two steps: We first “fit” the linear regression model using the lm() function and save it in score_model. We get the regression table by applying the get_regression_table() from the moderndive package to score_model. # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals_ch6) # Get regression table: get_regression_table(score_model) TABLE 5.2: Linear regression table term estimate std_error statistic p_value lower_ci upper_ci intercept 3.880 0.076 50.96 0 3.731 4.030 bty_avg 0.067 0.016 4.09 0 0.035 0.099 Let’s first focus on interpreting the regression table output in Table 5.2 and then we’ll later revisit the code that produced it. In the estimate column of Table 5.2 are the intercept \\(b_0\\) = 3.88 and the slope \\(b_1\\) = 0.067 for bty_avg. Thus the equation of the regression line in Figure 5.4 follows: \\[ \\begin{aligned} \\widehat{y} &amp;= b_0 + b_1 \\cdot x\\\\ \\widehat{\\text{score}} &amp;= b_0 + b_{\\text{bty}\\_\\text{avg}} \\cdot\\text{bty}\\_\\text{avg}\\\\ &amp;= 3.880 + 0.067\\cdot\\text{bty}\\_\\text{avg} \\end{aligned} \\] The intercept \\(b_0\\) = 3.880 is the average teaching score \\(\\widehat{y}\\) = \\(\\widehat{\\text{score}}\\) for those courses where the instructor had a “beauty” score bty_avg of 0. Or in graphical terms, it’s where the line intersects the \\(y\\) axis when \\(x\\) = 0. Note however that while the intercept of the regression line has a mathematical interpretation, it has no practical interpretation, since observing a bty_avg of 0 is impossible; it is the average of six panelists’ “beauty” score ranging from 1 to 10. Furthermore, looking at the scatterplot with the regression line in Figure 5.4, no instructors had a “beauty” score anywhere near 0. Of greater interest is the slope \\(b_1\\) = \\(b_{\\text{bty\\_avg}}\\) for bty_avg of 0.067, as this summarizes the relationship between the teaching and “beauty” score variables. Note that the sign is positive, suggesting a positive relationship between these two variables, meaning teachers with higher “beauty” scores also tend to have higher teaching scores. Recall from earlier that the correlation coefficient is 0.187. They both have the same positive sign, but have a different value. Recall further that the correlation’s interpretation is the “strength of linear association”. The slope’s interpretation is a little different: For every increase of 1 unit in bty_avg, there is an associated increase of, on average, 0.067 units of score. We only state that there is an associated increase and not necessarily a causal increase. For example, perhaps it’s not that higher “beauty” scores directly cause higher teaching scores per se. Instead it could be that individuals from wealthier backgrounds tend to have stronger educational backgrounds and hence have higher teaching scores, but that these wealthy individuals also have higher “beauty” scores. In other words, just because two variables are strongly associated, it doesn’t necessarily mean that one causes the other. This is summed up in the often quoted phrase “correlation is not necessarily causation.” We discuss this idea further in Subsection 5.3.1. Furthermore, we say that this associated increase is on average 0.067 units of teaching score, because you might have two instructors whose bty_avg scores differ by 1 unit, but their difference in teaching scores won’t necessarily be exactly 0.067. What the slope of 0.067 is saying is that across all possible courses, the average difference in teaching score between two instructors whose “beauty” scores differ by one is 0.067. Now that we’ve learned how to compute the equation for the regression line in Figure 5.4 using the values in the estimate column of Table 5.2 and how to interpret the resulting the intercept and slope, let’s revisit the code that generated this table: # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals_ch6) # Get regression table: get_regression_table(score_model) First, we “fit” the linear regression model to the data using the lm() function and save this to score_model. When we say “fit”, we mean “find the best fitting line to this data.” lm() stands for “linear model” and is used as follows: lm(y ~ x, data = data_frame_name) where: y is the outcome variable, followed by a tilde ~. In our case, y is set to score. x is the explanatory variable. In our case, x is set to bty_avg. The combination of y ~ x is called a model formula. (Note the order of y and x.) In our case, the model formula is score ~ bty_avg. We saw such model formulas earlier when we computed the correlation coefficient using the get_correlation() function in Subsection 5.1.1. data_frame_name is the name of the data frame that contains the variables y and x. In our case, data_frame_name is the evals_ch6 data frame. Second, we take the saved model in score_model and apply the get_regression_table() function from the moderndive package to it to obtain the regression table in Table 5.2. This function is an example of what’s known in computer programming as a wrapper function. They take other pre-existing functions and “wrap” them into a single function that hides its inner workings. This concept is illustrated in Figure 5.5. FIGURE 5.5: The concept of a wrapper function. So all you need to worry about is the what the inputs look like and what the outputs look like; you leave all the other details “under the hood of the car.” In our regression modeling example, the get_regression_table() function takes a saved lm() linear regression model as input and returns a data frame of the regression table as output. If you’re interested in learning more about the get_regression_table() function’s design and inner-workings, check out Subsection 5.3.3. Lastly, you might be wondering what remaining 5 columns in Table 5.2 are: std_error, statistic, p_value, lower_ci and upper_ci? They are the “standard error”, “test statistic”, “p-value”, “lower 95% confidence interval bound”, and “upper 95% confidence interval bound.” They tell us about both the statistical significance and practical significance of our results. You can think of this loosely as the “meaningfulness” of our results from a statistical perspective. We are going to put aside these ideas for now and revisit them in Chapter 10 on (statistical) inference for regression. We’ll do this after we’ve had a chance to cover standard errors in Chapter 7, confidence intervals in Chapter 8, and hypothesis testing and p-values in Chapter 9 Learning check (LC5.2) Fit a new simple linear regression using lm(score ~ age, data = evals_ch6) where age is the new explanatory variable \\(x\\). Get information about the “best-fitting” line from the regression table by applying the get_regression_table() function. How do the regression results match up with the results from your earlier exploratory data analysis? 5.1.3 Observed/fitted values and residuals We just saw how to get the value of the intercept and the slope of a regression line from the estimate column of a regression table generated by the get_regression_table() function. Now instead say we want information on individual observations. For example, let’s focus on the 21st of the 463 courses in the evals_ch6 data frame in Table 5.3: TABLE 5.3: Data for the 21st course out of 463 ID score bty_avg age 21 4.9 7.33 31 What is the value \\(\\widehat{y}\\) on the blue line regression line corresponding to this instructor’s bty_avg “beauty” score of 7.333? In Figure 5.6 we mark three values corresponding to the instructor for this 21st course and give their statistical names: Circle: The observed value \\(y\\) = 4.9 is this course’s instructor’s actual teaching score. Square: The fitted value \\(\\widehat{y}\\) is value on the regression line for \\(x\\) = bty_avg = 7.333. This value is computed using the intercept and slope in the previous regression table: \\[\\widehat{y} = b_0 + b_1 \\cdot x = 3.88 + 0.067 \\cdot 7.333 = 4.369\\] Arrow: The length of this arrow is the residual and is computed by subtracting the fitted value \\(\\widehat{y}\\) from the observed value \\(y\\). The residual can be thought of as a model’s error or “lack of fit” for a particular observation. In the case of this course’s instructor, it is \\(y - \\widehat{y}\\) = 4.9 - 4.369 = 0.531. FIGURE 5.6: Example of observed value, fitted value, and residual. Now say we want to compute both the fitted value \\(\\widehat{y} = b_0 + b_1 \\cdot x\\) and the residual \\(y - \\widehat{y}\\) for all 463 courses in the study? Recall that each course corresponds to one of the 463 rows in the evals_ch6 data frame and also one of the 463 points in the regression plot in Figure 5.6. We could repeat the previous calculations we performed by hand 463 times, but that would be tedious and time consuming. Instead, let’s do this using a computer with the get_regression_points() function. Just like the get_regression_table() function, the get_regression_points() function is a “wrapper” function. However, this function returns a different output. Let’s apply the get_regression_points() function to score_model, which is where we saved our lm() model in the previous section. In Table 5.4 we present the results of only the 21st through 24th courses for brevity’s sake. regression_points &lt;- get_regression_points(score_model) regression_points TABLE 5.4: Regression points (for only the 21st through 24th courses) ID score bty_avg score_hat residual 21 4.9 7.33 4.37 0.531 22 4.6 7.33 4.37 0.231 23 4.5 7.33 4.37 0.131 24 4.4 5.50 4.25 0.153 Let’s inspect the individual columns and match them with the elements of Figure 5.6: The score column represents the observed outcome variable \\(y\\) i.e. the y-position of the 463 black points. The bty_avg column represents the values of the explanatory variable \\(x\\) i.e. the x-position of the 463 black points. The score_hat column represents the fitted values \\(\\widehat{y}\\) i.e. the corresponding value on the regression line for the 463 \\(x\\) values. The residual column represents the residuals \\(y - \\widehat{y}\\) i.e the 463 vertical distances between the 463 black points and the regression line. Just as we did for the instructor of the 21st course in the evals_ch6 dataset (in the first row of the table), let’s repeat the calculations for the instructor of the 24th course (in the fourth row of Table 5.4): score = 4.4 is the observed teaching score \\(y\\) for this course’s instructor. bty_avg = 5.50 is the value of the explanatory variable bty_avg \\(x\\) for this course’s instructor. score_hat = 4.25 = 3.88 + 0.067 \\(\\cdot\\) 5.50 is the fitted value \\(\\widehat{y}\\) on the regression line for this course’s instructor. residual = 0.153 = 4.4 - 4.25 is the value of the residual for this instructor. In other words, the model was off by 0.153 teaching score units for this course’s instructor. At this point we suggest you read Section 5.3.2, where we define what we mean by “best-fitting” regression lines: of all possible lines we can draw through the points, it is the line that minimizes the sum of squared residuals. Learning check (LC5.3) Generate a data frame of the residuals of the model where you used age as the explanatory \\(x\\) variable. 5.2 One categorical explanatory variable It’s an unfortunate truth that life expectancy is not the same across all countries in the world. International development agencies are very interested in studying these differences in life expectancy in the hopes of identifying where governments should allocate resources to address this problem. In this section, we’ll explore differences in life expectancy in two ways: Differences between continents: Are there significant differences in average life expectancy between the five populated continents of the world: Africa, the Americas, Asia, Europe, and Oceania? Differences within continents: How does life expectancy vary within the world’s five continents? For example, is the spread of life expectancy among the countries of Africa larger than the spread of life expectancy among the countries of Asia? To answer such questions, we’ll use the gapminder data frame included in the gapminder package. This dataset has international development statistics such as life expectancy, GDP per capita, and population for 142 countries for 5-year intervals between 1952 and 2007. Recall we visualized some of this data in Figure 2.1 in Subsection 2.1.2 on the “Grammar of Graphics.” We’ll use this data for basic linear regression again, but now using an explanatory variable \\(x\\) that is categorical, as opposed to the numerical explanatory variable model we used in the previous Section 5.1. A numerical outcome variable \\(y\\), a country’s life expectancy and A single categorical explanatory variable \\(x\\), the continent the country is a part of. When the explanatory variable \\(x\\) is categorical, the concept of a “best-fitting” regression line is a little different than the one we saw previously in Section 5.1 where the explanatory variable \\(x\\) was numerical. We’ll study these differences shortly in Subsection 5.2.2, but first we conduct an exploratory data analysis. 5.2.1 Exploratory data analysis The data on the 142 countries can be found in the gapminder data frame included in the gapminder package. However, to keep things simple, let’s filter() for only those observations/rows corresponding to the year 2007, select() only the subset of the variables we’ll consider in this chapter. We’ll save this data in a new data frame called gapminder2007: library(gapminder) gapminder2007 &lt;- gapminder %&gt;% filter(year == 2007) %&gt;% select(country, lifeExp, continent, gdpPercap) Recall from Section 5.1.1 that there are three common steps in an exploratory data analysis: Most crucially: Looking at the raw data values. Computing summary statistics, like means, medians, and interquartile ranges. Creating data visualizations. Let’s perform the first common step in an exploratory data analysis: looking at the raw data values. You can do this by using RStudio’s spreadsheet viewer or by using the glimpse() command as introduced in Section 1.4.3 on exploring data frames: glimpse(gapminder2007) Observations: 142 Variables: 4 $ country &lt;fct&gt; Afghanistan, Albania, Algeria, Angola, Argentina, Australia… $ lifeExp &lt;dbl&gt; 43.8, 76.4, 72.3, 42.7, 75.3, 81.2, 79.8, 75.6, 64.1, 79.4,… $ continent &lt;fct&gt; Asia, Europe, Africa, Africa, Americas, Oceania, Europe, As… $ gdpPercap &lt;dbl&gt; 975, 5937, 6223, 4797, 12779, 34435, 36126, 29796, 1391, 33… Observe that Observations: 142 indicates that there are 142 rows/observations in gapminder2007, where each row corresponds to one country. In other words, the observational unit are individual countries. Furthermore, observe that the variable continent is of type &lt;fct&gt;, which stands for “factor,” which is R’s way of encoding categorical variables. A full description of all the variables included in gapminder can be found by reading the associated help file (run ?gapminder in the console). However, let’s fully describe the 4 variables we selected in gapminder2007: country: An identification variable used to distinguish the 142 countries in the dataset. lifeExp: A numerical variable of that country’s life expectancy at birth. This is the outcome variable \\(y\\) of interest. continent: A categorical variable with 5 levels i.e. possible categories: Africa, Asia, Americas, Europe, and Oceania. This is the explanatory variable \\(x\\) of interest. gdpPercap: A numerical variable of that country’s GDP per capita in US inflation-adjusted dollars that we’ll use as another outcome variable \\(y\\) in the Learning Check at the end of this section. Furthermore, let’s look at a random sample of 5 out of the 142 countries in Table 5.5. Note due to the random nature of the sampling, you will likely end up with a different subset of 5 rows. gapminder2007 %&gt;% sample_n(size = 5) TABLE 5.5: Random sample of 5 out of 142 countries country lifeExp continent gdpPercap Togo 58.4 Africa 883 Sao Tome and Principe 65.5 Africa 1598 Congo, Dem. Rep. 46.5 Africa 278 Lesotho 42.6 Africa 1569 Bulgaria 73.0 Europe 10681 Now that we’ve looked at the raw values in our gapminder2007 data frame and got a sense of the data, let’s move on to computing summary statistics. Let’s once again apply the skim() function from the skimr package. Recall from our previous EDA that this function takes in a data frame, “skims” it, and returns commonly used summary statistics. Let’s take our gapminder2007 data frame, select() only the outcome and explanatory variables lifeExp and continent, and pipe them into the skim() function: gapminder2007 %&gt;% select(lifeExp, continent) %&gt;% skim() Skim summary statistics n obs: 142 n variables: 2 ── Variable type:factor ──────────────────────────────────────────────────────── variable missing complete n n_unique top_counts ordered continent 0 142 142 5 Afr: 52, Asi: 33, Eur: 30, Ame: 25 FALSE ── Variable type:numeric ─────────────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 p75 p100 lifeExp 0 142 142 67.01 12.07 39.61 57.16 71.94 76.41 82.6 The skim() output now reports summaries for categorical variables (Variable type:factor) separately from the numerical variables (Variable type:numeric). For the categorical variable continent, it reports: missing, complete, n which are the number of missing, complete, and total number of values as before. n_unique: The number of unique levels to this variable, corresponding to Africa, Asia, Americas, Europe, and Oceania. top_counts: In this case the top four counts: Africa has 52 countries, Asia has 33, Europe has 30, and Americas has 25. Not displayed is Oceania with 2 countries. ordered: This tells us whether the categorical variable is “ordinal”: whether there is encoded hierarchy (like low, medium, high). In this case, continent is not ordered. Turning our attention to the summary statistics of the numerical variable lifeExp, we observe that the global median life expectancy in 2007 was 71.94, or in other words, half of the world’s countries (71 countries) had a life expectancy less than 71.94. The mean life expectancy of 67.01 is lower however. Why is the mean life expectancy lower than the median? We can answer this question by performing the last of the three common steps in an exploratory data analysis: creating data visualizations. Let’s visualize the distribution of our outcome variable \\(y\\) = lifeExp in Figure 5.7. ggplot(gapminder2007, aes(x = lifeExp)) + geom_histogram(binwidth = 5, color = &quot;white&quot;) + labs(x = &quot;Life expectancy&quot;, y = &quot;Number of countries&quot;, title = &quot;Histogram of distribution of worldwide life expectancies&quot;) FIGURE 5.7: Histogram of Life Expectancy in 2007. We see that this data is left-skewed, also known as negatively skewed: there are a few countries with very low life expectancy that are bringing down the mean life expectancy. However, the median is less sensitive to the effects of such outliers, hence the median is greater than the mean in this case. Remember however, that we want to compare life expectancies both between continents and within continents. In other words, our visualizations need to incorporate some notion of the variable continent. We can do this easily with a faceted histogram. Recall from Section 2.6 that facets allow us to split a visualization by the different values of another variable. We display the resulting visualization in Figure 5.8 by adding a facet_wrap(~ continent, nrow = 2) layer. ggplot(gapminder2007, aes(x = lifeExp)) + geom_histogram(binwidth = 5, color = &quot;white&quot;) + labs(x = &quot;Life expectancy&quot;, y = &quot;Number of countries&quot;, title = &quot;Histogram of distribution of worldwide life expectancies&quot;) + facet_wrap(~ continent, nrow = 2) FIGURE 5.8: Life expectancy in 2007. Observe that unfortunately the distribution of African life expectancies is much lower than the other continents, while in Europe life expectancies tend to be higher and furthermore do not vary as much. On the other hand, both Asia and Africa have the most variation in life expectancies. There is the least variation in Oceania, but keep in mind that there are only two countries in Oceania: Australia and New Zealand. Recall that an alternative method to visualize the distribution of a numerical variable split by a categorical variable is by using a side-by-side boxplot. We map the categorical variable continent to the \\(x\\)-axis and the different life expectancies within each continent on the \\(y\\)-axis in Figure 5.9. ggplot(gapminder2007, aes(x = continent, y = lifeExp)) + geom_boxplot() + labs(x = &quot;Continent&quot;, y = &quot;Life expectancy (years)&quot;, title = &quot;Life expectancy by continent&quot;) FIGURE 5.9: Life expectancy in 2007. Some people prefer comparing the distributions of a numerical variable between different levels of a categorical variable using a boxplot instead of a faceted histogram. This is because we can make quick comparisons between the categorical variable’s levels with imaginary horizontal lines. For example, observe in Figure 5.9 that we can quickly convince ourselves that Oceania has the highest median life expectancies by drawing an imaginary horizontal line at \\(y\\) = 80. Furthermore, as we observed in the faceted histogram in Figure 5.8, Africa and Asia have the largest variation in life expectancy as evidenced by their large interquartile ranges i.e. the heights of the boxes. It’s important to remember however that the solid lines in the middle of the boxes correspond to the medians (i.e. the middle value) rather than the mean (the average). So for example, if you look at Asia, the solid line denotes the median life expectancy of around 72 years. This tells us that half of all countries in Asia have a life expectancy below 72 years whereas half have a life expectancy above 72 years. Let’s compute the median and mean life expectancy for each continent with a little more data wrangling and display the results in Table 5.6. lifeExp_by_continent &lt;- gapminder2007 %&gt;% group_by(continent) %&gt;% summarize(median = median(lifeExp), mean = mean(lifeExp)) TABLE 5.6: Life expectancy by continent continent median mean Africa 52.9 54.8 Americas 72.9 73.6 Asia 72.4 70.7 Europe 78.6 77.6 Oceania 80.7 80.7 Observe the order of the second column median life expectancy: Africa is lowest, the Americas and Asia are next with similar medians, then Europe, then Oceania. This ordering corresponds to the ordering of the solid black lines inside the boxes in our side-by-side boxplot in Figure 5.9. Let’s now turn our attention to the values in the third column mean. Using Africa’s mean life expectancy of 54.8 as a baseline for comparison, let’s start making relative comparisons to the life expectancies of the other four continents: The mean life expectancy of the Americas is 73.6 - 54.8 = 18.8 years higher. The mean life expectancy of Asia is 70.7 - 54.8 = 15.9 years higher. The mean life expectancy of Europe is 77.6 - 54.8 = 22.8 years higher. The mean life expectancy of Oceania is 80.7 - 54.8 = 25.9 years higher. Let’s put these values Table 5.7, which we’ll revisit later on in this section. TABLE 5.7: Mean life expectancy by continent and relative differences from mean for Africa. continent mean Difference versus Africa Africa 54.8 0.0 Americas 73.6 18.8 Asia 70.7 15.9 Europe 77.6 22.8 Oceania 80.7 25.9 Learning check (LC5.4) Conduct a new exploratory data analysis with the same explanatory variable \\(x\\) being continent but with gdpPercap as the new outcome variable \\(y\\). Remember, this involves three things: Most crucially: Looking at the raw data values. Computing summary statistics, such as means, medians, and interquartile ranges. Creating data visualizations. What can you say about the differences in GDP per capita between continents based on this exploration? 5.2.2 Linear regression In Subsection 5.1.2 we introduced simple linear regression, which involves modeling the relationship between a numerical outcome variable \\(y\\) and a numerical explanatory variable \\(x\\). In our life expectancy example, we now instead have a categorical explanatory variable \\(x\\) continent. Our model will not yield a “best-fitting” regression line like in Figure 5.4, but rather offsets relative to a baseline for comparison. As we did in Section 5.1.2 when studying the relationship between teaching scores and “beauty” scores, let’s output the regression table for this model. Recall that this is done in two steps: We first “fit” the linear regression model using the lm(y~x, data) function and save it in lifeExp_model. We get the regression table by applying the get_regression_table() from the moderndive package to lifeExp_model. # Fit regression model: lifeExp_model &lt;- lm(lifeExp ~ continent, data = gapminder2007) # Get regression table: get_regression_table(lifeExp_model) TABLE 5.8: Linear regression table term estimate std_error statistic p_value lower_ci upper_ci intercept 54.8 1.02 53.45 0 52.8 56.8 continentAmericas 18.8 1.80 10.45 0 15.2 22.4 continentAsia 15.9 1.65 9.68 0 12.7 19.2 continentEurope 22.8 1.70 13.47 0 19.5 26.2 continentOceania 25.9 5.33 4.86 0 15.4 36.5 Let’s once again focus on the values in the term and estimate columns of Table 5.8. Why are there now 5 rows? Let’s break them down one-by-one: intercept here corresponds to the mean life expectancy of countries in Africa of 54.8 years. continentAmericas corresponds to countries in the Americas and the value +18.8 is the same difference in mean life expectancy relative to Africa we displayed in Table 5.7. In other words, the mean life expectancy of countries in the Americas is 54.8 + 18.8 = 73.6. continentAsia corresponds to countries in Asia and the value +15.9 is the same difference in mean life expectancy relative to Africa we displayed in Table 5.7. In other words, the mean life expectancy of countries in Asia is 54.8 + 15.9 = 70.7. continentEurope corresponds to countries in Europe and the value +22.8 is the same difference in mean life expectancy relative to Africa we displayed in Table 5.7. In other words, the mean life expectancy of countries in Europe is 54.8 + 22.8 = 77.6. continentOceania corresponds to countries in Oceania and the value +25.9 is the same difference in mean life expectancy relative to Africa we displayed in Table 5.7. In other words, the mean life expectancy of countries in the Oceania is 54.8 + 25.9 = 80.7. To summarize, the 5 values in the estimate column in Table 5.8 correspond to the “baseline for comparison” continent Africa (the intercept) as well as four “offsets” from this baseline for the remaining 4 continents: the Americas, Asia, Europe, and Oceania. You might be asking at this point why was Africa chosen as the “baseline for comparison” group. This is the case for no other reason than it comes first alphabetically of the five continents; by default R arranges factors/categorical variables in alphanumeric order. You can change this baseline group to be another continent if you manipulate the variable continent’s factor “levels” using the forcats package. See Chapter 15 of Garrett Grolemund and Hadley Wickham’s book “R for Data Science” (Grolemund and Wickham 2016) for examples. Let’s now write the equation for our fitted values \\(\\widehat{y} = \\widehat{\\text{life exp}}\\). \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{life exp}} &amp;= b_0 + b_{\\text{Amer}}\\cdot\\mathbb{1}_{\\mbox{Amer}}(x) + b_{\\text{Asia}}\\cdot\\mathbb{1}_{\\mbox{Asia}}(x) + \\\\ &amp; \\qquad b_{\\text{Euro}}\\cdot\\mathbb{1}_{\\mbox{Euro}}(x) + b_{\\text{Ocean}}\\cdot\\mathbb{1}_{\\mbox{Ocean}}(x)\\\\ &amp;= 54.8 + 18.8\\cdot\\mathbb{1}_{\\mbox{Amer}}(x) + 15.9\\cdot\\mathbb{1}_{\\mbox{Asia}}(x) + \\\\ &amp; \\qquad 22.8\\cdot\\mathbb{1}_{\\mbox{Euro}}(x) + 25.9\\cdot\\mathbb{1}_{\\mbox{Ocean}}(x) \\end{aligned} \\] Whoa! That looks very daunting! Don’t fret however, as once you understand what all the elements mean, things simply greatly. First, \\(\\mathbb{1}_{A}(x)\\) is what’s known in mathematics as an “indicator function.” It returns only one of two possible values, 0 and 1, where \\[ \\mathbb{1}_{A}(x) = \\left\\{ \\begin{array}{ll} 1 &amp; \\text{if } x \\text{ is in } A \\\\ 0 &amp; \\text{if } \\text{otherwise} \\end{array} \\right. \\] In a statistical modeling context this is also known as a dummy variable. In our case, let’s consider the first such indicator variable \\(\\mathbb{1}_{\\mbox{Amer}}(x)\\). This indicator function returns 1 if a country is in the Americas, 0 otherwise: \\[ \\mathbb{1}_{\\mbox{Amer}}(x) = \\left\\{ \\begin{array}{ll} 1 &amp; \\text{if } \\text{country } x \\text{ is in the Americas} \\\\ 0 &amp; \\text{otherwise}\\end{array} \\right. \\] Second, \\(b_0\\) corresponds to the intercept as before; in this case it’s the mean life expectancy of all countries in Africa. Third, the \\(b_{\\text{Amer}}\\), \\(b_{\\text{Asia}}\\), \\(b_{\\text{Euro}}\\), and \\(b_{\\text{Ocean}}\\) represent the 4 “offsets relative to the baseline for comparison” in the regression table output in Table 5.8: continentAmericas, continentAsia, continentEurope, and continentOceania. Let’s put this all together and compute the fitted value \\(\\widehat{y} = \\widehat{\\text{life exp}}\\) for a country in Africa. Since the country is in Africa, all four indicator functions \\(\\mathbb{1}_{\\mbox{Amer}}(x)\\), \\(\\mathbb{1}_{\\mbox{Asia}}(x)\\), \\(\\mathbb{1}_{\\mbox{Euro}}(x)\\), and \\(\\mathbb{1}_{\\mbox{Ocean}}(x)\\) will equal 0, and thus: \\[ \\begin{aligned} \\widehat{\\text{life exp}} &amp;= b_0 + b_{\\text{Amer}}\\cdot\\mathbb{1}_{\\mbox{Amer}}(x) + b_{\\text{Asia}}\\cdot\\mathbb{1}_{\\mbox{Asia}}(x) + \\\\ &amp; \\qquad b_{\\text{Euro}}\\cdot\\mathbb{1}_{\\text{Euro}}(x) + b_{\\text{Ocean}}\\cdot\\mathbb{1}_{\\text{Ocean}}(x)\\\\ &amp;= 54.8 + 18.8\\cdot\\mathbb{1}_{\\text{Amer}}(x) + 15.9\\cdot\\mathbb{1}_{\\text{Asia}}(x) + \\\\ &amp; \\qquad 22.8\\cdot\\mathbb{1}_{\\text{Euro}}(x) + 25.9\\cdot\\mathbb{1}_{\\text{Ocean}}(x)\\\\ &amp;= 54.8 + 18.8\\cdot 0 + 15.9\\cdot 0 + 22.8\\cdot 0 + 25.9\\cdot 0\\\\ &amp;= 54.8 \\end{aligned} \\] In other words, all that’s left is the intercept \\(b_0\\), corresponding to the average life expectancy of African countries of 54.8 years. Next, say we are considering a country in the Americas. In this case only the indicator function \\(\\mathbb{1}_{\\mbox{Amer}}(x)\\) for the Americas will equal 1, while all the others will equal 0, and thus: \\[ \\begin{aligned} \\widehat{\\text{life exp}} &amp;= 54.8 + 18.8\\cdot\\mathbb{1}_{\\mbox{Amer}}(x) + 15.9\\cdot\\mathbb{1}_{\\mbox{Asia}}(x) + 22.8\\cdot\\mathbb{1}_{\\mbox{Euro}}(x) + \\\\ &amp; \\qquad 25.9\\cdot\\mathbb{1}_{\\mbox{Ocean}}(x)\\\\ &amp;= 54.8 + 18.8\\cdot 1 + 15.9\\cdot 0 + 22.8\\cdot 0 + 25.9\\cdot 0\\\\ &amp;= 54.8 + 18.8\\\\ &amp;= 73.6 \\end{aligned} \\] which is the mean life expectancy for countries in the Americas of 73.6 years we computed in Table 5.7. Note the “offset from the baseline for comparison” here is +18.8 years. Let’s do one more. Say we are considering a country in Asia. In this case only the indicator function \\(\\mathbb{1}_{\\mbox{Asia}}(x)\\) for Asia will equal 1, while all the others will equal 0, and thus: \\[ \\begin{aligned} \\widehat{\\text{life exp}} &amp;= 54.8 + 18.8\\cdot\\mathbb{1}_{\\mbox{Amer}}(x) + 15.9\\cdot\\mathbb{1}_{\\mbox{Asia}}(x) + 22.8\\cdot\\mathbb{1}_{\\mbox{Euro}}(x) + \\\\ &amp; \\qquad 25.9\\cdot\\mathbb{1}_{\\mbox{Ocean}}(x)\\\\ &amp;= 54.8 + 18.8\\cdot 0 + 15.9\\cdot 1 + 22.8\\cdot 0 + 25.9\\cdot 0\\\\ &amp;= 54.8 + 15.9\\\\ &amp;= 70.7 \\end{aligned} \\] which is the mean life expectancy for countries in Asia of 70.7 years we computed in Table 5.7. Note the “offset from the baseline for comparison” here is +15.9 years. Let’s generalize this idea a bit. If we fit a linear regression model using a categorical explanatory variable \\(x\\) that has \\(k\\) levels i.e. possible categories, the regression table will return an intercept and \\(k - 1\\) “offsets.” In our case, since there are \\(k = 5\\) continents, the regression model returns an intercept corresponding to the baseline for comparison group of Africa and \\(k - 1 = 4\\) offsets corresponding to the Americas, Asia, Europe, and Oceania. Phew! That was a lot of work! Understanding a regression table output when you’re using a categorical explanatory variable is a topic those new to regression often struggle with. The only real remedy for these struggles is practice, practice, practice. However, once you equip yourselves with an understanding of how to create regression models using categorical explanatory variables, you’ll be able to incorporate many new variables into your models given the large amount of the world’s data that is categorical. If you feel like you’re still struggling at this point however, we suggest you closely compare Tables 5.7 and 5.8 and note how you can compute all the values from one table using the values in the other. Learning check (LC5.5) Fit a new linear regression using lm(gdpPercap ~ continent, data = gapminder2007) where gdpPercap is the new outcome variable \\(y\\). Get information about the “best-fitting” line from the regression table by applying the get_regression_table() function. How do the regression results match up with the results from your previous exploratory data analysis? 5.2.3 Observed/fitted values and residuals Recall in Subsection 5.1.3, we defined the following three concepts: Observed values \\(y\\), or the observed value of the outcome variable Fitted values \\(\\widehat{y}\\), or the value on the regression line for a given \\(x\\) value Residuals \\(y - \\widehat{y}\\), or the error between the observed value and the fitted value We obtained these values and other values using the get_regression_points() function from the moderndive package. This time however, let’s add an ID = &quot;country&quot; argument: this is telling the function to use the variable country in gapminder2007 as an identification variable in the output. This will help contextualize our analysis by matching values to countries. regression_points &lt;- get_regression_points(lifeExp_model, ID = &quot;country&quot;) regression_points TABLE 5.9: Regression points (First 10 out of 142 countries) country lifeExp continent lifeExp_hat residual Afghanistan 43.8 Asia 70.7 -26.900 Albania 76.4 Europe 77.6 -1.226 Algeria 72.3 Africa 54.8 17.495 Angola 42.7 Africa 54.8 -12.075 Argentina 75.3 Americas 73.6 1.712 Australia 81.2 Oceania 80.7 0.515 Austria 79.8 Europe 77.6 2.180 Bahrain 75.6 Asia 70.7 4.907 Bangladesh 64.1 Asia 70.7 -6.666 Belgium 79.4 Europe 77.6 1.792 Observe in Table 5.9 that lifeExp_hat are the fitted values \\(\\widehat{y}\\) = \\(\\widehat{\\text{lifeexp}}\\). If you look closely, there are only 5 possible values for lifeExp_hat. These correspond to the 5 mean life expectancies for the 5 continents that we displayed in Table 5.7 and computed using the values in the estimate column of the regression table in Table 5.8. The residual column is simply \\(y - \\widehat{y}\\) = lifeexp - lifeexp_hat. These values can be interpreted as the deviation of a country’s life expectancy from its continent’s average life expectancy. For example, look at the first row of Table 5.9 corresponding to Afghanistan. The residual of \\(y - \\widehat{y}\\) = 43.8 - 70.7 = -26.9 is telling us that Afghanistan’s life expectancy is a whopping 26.9 years lower than the mean life expectancy of all Asian countries. This can in part be explained by the many years of war that country has suffered. Learning check (LC5.6) Using either the sorting functionality of RStudio’s spreadsheet viewer or using the data wrangling tools you learned in Chapter 3, identify the 5 countries with the 5 smallest (most negative) residuals? What do these negative residuals say about their life expectancy relative to their continents? (LC5.7) Repeat this process, but identify the 5 countries with the 5 largest (most positive) residuals. What do these negative residuals say about their life expectancy relative to their continents? 5.3 Related topics 5.3.1 Correlation is not necessarily causation Throughout this chapter we’ve been very cautious when interpreting regression slope coefficients. We always discussed the “associated” effect of an explanatory variable \\(x\\) on an outcome variable \\(y\\). For example our statement from Subsection 5.1.2 that “for every increase of 1 unit in bty_avg, there is an associated increase of on average 0.067 units of score.” We include the term “associated” to be extra careful not suggest we are making a causal statement. So while “beauty” score bty_avg is positively correlated with teaching score, we can’t necessarily make any statements about “beauty” scores’ direct causal effect on teaching score without more information on how this study was conducted. Here is another example: a not-so-great medical doctor goes through their medical records and finds that patients who slept with their shoes on tended to wake up more with headaches. So this doctor declares “Sleeping with shoes on causes headaches!” FIGURE 5.10: Does sleeping with shoes on cause headaches? However, there is a good chance that if someone is sleeping with their shoes on, it’s potentially likely because they are intoxicated from alcohol. Furthermore, higher levels of drinking leads to more hangovers, and hence more headaches. In this instance, the amount of alcohol consumption is what’s known as a confounding/lurking variable. It “lurks” behind the scenes, confounding the causal relationship (if any) of “sleeping with shoes on” with “waking up with a headache.” We can summarize this notion in Figure 5.11 with a causal graph where: Y is a response variable; here “waking up with a headache.” X is a treatment variable whose causal effect we are interested in; here “sleeping with shoes on.” FIGURE 5.11: Causal graph. To study the relationship between Y and X, we could use a regression model where the response variable is set to Y and the explanatory variable is set to be X, as you’ve been doing throughout this chapter. However, Figure 5.11 also includes a third variable with arrows pointing at both X and Y: Z is a confounding variable that affects both X &amp; Y, thereby “confounding” their relationship. Here the confounding variable is alcohol. Alcohol will cause people to be both more likely to sleep with their shoes on as well as be more likely to wake up with a headache. Thus any regression model of the relationship between X and Y should also use Z as an explanatory variable. In other words, our doctor needs to take into account who had been drinking the night before. In the next chapter we’ll start covering multiple regression models that allow us to incorporate more than one variable in our regression models. Establishing causation is a tricky problem and frequently takes either carefully designed experiments or methods to control for the effects of potential confounding variables. Both these approaches attempt to, as best they can, either take all possible confounding variables into account or negate their impact. This allows researchers to focus only on the relationship of interest: the relationship between the response variable Y and the treatment variable X. As you read news stories, be careful to not fall into the trap of thinking the correlation necessarily implies causation. Check out Spurious Correlations for some rather comical examples of variables that are correlated, but are definitely not causally related. 5.3.2 Best-fitting line Regression lines are also known as “best-fitting” lines. But what do we mean by “best”? Let’s unpack the criteria that is used in regression to determine “best.” Recall Figure 5.6, where for an instructor with a beauty score of \\(x\\) = 7.333 we mark with the observed value \\(y\\) with a circle, the fitted value \\(\\widehat{y}\\) with a square, and the residual \\(y - \\widehat{y}\\) with an arrow. We re-display Figure 5.6 in the top-left plot of Figure 5.12. Furthermore, let’s repeat this for three more arbitrarily chosen course’s instructors: A course whose instructor had a “beauty” score \\(x\\) = 2.333 and teaching score \\(y\\) = 2.7. The residual in this case is 2.7 - 4.036 = -1.336, which we mark with a new blue arrow in the top-right plot. A course whose instructor had a “beauty” score \\(x\\) = 3.667 and teaching score \\(y\\) = 4.4. The residual in this case is 4.4 - 4.125 = 0.2753, which we mark with a new blue arrow in the bottom-left plot. A course whose instructor had a “beauty” score \\(x\\) = 6 and teaching score \\(y\\) = 3.8. The residual in this case is 3.8 - 4.28 = -0.4802, which we mark with a new blue arrow in the bottom-right plot. FIGURE 5.12: Example of observed value, fitted value, and residual. Now say we repeated this process of computing residuals for all 463 courses’ instructors, then we squared all the residuals, and then we summed them. We call this quantity the sum of squared residuals and it is a measure of the “lack of fit” of a model. Larger values of the sum of squared residuals indicate a bigger “lack of fit,” in other words a worse fitting model. If the regression line perfectly fits all the points perfectly, then the sum of squared residuals is 0. This is because if the regression line fits all the points perfectly, then the fitted value \\(\\widehat{y}\\) equals the observed value \\(y\\) in all cases, and hence the residual \\(y-\\widehat{y}\\) = 0 in all cases, and the sum of a large number of 0’s is still 0. Furthermore, of all possible lines we can draw through the cloud of 463 points, the regression line minimizes this value. In other words, the regression and its corresponding fitted values \\(\\widehat{y}\\) minimizes the sum of the squared residuals: \\[ \\sum_{i=1}^{n}(y_i - \\widehat{y}_i)^2 \\] Let’s use our data wrangling tools from Chapter 3 to compute the sum of squared residuals exactly: # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals_ch6) # Get regression points: regression_points &lt;- get_regression_points(score_model) regression_points # A tibble: 463 x 5 ID score bty_avg score_hat residual &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 1 4.7 5 4.21 0.486 2 2 4.1 5 4.21 -0.114 3 3 3.9 5 4.21 -0.314 4 4 4.8 5 4.21 0.586 5 5 4.6 3 4.08 0.52 6 6 4.3 3 4.08 0.22 7 7 2.8 3 4.08 -1.28 8 8 4.1 3.33 4.10 -0.002 9 9 3.4 3.33 4.10 -0.702 10 10 4.5 3.17 4.09 0.409 # … with 453 more rows # Compute sum of squared residuals regression_points %&gt;% mutate(squared_residuals = residual^2) %&gt;% summarize(sum_of_squared_residuals = sum(squared_residuals)) # A tibble: 1 x 1 sum_of_squared_residuals &lt;dbl&gt; 1 132. Any other line drawn in the figure would yield a sum of squared residuals greater than 132. This is a mathematically guaranteed fact that you can prove using calculus and linear algebra. That’s why alternative names for the linear regression line are the best-fitting line as well as the least-squares line. Why do we square the residuals (i.e. the arrow lengths)? We do this so that both positive and negative deviations of the same amount are treated equally. Learning check (LC5.8) Note in the following plot there are 3 points marked with black dots along with: The “best” fitting regression line in blue An arbitrarily chosen line in dashed red Another arbitrarily chosen line in dashed green FIGURE 5.13: Regression line and two others. Compute the sum of squared residuals by hand for each line and show that of these three lines, the regression line in blue has the smallest value. 5.3.3 get_regression_x() functions Recall in this chapter we introduced two functions from the moderndive package: get_regression_table() function that returns a regression table in Subsection 5.1.2 and the get_regression_points() function that returns point-by-point information from a regression model in Subsection 5.1.3. What is going on behind the scenes with the get_regression_table() and get_regression_points() functions? We mentioned in Section 5.1.2 that these were examples of wrapper functions. Such functions take other pre-existing functions and “wrap” them into single functions that hide the user from their inner workings. This way all the user needs to worry about is what the inputs look like and what the outputs look like. In this subsection we’ll “get under the hood” of these functions and see how the “engine” of these wrapper functions work. Recall our two-step process to generate a regression table from Subsection 5.1.2: # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals_ch6) # Get regression table: get_regression_table(score_model) TABLE 5.10: Regression table. term estimate std_error statistic p_value lower_ci upper_ci intercept 3.880 0.076 50.96 0 3.731 4.030 bty_avg 0.067 0.016 4.09 0 0.035 0.099 The get_regression_table() wrapper function takes two pre-existing functions in other R packages the tidy() function from the broom package (Robinson and Hayes 2019) and the clean_names() function from the janitor package (Firke 2019) and “wraps” them into a single function that takes in a saved lm() linear model model, here score_model, and returns a regression table saved as a “tidy” data frame. Here is how we used the tidy() and clean_names() functions: library(broom) library(janitor) score_model %&gt;% tidy(conf.int = TRUE) %&gt;% mutate_if(is.numeric, round, digits = 3) %&gt;% clean_names() %&gt;% rename(lower_ci = conf_low, upper_ci = conf_high) TABLE 5.11: Regression table using tidy() from broom package. term estimate std_error statistic p_value lower_ci upper_ci (Intercept) 3.880 0.076 50.96 0 3.731 4.030 bty_avg 0.067 0.016 4.09 0 0.035 0.099 Yikes! That’s a lot of code! So in order to simplify your lives, we made the editorial decision to “wrap” all the code into get_regression_table(), freeing you from the need to understand the inner workings of the function. Note that the mutate_if() function is from the dplyr package and applies the round() function to 3 significant digits precision only to those variables that are numerical. Similarly, the get_regression_points() function is another wrapper function, but this time returning information the individual points involved in a regression model like the fitted values, observed values, and the residuals. get_regression_points() uses the augment() function in the broom package instead of the tidy() function as with get_regression_table(): library(broom) library(janitor) score_model %&gt;% augment() %&gt;% mutate_if(is.numeric, round, digits = 3) %&gt;% clean_names() %&gt;% select(-c(&quot;se_fit&quot;, &quot;hat&quot;, &quot;sigma&quot;, &quot;cooksd&quot;, &quot;std_resid&quot;)) TABLE 5.12: Regression points using augment() from broom package. score bty_avg fitted resid 4.7 5.00 4.21 0.486 4.1 5.00 4.21 -0.114 3.9 5.00 4.21 -0.314 4.8 5.00 4.21 0.586 4.6 3.00 4.08 0.520 4.3 3.00 4.08 0.220 2.8 3.00 4.08 -1.280 4.1 3.33 4.10 -0.002 3.4 3.33 4.10 -0.702 4.5 3.17 4.09 0.409 In this case, it outputs only the variables of interest to students learning regression: the outcome variable \\(y\\) (score), all explanatory/predictor variables (bty_avg), all resulting fitted values \\(\\hat{y}\\) used by applying the equation of the regression line to bty_avg, and the residual \\(y - \\hat{y}\\). If you’re even more curious about how these and other wrapper functions work, take a look at the source code for these functions on GitHub. 5.4 Conclusion 5.4.1 Additional resources An R script file of all R code used in this chapter is available here. As we suggested in Subsection 5.1.1, interpreting coefficients that are not close to the extreme values of -1, 0, and 1 can be somewhat subjective. To help develop your sense of correlation coefficients, we suggest you play the following 80’s-style video game called “Guess the correlation” at http://guessthecorrelation.com/. FIGURE 5.14: Preview of “Guess the Correlation” Game. 5.4.2 What’s to come? In this chapter, you’ve studied what term “basic regression,” where you fit models that only have one explanatory variable. In Chapter 6, we’ll study multiple regression, where our regression models can now have more than one explanatory variable! In particular, we’ll consider two scenarios: regression models with one numerical and one categorical explanatory variable and regression models with two numerical explanatory variables. This will allow you to construct more sophisticated and more powerful models, all in the hopes of better explaining your outcome variable \\(y\\). References "],
-["6-multiple-regression.html", "Chapter 6 Multiple Regression 6.1 One numerical &amp; one categorical explanatory variable 6.2 Two numerical explanatory variables 6.3 Related topics 6.4 Conclusion", " Chapter 6 Multiple Regression In Chapter 5 we introduced ideas related to modeling for explanation, in particular that the goal of modeling is to make explicit the relationship between some outcome variable \\(y\\) and some explanatory variable \\(x\\). While there are many approaches to modeling, we focused on one particular technique: linear regression, one of the most commonly-used and easy-to-understand approaches to modeling. Furthermore to keep things simple we only considered models with one explanatory \\(x\\) variable that was either numerical in Section 5.1 or categorical in Section 5.2. In this chapter on multiple regression, we’ll start considering models that include more than one explanatory variable \\(x\\). You can imagine when trying to model a particular outcome variable, like teaching evaluation scores as in Section 5.1 or life expectancy as in Section 5.2, that it would be very useful to include more than just one explanatory variable’s worth of information. Since our regression models will now consider more than one explanatory variable, the interpretation of the associated effect of any one explanatory variable must be made in conjunction with the other explanatory variables included in your model. Let’s begin! Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once: ggplot2 for data visualization dplyr for data wrangling tidyr for converting data to “tidy” format readr for importing spreadsheet data into R As well as the more advanced purrr, tibble, stringr, and forcats packages If needed, read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(moderndive) library(skimr) library(ISLR) 6.1 One numerical &amp; one categorical explanatory variable Let’s revisit the instructor evaluation data from UT Austin we introduced in Section 5.1. We studied the relationship between teaching evaluation scores as given by students and “beauty” scores.The variable teaching score was the numerical outcome variable \\(y\\) and the variable “beauty” score bty_avg was the numerical explanatory \\(x\\) variable. In this section, we are going to consider a different model. Our outcome variable will still be teaching score, but now we’ll now include two different explanatory variables: age and gender. Could it be that instructors who are older receive better teaching evaluations from students? Or could it instead be that younger instructors receive better evaluations? Are there differences in evaluations given by students for instructors of different genders? We’ll answer these questions by modeling the relationship between these variables using multiple regression, where we have: A numerical outcome variable \\(y\\) the instructor’s teaching score and Two explanatory variables: A numerical explanatory variable \\(x_1\\), the instructor’s age A categorical explanatory variable \\(x_2\\), the instructor’s gender. It is important to note that at the time of this study due to then commonly held beliefs about gender, this variable was often recorded as a binary variable. While the results of a model that oversimplifies gender this way may be imperfect, we still found the results to be very pertinent and relevant today. 6.1.1 Exploratory data analysis Recall that data on the 463 courses at UT Austin can be found in the evals data frame included in the moderndive package. However, to keep things simple, let’s select() only the subset of the variables we’ll consider in this chapter, and save this data in a new data frame called eval_ch7. Note that these are different than the variables chosen in Chapter 6. evals_ch7 &lt;- evals %&gt;% select(ID, score, age, gender) Recall the three common steps in an exploratory data analysis we saw in Section 5.1.1: Looking at the raw data values. Computing summary statistics. Creating data visualizations. Let’s first look at the raw data values by either looking at evals_ch7 using RStudio’s spreadsheet viewer or by using the glimpse() function from the dplyr package: glimpse(evals_ch7) Observations: 463 Variables: 4 $ ID &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,… $ score &lt;dbl&gt; 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4.… $ age &lt;int&gt; 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 40… $ gender &lt;fct&gt; female, female, female, female, male, male, male, male, male, … Let’s also display a random sample of 5 rows of the 463 rows corresponding to different courses in Table 6.1. Remember due to the random nature of the sampling, you will likely end up with a different subset of 5 rows. evals_ch7 %&gt;% sample_n(size = 5) TABLE 6.1: A random sample of 5 out of the 463 courses at UT Austin ID score age gender 129 3.7 62 male 109 4.7 46 female 28 4.8 62 male 434 2.8 62 male 330 4.0 64 male Now that we’ve looked at the raw values in our evals_ch7 data frame and got a sense of the data, let’s computing summary statistics. As we did in our exploratory data analyses in Sections 5.1.1 and 5.2.1 from the previous chapter, let’s use the skim() function from the skimr package, being sure to only select() the variables of interest in our model: evals_ch7 %&gt;% select(score, age, gender) %&gt;% skim() Skim summary statistics n obs: 463 n variables: 3 ── Variable type:factor ──────────────────────────────────────────────────────── variable missing complete n n_unique top_counts ordered gender 0 463 463 2 mal: 268, fem: 195, NA: 0 FALSE ── Variable type:integer ─────────────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 p75 p100 age 0 463 463 48.37 9.8 29 42 48 57 73 ── Variable type:numeric ─────────────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 p75 p100 score 0 463 463 4.17 0.54 2.3 3.8 4.3 4.6 5 Observe for example that we have no missing data, that there are 268 courses taught by male instructors and 195 courses taught by female instructors, and that the average instructor age is 48.37. Recall however that each row of our data represents a particular course and that the same instructor often teaches more than one course. Therefore the average age of the unique instructors may differ. Furthermore, let’s compute the correlation coefficient between our two numerical variables: score and age. Recall from Section 5.1.1 that correlation coefficients only exist between numerical variables. We observe that they are “weakly negatively” correlated. evals_ch7 %&gt;% get_correlation(formula = score ~ age) # A tibble: 1 x 1 correlation &lt;dbl&gt; 1 -0.107 Let’s now perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Given that the outcome variable score and explanatory variable age are both numerical, we’ll use a scatterplot to display their relationship. How can we incorporate the categorical variable gender however? By mapping the variable gender to the color aesthetic, thereby creating a colored scatterplot. The following code is very similar to the code that created the scatterplot of teaching score over “beauty” score in Figure 5.2, but with color = gender added to the aes()thetic mapping. ggplot(evals_ch7, aes(x = age, y = score, color = gender)) + geom_point() + labs(x = &quot;Age&quot;, y = &quot;Teaching Score&quot;, color = &quot;Gender&quot;) + geom_smooth(method = &quot;lm&quot;, se = FALSE) FIGURE 6.1: Colored scatterplot of relationship of teaching and beauty scores. In the resulting Figure 6.1, observe that ggplot assigns a default red/blue color scheme to the points and to the lines associated with the two levels of gender: female and male. Furthermore the geom_smooth(method = &quot;lm&quot;, se = FALSE) layer automatically fits a different regression line for each group. We notice some interesting trends. First, there are almost no women faculty over the age of 60 as evidenced by lack of red dots above \\(x\\) = 60. Second, while both regression lines are negatively sloped with age (i.e. older instructors tend to have lower scores), the slope for age for the female instructors is more negative. In other words, female instructors are paying a harsher penalty for their age than the male instructors. 6.1.2 Interaction model Let’s now quantify the relationship of our outcome variable \\(y\\) and the two explanatory variables using one type of multiple regression model known as an interaction model. We’ll explain where the term “interaction” comes from at the end of this section. In particular, we’ll write out the equation of the two regression lines in Figure 6.1 using the values from a regression table. Before we do this however, let’s go over a brief refresher of regression when you have a categorical explanatory variable \\(x\\). Recall in Section 5.2.2 we fit a regression model for countries’ life expectancies as a function of which continent the country was in. In other words, we had a numerical outcome variable \\(y\\) = lifeExp and a categorical explanatory variable \\(x\\) = continent which had 5 levels: Africa, Americas, Asia, Europe, and Oceania. Let’s re-display the regression table you saw in Table 5.8: TABLE 6.2: Regression table for life expectancy as a function of continent. term estimate std_error statistic p_value lower_ci upper_ci intercept 54.8 1.02 53.45 0 52.8 56.8 continentAmericas 18.8 1.80 10.45 0 15.2 22.4 continentAsia 15.9 1.65 9.68 0 12.7 19.2 continentEurope 22.8 1.70 13.47 0 19.5 26.2 continentOceania 25.9 5.33 4.86 0 15.4 36.5 Recall our interpretation of the estimate column. Since Africa was the “baseline for comparison” group, the intercept term corresponds to the mean life expectancy for all countries in Africa of 54.8 years. The other 4 values of estimate correspond to “offsets” relative to the baseline group. So for example, the “offset” corresponding to the Americas is +18.8 as compared to the baseline for comparison group Africa. In other words, the average life expectancy for countries in the Americas is 18.8 years higher. Thus the mean life expectancy for all countries in the Americas is 54.8 + 18.8 = 73.6. The same interpretation holds for Asia, Europe, and Oceania. Going back to our multiple regression model for teaching score using age and gender in Figure 6.1, we generate the regression table using the same two-step approach from Chapter 5: we first “fit” the model using the lm() “linear model” function and then we apply the get_regression_table() function. This time however, our model formula won’t be of the form y ~ x, but rather of the form y ~ x1 * x2. In other words, our two explanatory variables x1 and x2 are separated by a * sign: # Fit regression model: score_model_interaction &lt;- lm(score ~ age * gender, data = evals_ch7) # Get regression table: get_regression_table(score_model_interaction) TABLE 6.3: Regression table for interaction model. term estimate std_error statistic p_value lower_ci upper_ci intercept 4.883 0.205 23.80 0.000 4.480 5.286 age -0.018 0.004 -3.92 0.000 -0.026 -0.009 gendermale -0.446 0.265 -1.68 0.094 -0.968 0.076 age:gendermale 0.014 0.006 2.45 0.015 0.003 0.024 Looking at the regression table output in Table 6.3, we see there are four rows of values in the estimate column. While it is not immediately apparent, using these four values we can write out the equations of both lines in Figure 6.1. First, since the word female comes alphabetically before male, female instructors are the “baseline for comparison” group. Therefore intercept is the intercept for only the female instructors. This holds similarly for age. It is the slope for age for only the female instructors. Thus the red regression line in Figure 6.1 has an intercept of 4.883 and slope for age of -0.018. Remember that for this particular data, while the intercept has a mathematical interpretation, it has no practical interpretation since there can’t be any instructors with age = 0. What about the intercept and slope for age of the male instructors? In other words, the blue line in Figure 6.1? This is where our notion of “offsets” comes into play once again. The value for gendermale of -0.446 is not the intercept for the male instructors, but rather the offset in intercept for male instructors relative to female instructors. Therefore, the intercept for the male instructors is intercept + gendermale = 4.883 + (-0.446) = 4.883 - 0.446 = 4.437. Similarly, age:gendermale = 0.014 is not the slope for age for the male instructors, but rather the offset in slope for the male instructors. Therefore, the slope for age for the male instructors is age + age:gendermale = -0.018 + 0.014 = -0.004. Therefore the blue regression line in Figure 6.1 has intercept 4.437 and slope for age of -0.004. Let’s summarize these values in Table 6.4 and focus on the two slopes for age: TABLE 6.4: Comparison of intercepts and slopes for interaction model. Gender Intercept Slope for age Female instructors 4.883 -0.018 Male instructors 4.437 -0.004 Since the slope for age for the female instructors was -0.018, it means that on average, a female instructor who is a year older would have a teaching score that is 0.018 units lower. For the male instructors however, the corresponding associated decrease was on average only 0.004 units. While both slopes for age were negative, the slope for age for the female instructors is more negative. This is consistent with our observation from Figure 6.1, that this model is suggesting that age is impacts teaching scores for female instructors more than for male instructors. Let’s now write the equation for our regression lines, which we can use to compute our fitted values \\(\\widehat{y} = \\widehat{\\text{score}}\\). \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{score}} &amp;= b_0 + b_{\\mbox{age}} \\cdot \\mbox{age} + b_{\\mbox{male}} \\cdot \\mathbb{1}_{\\mbox{is male}}(x) + b_{\\mbox{age,male}} \\cdot \\mbox{age} \\cdot \\mathbb{1}_{\\mbox{is male}}\\\\ &amp;= 4.883 -0.018 \\cdot \\mbox{age} - 0.446 \\cdot \\mathbb{1}_{\\mbox{is male}}(x) + 0.014 \\cdot \\mbox{age} \\cdot \\mathbb{1}_{\\mbox{is male}} \\end{aligned} \\] Whoa! That’s even more daunting than the equation you saw for the life expectancy as a function of continent in Section 5.2.2! However if you recall what an “indicator function” AKA “dummy variable” does, the equation simplifies greatly. In the previous equation, we have one indicator function of interest: \\[ \\mathbb{1}_{\\mbox{is male}}(x) = \\left\\{ \\begin{array}{ll} 1 &amp; \\text{if } \\text{instructor } x \\text{ is male} \\\\ 0 &amp; \\text{otherwise}\\end{array} \\right. \\] Second, let’s match coefficients in the previous equation with values in the estimate column in our regression table in Table 6.3: \\(b_0\\) is the intercept = 4.883 for the female instructors \\(b_{\\mbox{age}}\\) is the slope for age = -0.018 for the female instructors \\(b_{\\mbox{male}}\\) is the offset in intercept for the male instructors \\(b_{\\mbox{age,male}}\\) is the offset in slope for age for the male instructors Let’s put this all together and compute the fitted value \\(\\widehat{y} = \\widehat{\\text{score}}\\) for female instructors. Since for female instructors \\(\\mathbb{1}_{\\mbox{is male}}(x)\\) = 0, the previous equation becomes \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{score}} &amp;= 4.883 - 0.018 \\cdot \\mbox{age} - 0.446 \\cdot 0 + 0.014 \\cdot \\mbox{age} \\cdot 0\\\\ &amp;= 4.883 - 0.018 \\cdot \\mbox{age} - 0 + 0\\\\ &amp;= 4.883 - 0.018 \\cdot \\mbox{age}\\\\ \\end{aligned} \\] which is the equation of the red regression line in Figure 6.1 corresponding to the female instructors in Table 6.4. Correspondingly, since for male instructors \\(\\mathbb{1}_{\\mbox{is male}}(x)\\) = 1, the previous equation becomes \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{score}} &amp;= 4.883 - 0.018 \\cdot \\mbox{age} - 0.446 + 0.014 \\cdot \\mbox{age}\\\\ &amp;= (4.883 - 0.446) + (- 0.018 + 0.014) * \\mbox{age}\\\\ &amp;= 4.437 - 0.004 \\cdot \\mbox{age}\\\\ \\end{aligned} \\] which is the equation of the blue regression line in Figure 6.1 corresponding to the male instructors in Table 6.4. Phew! That was a lot of arithmetic! Don’t fret however, this is as hard as modeling will get in this book. If you’re still a little unsure about using indicator functions and using categorical explanatory variables in a regression model, we highly suggest you re-read Section 5.2.2. This involves only a single categorical explanatory variable and thus is much simpler. Before we end this section, we explain why we refer to this type of model as an “interaction model.” The \\(b_{\\mbox{age,male}}\\) term in the equation for the fitted value \\(\\widehat{y}\\) = \\(\\widehat{\\text{score}}\\) is what’s known in statistical modeling as an “interaction effect.” The interaction term corresponds to the age:gendermale = 0.014 in the final row of the regression table in Table 6.3. We say there is an interaction effect if the associated effect of one variable depends on the value of another variable. In other words, the two variables are “interacting” with each other. In our case, the associated effect of the variable age depends on the value of the other variable gender. This was evidenced by the difference in slopes for age of +0.014 of male instructors relative to female instructors. Another way of thinking about interaction effects on teaching scores is as follows. For a given instructor at UT Austin, there might be an associated effect of their age by itself, there might be an associated effect of their gender by itself, but when age and gender are considered together there might an additional effect above and beyond the two individual effects. 6.1.3 Parallel slopes model When creating regression models with one numerical and one categorical explanatory variable, we are not just limited to interaction models as we just saw. Another type of model we can use is known as a parallel slopes model. Unlike interaction models where the regression lines can have different intercepts and different slopes, parallel slopes models still allow for different intercepts but force all lines to have the same slope. The resulting regression lines are thus parallel. Let’s visualize the best-fitting parallel slopes model to our evals_ch7 data. Unfortunately, the ggplot2 package does not have a convenient way to plot a parallel slopes model. We therefore created our own special purpose function gg_parallel_slopes() and included it in the moderndive package: gg_parallel_slopes(y = &quot;score&quot;, num_x = &quot;age&quot;, cat_x = &quot;gender&quot;, data = evals_ch7) FIGURE 6.2: Parallel slopes model of relationship of score with age and gender. Note the arguments to this function: the outcome variable y = &quot;score&quot;, the numerical explanatory variable num_x = &quot;age&quot;, the categorical explanatory variable cat_x = &quot;gender&quot;, and the data frame that includes this data = evals_ch7. Be careful to include the quotation marks when specifying all variables. Note that the gg_parallel_slopes() function is quite different than all the ggplot() code you saw in Chapter 2. This is because the ggplot2 package does not include a function for plotting parallel slopes models. Thus we had to write a new function for ourselves and include it in the moderndive package. If you’re curious, you can see the code for this function on GitHub. Observe in Figure 6.2 that we now have parallel lines corresponding to the female and male instructors respectively: here they have the same negative slope. This is telling us that instructors who are older will tend to receive lower teaching scores than instructors who are younger. Furthermore, since the lines are parallel, the associated penalty for aging is assumed to be the same for both female and male instructors. However, observe also in Figure 6.2 that these two lines have different intercepts as evidenced by the fact that the blue line corresponding to the male instructors is higher than the red line corresponding to the female instructors. This is telling us that irrespective of age, female instructors tended to receive lower teaching scores than male instructors. In order to obtain the precise numerical values of the two intercepts and the single common slope, we once again “fit” the model using the lm() “linear model” function and then apply the get_regression_table() function. However, unlike the interaction model which had a model formula of the form y ~ x1 * x2, our model formula is now of the form y ~ x1 + x2. In other words our two explanatory variables x1 and x2 are separated by a + sign: # Fit regression model: score_model_parallel_slopes &lt;- lm(score ~ age + gender, data = evals_ch7) # Get regression table: get_regression_table(score_model_parallel_slopes) TABLE 6.5: Regression table for parallel slopes model. term estimate std_error statistic p_value lower_ci upper_ci intercept 4.484 0.125 35.79 0.000 4.238 4.730 age -0.009 0.003 -3.28 0.001 -0.014 -0.003 gendermale 0.191 0.052 3.63 0.000 0.087 0.294 Similarly to the regression table for the interaction model from Table 6.3, we have an intercept term corresponding to the intercept for the “baseline for comparison” female instructor group and a gendermale term corresponding to the offset in intercept for the male instructors relative to female instructors. In other words, in Figure 6.2 the red regression line corresponding to the female instructors has an intercept of 4.484 while the blue regression line corresponding to the male instructors has an intercept of 4.484 + 0.191 = 4.675. Once again, since there aren’t any instructors of age 0, the intercepts only have a mathematical interpretation but no practical one. Unlike in Table 6.3 however, we now only have a single slope for age of -0.009. This is because model specifies that both the female and male instructors have a common slope for age. This is telling us that an instructor who is a year older than another instructor received a teaching score that is on average 0.018 units lower. This penalty for aging applies equally for both female and male instructors. Let’s summarize these values in Table 6.6, noting the different intercepts but common slopes: TABLE 6.6: Comparison of intercepts and slope for parallel slopes model. Gender Intercept Slope for age Female instructors 4.484 -0.009 Male instructors 4.675 -0.009 Let’s now write the equation for our regression lines, which we can use to compute our fitted values \\(\\widehat{y} = \\widehat{\\text{score}}\\). \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{score}} &amp;= b_0 + b_{\\mbox{age}} \\cdot \\mbox{age} + b_{\\mbox{male}} \\cdot \\mathbb{1}_{\\mbox{is male}}(x)\\\\ &amp;= 4.484 -0.009 \\cdot \\mbox{age} + 0.191 \\cdot \\mathbb{1}_{\\mbox{is male}}(x) \\end{aligned} \\] Let’s put this all together and compute the fitted value \\(\\widehat{y} = \\widehat{\\text{score}}\\) for female instructors. Since for female instructors the indicator function \\(\\mathbb{1}_{\\mbox{is male}}(x)\\) = 0, the previous equation becomes \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{score}} &amp;= 4.484 -0.009 \\cdot \\mbox{age} + 0.191 \\cdot 0\\\\ &amp;= 4.484 -0.009 \\cdot \\mbox{age} \\end{aligned} \\] which is the equation of the red regression line in Figure 6.2 corresponding to the female instructors. Correspondingly, since for male instructors the indicator function \\(\\mathbb{1}_{\\mbox{is male}}(x)\\) = 1, the previous equation becomes \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{score}} &amp;= 4.484 -0.009 \\cdot \\mbox{age} + 0.191 \\cdot 1\\\\ &amp;= (4.484 + 0.191) - 0.009 \\cdot \\mbox{age}\\\\ &amp;= 4.67 -0.009 \\cdot \\mbox{age} \\end{aligned} \\] which is the equation of the blue regression line in Figure 6.2 corresponding to the male instructors. Great! We’ve considered both an interaction model and a parallel slopes model for our data. Let’s compare the visualizations for both models side-by-side in Figure 6.3. FIGURE 6.3: Comparison of interaction and parallel slopes models. At this point, you might be asking yourself: “Why would we ever use a parallel slopes model?” Looking at the left-hand plot in Figure 6.3, the two lines definitely do not appear to be parallel, so why would we force them to be parallel? For this data, we agree! It can easily be argued that the interaction model is more appropriate. However, in the upcoming Section 6.3.1 on model selection, we’ll present an example where it can be argued that the case for a parallel slopes model might be stronger. 6.1.4 Observed/fitted values and residuals For brevity’s sake, in this section we’ll only compute the observed values, fitted values, and residuals for the interaction model which we saved in score_model_interaction. You’ll have an opportunity to study these values for the parallel slopes model in the upcoming Learning Check. Say you have a professor who is female and is 36 years old. What fitted value \\(\\widehat{y}\\) = \\(\\widehat{\\text{score}}\\) would our model yield? Say you have another professor who is male and is 59 years old. What would their fitted value \\(\\widehat{y}\\) be? We answer this question visually first by finding the intersection of the red regression line and the vertical line at \\(x\\) = age = 36. We mark this value with a large red dot in Figure 6.4. Similarly, we can identify the fitted value \\(\\widehat{y}\\) = \\(\\widehat{\\text{score}}\\) for the male instructor by finding the intersection of the blue regression line and the vertical line at \\(x\\) = age = 59. We mark this value with a large blue dot in Figure 6.4. FIGURE 6.4: Fitted values for two new professors. What are these two values of \\(\\widehat{y}\\) = \\(\\widehat{\\text{score}}\\) precisely? We can use the equations of the two regression lines we computed in Section 6.1.2, which in turn were based on values from the regression table in Table 6.3: For all female instructors: \\(\\widehat{y} = \\widehat{\\text{score}} = 4.883 - 0.018 \\cdot \\mbox{age}\\) For all male instructors: \\(\\widehat{y} = \\widehat{\\text{score}} = 4.437 - 0.004 \\cdot \\mbox{age}\\) So our fitted values would be: 4.883 - 0.018 \\(\\cdot\\) 36 = 4.25 and 4.437 - 0.004 \\(\\cdot\\) 59 = 4.20 respectively. Now say we want the fitted values not just for the instructors of these two courses, but for the instructors of all 463 courses included in the evals_ch7 data frame? Doing this by hand would be long and tedious! This is where the get_regression_points() function from the moderndive package can help: it will quickly automate this for all 463 courses. We present a preview of just the first 10 rows out of 463 in Table 6.7. regression_points &lt;- get_regression_points(score_model_interaction) regression_points TABLE 6.7: Regression points (First 10 out of 463 courses) ID score age gender score_hat residual 1 4.7 36 female 4.25 0.448 2 4.1 36 female 4.25 -0.152 3 3.9 36 female 4.25 -0.352 4 4.8 36 female 4.25 0.548 5 4.6 59 male 4.20 0.399 6 4.3 59 male 4.20 0.099 7 2.8 59 male 4.20 -1.401 8 4.1 51 male 4.23 -0.133 9 3.4 51 male 4.23 -0.833 10 4.5 40 female 4.18 0.318 In fact, it turns out that the female instructor of age 36 taught the first four courses, while the male instructor taught the next 3. The resulting \\(\\widehat{y}\\) = \\(\\widehat{\\text{score}}\\) fitted values are in the score_hat column. Furthermore, the get_regression_points() function also returns the residuals \\(y-\\widehat{y}\\). Notice for example the first and fourth courses the female instructor of age 36 taught had positive residuals, indicating that the actual teaching score they received from students was less than their fitted score of 4.25. On the other hand, the second and third course this instructor taught had negative residuals, indicating that the actual teaching score they received from students was more than their fitted score of 4.25. Learning check (LC6.1) Compute the observed values, fitted values, and residuals not for the interaction model as we just did, but rather for the parallel slopes model we saved in score_model_interaction. 6.2 Two numerical explanatory variables Let’s now switch gears and consider multiple regression models where instead of one numerical and one categorical explanatory variable, we now have two numerical explanatory variables. The dataset we’ll use is from “An Introduction to Statistical Learning with Applications in R (ISLR)”, an intermediate-level textbook on statistical and machine learning. Its accompanying ISLR R package contains the datasets that the authors apply various machine learning methods to. One frequently used dataset in this book is the Credit dataset, where the outcome variable of interest is the credit card debt of 400 individuals. Other variables like income, credit limit, credit rating, and age are included as well. Note that the Credit data is not based on real individuals’ financial information, but rather is a simulated dataset used for educational purposes. In this section, we’ll fit a regression model where we have A numerical outcome variable \\(y\\), the cardholder’s credit card debt Two explanatory variables: One numerical explanatory variable \\(x_1\\), the cardholder’s credit limit Another numerical explanatory variable \\(x_2\\), the cardholder’s income (in thousands of dollars). 6.2.1 Exploratory data analysis Let’s load the Credit dataset, but to keep things simple let’s select() only the subset of the variables we’ll consider in this chapter, and save this data in a new data frame called credit_ch7. Notice our slightly different use of the select() verb here than we introduced in Subsection 3.8.1. For example, we’ll select the Balance variable from Credit but then save it with a new variable name debt. We do this because here the term “debt” is a little more interpretable than “balance.” library(ISLR) credit_ch7 &lt;- Credit %&gt;% as_tibble() %&gt;% select(ID, debt = Balance, credit_limit = Limit, income = Income, credit_rating = Rating, age = Age) You can observe the effect of our use ofselect() in the first common step of an exploratory data analysis: looking at the raw values either in RStudio’s spreadsheet viewer or by using glimpse(). glimpse(credit_ch7) Observations: 400 Variables: 6 $ ID &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … $ debt &lt;int&gt; 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 140… $ credit_limit &lt;int&gt; 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6… $ income &lt;dbl&gt; 14.9, 106.0, 104.6, 148.9, 55.9, 80.2, 21.0, 71.4, 15.1… $ credit_rating &lt;int&gt; 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, … $ age &lt;int&gt; 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49,… Furthermore, let’s look at a random sample of five out of the 400 credit card holders in Table 6.8. Note due to the random nature of the sampling, you will likely end up with a different subset of five rows. set.seed(9) credit_ch7 %&gt;% sample_n(size = 5) TABLE 6.8: Random sample of 5 credit card holders. ID debt credit_limit income credit_rating age 272 436 4866 45.0 347 30 239 52 2910 26.5 236 58 87 815 6340 55.4 448 33 108 0 3189 39.1 263 72 149 0 2420 15.2 192 69 Now that we’ve looked at the raw values in our credit_ch7 data frame and got a sense of the data, let’s move on to next common step in an exploratory data analysis: computing summary statistics. Let’s use the skim() function from the skimr package, being sure to only select() the columns of interest for our model: credit_ch7 %&gt;% select(debt, credit_limit, income) %&gt;% skim() Skim summary statistics n obs: 400 n variables: 3 ── Variable type:integer ─────────────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 p75 p100 credit_limit 0 400 400 4735.6 2308.2 855 3088 4622.5 5872.75 13913 debt 0 400 400 520.01 459.76 0 68.75 459.5 863 1999 ── Variable type:numeric ─────────────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 p75 p100 income 0 400 400 45.22 35.24 10.35 21.01 33.12 57.47 186.63 Observe the summary statistics for the outcome variable debt: the mean and median credit card debt are $520.01 and $459.50 respectively and that 25% of card holders had debts of $68.75 or less. Let’s now look at one of the explanatory variables credit_limit: the mean and median credit card limit are $4735.6 and $4622.50 respectively while 75% of card holders had incomes of $57,470 or less. Since our outcome variable debt and the explanatory variables credit_limit and income are numerical, we can compute the correlation coefficient between the different possible pairs of these variables. First, we can run the get_correlation() command as seen in Subsection 5.1.1 twice, once for each explanatory variable: credit_ch7 %&gt;% get_correlation(debt ~ credit_limit) credit_ch7 %&gt;% get_correlation(debt ~ income) Or we can simultaneously compute them by returning a correlation matrix which we display in Table 6.9. We can read off the correlation coefficient for any pair of variables by looking them up in the appropriate row/column combination. credit_ch7 %&gt;% select(debt, credit_limit, income) %&gt;% cor() TABLE 6.9: Correlation coefficients between credit card debt, credit limit, and income. debt credit_limit income debt 1.000 0.862 0.464 credit_limit 0.862 1.000 0.792 income 0.464 0.792 1.000 For example, the correlation coefficient of: debt with itself is 1 as we would expect based on the definition of the correlation coefficient. debt with credit_limit is 0.862. This indicates a strong positive linear relationship, which makes sense as only individuals with large credit limits can accrue large credit card debts. debt with income is 0.464. This is suggestive of another positive linear relationship, although not as strong as the relationship between debt and credit_limit. As an added bonus, we can read off the correlation coefficient between the two explanatory variables, credit_limit and income of 0.792. We say there is a high degree of collinearity between the credit_limit and income explanatory variables. Collinearity (or multicollinearity) is a phenomenon where one explanatory variable in a multiple regression model is highly correlated with another. So in our case since credit_limit and income are highly correlated, if we knew someone’s credit_limit, we could make pretty good guesses about their income as well. Thus, these two variables provided somewhat redundant information. However, we’ll leave discussion on how to work with collinear explanatory variables to a more intermediate-level book on regression modeling. Let’s visualize the relationship of the outcome variable with each of the two explanatory variables in two separate plots in Figure 6.5. ggplot(credit_ch7, aes(x = credit_limit, y = debt)) + geom_point() + labs(x = &quot;Credit limit (in $)&quot;, y = &quot;Credit card debt (in $)&quot;, title = &quot;Debt and credit limit&quot;) + geom_smooth(method = &quot;lm&quot;, se = FALSE) ggplot(credit_ch7, aes(x = income, y = debt)) + geom_point() + labs(x = &quot;Income (in $1000)&quot;, y = &quot;Credit card debt (in $)&quot;, title = &quot;Debt and income&quot;) + geom_smooth(method = &quot;lm&quot;, se = FALSE) FIGURE 6.5: Relationship between credit card debt and credit limit/income. Observe there is a positive relationship between credit limit and credit card debt: as credit limit increases so also does credit card debt. This is consistent with the strongly positive correlation coefficient of 0.862 we computed earlier. In the case of income, the positive relationship doesn’t appear as strong, given the weakly positive correlation coefficient of 0.464. However, the two plots in Figure 6.5 only focus on the relationship of the outcome variable with each of the two explanatory variables separately. To visualize the joint relationship of all three variables simultaneously, we need a 3-dimensional (3D) scatterplot as seen in Figure 6.6. Each of the 400 observations in the credit_ch7 data frame are marked with a blue point where The numerical outcome variable \\(y\\) debt is on the vertical axis The two numerical explanatory variables, \\(x_1\\) income and \\(x_2\\) credit_limit, are on the two axes that form the bottom plane. FIGURE 6.6: 3D scatterplot and regression plane. Furthermore, we also include the regression plane. Recall from Section 5.3.2 that regression lines are “best-fitting” in that of all possible lines we can draw through a cloud of points, the regression line minimizes the sum of squared residuals. This concept also extends to models with two numerical explanatory variables. The difference is instead of a “best-fitting” line, we now have a “best-fitting” plane that similarly minimizes the sum of squared residuals. Head to here to open an interactive version of this plot in your browser. Learning check (LC6.2) Conduct a new exploratory data analysis with the same outcome variable \\(y\\) being debt but with credit_rating and age as the new explanatory variables \\(x_1\\) and \\(x_2\\). Remember, this involves three things: Most crucially: Looking at the raw data values. Computing summary statistics, such as means, medians, and interquartile ranges. Creating data visualizations. What can you say about the relationship between a credit card holder’s debt and their credit rating and age? 6.2.2 Regression plane Let’s now fit a regression model and get the regression table corresponding to the regression plane in Figure 6.6. To keep things brief in this subsection, we won’t consider an interaction model for the two numerical explanatory variables income and credit_limit like we did in Section 6.1.2 using the model formula score ~ age * gender. Rather we’ll only consider a model fit with a formula of the form y ~ x1 + x2. Somewhat confusing however, since we now have a regression plane instead of multiple lines, the label “parallel slopes” doesn’t apply when you have two numerical explanatory variables. Just as we have done multiple times throughout Chapters 5 and this chapter, let’s get the regression table for this model using our two-step process and display the results in Table 6.10 We first “fit” the linear regression model using the lm(y ~ x1 + x2, data) function and save it in debt_model. We get the regression table by applying the get_regression_table() from the moderndive package to debt_model. # Fit regression model: debt_model &lt;- lm(debt ~ credit_limit + income, data = credit_ch7) # Get regression table: get_regression_table(debt_model) TABLE 6.10: Multiple regression table term estimate std_error statistic p_value lower_ci upper_ci intercept -385.179 19.465 -19.8 0 -423.446 -346.912 credit_limit 0.264 0.006 45.0 0 0.253 0.276 income -7.663 0.385 -19.9 0 -8.420 -6.906 Let’s interpret the three values in the estimate column. First, intercept = -$385.179. The intercept represents the credit card debt for an individual who has credit_limit of $0 and income of $0. In our data however, the intercept has limited practical interpretation since no individuals had credit_limit or income values of $0. Rather, the intercept is used to situate the regression plane in 3D space. Second, credit_limit = $0.264. Taking into account all the other explanatory variables in our model, for every increase of one dollar in credit_limit, there is an associated increase of on average $0.26 in credit card debt. Just as we did in Subsection 5.1.2, we are cautious not imply causality as we saw in Subsection 5.3.1 that “correlation is not necessarily causation.” We do this merely stating there was an associated increase. Furthermore, we preface our interpretation with the statement “taking into account all the other explanatory variables in our model.” Here, by all other explanatory variables we mean income. We do this to emphasize that we are now jointly interpreting the associated effect of multiple explanatory variables in the same model at the same time. Third, income = -$7.663. Taking into account all the other explanatory variables in our model, for every increase of one unit in the variable income, in other words $1000 in actual income, there is an associated decrease of on average $7.663 in credit card debt. Putting these results together, the equation of the regression plane that gives us fitted values \\(\\widehat{y}\\) = \\(\\widehat{\\text{debt}}\\) is: \\[ \\begin{aligned} \\widehat{y} &amp;= b_0 + b_1 \\cdot x_1 + b_2 \\cdot x_2\\\\ \\widehat{\\text{debt}} &amp;= b_0 + b_{\\text{limit}} \\cdot \\text{limit} + b_{\\text{income}} \\cdot \\text{income}\\\\ &amp;= -387.179 + 0.263 \\cdot\\text{limit} - 7.663 \\cdot\\text{income} \\end{aligned} \\] Recall in the right-hand plot of Figure 6.5 that when plotting the relationship between debt and income in isolation, there appeared to be a positive relationship. In the last discussed multiple regression however, when jointly modeling the relationship between debt, credit_limit, and income, there appears to be a negative relationship of debt and income as evidenced by the negative slope for income of -$7.663. What explains these contradictory results? A phenomenon known as Simpson’s Paradox, whereby overall trends that exist in aggregate either disappear or reverse when the data are broken down into groups. In Subsection 6.3.3 we elaborate on this idea by looking at the relationship between credit_limit and credit card debt, but split along different income brackets. Learning check (LC6.3) Fit a new simple linear regression using lm(debt ~ credit_rating + age, data = credit_ch7) where credit_rating and age are the new numerical explanatory variables \\(x_1\\) and \\(x_2\\). Get information about the “best-fitting” regression plane from the regression table by applying the get_regression_table() function. How do the regression results match up with the results from your previous exploratory data analysis? 6.2.3 Observed/fitted values and residuals Let’s also compute all fitted values and residuals for our regression model using the get_regression_points() function and present only the first 10 rows of output in Table 6.11. Remember that the coordinates of each of the blue points in our 3D scatterplot in Figure 6.6 can be found in the income, credit_limit, and debt columns. The fitted values on the regression plane are found in the debt_hat column and are computed using our equation for the regression plane in the previous section: \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{debt}} &amp;= -387.179 + 0.263 \\cdot \\text{limit} - 7.663 \\cdot \\text{income} \\end{aligned} \\] regression_points &lt;- get_regression_points(debt_model) regression_points TABLE 6.11: Regression points (First 10 credit card holders out of 400). ID debt credit_limit income debt_hat residual 1 333 3606 14.9 454 -120.8 2 903 6645 106.0 559 344.3 3 580 7075 104.6 683 -103.4 4 964 9504 148.9 986 -21.7 5 331 4897 55.9 481 -150.0 6 1151 8047 80.2 1127 23.6 7 203 3388 21.0 349 -146.4 8 872 7114 71.4 948 -76.0 9 279 3300 15.1 371 -92.2 10 1350 6819 71.1 873 477.3 6.3 Related topics 6.3.1 Model selection When do we use an interaction model versus a parallel slopes model? Recall in Sections 6.1.2 and 6.1.3 we fit both interaction and parallel slopes models for the outcome variable \\(y\\) teaching score using a numerical explanatory variable \\(x_1\\) age and a categorical explanatory variable \\(x_2\\) gender (recorded as a binary variable). We compared these models in Figure 6.3, which we display again now. FIGURE 6.7: Previously seen comparison of interaction and parallel slopes models. A lot of you might have asked yourselves: “Why would I force the lines to have parallel slopes (as seen in the right-hand plot) when they clearly have different slopes (as seen in the left-hand plot).” The answer lies in a philosophical principle known as “Occam’s Razor.” It states that “all other things being equal, simpler solutions are more likely to be correct than complex ones.” When viewed in a modeling framework, Occam’s Razor can be restated as “all other things being equal, simpler models are to be preferred over complex ones.” In other words, we should only favor the more complex model if the additional complexity is warranted. Let’s revisit the equations for the regression line for both the interaction and parallel slopes model: \\[ \\begin{aligned} \\text{Interaction} &amp;: \\widehat{y} = \\widehat{\\text{score}} = b_0 + b_{\\mbox{age}} \\cdot \\mbox{age} + b_{\\mbox{male}} \\cdot \\mathbb{1}_{\\mbox{is male}}(x) + \\\\ &amp; \\qquad b_{\\mbox{age,male}} \\cdot \\mbox{age} \\cdot \\mathbb{1}_{\\mbox{is male}}\\\\ \\text{Parallel slopes} &amp;: \\widehat{y} = \\widehat{\\text{score}} = b_0 + b_{\\mbox{age}} \\cdot \\mbox{age} + b_{\\mbox{male}} \\cdot \\mathbb{1}_{\\mbox{is male}}(x) \\end{aligned} \\] The interaction model is “more complex” in that there is an additional \\(b_{\\mbox{age,male}} \\cdot \\mbox{age} \\cdot \\mathbb{1}_{\\mbox{is male}}\\) element to the equation not present for the parallel slopes model. Or viewed alternatively, the regression table for the interaction model in Table 6.3 has four rows, whereas the regression table for the parallel slopes model in Table 6.5 has three rows. The question becomes: “Is this additional complexity warranted?” In this case, it can be argued that this additional complexity is warranted, as evidenced by the clear x-shaped pattern of the two regression lines in the left-hand plot of Figure 6.7. However, let’s consider an example where the additional complexity might not be warranted. Let’s consider the MA_schools data which contains 2017 data on Massachusetts public high schools provided by the Massachusetts Department of Education; read the help file for this data by running ?MA_schools if you would like more details. Let’s model the numerical outcome variable \\(y\\), average SAT math score for that high school, as a function of two explanatory variables: A numerical explanatory variable \\(x_1\\), the percentage of that high school’s student body that are economically disadvantaged and A categorical explanatory variable \\(x_2\\), the school size as measured by enrollment: small (13-341 students), medium (342-541 students), and large (542-4264 students) Let’s create visualizations of both the interaction and parallel slopes model once again and display the output in Figure 6.8. Recall from Subsection 6.1.3 that the gg_parallel_slopes() function is a special purpose function included in the moderndive package, since the ggplot2 package does not include a function for plotting parallel slopes models. # Interaction model ggplot(MA_schools, aes(x = perc_disadvan, y = average_sat_math, color = size)) + geom_point(alpha = 0.25) + geom_smooth(method = &quot;lm&quot;, se = FALSE ) + labs(x = &quot;Percent economically disadvantaged&quot;, y = &quot;Math SAT Score&quot;, color = &quot;School size&quot;, title = &quot;Interaction model&quot;) # Parallel slopes model gg_parallel_slopes(y = &quot;average_sat_math&quot;, num_x = &quot;perc_disadvan&quot;, cat_x = &quot;size&quot;, data = MA_schools, alpha = 0.25) + labs(x = &quot;Percent economically disadvantaged&quot;, y = &quot;Math SAT Score&quot;, color = &quot;School size&quot;, title = &quot;Parallel slopes model&quot;) FIGURE 6.8: Comparison of interaction and parallel slopes models for MA schools. Look closely at the left-hand plot of Figure 6.8 corresponding to an interaction model. While the slopes are indeed different, they do not differ by much. In other words, they are near identical. Now look compare the left-hand plot with the right-hand plot corresponding to a parallel slopes model. The two models don’t appear all that different. Therefore in this case, it can be argued that the additional complexity of the interaction model is not warranted. Thus following Occam’s Razor, we should prefer the “simpler” parallel slopes model. Let’s explicitly define what “simpler” means in this case. Let’s compare the regression tables for the interaction and parallel slopes models in Tables 6.12 and 6.13. model_2_interaction &lt;- lm(average_sat_math ~ perc_disadvan * size, data = MA_schools) get_regression_table(model_2_interaction) TABLE 6.12: Interaction model regression table term estimate std_error statistic p_value lower_ci upper_ci intercept 594.327 13.288 44.726 0.000 568.186 620.469 perc_disadvan -2.932 0.294 -9.961 0.000 -3.511 -2.353 sizemedium -17.764 15.827 -1.122 0.263 -48.899 13.371 sizelarge -13.293 13.813 -0.962 0.337 -40.466 13.880 perc_disadvan:sizemedium 0.146 0.371 0.393 0.694 -0.585 0.877 perc_disadvan:sizelarge 0.189 0.323 0.586 0.559 -0.446 0.824 model_2_parallel_slopes &lt;- lm(average_sat_math ~ perc_disadvan + size, data = MA_schools) get_regression_table(model_2_parallel_slopes) TABLE 6.13: Parallel slopes regression table term estimate std_error statistic p_value lower_ci upper_ci intercept 588.19 7.607 77.325 0.000 573.23 603.15 perc_disadvan -2.78 0.106 -26.120 0.000 -2.99 -2.57 sizemedium -11.91 7.535 -1.581 0.115 -26.74 2.91 sizelarge -6.36 6.923 -0.919 0.359 -19.98 7.26 Observe how the regression table for the interaction model has 2 more rows (6 versus 4). This reflects the additional “complexity” of the interaction model over the parallel slopes model. Furthermore, note in Table 6.12 how the offsets for the slopes perc_disadvan:sizemedium = 0.146 and perc_disadvan:sizelarge = 0.189 are very small relative to the slope for the baseline group of small schools. In other words, all three slopes for are similarly negative: -2.932 for small schools, -2.786 (= -2.932 + 0.146) for medium schools, and -2.743 (= -2.932 + 0.146) for large schools. These results are suggesting that irrespective of school size, the relationship between average math SAT scores and the percent of the student body that is economically disadvantaged is similar and alas very negative. What you have just performed is a rudimentary model selection: choosing which model fits data best among a set of candidate models. While the model selection you just performed was somewhat qualitative fashion, more statistically rigorous methods exist. If you’re curious, take a course on multiple regression! 6.3.2 Correlation coefficient Recall from Table 6.9 that the correlation coefficient between income in thousands of dollars and credit card debt was 0.464. What if instead we looked at the correlation coefficient between income and credit card debt, but where income was in dollars and not thousands of dollars? This can be done by multiplying income by 1000. credit_ch7 %&gt;% select(debt, income) %&gt;% mutate(income = income * 1000) %&gt;% cor() TABLE 6.14: Correlation between income (in dollars) and credit card debt debt income debt 1.000 0.464 income 0.464 1.000 We see it is the same! We say that the correlation coefficient is invariant to linear transformations! In other words, the correlation between \\(x\\) and \\(y\\) will be the same as the correlation between \\(a\\cdot x + b\\) and \\(y\\) for any numerical values \\(a\\) and \\(b\\). 6.3.3 Simpson’s Paradox Recall in Section 6.2, we saw the two seemingly contradictory results when studying the relationship between credit card debt and income. On the one hand, the right hand plot of Figure 6.5 suggested that the relationship between credit card debt and income was positive. We re-display this plot in Figure 6.9. FIGURE 6.9: Relationship between credit card debt and income. On the other hand, the multiple regression table in Table 6.10 suggested that the relationship between debt and income was negative. We re-display this table in Table 6.15. TABLE 6.15: Multiple regression table term estimate std_error statistic p_value lower_ci upper_ci intercept -385.179 19.465 -19.8 0 -423.446 -346.912 credit_limit 0.264 0.006 45.0 0 0.253 0.276 income -7.663 0.385 -19.9 0 -8.420 -6.906 Observe how the slope for income is -7.663 and, most importantly for now, it is negative. This contradicts our observation in Figure 6.9 that the relationship is positive. How can this be? Recall the interpretation of the slope for income in the context of a multiple regression model: taking into account all the other explanatory variables in our model, for every increase of one unit in income (i.e. $1000), there is an associated decrease of on average $7.663 in debt. In other words, while in isolation the relationship between debt and income may be positive, when taking into account credit limit as well, this relationship becomes negative. These seemingly paradoxical results are due to a phenomenon aptly named Simpson’s Paradox. Simpson’s paradox occurs when trends that exist for the data in aggregate either disappear or reverse when the data are broken down into groups. Let’s show how Simpson’s Paradox manifests itself in the credit_ch7 data. Let’s first visualize the distribution of the numerical explanatory variable credit limit with a histogram in Figure 6.10. FIGURE 6.10: Histogram of credit limits and brackets. The vertical dashed lines are the quartiles that cut up the variable credit limit into four equally sized groups. Let’s think of these quartiles as converting our numerical variable credit limit into a categorical variable “credit limit bracket” with 4 levels. This means 25% of credit limits were between $0 and $3088. Let’s assign these 100 people to the “low” credit limit bracket. 25% of credit limits were between $3088 and $4622. Let’s assign these 100 people to the “medium-low” credit limit bracket. 25% of credit limits were between $4622 and $5873. Let’s assign these 100 people to the “medium-high” credit limit bracket. 25% of credit limits were over $5873. Let’s assign these 100 people to the “high” credit limit bracket. Now in Figure 6.11 let’s re-display two versions of the scatterplot of debt and income from Figure 6.9, but with a slight twist: The left-hand plot shows the regular scatterplot and the single regression line, just as you saw previously. The right-hand plot shows the colored scatterplot, where the color aesthetic is mapped to “credit limit bracket.” Furthermore, there are now four separate regression lines. In other words, the location of the 400 points are the same in both scatterplots, but the right-hand plot shows an additional variable of information: credit limit bracket. FIGURE 6.11: Relationship between credit card debt and income by credit limit bracket. The left-hand plot of 6.11 focuses on the relationship between debt and income in aggregate. It is suggesting that overall there exists a positive relationship between debt and income. However, the right-hand plot of 6.11 focuses on the relationship between debt and income broken down by credit limit bracket. In other words, we focus on four separate relationships between debt and income: one for the “low” credit limit bracket, one for the “medium-low” credit limit bracket, and so on. Observe in the right-hand plot that the relationship between debt and income is clearly negative for the “medium-low” and “medium-high” credit limit brackets, while the relationship is somewhat flat for the “low” credit limit bracket. The only credit limit bracket where the relationship remains positive is for the “high” credit limit bracket. However, this relationship is less positive than in the relationship in aggregate, since the slope is shallower than the slope of the regression line in the left-hand plot. In this example of Simpson’s Paradox, credit limit is a confounding variable of the relationship between credit card debt and income as we defined in Subsection 5.3.1, as thus needs to be accounted for in any appropriate model for the relationship between debt and income. 6.4 Conclusion 6.4.1 Additional resources An R script file of all R code used in this chapter is available here. 6.4.2 What’s to come? Congratulations! We’ve completed our first pass through the “Data modeling with moderndive” portion of this book. We’re ready to proceed to the next portion of this book: “Statistical inference with infer”. Statistical inference is the science of inferring about some unknown quantity using sampling. For example, among the most well-known examples of sampling involved polls. Because asking an entire population about their opinions would be a long and arduous task, pollsters often take a smaller sample that is hopefully representative of the population. Based on the results of this sample, pollsters hope to make claims about the entire population. Once we’ve covered Chapters 7 on sampling, 8 on confidence intervals, and 9 on hypothesis testing, in Chapter 10 on inference for regression we’ll revisit the regression models we studied in Chapter 5 and 6. So far we’ve only studied the estimate column of all our regression tables. The next 4 chapters focus on what the remaining columns mean: the std_error standard error, the statistic test statistic, the p_value p-value, and the lower_ci and upper_ci lower and upper bounds of confidence intervals. Furthermore in Chapter 10, we’ll revisit the concept of residuals \\(y - \\widehat{y}\\) and discuss their importance when interpreting the results of a regression model. We’ll perform what is known as a residual analysis of the residual variable of all get_regression_points() outputs. Residual analyses allow you to verify what are known as the conditions for inference for regression. On to Chapter 7 on sampling! FIGURE 6.12: ModernDive flowchart - On to Part III! "],
-["7-sampling.html", "Chapter 7 Sampling 7.1 Sampling bowl activity 7.2 Virtual sampling 7.3 Sampling framework 7.4 Case study: Polls 7.5 Conclusion", " Chapter 7 Sampling In this chapter, we kick off the third portion of this book on statistical inference by learning about sampling. The concepts behind sampling form the basis of confidence intervals and hypothesis testing, which we’ll cover in Chapters 8 and 9. We will see that the tools that you learned in the data science portion of this book, in particular data visualization and data wrangling, will also play an important role in the development of your understanding. As mentioned before, the concepts throughout this text all build into a culmination allowing you to “tell the story with data.” Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once: ggplot2 for data visualization dplyr for data wrangling tidyr for converting data to “tidy” format readr for importing spreadsheet data into R As well as the more advanced purrr, tibble, stringr, and forcats packages If needed, read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(moderndive) 7.1 Sampling bowl activity Let’s start with a hands-on activity. 7.1.1 What proportion of this bowl’s balls are red? Take a look at the bowl in Figure 7.1. It has a certain number of red and a certain number of white balls all of equal size. Furthermore, it appears the bowl has been mixed beforehand, as there does not seem to be any coherent pattern to the spatial distribution of the red and white balls. Let’s now ask ourselves, what proportion of this bowl’s balls are red? FIGURE 7.1: A bowl with red and white balls. One way to answer this question would be to perform an exhaustive count: remove each ball individually, count the number of red balls and the number of white balls, and divide the number of red balls by the total number of balls. However, this would be a long and tedious process. 7.1.2 Using the shovel once Instead of performing an exhaustive count, let’s insert a shovel into the bowl as seen in Figure 7.2. Using the shovel let’s remove 5 \\(\\times\\) 10 = 50 balls, as seen in Figure 7.3. FIGURE 7.2: Inserting a shovel into the bowl. FIGURE 7.3: Fifty balls from the bowl. Observe that 17 of the balls are red and thus 0.34 = 34% of the shovel’s balls are red. We can view the proportion of balls that are red in this shovel as a guess of the proportion of balls that are red in the entire bowl. While not as exact as doing an exhaustive count of all the balls in the bowl, our guess of 34% took much less time and energy to make. However, say, we started this activity over from the beginning. In other words, we replace the 50 balls back into the bowl and start over. Would we remove exactly 17 red balls again? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% again? Maybe? What if we repeated this activity several times? Would we obtain exactly 17 red balls each time? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% every time? Surely not. Let’s repeat this exercise several times with the help of 33 groups of friends to understand how the value differs with repetition. 7.1.3 Using the shovel 33 times Each of our 33 groups of friends will do the following: Use the shovel to remove 50 balls each. Count the number of red balls and thus compute the proportion of the 50 balls that are red. Return the balls into the bowl. Mix the contents of the bowl a little to not let a previous group’s results influence the next group’s. FIGURE 7.4: Repeating sampling activity 33 times. Before returning the balls into the bowl, each of our 33 groups of friends are going to mark their proportion of the 50 balls that were red in a hand-drawn histogram as seen in Figure 7.5. FIGURE 7.5: Constructing a histogram of proportions. Recall from Section 2.5 that histograms allow us to visualize the distribution of a numerical variable. In particular, where the center of the values falls and how the values vary. A partially completed histogram of the first 10 out of 33 groups of friends’ results can be seen in Figure 7.6. FIGURE 7.6: Hand-drawn histogram of first 10 out of 33 proportions. Observe the following in the histogram in Figure 7.6: At the low end, one group removed 50 balls from the bowl with proportion between 0.20 and 0.25. At the high end, another group removed 50 balls from the bowl with proportion between 0.45 and 0.5 red. However the most frequently occurring proportions were between 0.30 and 0.35 red, right in the middle of the distribution. The shape of this distribution is somewhat bell-shaped. Let’s construct this same hand-drawn histogram in R using your data visualization skills that you honed in Chapter 2. We saved our 33 groups of friends’ results in a data frame tactile_prop_red included in the moderndive package. Run the following to display the first 10 of 33 rows: tactile_prop_red # A tibble: 33 x 4 group replicate red_balls prop_red &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 Ilyas, Yohan 1 21 0.42 2 Morgan, Terrance 2 17 0.34 3 Martin, Thomas 3 21 0.42 4 Clark, Frank 4 21 0.42 5 Riddhi, Karina 5 18 0.36 6 Andrew, Tyler 6 19 0.38 7 Julia 7 19 0.38 8 Rachel, Lauren 8 11 0.22 9 Daniel, Caroline 9 15 0.3 10 Josh, Maeve 10 17 0.34 # … with 23 more rows Observe for each group that we have their names, the number of red_balls they obtained, and the corresponding proportion out of 50 balls that were red named prop_red. We also have a variable replicate enumerating each of the 33 groups; we chose this name because each row can be viewed as one instance of a replicated (in other words repeated) activity: using the shovel to remove 50 balls and computing the proportion of those balls that are red. Let’s visualize the distribution of these 33 proportions using a geom_histogram() with binwidth = 0.05 in Figure 7.7. This is a computerized and complete version of the partially completed hand-drawn histogram you saw in Figure 7.6. ggplot(tactile_prop_red, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 50 balls that were red&quot;, title = &quot;Distribution of 33 proportions red&quot;) FIGURE 7.7: Distribution of 33 proportions based on 33 samples of size 50. 7.1.4 What did we just do? What we just demonstrated in this activity is the statistical concept of sampling. We would like to know the proportion of the bowl’s balls that are red. However, because the bowl has a very large number of balls, performing an exhaustive count of the red and white balls would be very time-consuming. We therefore extracted a sample of 50 balls using the shovel to make an estimate. Using this sample of 50 balls, we estimated the proportion of the bowl’s balls that are red to be 34%. Moreover, because we mixed the balls before each use of the shovel, the samples were randomly drawn. Because each sample was drawn at random, the samples were different from each other. Because the samples were different from each other, we obtained the different proportions red observed in Figure 7.7. This is known as the concept of sampling variation. The purpose of this sampling activity was to develop an understanding of two key concepts relating to sampling: Understanding the effect of sampling variation. Understanding the effect of sample size on sampling variation. In Section 7.2, we’ll mimic the hands-on sampling activity we just performed on a computer. This will allow us not only to repeat the sampling exercise much more than 33 times, but it will also allow us to use shovels with different numbers of slots than just 50. Afterwards, we’ll present you with definitions, terminology, and notation related to sampling in Section 7.3. As in many disciplines, such necessary background knowledge may seem very inaccessible and even confusing at first. However, as with many difficult topics, if you truly understand the underlying concepts and practice, practice, practice, you’ll be able to master them. To tie the contents of this chapter to the real-word, we’ll present an example of one of the most recognizable uses of sampling: polls. In Section 7.4 we’ll look at a particular case study: a 2013 poll on then U.S. President Obama’s popularity among young Americans, conducted by the Harvard Kennedy School’s Institute of Politics. To close this chapter we’ll generalize the previous “sampling from a bowl” exercise to other sampling scenarios, present an important theoretical result known as the Central Limit Theorem, and present a few mathematical formulas related to sampling. Learning check (LC7.1) Why was it important to mix the bowl before we sampled the balls? (LC7.2) Why is it that our 33 groups of friends did not all have the same numbers of balls that were red out of 50, and hence different proportions red? 7.2 Virtual sampling In the previous Section 7.1, we performed a tactile sampling activity by hand. In other words, we used a physical bowl of balls and a physical shovel. We performed this sampling activity by hand first so that we develop a firm understanding of the root ideas behind sampling. In this section, we’ll mimic this tactile sampling activity with a virtual sampling activity using a computer. In other words, we’ll use a virtual analog to the bowl of balls and a virtual analog to the shovel. 7.2.1 Using the virtual shovel once Let’s start by performing the virtual analog of the tactile sampling exercise we performed in Section 7.1. We first need a virtual analog of the bowl seen in Figure 7.1. To this end, we included a data frame bowl in the moderndive package. The rows of bowl correspond exactly with the contents of the actual bowl. bowl # A tibble: 2,400 x 2 ball_ID color &lt;int&gt; &lt;chr&gt; 1 1 white 2 2 white 3 3 white 4 4 red 5 5 white 6 6 white 7 7 red 8 8 white 9 9 red 10 10 white # … with 2,390 more rows Observe that bowl has 2400 rows, telling us that the bowl contains 2400 equally-sized balls. The first variable ball_ID is used as an identification variable as discussed in Subsection 1.4.4; none of the balls in the actual bowl are marked with numbers. The second variable color indicates whether a particular virtual ball is red or white. View the contents of the bowl in RStudio’s data viewer and scroll through the contents to convince yourself that bowl is indeed a virtual analog of the actual bowl in Figure 7.1. Now that we have a virtual analog of our bowl, we now need a virtual analog to the shovel seen in Figure 7.2 to generate virtual samples of 50 balls. We’re going to use the rep_sample_n() function included in the moderndive package. This function allows us to take repeated, or replicated, samples of size n. virtual_shovel &lt;- bowl %&gt;% rep_sample_n(size = 50) virtual_shovel # A tibble: 50 x 3 # Groups: replicate [1] replicate ball_ID color &lt;int&gt; &lt;int&gt; &lt;chr&gt; 1 1 1970 white 2 1 842 red 3 1 2287 white 4 1 599 white 5 1 108 white 6 1 846 red 7 1 390 red 8 1 344 white 9 1 910 white 10 1 1485 white # … with 40 more rows Observe that virtual_shovel has 50 rows corresponding to our virtual sample of size 50. The ball_ID variable identifies which of the 2400 balls from bowl are included in our sample of 50 balls while color denotes its color. However what does the replicate variable indicate? In virtual_shovel’s case, replicate is equal to 1 for all 50 rows. This is telling us that these 50 rows correspond to the first repeated/replicated use of the shovel, in our case our first sample. We’ll see in what follows when we “virtually” take 33 samples, replicate will take values between 1 and 33. Let’s compute the proportion of balls in our virtual sample that are red using the dplyr data wrangling verbs you learned in Chapter 3. First, for each of our 50 sampled balls, let’s identify if it is red or not using a test for equality using ==. Let’s create a new Boolean variable is_red using the mutate() function from Section 3.5: virtual_shovel %&gt;% mutate(is_red = (color == &quot;red&quot;)) # A tibble: 50 x 4 # Groups: replicate [1] replicate ball_ID color is_red &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;lgl&gt; 1 1 1970 white FALSE 2 1 842 red TRUE 3 1 2287 white FALSE 4 1 599 white FALSE 5 1 108 white FALSE 6 1 846 red TRUE 7 1 390 red TRUE 8 1 344 white FALSE 9 1 910 white FALSE 10 1 1485 white FALSE # … with 40 more rows Observe that for every row where color == &quot;red&quot;, the Boolean TRUE is returned and for every row where color is not equal to &quot;red&quot;, the Boolean FALSE is returned. Second, let’s compute the number of balls out of 50 that are red using the summarize() function. Recall from Section 3.3 that summarize() takes a data frame with many rows and returns a data frame with a single row containing summary statistics, like the mean() or median(). In this case, we use the sum(): virtual_shovel %&gt;% mutate(is_red = (color == &quot;red&quot;)) %&gt;% summarize(num_red = sum(is_red)) # A tibble: 1 x 2 replicate num_red &lt;int&gt; &lt;int&gt; 1 1 12 Why does this work? Because R treats TRUE like the number 1 and FALSE like the number 0. So summing the number of TRUE’s and FALSE’s is equivalent to summing 1’s and 0’s. In the end, this operation counts the number of balls where color is red. In our case, 12 of the 50 balls were red. However, you might’ve gotten a different number red because of the randomness of the virtual sampling. Third and lastly, let’s compute the proportion of the 50 sampled balls that are red by dividing num_red by 50: virtual_shovel %&gt;% mutate(is_red = color == &quot;red&quot;) %&gt;% summarize(num_red = sum(is_red)) %&gt;% mutate(prop_red = num_red / 50) # A tibble: 1 x 3 replicate num_red prop_red &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 1 12 0.24 In other words, 34% of this virtual sample’s balls were red. Let’s make this code a little more compact and succinct by combining the first mutate() and the summarize() as follows: virtual_shovel %&gt;% summarize(num_red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = num_red / 50) # A tibble: 1 x 3 replicate num_red prop_red &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 1 12 0.24 Great! 34% of virtual_shovel’s 50 balls were red! So based on this particular sample of 50 balls, our guess at the proportion of the bowl’s balls that are red is 34%. But remember from our earlier tactile sampling activity that if we repeat this sampling, we will not necessarily obtain the same value of 34% again. There will likely be some variation. In fact, our 33 groups of friends computed 33 such proportions whose distribution we visualized in Figure 7.6. We saw that these estimates varied. Let’s now perform the virtual analog of having 33 groups of students use the sampling shovel! 7.2.2 Using the virtual shovel 33 times Recall that in our tactile sampling exercise in Section 7.1 we had 33 groups of students each use the shovel, yielding 33 samples of size 50 balls. We then used these 33 samples to compute 33 proportions. In other words we repeated/replicated using the shovel 33 times. We can perform this repeated/replicated sampling virtually by once again using our virtual shovel function rep_sample_n(), but by adding the reps = 33 argument. This is telling R that we want to repeat the sampling 33 times. We’ll save these results in a data frame called virtual_samples. While we provide a preview of the first 10 rows of virtual_samples in what follows, we highly suggest you scroll through its contents using RStudio’s spreadsheet viewer by running View(virtual_samples). virtual_samples &lt;- bowl %&gt;% rep_sample_n(size = 50, reps = 33) virtual_samples # A tibble: 1,650 x 3 # Groups: replicate [33] replicate ball_ID color &lt;int&gt; &lt;int&gt; &lt;chr&gt; 1 1 875 white 2 1 1851 red 3 1 1548 red 4 1 1975 white 5 1 835 white 6 1 16 white 7 1 327 white 8 1 1803 red 9 1 740 red 10 1 179 red # … with 1,640 more rows Observe in the spreadsheet viewer that the first 50 rows of replicate are equal to 1 while the next 50 rows of replicate are equal to 2. This is telling us that the first 50 rows correspond to the first sample of 50 balls while the next 50 rows correspond to the second sample of 50 balls. This pattern continues for all reps = 33 replicates and thus virtual_samples has 33 \\(\\times\\) 50 = 1650 rows. Let’s now take virtual_samples and compute the resulting 33 proportions red. We’ll use the same dplyr verbs as before, but this time with an additional group_by() of the replicate variable. Recall from Section 3.4 that by assigning the grouping variable “meta-data” before we summarize(), we’ll obtain 33 different proportions red. We display a preview of the first 10 out of 33 rows: virtual_prop_red &lt;- virtual_samples %&gt;% group_by(replicate) %&gt;% summarize(red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = red / 50) virtual_prop_red # A tibble: 33 x 3 replicate red prop_red &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 1 23 0.46 2 2 19 0.38 3 3 18 0.36 4 4 19 0.38 5 5 15 0.3 6 6 21 0.42 7 7 21 0.42 8 8 16 0.32 9 9 24 0.48 10 10 14 0.28 # … with 23 more rows As with our 33 groups of friends’ tactile samples, there is variation in the resulting 33 virtual proportions red. Let’s visualize this variation in a histogram in Figure 7.8. Note that we add binwidth = 0.05 and boundary = 0.4 arguments as well. Setting boundary = 0.4 indicates that we want a binning scheme such that one of the bins’ boundary is at 0.4. Since the binwidth = 0.05 is also set, this will create bins with boundaries at 0.30, 0.35, 0.45, 0.5, etc as well. ggplot(virtual_prop_red, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 50 balls that were red&quot;, title = &quot;Distribution of 33 proportions red&quot;) FIGURE 7.8: Distribution of 33 proportions based on 33 samples of size 50. Observe that we occasionally obtained proportions red that are less than 30%. On the other hand, we occasionally obtained proportions that are greater than 45%. However, the most frequently occurring proportions were between 35% and 40% (for 11 out of 33 samples). Why do we have these differences in proportions red? Because of sampling variation. Let’s now compare our virtual results with our tactile results from the previous section in Figure 7.9. Observe that both histograms are somewhat similar in their center and variation, although not identical. These slight differences are again due to random sampling variation. Furthermore, observe that both distributions are somewhat bell-shaped. FIGURE 7.9: Comparing 33 virtual and 33 tactile proportions red. Learning check (LC7.3) Why couldn’t we study the effects of sampling variation when we used the virtual shovel only once? Why did we need to take more than one virtual sample (in our case 33 virtual samples)? 7.2.3 Using the virtual shovel 1000 times Now say we want to study the effects of sampling variation not for 33 samples, but rather for a very large number of samples, say 1000. We have two choices at this point. We could have our groups of friends manually take 1000 samples of 50 balls and compute the corresponding 1000 proportions. However, this would be a very tedious and time-consuming task. This is where computers excel: automating long and repetitive tasks while performing them very quickly. Thus at this point we will abandon tactile sampling in favor of only virtual sampling. Let’s once again use the rep_sample_n() function with sample size set to be 50 once again, but this time with the number of replicates reps = 1000. Be sure to scroll through the contents of virtual_samples in RStudio’s viewer. virtual_samples &lt;- bowl %&gt;% rep_sample_n(size = 50, reps = 1000) virtual_samples # A tibble: 50,000 x 3 # Groups: replicate [1,000] replicate ball_ID color &lt;int&gt; &lt;int&gt; &lt;chr&gt; 1 1 1236 red 2 1 1944 red 3 1 1939 white 4 1 780 white 5 1 1956 white 6 1 1003 white 7 1 2113 white 8 1 2213 white 9 1 782 white 10 1 898 white # … with 49,990 more rows Observe that now virtual_samples has 1000 \\(\\times\\) 50 = 50,000 rows, instead of the 33 \\(\\times\\) 50 = 1650 rows from earlier. Using the same data wrangling code as earlier, let’s take the data frame virtual_samples with 1000 \\(\\times\\) 50 = 50,000 and compute the resulting 1000 proportions red. virtual_prop_red &lt;- virtual_samples %&gt;% group_by(replicate) %&gt;% summarize(red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = red / 50) virtual_prop_red # A tibble: 1,000 x 3 replicate red prop_red &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 1 18 0.36 2 2 19 0.38 3 3 20 0.4 4 4 15 0.3 5 5 17 0.34 6 6 16 0.32 7 7 23 0.46 8 8 23 0.46 9 9 15 0.3 10 10 18 0.36 # … with 990 more rows Observe that we now have 1000 replicates of prop_red, the proportion of 50 balls that are red. Using the same code as earlier, let’s now visualize the distribution of these 1000 replicates of prop_red in a histogram in Figure 7.10. ggplot(virtual_prop_red, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 50 balls that were red&quot;, title = &quot;Distribution of 1000 proportions red&quot;) FIGURE 7.10: Distribution of 1000 proportions based on 33 samples of size 50. Once again, the most frequently occurring proportions red occur between 35% and 40%. Every now and then, we obtain proportions as low as between 20% and 25%, and others as high as between 55% and 60%. These are rare, however. Furthermore, observe that we now have a much more symmetric and smoother bell-shaped distribution. This distribution is, in fact, a Normal distribution. At this point we recommend you read the “Normal distribution” section of Appendix A.2 for a brief discussion on the properties of the Normal distribution. Learning check (LC7.4) Why did we not take 1000 “tactile” samples of 50 balls by hand? (LC7.5) Looking at Figure 7.10, would you say that sampling 50 balls where 30% of them were red is likely or not? What about sampling 50 balls where 10% of them were red? 7.2.4 Using different shovels Now say instead of just one shovel, you have three choices of shovels to extract a sample of balls with: shovels of size 25, 50, and 100. FIGURE 7.11: Three shovels to extract three different sample sizes. If your goal is still to estimate the proportion of the bowl’s balls that are red, which shovel would you choose? In our experience, most people would choose the largest shovel with 100 slots because it would yield the “best” guess of the proportion of the bowl’s balls that are red. Let’s define some criteria for “best” in this subsection. Using our newly developed tools for virtual sampling, let’s unpack the effect of having different sample sizes! In other words, let’s use rep_sample_n() with size = 25, size = 50, and size = 100, while keeping the number of repeated/replicated samples at 1000: Virtually use the appropriate shovel to generate 1000 samples with size balls. Compute the resulting 1000 replicates of the proportion of the shovel’s balls that are red. Visualize the distribution of these 1000 proportions red using a histogram. Run each of the following code segments individually and then compare the three resulting histograms. # Segment 1: sample size = 25 ------------------------------ # 1.a) Virtually use shovel 1000 times virtual_samples_25 &lt;- bowl %&gt;% rep_sample_n(size = 25, reps = 1000) # 1.b) Compute resulting 1000 replicates of proportion red virtual_prop_red_25 &lt;- virtual_samples_25 %&gt;% group_by(replicate) %&gt;% summarize(red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = red / 25) # 1.c) Plot distribution via a histogram ggplot(virtual_prop_red_25, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 25 balls that were red&quot;, title = &quot;25&quot;) # Segment 2: sample size = 50 ------------------------------ # 2.a) Virtually use shovel 1000 times virtual_samples_50 &lt;- bowl %&gt;% rep_sample_n(size = 50, reps = 1000) # 2.b) Compute resulting 1000 replicates of proportion red virtual_prop_red_50 &lt;- virtual_samples_50 %&gt;% group_by(replicate) %&gt;% summarize(red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = red / 50) # 2.c) Plot distribution via a histogram ggplot(virtual_prop_red_50, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 50 balls that were red&quot;, title = &quot;50&quot;) # Segment 3: sample size = 100 ------------------------------ # 3.a) Virtually using shovel with 100 slots 1000 times virtual_samples_100 &lt;- bowl %&gt;% rep_sample_n(size = 100, reps = 1000) # 3.b) Compute resulting 1000 replicates of proportion red virtual_prop_red_100 &lt;- virtual_samples_100 %&gt;% group_by(replicate) %&gt;% summarize(red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = red / 100) # 3.c) Plot distribution via a histogram ggplot(virtual_prop_red_100, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 100 balls that were red&quot;, title = &quot;100&quot;) For easy comparison, we present the three resulting histograms in a single row with matching x and y axes in Figure 7.12. FIGURE 7.12: Comparing the distributions of proportion red for different sample sizes. Observe that as the sample size increases, the variation of the 1000 replicates of the proportion of red decreases. In other words, as the sample size increases, there are fewer differences due to sampling variation and the distribution centers more tightly around the same value. Eyeballing Figure 7.12, all three histograms appear to center around roughly 40%. We can be numerically explicit about the amount of variation in our 3 sets of 1000 values of prop_red using the standard deviation . A standard deviation is a summary statistic that measures the amount of variation within a numerical variable (see Appendix A.1 for a brief discussion on the properties of the standard deviation). For all three sample sizes, let’s compute the standard deviation of the 1000 proportions red by running the following data wrangling code that uses the sd() summary function. # n = 25 virtual_prop_red_25 %&gt;% summarize(sd = sd(prop_red)) # n = 50 virtual_prop_red_50 %&gt;% summarize(sd = sd(prop_red)) # n = 100 virtual_prop_red_100 %&gt;% summarize(sd = sd(prop_red)) Let’s compare these three measures of variation of the distributions in Table 7.1. TABLE 7.1: Comparing standard deviations of proportions red for 3 different shovels. Number of slots in shovel Standard deviation of proportions red 25 0.099 50 0.071 100 0.048 As we observed in Figure 7.12, as the sample size increases, the variation decreases. In other words, there is less variation in the 1000 values of the proportion red. So as the sample size increases, our guesses at the true proportion of the bowl’s balls that are red get more precise. Learning check (LC7.6) In Figure 7.12, we used shovels to take 1000 samples each, computed the resulting 1000 proportions of the shovel’s balls that were red, and then visualized the distribution of these 1000 proportions in a histogram. We did this for shovels with 25, 50, and 100 slots in them. As the size of the shovels increased, the histograms got narrower. In other words, as the size of the shovels increased from 25 to 50 to 100, did the 1000 proportions A. Vary less, B. Vary by the same amount, or C. Vary more? (LC7.7) What summary statistic did we use to quantify how much the 1000 proportions red varied? A. The inter-quartile range B. The standard deviation C. The range: the largest value minus the smallest 7.3 Sampling framework In both our tactile and our virtual sampling activities, we used sampling for the purpose of estimation. We extracted samples in order to estimate the proportion of the bowl’s balls that are red. We used sampling as a less time consuming approach than to perform an exhaustive count of all the balls. Our virtual sampling activity built up to the results shown in Figure 7.12 and Table 7.1: comparing 1000 proportions red based on samples of size 25, 50, and 100. This was our first attempt at understanding two key concepts relating to sampling for estimation: The effect of sampling variation on our estimates. The effect of sample size on sampling variation. Let’s now introduce some terminology and notation as well as statistical definitions related to sampling. Given the number of new words you’ll need to learn, you will likely have to read this section a few times. Keep in mind, however, that all of the concepts underlying these terminology, notation, and definitions tie directly to the concepts underlying our tactile and virtual sampling activities. It will simply take time and practice to master them. 7.3.1 Terminology &amp; notation Here is a list of terminology and mathematical notation relating to sampling. First, A (study) population is a collection of individuals or observations about which we are interested in. We mathematically denote the population’s size using upper case \\(N\\). In our sampling activities, the (study) population is the collection of \\(N\\) = 2400 identically sized red and white balls contained in the bowl. Second, a population parameter is a numerical summary quantity about the population that is unknown, but you wish you knew. For example, when this quantity is a mean, the population parameter of interest is the population mean. This is mathematically denoted with the Greek letter \\(\\mu\\) pronounced “mu” (We’ll see a sampling activity involving means in the upcoming Section 8.1). In our earlier sampling from the bowl activity however, since we were interested in the proportion of the bowl’s balls that were red, the population parameter is the population proportion . This is mathematically denoted with the letter \\(p\\). Third, a census is an exhaustive enumeration or counting of all \\(N\\) individuals or observations in the population in order to compute the population parameter’s value exactly. In our sampling activity, this would correspond to counting the number of balls out of \\(N\\) = 2400 that are red and computing the population proportion \\(p\\) that are red exactly. When the number \\(N\\) of individuals or observations in our population is large as was the case with our bowl, a census can be very expensive in terms of time, energy, and money. Fourth, sampling is the act of collecting a sample from the population when we don’t have the means to perform a census. We mathematically denote the sample’s size using lower case \\(n\\), as opposed to upper case \\(N\\) which denotes the population’s size. Typically the sample size \\(n\\) is much smaller than the population size \\(N\\). Thus sampling is a much cheaper alternative than performing a census. In our sampling activities, we used shovels with 25, 50, and 100 slots to extract a sample of size \\(n\\) = 25, \\(n\\) = 50, and \\(n\\) = 100. Fifth, A point estimate (AKA sample statistic) is a summary statistic computed from a sample that estimates an unknown population parameter. In our sampling activities, recall that the unknown population parameter was the population proportion and that this is mathematically denoted with \\(p\\). Our point estimate is the sample proportion: the proportion of the shovel’s balls that are red. In other words, it is our guess of the proportion of the bowl’s balls balls that are red. We mathematically denote the sample proportion using \\(\\widehat{p}\\). The “hat” on top of the \\(p\\) indicates that it is an estimate of the unknown population proportion \\(p\\). Sixth, the idea of representative sampling. A sample is said to be a representative sample if it roughly looks like the population. In other words, are the sample’s characteristics a good representation of the population’s characteristics? In our sampling activity, are the samples of \\(n\\) balls extracted using our shovels representative of the bowl’s \\(N\\) = 2400 balls? Seventh, the idea of generalizability. We say a sample is generalizable if any results based on the sample can generalize to the population. In other words, does the value of the point estimate generalize to the population? In our sampling activity, can we generalize the sample proportion from our shovels to the entire bowl? Using our mathematical notation, this is akin to asking if \\(\\widehat{p}\\) a “good guess” of \\(p\\)? Eighth, we say biased sampling occurs if certain individuals or observations in a population have a higher chance of being included in a sample than others. We say a sampling procedure is unbiased if every observation in a population had an equal chance of being sampled. In our sampling activities, since each equally sized balls had an equal chance of being sampled in our shovels, our samples were unbiased. Ninth and lastly, the idea of random sampling. We say a sampling procedure is random if we sample randomly from the population in an unbiased fashion. In our sampling activities, this would correspond to sufficiently mixing the bowl before each use of the shovel. Phew, that’s a lot of new terminology and notation to learn! Let’s put them all together to describe the paradigm of sampling. In general: If the sampling of a sample of size \\(n\\) is done at random, then the sample is unbiased and representative of the population of size \\(N\\), thus any result based on the sample can generalize to the population, thus the point estimate is a “good guess” of the unknown population parameter, thus instead of performing a census, we can infer about the population using sampling. Specific to our sampling activity:: If we extract a sample of \\(n=50\\) balls at random, in other words, we mix e equally-sized balls before using the shovel, then the contents of the shovel are an unbiased representation of the contents of the bowl’s 2400 balls, thus any result based on the shovel’s balls can generalize to the bowl, thus the sample proportion \\(\\widehat{p}\\) of the \\(n=50\\) balls in the shovel that are red is a “good guess” of the population proportion \\(p\\) of the \\(N\\)=2400 balls that are red, thus instead of manually going over all 2400 balls in the bowl, we can infer about the bowl using the shovel. Note that last word we wrote in bold: infer. The act of “inferring” means to deduce or conclude (information) from evidence and reasoning. In our sampling activities, we wanted to infer about the proportion of the bowl’s balls that are red. Statistical inference is the “theory, methods, and practice of forming judgments about the parameters of a population and the reliability of statistical relationships, typically on the basis of random sampling” (Wikipedia). In other words, statistical inference is the act of inference via sampling. In the upcoming Chapter 8 on confidence intervals, we’ll introduce the infer package, which makes statistical inference “tidy” and transparent. It is why this third portion of the book is called “Statistical inference via infer”. Learning check (LC7.8) In the case of our bowl activity, what is the population parameter? Do we know its value? (LC7.9) What would performing a census in our bowl activity correspond to? Why did we not perform a census? (LC7.10) What purpose do point estimates serve in general? What is the name of the point estimate specific to our bowl activity? What is its mathematical notation? (LC7.11) How did we ensure that our tactile samples using the shovel were random? (LC7.12) Why is it important that sampling be done at random? (LC7.13) What are we inferring about the bowl based on the samples using the shovel? 7.3.2 Statistical definitions Now for some important statistical definitions related to sampling. As a refresher of our 1000 repeated/replicated virtual samples of size \\(n\\) = 25, \\(n\\) = 50, and \\(n\\) = 100 in Section 7.2, let’s display Figure 7.12 again. FIGURE 7.13: Previously seen three sampling distributions of the sample proportion \\(\\widehat{p}\\). These types of distributions have a special name: sampling distributions; their visualization displays the effect of sampling variation on the distribution of any point estimate, in this case, the sample proportion \\(\\widehat{p}\\). Using these sampling distributions, for a given sample size \\(n\\), we can make statements about what values we can typically expect. For example, observe the centers of all three sampling distributions: they are all roughly centered around 0.4 = 40%. Furthermore, observe that while we are somewhat likely to observe sample proportions red of 0.2 = 20% when using the shovel with 25 slots, we will almost never observe a proportion of 20% when using the shovel with 100 slots. Observe also the effect of sample size on the sampling variation. As the sample size \\(n\\) increases from 25 to 50 to 100, the variation of the sampling distribution decreases and thus the values cluster more and more tightly around the same center of around 40%. We quantified this variation using the standard deviation of our sample proportions in Table 7.1, which we display again: TABLE 7.2: Previously seen comparing standard deviations of proportions red for 3 different shovels. Number of slots in shovel Standard deviation of proportions red 25 0.099 50 0.071 100 0.048 So as the sample size increases, the standard deviation decreases. This type of standard deviation has another special name: standard error. Standard errors quantify the effect of sampling variation induced on our estimates. In other words, they quantify how much we can expect different proportions of a shovel’s balls that are red to vary from one sample to another sample to another sample, and so on. Unfortunately, these names confuse many people new to statistical inference. For example, it’s common for people new to statistical inference to call the “sampling distribution” the “sample distribution.” Another additional source of confusion is the name “standard deviation” and “standard error.” Remember that a standard error is merely a kind of standard deviation: the standard deviation of any point estimate from sampling. In other words, all standard errors are standard deviations, but not all standard deviations are necessarily a standard error. To help reinforce these concepts, let’s re-display Figure 7.12 but using our new terminology, notation, and definitions relating to sampling in Figure 7.14. FIGURE 7.14: Three sampling distributions of the sample proportion \\(\\widehat{p}\\). Furthermore, let’s re-display Table 7.1 but using our new terminology, notation, and definitions relating to sampling in Table 7.3. TABLE 7.3: Three standard errors of the sample proportion based on n = 25, 50, 100. Sample size (n) Standard error of \\(\\widehat{p}\\) n = 25 0.099 n = 50 0.071 n = 100 0.048 Remember the key message of this last table: that as the sample size \\(n\\) goes up, the “typical” error of your point estimate will go down (as quantified by the standard error). Learning check (LC7.14) What purpose did the sampling distributions serve? (LC7.15) What does the standard error of the sample proportion \\(\\widehat{p}\\) quantify? 7.3.3 The moral of the story Let’s recap this section so far. We’ve seen that if a sample is generated at random, then the resulting point estimate is a “good guess” of the true unknown population parameter. In our sampling activities, since we made sure to mix the balls first before extracting a sample with the shovel, the resulting sample proportion \\(\\widehat{p}\\) of the shovel’s balls that were red was a “good guess” of the population proportion \\(p\\) of the bowl’s balls that were red. However, what do we mean by our point estimate being a “good guess”? Sometimes we’ll get an estimate that is less than the true value of the population parameter, while at other times we’ll get an estimate that is greater. This is due to sampling variation. However, despite this sampling variation, our estimates will “on average” be correct and thus will be centered at the true value. This is because our sampling was done at random and thus in an unbiased fashion. In our sampling activities, sometimes our sample proportion \\(\\widehat{p}\\) was less than the true population proportion \\(p\\), while at other times it was greater. This was due to the sampling variability. However, despite this sampling variation, our sample proportions \\(\\widehat{p}\\) were “on average” correct and thus were centered at the true value of the population proportion \\(p\\). This is because we mixed our bowl before taking samples and thus the sampling was done at random and thus in an unbiased fashion. This is also known as having an accurate estimate. What was the value of the population proportion \\(p\\) of the \\(N\\) = 2400 balls in the actual bowl that were red? There were 900 red balls, for a proportion red of 900/2400 = 0.375 = 37.5%! How do we know this? Did the authors do an exhaustive count of all the balls? No! They were listed in the contents of the box that the bowl came in! Hence we were able to make the contents of the virtual bowl match the tactile bowl: bowl %&gt;% summarize(sum_red = sum(color == &quot;red&quot;), sum_not_red = sum(color != &quot;red&quot;)) # A tibble: 1 x 2 sum_red sum_not_red &lt;int&gt; &lt;int&gt; 1 900 1500 Let’s re-display our sampling distributions from Figures 7.12 and 7.14, but now with a vertical red line marking the true population proportion \\(p\\) of balls that are red = 37.5% in Figure 7.15. We see that while there is a certain amount of error in the sample proportions \\(\\widehat{p}\\) for all three sampling distributions, on average the \\(\\widehat{p}\\) are centered at the true population proportion red \\(p\\). FIGURE 7.15: Three sampling distributions with population proportion \\(p\\) marked in red. We also saw in this section that as your sample size \\(n\\) increases, your point estimates will vary less and less and be more and more concentrated around the true population parameter. This variation is quantified by the decreasing standard error. In other words, the typical error of your point estimates will decrease. In our sampling exercise, as the sample size increased, the variation of our sample proportions \\(\\widehat{p}\\) decreased. You can observe this behavior in Figure 7.15. This is also known as having a precise estimate. So random and unbiased sampling ensures our point estimates are accurate, while on the other hand having a large sample size ensures our point estimates are precise. While the terms “accuracy” and “precision” may sound like they mean the same thing, there is a subtle difference. Accuracy describes how “on target” our estimates are, whereas precision describes how “consistent” our estimates are. Figure 7.16 illustrates the difference. FIGURE 7.16: Comparing accuracy and precision. As this point, you might be asking yourself: “If we already knew the true proportion of the bowl’s balls that are red was 37.5%, then why did do any sampling?” You might also be asking: “Why did we take 1000 repeated samples of size n = 25, 50, and 100? Shouldn’t we be taking only one sample that’s as large as possible?” If you did ask yourself these questions, your suspicion is merited! The sampling activity involving the bowl is merely an idealized version of how sampling is done in real-life. We performed this exercise only to study and understand: The effect of sampling variation. The effect of sample size on sampling variation. This not how sampling is done in real-life. In a real-life scenario, we won’t know what the true value of the population parameter is. Furthermore we wouldn’t take 1000 repeated/replicated samples, but rather a single sample that’s as large as we can afford. In the next section, let’s now study a real-life example of sampling: polls. Learning check (LC7.16) The table that follows is a version of Table 7.3 matching sample sizes \\(n\\) to different standard errors of the sample proportion \\(\\widehat{p}\\), but with the rows randomly re-ordered and the sample sizes removed. Fill in the table by matching the correct sample sizes to the correct standard errors. TABLE 7.4: Three standard errors of the sample proportion based on n = 25, 50, 100. Sample size Standard error of p-hat n = 0.099 n = 0.048 n = 0.071 For the following four learning checks, let the estimate be the sample proportion \\(\\widehat{p}\\): the proportion of a shovel’s balls that were red. It estimates the population proportion \\(p\\): the proportion of the bowl’s balls that were red. (LC7.17) What is the difference between an accurate estimate and a precise estimate? (LC7.18) How do we ensure that an estimate is accurate? How do we ensure that an estimate is precise? (LC7.19) In a real-life situation, we would not take 1000 different samples to infer about a population, but rather only one. Then what was the purpose of our exercises where we took 1000 different samples? (LC7.20) Figure 7.16 with the targets shows four combinations of “accurate versus precise” estimates. Draw four corresponding sampling distributions of the sample proportion \\(\\widehat{p}\\), like the one in the left-most plot in Figure 7.15. 7.4 Case study: Polls Let’s now switch gears to a more realistic sampling scenario than our bowl activity: a poll. In practice, pollsters do not take 1000 repeated samples as we did in our previous sampling activities, but rather take only a single sample that’s as large as possible. On December 4, 2013, National Public Radio in the US reported on a poll of President Obama’s approval rating among young Americans aged 18-29 in an article “Poll: Support For Obama Among Young Americans Eroding”. The poll was conducted by the Harvard University Institute of Politics. A quote from the article: After voting for him in large numbers in 2008 and 2012, young Americans are souring on President Obama. According to a new Harvard University Institute of Politics poll, just 41 percent of millennials — adults ages 18-29 — approve of Obama’s job performance, his lowest-ever standing among the group and an 11-point drop from April. Let’s tie elements of the real-life poll in this new article with our “tactile” and “virtual” bowl activity from Sections 7.1 and 7.2 using the terminology, notations, and definitions we learned in Section 7.3. You see that our sampling activity with the bowl is an idealized version of what pollsters are trying to do in real-life. First, who is the (Study) Population of \\(N\\) individuals or observations of interest? Bowl: \\(N\\) = 2400 identically-sized red and white balls Obama poll: \\(N\\) = ? young Americans aged 18-29 Second, what is the population parameter? Bowl: The population proportion \\(p\\) of all the balls in the bowl that are red. Obama poll: The population proportion \\(p\\) of all young Americans who approve of Obama’s job performance. Third, what would a census look like? Bowl: Manually going over all \\(N\\) = 2400 balls and exactly computing the population proportion \\(p\\) of the balls that are red. Obama poll: Locating all \\(N\\) young Americans and asking them all if they approve of Obama’s job performance. In the case, we don’t even know what the population size \\(N\\) is! Fourth, how do you perform sampling to obtain a sample of size \\(n\\)? Bowl: Using a shovel with \\(n\\) slots. Obama poll: One method is to get a list of phone numbers of all young Americans and pick out \\(n\\) phone numbers. In this poll’s case, the sample size of this poll was \\(n\\) = 2089 young Americans. Fifth, what is your point estimate (AKA sample statistic) of the unknown population parameter? Bowl: The sample proportion \\(\\widehat{p}\\) of the balls in the shovel that were red. Obama poll: The sample proportion \\(\\widehat{p}\\) of young Americans in the sample that approve of Obama’s job performance. In this poll’s case, \\(\\widehat{p}\\) = 0.41 = 41%, the quoted percentage in the second paragraph of the article. Sixth, is the sampling procedure representative? Bowl: Are the contents of the shovel representative of the contents of the bowl? Because we mixed the bowl before sampling, we can feel confident that they are. Obama poll: Is the sample of \\(n\\) = 2089 young Americans representative of all young Americans aged 18-29? This depends on whether the sampling was random. Seventh, are the samples generalizable to the greater population? Bowl: Is the sample proportion \\(\\widehat{p}\\) of the shovel’s balls that are red a “good guess” of the population proportion \\(p\\) of the bowl’s balls that are red? Given that the sample was representative, the answer is yes. Obama poll: Is the sample proportion \\(\\widehat{p}\\) = 0.41 of the sample of young Americans who support Obama a “good guess” of the population proportion \\(p\\) of all young Americans who support Obama? In other words, can we confidently say that roughly 41% of all young Americans approve of Obama? Again, this depends on whether the sampling was random. Eighth, is the sampling procedure unbiased? In other words, do all observations have an equal chance of being included in the sample? Bowl: Since each ball was equally sized and we mixed the bowl before using the shovel, each ball had an equal chance of being included in a sample and hence the sampling was unbiased. Obama poll: Did all young Americans have an equal chance at being represented in this poll? Again, this depends on whether the sampling was random. Ninth and lastly, was the sampling done at random? Bowl: As long as you mixed the bowl sufficiently before sampling, your samples would be random. Obama poll: Was the sample conducted at random? We can’t answer this question without knowing about the sampling methodology used by the Harvard University Institute of Politics. We’ll discuss this more at the end of this section. In other words, the Harvard University Institute of Politics poll can be thought of as an instance of using the shovel to sample balls from the bowl. Furthermore, if another polling company conducted a similar poll of young Americans at roughly the same time, they would likely get a different estimate than 41%. This is due to sampling variation. Let’s now revisit the sampling paradigm from Section 7.3.1: In general: If the sampling of a sample of size \\(n\\) is done at random, then the sample is unbiased and representative of the population of size \\(N\\), thus any result based on the sample can generalize to the population, thus the point estimate is a “good guess” of the unknown population parameter, thus instead of performing a census, we can infer about the population using sampling. Specific to the bowl:: If we extract a sample of \\(n=50\\) balls at random, in other words, we mix e equally-sized balls before using the shovel, then the contents of the shovel are an unbiased representation of the contents of the bowl’s 2400 balls, thus any result based on the shovel’s balls can generalize to the bowl, thus the sample proportion \\(\\widehat{p}\\) of the \\(n=50\\) balls in the shovel that are red is a “good guess” of the population proportion \\(p\\) of the \\(N\\)=2400 balls that are red, thus instead of manually going over all 2400 balls in the bowl, we can infer about the bowl using the shovel. Specific to the Obama poll:: If we had a way of contacting a randomly chosen sample of 2089 young Americans and poll their approval of President Obama, then these 2089 young Americans would be an unbiased and representative sample of all young Americans, thus any results based on this sample of 2089 young Americans can generalize to the entire population of all young Americans, thus the reported sample approval rating of 41% of these 2089 young Americans is a good guess of the true approval rating among all young Americans, thus instead of performing an expensive census of all young Americans, we can infer about all young Americans using polling. So as you can see, it was critical for the Harvard University Institute of Politics sample to be truly random in order to infer about all young Americans’ opinions about Obama. Was their sample truly random? It’s hard to answer such questions without knowing about the sampling methodology used. For example, if this poll was conducted using only mobile phone numbers, people without mobile phones would be left out and therefore not represented in the sample. What about if the Harvard University Institute of Politics conducted this poll on an internet news site? Then people who don’t read this internet news site would be left out. Ensuring that our samples were random was easy to do in our sampling bowl exercises, however in a real-life situation like the Obama poll, this is much harder to do. Learning check Comment on the representativeness of the following sampling methodologies: (LC7.21) The Royal Air Force wants to study how resistant all their airplanes are to bullets. They study the bullet holes on all the airplanes on the tarmac after an air battle against the Luftwaffe (German Air Force). (LC7.22) Imagine it is 1993, a time when almost all households had landlines. You want to know the average number of people in each household in your city. You randomly pick out 500 phone numbers from the phone book and conduct a phone survey. (LC7.23) You want to know the prevalence of illegal downloading of TV shows among students at a local college. You get the emails of 100 randomly chosen students and ask them “How many times did you download a pirated TV show last week?” (LC7.24) A local college administrator wants to know the average income of all graduates in the last 10 years. So they get the records of 5 randomly chosen graduates, contact them, and obtain their answers. 7.5 Conclusion 7.5.1 Sampling scenarios In this chapter, we performed both tactile and virtual sampling exercises to infer about an unknown proportion. We also presented a case study of sampling in real-life: polls. In both cases, we used the sample proportion \\(\\widehat{p}\\) to estimate the population proportion \\(p\\). However, we are not just limited to scenarios related to proportions. In other words, we can use sampling to estimate other population parameters using other point estimates as well. We present 5 more such scenarios in Table 7.5. TABLE 7.5: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Notation. 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) 4 Difference in population means \\(\\mu_1 - \\mu_2\\) Difference in sample means \\(\\overline{x}_1 - \\overline{x}_2\\) 5 Population regression slope \\(\\beta_1\\) Fitted regression slope \\(b_1\\) or \\(\\widehat{\\beta}_1\\) 6 Population regression intercept \\(\\beta_0\\) Fitted regression intercept \\(b_0\\) or \\(\\widehat{\\beta}_0\\) In the rest of this book, we’ll cover all the remaining scenarios as follows: In Chapter 8, we’ll cover examples of statistical inference for Scenario 2: The mean age \\(\\mu\\) of all pennies in circulation in the US. Scenario 3: The difference \\(p_1 - p_2\\) in the proportion of people who yawn when seeing someone else yawn first minus the proportion of people who yawn without seeing someone else yawn first. This is an example of two-sample inference. In Chapter 9, we’ll cover an example of statistical inference for Scenario 4: The difference \\(\\mu_1 - \\mu_2\\) in mean IMDb ratings for action and romance movies. This is another example of two-sample inference. In Chapter 10, we’ll cover an example of statistical inference for regression by revisiting the regression models for teaching score as a function of various instructor demographic variables you saw in Chapters 5 and 6. Specifically Scenario 5: The intercept \\(\\beta_0\\) of the population regression line. Scenario 6: The slope \\(\\beta_1\\) of the population regression line. 7.5.2 Central Limit Theorem What you visualized in Figure 7.12 and summarized in Table 7.1 was a demonstration of a very famous theorem, or mathematically proven truth, called the Central Limit Theorem. It loosely states that when sample means are based on larger and larger sample sizes, the sampling distribution of these sample means both more and more normally shaped and more and more narrow. In other words, their sampling distribution increasingly follows a normal distribution and the variation of these sampling distributions gets smaller, as quantified by their standard errors. Shuyi Chiou, Casey Dunn, and Pathikrit Bhattacharyya created a 3 minute and 38 second video at https://youtu.be/jvoxEYmQHNM explaining this crucial statistical theorem using the average weight of wild bunny rabbits and the average wingspan of dragons as examples. Figure 7.17 shows a preview of this video. FIGURE 7.17: Preview of Central Limit Theorem video. 7.5.3 Additional resources An R script file of all R code used in this chapter is available here. 7.5.4 What’s to come? Recall in our Obama poll case study in Section 7.4 that based on this particular sample, the Harvard University Institute of Politics’ best guess of the U.S. President Obama’s approval rating among all young Americans was 41%. However, this isn’t the end of the story. If you read the article further, it states: The online survey of 2,089 adults was conducted from Oct. 30 to Nov. 11, just weeks after the federal government shutdown ended and the problems surrounding the implementation of the Affordable Care Act began to take center stage. The poll’s margin of error was plus or minus 2.1 percentage points. Note the term margin of error, which here is plus or minus 2.1 percentage points. Most polls won’t produce an estimate that’s perfectly right; there will always be a certain amount of error caused by sampling variation. The margin of error of plus or minus 2.1 percentage points is saying that a typical range of errors for polls of this type is about \\(\\pm\\) 2.1%, in words from about 2.1% too small to about 2.1% too big. We can restate this as interval of [41% - 2.1%, 41% + 2.1%] = [37.9%, 43.1%] (this notation indicates the interval contains all values 37.9% and 43.1% inclusively). We’ll see in the next chapter that such intervals are known as confidence intervals. "],
-["8-confidence-intervals.html", "Chapter 8 Bootstrapping &amp; Confidence Intervals 8.1 Pennies activity 8.2 Computer simulation of resampling 8.3 Understanding confidence intervals 8.4 Constructing confidence intervals 8.5 Interpreting confidence intervals 8.6 Case study: Is yawning contagious? 8.7 Conclusion", " Chapter 8 Bootstrapping &amp; Confidence Intervals In Chapter 7, we studied sampling. We started with a “tactile” exercise where we wanted to know the proportion of balls in the sampling bowl in Figure 7.1 that are red. While we could have performed an exhaustive count, this would have been a tedious process. So instead we used a shovel to extract a sample of 50 balls and used the resulting proportion that were red as an estimate. Furthermore, we made sure to mix the bowl’s contents before every use of the shovel. Because of the randomness induced by the mixing, different uses of the shovel yielded different proportions red and hence different estimates of the proportion of the bowl’s balls that are red. We then mimicked this “tactile” sampling exercise with an equivalent “virtual” sampling exercise performed on the computer. Using our computers’ random number generator, we very quickly mimicked the above sampling procedure a large number of times. In Section 7.2.4, we quickly repeated this sampling procedure 1000 times, using three different “virtual” shovels with 25, 50, and 100 slots. We visualized these three sets of 1000 estimates in Figure 7.15 and saw that as the sample size increased, the variation in the estimates decreased. What we did was construct sampling distributions. The motivation for taking 1000 repeated samples and visualizing the resulting estimates was to study how these estimates varied from one sample to another; in other words we wanted to study the effect of sampling variation. We quantified the variation of these estimates using their standard deviation, which has a special name: the standard error. In particular, we saw that as the sample size increased from 25 to 50 to 100, the standard error decreased and thus the sampling distributions narrowed. In other words, larger sample sizes lead to more precise estimates that varied less around the center. We then tied these sampling exercises to terminology and mathematical notation related to sampling in Section 7.3.1. Our study population was the large bowl with \\(N\\) = 2400 balls, while the population parameter, the unknown quantity of interest, here was the population proportion \\(p\\) of the bowl’s balls that are red. Since performing a census would be very expensive in terms of time and energy, we instead extracted a sample of size \\(n\\) = 50. The point estimate, also known as a sample statistic, used to estimate \\(p\\) was the sample proportion \\(\\widehat{p}\\) of these 50 sampled balls that were red. Furthermore, since the sample was obtained at random, it can be considered as unbiased and representative of the population. Thus any results based on the sample could be generalized to the population. Thus, the proportion of the shovel’s balls that were red was a “good guess” of the proportion of the bowl’s balls that are red. In other words, we used the sample to infer about the population. However, as described in Section 7.2, both the tactile and virtual sampling exercises are not what one would do in real life; this was merely an activity used to study the effects of sampling variation. In a real life situation, we would not take 1000 samples of size \\(n\\), but rather take a single representative sample that’s as large as possible. Additionally, we knew what the true proportion of the bowl’s balls that were red was 37.5%. In a real life situation, we will not know what this value is. Because if we did, then why would we take a sample to estimate it? An example of a realistic sampling situation would be a poll, like the Obama poll you saw in Section 7.4. Pollsters did not know the true proportion of all young Americans who supported President Obama, and thus they took a single sample of size \\(n\\) = 2089 young Americans to estimate this value. So how does one quantify the effects of sampling variation when you only have a single sample to work with? You cannot directly study the effects of sampling variation when you only have one sample. One common method to study this is bootstrapping resampling, which will be the focus of the earlier sections of this chapter. Furthermore, what if we would like not only a single estimate of the unknown population parameter, but also a range of highly plausible values? Going back to the Obama poll article, it stated that the pollsters’ estimate of the proportion of all young Americans who supported President Obama was 41%. But in addition it stated that the poll’s “margin of error was plus or minus 2.1 percentage points.” In other words, this “plausible range” was [41% - 2.1%, 41% + 2.1%] = [37.9%, 43.1%]. This range of plausible values is what’s known as a confidence interval, which will be the focus of the later sections of this chapter. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once: ggplot2 for data visualization dplyr for data wrangling tidyr for converting data to “tidy” format readr for importing spreadsheet data into R As well as the more advanced purrr, tibble, stringr, and forcats packages If needed, read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(moderndive) library(infer) 8.1 Pennies activity As we did in Chapter 7, we’ll begin with a hands-on tactile activity. 8.1.1 What is the average year on US pennies in 2019? Try to imagine all the pennies being used in the United States in 2019. That’s a lot of pennies! Now say we’re interested in the average year of minting of all these pennies. One way to compute this value would be to gather up all pennies being used in the US, record the year, and compute the average. However, this would be near impossible! So instead, let’s collect a sample of 50 pennies collected from a local bank in downtown Northampton, Massachusetts, USA as seen in Figure 8.1. FIGURE 8.1: Collecting a sample of 50 US pennies from a local bank. An image of these 50 pennies can be seen in Figure 8.2. For each the 50 pennies starting in the top left, progressing row-by-row, and ending in the bottom right, we assigned an “ID” identification variable and marked the year of minting. FIGURE 8.2: 50 US pennies labelled. The moderndive package contains this data on our 50 sampled pennies in the pennies_sample data frame: pennies_sample # A tibble: 50 x 2 ID year &lt;int&gt; &lt;dbl&gt; 1 1 2002 2 2 1986 3 3 2017 4 4 1988 5 5 2008 6 6 1983 7 7 2008 8 8 1996 9 9 2004 10 10 2000 # … with 40 more rows The pennies_sample data frame has 50 rows corresponding to each penny with two variables. The first variable ID corresponds to the ID labels in Figure 8.2 whereas the second variable year corresponds to the year of minting saved as an integer, in other words a whole number. Based on these 50 sampled pennies, what can we say about all US pennies in 2019? Let’s study some properties of our sample by performing an exploratory data analysis. Let’s first visualize the distribution of the year of these 50 pennies using our data visualization tools from Chapter 2. Since year is a numerical variable, we use a histogram in Figure 8.3 to visualize its distribution. ggplot(pennies_sample, aes(x = year)) + geom_histogram(binwidth = 10, color = &quot;white&quot;) FIGURE 8.3: Distribution of year on 50 US pennies. Observe a slightly left-skewed distribution, since most pennies fall in between 1980 and 2010 with only a few pennies older than 1970. What is the average year for the 50 sampled pennies? Eyeballing the histogram it appears to be around 1990. Let’s now compute this value exactly using our data wrangling tools from Chapter 3. pennies_sample %&gt;% summarize(mean_year = mean(year)) # A tibble: 1 x 1 mean_year &lt;dbl&gt; 1 1995.44 Thus, if we’re willing to assume that pennies_sample is a representative sample from all US pennies, a “good guess” of the average year of minting of all US pennies would be 1995.44. In other words, around 1995. This should all start sounding similar to what we did previously in Chapter 7! In Chapter 7, our study population was the bowl of \\(N\\) = 2400 balls. Our population parameter was the population proportion of these balls that were red, denoted mathematically by \\(p\\). In order to estimate \\(p\\), we extracted a sample of 50 balls using the shovel. We then computed the relevant point estimate: the sample proportion of these 50 balls that were red, denoted mathematically by \\(\\widehat{p}\\). Here our population is \\(N\\) = whatever the number of pennies are being used in the US, a value which we don’t know and probably never will. The population parameter of interest is now the population mean year of all these pennies, a value denoted mathematically by the Greek letter \\(\\mu\\) (pronounced “mu”). In order to estimate \\(\\mu\\), we went to the bank and obtained a sample of 50 pennies and computed the relevant point estimate: the sample mean year of these 50 pennies, denoted mathematically by \\(\\overline{x}\\) (pronounced “x-bar”). An alternative and more intuitive notation for the sample mean is \\(\\widehat{\\mu}\\). However this is unfortunately not as commonly used, so in this book we’ll stick with convention and always denote the sample mean as \\(\\overline{x}\\). We summarize the correspondence between the sampling bowl exercise in Chapter 7 and our pennies exercise in Table 8.1, which are the first two rows of the previously seen Table 7.5 of the various sampling scenarios we’ll cover in this text. TABLE 8.1: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Notation. 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) Going back to our 50 sampled pennies in Figure 8.2, the point estimate of interest is the sample mean \\(\\overline{x}\\) of 1995.44. This quantity is an estimate of the population mean year of all US pennies \\(\\mu\\). Recall that we also saw in Chapter 7 that such estimates are prone to sampling variation. For example, in this particular sample in Figure 8.2, we observed three pennies with the year of 1999. If we sampled another 50 pennies, would we observe exactly three pennies with the year of 1999 again? More than likely not. We might observe none, or one, or two, or maybe even all 50! The same can be said for the other 26 unique years that are represented in our sample of 50 pennies. To study the effects of sampling variation in Chapter 7 we took many samples, something we could easily do with our shovel. In our case with pennies however, how would we obtain another sample? By going to the bank and getting another roll of 50 pennies. Say we’re feeling lazy however and don’t want to go back to the bank. How can we study the effects of sampling variation using our single sample. We will do so using a technique known as “bootstrap resampling with replacement,” which we now illustrate. 8.1.2 Resampling once Step 1: Let’s print out identically-sized slips of paper representing our 50 pennies as seen in Figure 8.4. FIGURE 8.4: Step 1: 50 slips of paper representing 50 US pennies. Step 2: Put the 50 slips of paper into a hat or tuque as seen in Figure 8.5. FIGURE 8.5: Step 2: Putting 50 slips of paper in a hat. Step 3: Mix the hat’s contents and draw one slip of paper at random as seen in Figure 8.6. Record the year. FIGURE 8.6: Step 3: Drawing one slip of paper at random. Step 4: Put the slip of paper back in the hat! In other words, replace it as seen in Figure 8.7. FIGURE 8.7: Step 4: Replacing slip of paper. Step 5: Repeat Steps 3 and 4 49 more times, resulting in 50 recorded years. What we just performed was a resampling of the original sample of 50 pennies. We are not sampling 50 pennies from the population of all US pennies as we did in our trip to the bank. Instead, we are mimicking this act by resampling 50 pennies from our original sample of 50 pennies. Now ask yourselves, why did we replace our resampled slip of paper back into the hat in Step 4? Because if we left the slip of paper out of the hat each time we performed Step 4, we would end up with the same 50 original pennies! In other words, replacing the slips of paper induces sampling variation. Being more precise with our terminology, we just performed a resampling with replacement from the original sample of 50 pennies. Had we left the slip of paper out of the hat each time we performed Step 4, this would be resampling without replacement. Let’s study our 50 resampled pennies via an exploratory data analysis. First, let’s load the data into R by manually creating a data frame pennies_resample of our 50 resampled values. We’ll do this using the tibble() command from the dplyr package. Note that the 50 values you resample will almost certainly not be the same as ours given the inherent randomness. pennies_resample &lt;- tibble( year = c(1976, 1962, 1976, 1983, 2017, 2015, 2015, 1962, 2016, 1976, 2006, 1997, 1988, 2015, 2015, 1988, 2016, 1978, 1979, 1997, 1974, 2013, 1978, 2015, 2008, 1982, 1986, 1979, 1981, 2004, 2000, 1995, 1999, 2006, 1979, 2015, 1979, 1998, 1981, 2015, 2000, 1999, 1988, 2017, 1992, 1997, 1990, 1988, 2006, 2000) ) The 50 values of year in pennies_resample represent a resample of size 50 from the original sample of 50 pennies. We display the 50 resampled pennies in Figure 8.8. FIGURE 8.8: 50 resampled US pennies labelled. Let’s compare the distribution of the numerical variable year of our 50 resampled pennies with the distribution of the numerical variable year of our original sample of 50 pennies in Figure 8.9. ggplot(pennies_resample, aes(x = year)) + geom_histogram(binwidth = 10, color = &quot;white&quot;) + labs(title = &quot;Resample of 50 pennies&quot;) ggplot(pennies_sample, aes(x = year)) + geom_histogram(binwidth = 10, color = &quot;white&quot;) + labs(title = &quot;Original sample of 50 pennies&quot;) FIGURE 8.9: Comparing year in the resampled pennies_resample with the original sample pennies_sample. Observe in Figure 8.9 that while the general shapes of both distributions of year is roughly similar, they are not identical. Recall from the previous section that the sample mean of the original sample of 50 pennies from the bank was 1995.44. What about for our resample? Any guesses? Let’s have dplyr help us out as before: pennies_resample %&gt;% summarize(mean_year = mean(year)) # A tibble: 1 x 1 mean_year &lt;dbl&gt; 1 1994.82 We obtained a different mean year of 1994.82. This variation is induced by resampling with replacement we performed earlier. What if we repeated this resampling exercise many times? Would we obtain the same mean year each time? In other words, would our guess at the mean year of all pennies in the US in 2019 be exactly 1994.82 every time? Just as we did in Chapter 7, let’s perform this resampling activity with the help of 35 of our friends. 8.1.3 Resampling 35 times Each of our 35 friends will repeat the same 5 steps: Start with 50 identically-sized slips of paper representing the 50 pennies. Put the 50 small pieces of paper into a hat or beanie cap. Mix the hat’s contents and draw one slip of paper at random. Record the year in a spreadsheet. Replace the slip of paper back in the hat! Repeat Steps 3 and 4 49 more times, resulting in 50 recorded years. Since we had 35 of our friends perform this task, we ended up with 35 \\(\\times\\) 50 = 1750 values. We recorded these values in a shared spreadsheet with 50 rows (plus a header row) and 35 columns. We display a snapshot of the first 10 rows and 5 columns of this shared spreadsheet in Figure 8.10. FIGURE 8.10: Snapshot of shared spreadsheet of resampled pennies. For your convenience, we’ve taken these 35 \\(\\times\\) 50 = 1750 values and saved them in pennies_resamples, a “tidy” data frame included in the moderndive package. We saw what it means for a data frame to be “tidy” in Subsection 4.2.1. pennies_resamples # A tibble: 1,750 x 3 replicate name year &lt;int&gt; &lt;chr&gt; &lt;dbl&gt; 1 1 A 1988 2 1 A 2002 3 1 A 2015 4 1 A 1998 5 1 A 1979 6 1 A 1971 7 1 A 1971 8 1 A 2015 9 1 A 1988 10 1 A 1979 # … with 1,740 more rows What did each of our 35 friends obtain as the mean year? Once again, dplyr to the rescue! After grouping the rows by name, we summarize each group of 50 rows by their mean year: resampled_means &lt;- pennies_resamples %&gt;% group_by(name) %&gt;% summarize(mean_year = mean(year)) resampled_means # A tibble: 35 x 2 name mean_year &lt;chr&gt; &lt;dbl&gt; 1 A 1992.5 2 AA 1995.86 3 B 1996.42 4 BB 1992.4 5 C 1996.32 6 CC 1995.88 7 D 1996.9 8 DD 1997.46 9 E 1991.22 10 EE 1998.44 # … with 25 more rows Observe that resampled_means has 35 rows corresponding to the 35 means based on the 35 resamples. Furthermore, observe the variation in the 35 values in the variable mean_year. Let’s visualize this variation using a histogram in Figure 8.11. Recall that adding the argument boundary = 1990 to the geom_histogram() sets the binning structure so that one of the bin boundaries is 1990 exactly. ggplot(resampled_means, aes(x = mean_year)) + geom_histogram(binwidth = 1, color = &quot;white&quot;, boundary = 1990) + labs(x = &quot;Sampled mean year&quot;) FIGURE 8.11: Distribution of 35 sample means from 35 resamples. Observe in Figure 8.11 that the distribution looks roughly normal and that we rarely observe sample mean years less than in 1992 or greater than 2000. Also observe how the distribution is roughly centered at 1995, which is the sample mean of 1995.44 of the original sample of 50 pennies from the bank. 8.1.4 What did we just do? What we just demonstrated in this activity is the statistical procedure known as bootstrap resampling with replacement . We used resampling to mimic the sampling variation we studied in Chapter 7 on sampling. However in this case, we did so using only a single sample from the population. In fact, the histogram of sample means from 35 resamples in Figure 8.11 is called the bootstrap distribution . It is an approximation to the sampling distribution of the sample mean, in the sense that both distributions will have a similar shape and similar spread. In fact in the upcoming Section 8.7, we’ll show you that this is the case. Using this bootstrap distribution, we can study the effect of sampling variation on our estimates. In particular, we’ll study the typical “error” of our estimates, known as the standard error . In Section 8.2 we’ll mimic our tactile resampling activity virtually on the computer, allowing us to quickly perform the resampling many more than 35 times. In Section 8.3 we’ll define the statistical concept of a confidence interval, which builds off bootstrap distributions. In Section 8.4, construct confidence intervals using the dplyr package, as well as a new package: the infer package for “tidy” and transparent statistical inference. We’ve already used one of the infer package’s functions, rep_sample_n(), but there’s a lot more. We’ll introduce the “tidy” statistical inference framework that was the motivation for the infer package pipeline that will be the driving package throughout the rest of this book. As we did in Chapter 7, we’ll tie all these ideas together with a real-life case study in Section 8.6. This time we’ll look at data from an experiment about yawning from the US television show Mythbusters. 8.2 Computer simulation of resampling Let’s now mimic our tactile resampling activity virtually by using our computer. 8.2.1 Virtually resampling once First, let’s perform the virtual analog of resampling once. Recall that the pennies_sample data frame included in the moderndive package contains the years of our original sample of 50 pennies from the bank. Furthermore, recall in Chapter 7 on sampling that we used the rep_sample_n() function as a virtual shovel to sample balls from our virtual bowl of 2400 balls as follows: virtual_shovel &lt;- bowl %&gt;% rep_sample_n(size = 50) Let’s modify this code to perform the resampling with replacement of the 50 slips of paper representing our original sample 50 pennies: virtual_resample &lt;- pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE) Observe how we explicitly set the replace argument to TRUE in order to tell rep_sample_n() that we would like to sample pennies with replacement. Had we not set replace = TRUE, the function would’ve assumed the default value of FALSE and hence done resampling without replacement. Additionally, since we didn’t specify the number of replicates via the reps argument, the function assumes the default of one replicate reps = 1. Lastly, observe also that the size argument is set to match the original sample size of 50 pennies. Let’s look at only the first 10 out of 50 rows of virtual_resample: virtual_resample # A tibble: 50 x 3 # Groups: replicate [1] replicate ID year &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 1 37 1962 2 1 1 2002 3 1 45 1997 4 1 28 2006 5 1 50 2017 6 1 10 2000 7 1 16 2015 8 1 47 1982 9 1 23 1998 10 1 44 2015 # … with 40 more rows The replicate variable only takes on the value of 1 corresponding to us only having reps = 1, the ID variable indicates which of the 50 pennies from pennies_sample was resampled, and year denotes the year of minting. Let’s now compute the mean year in our virtual resample of size 50 using data wrangling functions included in the dplyr package: virtual_resample %&gt;% summarize(resample_mean = mean(year)) # A tibble: 1 x 2 replicate resample_mean &lt;int&gt; &lt;dbl&gt; 1 1 1996 As we saw when we did our tactile resampling exercise, the resulting mean year is different than the mean year of our 50 originally sampled pennies of 1995.44. 8.2.2 Virtually resampling 35 times Let’s now perform the virtual analog of our 35 friends’ resampling. Using these results, we’ll be able to study the variability in the sample means from 35 resamples of size 50. Let’s first add a reps = 35 argument to rep_sample_n() to indicate we would like 35 replicates. Thus, we want to repeat the resampling with the replacement of 50 pennies 35 times. virtual_resamples &lt;- pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE, reps = 35) virtual_resamples # A tibble: 1,750 x 3 # Groups: replicate [35] replicate ID year &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 1 21 1981 2 1 34 1985 3 1 4 1988 4 1 11 1994 5 1 26 1979 6 1 8 1996 7 1 19 1983 8 1 21 1981 9 1 49 2006 10 1 2 1986 # … with 1,740 more rows The resulting virtual_resamples data frame has 35 \\(\\times\\) 50 = 1750 rows corresponding to 35 resamples of 50 pennies. Let’s now compute the resulting 35 sample means using the same dplyr code as we did in the previous section, but this time adding a group_by(replicate): virtual_resampled_means &lt;- virtual_resamples %&gt;% group_by(replicate) %&gt;% summarize(mean_year = mean(year)) virtual_resampled_means # A tibble: 35 x 2 replicate mean_year &lt;int&gt; &lt;dbl&gt; 1 1 1995.58 2 2 1999.74 3 3 1993.7 4 4 1997.1 5 5 1999.42 6 6 1995.12 7 7 1994.94 8 8 1997.78 9 9 1991.26 10 10 1996.88 # … with 25 more rows Observe that virtual_resampled_means has 35 rows, corresponding to the 35 resampled means. Furthermore, observe that the values of mean_year vary. Let’s visualize this variation using a histogram in Figure 8.12. ggplot(virtual_resampled_means, aes(x = mean_year)) + geom_histogram(binwidth = 1, color = &quot;white&quot;, boundary = 1990) + labs(x = &quot;Resample mean year&quot;) FIGURE 8.12: Distribution of 35 sample means from 35 resamples. Let’s compare our virtually constructed bootstrap distribution with the one our 35 friends constructed via our tactile resampling exercise in Figure 8.13. Observe how they are somewhat similar, but not identical. FIGURE 8.13: Comparing distributions of means from resamples. Recall that in the “resampling with replacement” scenario we are illustrating here both of these histograms have a special name: the bootstrap distribution of the sample mean. Furthermore, they are an approximation to the sampling distribution of the sample mean, a concept you saw in Chapter 7 on sampling. These distributions allow us to study the effect of sampling variation on our estimates of the true population mean, in this case the true mean year for all US pennies. However, unlike in Chapter 7 where took multiple samples (something one would never do in practice), bootstrap distributions are constructed by taking multiple resamples from a single sample. In this case the 50 original pennies from the bank. 8.2.3 Virtually resampling 1000 times Remember that one of the goals of resampling with replacement is to construct the bootstrap distribution, which is an approximation of the sampling distribution. However, the bootstrap distribution in Figure 8.12 is based only on 35 resamples and hence looks a little coarse. Let’s increase the number of resamples to 1000, so that we can hopefully better see the shape and the variability between different resamples. # Repeat resampling 1000 times virtual_resamples &lt;- pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE, reps = 1000) # Compute 1000 sample means virtual_resampled_means &lt;- virtual_resamples %&gt;% group_by(replicate) %&gt;% summarize(mean_year = mean(year)) However, in the interest of brevity, going forward let’s combine these two operations into a single chain of %&gt;% pipe operators: virtual_resampled_means &lt;- pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE, reps = 1000) %&gt;% group_by(replicate) %&gt;% summarize(mean_year = mean(year)) virtual_resampled_means # A tibble: 1,000 x 2 replicate mean_year &lt;int&gt; &lt;dbl&gt; 1 1 1992.6 2 2 1994.78 3 3 1994.74 4 4 1997.88 5 5 1990 6 6 1999.48 7 7 1990.26 8 8 1993.2 9 9 1994.88 10 10 1996.3 # … with 990 more rows In Figure 8.14 let’s visualize the bootstrap distribution of these 1000 means based 1000 virtual resamples: ggplot(virtual_resampled_means, aes(x = mean_year)) + geom_histogram(binwidth = 1, color = &quot;white&quot;, boundary = 1990) + labs(x = &quot;sample mean&quot;) FIGURE 8.14: Bootstrap resampling distribution based on 1000 resamples. Note here that the bell shape is starting to become much more apparent. We now have a general sense for the range of values that the sample mean may take on. But where is this histogram centered? Let’s compute the mean of the 1000 resample means: virtual_resampled_means %&gt;% summarize(mean_of_means = mean(mean_year)) # A tibble: 1 x 1 mean_of_means &lt;dbl&gt; 1 1995.36 The mean of these 1000 means is 1995.36, which is quite close to the mean of our original sample of 50 pennies of 1995.44. This is the case since each of the 1000 resamples are based on the original sample of 50 pennies. Congratulations! You’ve just constructed your first bootstrap distribution! In the next section, you’ll see how to use this bootstrap distribution to construct confidence intervals. Learning check (LC8.1) What is the chief difference between a bootstrap distribution and a sampling distribution? (LC8.2) Looking at the bootstrap distribution for the sample mean in Figure 8.14, between what two values would you say most values lie? 8.3 Understanding confidence intervals Let’s start this section with an analogy involving fishing. Say you are trying to catch a fish. On the one hand, you could use a spear, while on the other you could use a net. Using the net will probably allow you to catch more fish! Now think back to our pennies exercise where you are trying to estimate the true population mean year \\(\\mu\\) of all US pennies. Think of the value of \\(\\mu\\) as a fish. On the one hand, we could use the appropriate point estimate/sample statistic to estimate \\(\\mu\\), which we saw in Table 8.1 is the sample mean \\(\\overline{x}\\). Based on our sample of 50 pennies from the bank, the sample mean was 1995.44. Think of using this value as “fishing with a spear.” What would “fishing with a net” correspond to? Look at the bootstrap distribution in Figure 8.14 once more. Between which two years would you say that “most” sample means lie? While this question is somewhat subjective, saying that most sample means lie between 1992 and 2000 would not be unreasonable. Think of this interval as the “net.” What we’ve just illustrated is the concept of a confidence interval, which we’ll abbreviate with “CI” throughout this book. As opposed to a point estimate/sample statistic that estimates the value of an unknown population parameter with a single value, a confidence interval gives what can be interpreted as a range of plausible values. Going back to our analogy, point estimates/sample statistics can be thought of as spears, whereas confidence intervals can be thought of as nets. FIGURE 8.15: Analogy of difference between point estimates and confidence intervals. Our proposed interval of 1992 to 2000 was constructed by eye and was thus somewhat subjective. We now introduce two methods for constructing such intervals in a more exact fashion: the percentile method and the standard error method. Both methods for confidence interval construction share some commonalities. First, they are both constructed from a bootstrap distribution, as you constructed in Subsection 8.2.3 and visualized in Figure 8.14. Second, they both require you to specify the confidence level. Commonly used confidence levels include 90%, 95%, and 99%. All other things being equal, higher confidence levels correspond to wider confidence intervals and lower confidence levels correspond to narrower confidence intervals. In this book, we’ll be mostly using 95% and hence constructing “95% confidence intervals for \\(\\mu\\).” 8.3.1 Percentile method One method to construct a confidence interval is to use the middle 95% of values of the bootstrap distribution. We can do this by computing the 2.5th and 97.5th percentiles, which are 1991.059 and 1999.283 respectively. This is known as the percentile method for constructing confidence intervals. For now, let’s focus only on the concepts behind a percentile method constructed confidence interval; we’ll show you the code to compute these values in the next section. Let’s mark these percentiles on the bootstrap distribution with vertical lines in Figure 8.16. About 95% of the values in the mean_year variable in virtual_resampled_means fall between the 1991.059 and 1999.283 endpoints, with 2.5% to the left of the left-most line and 2.5% to the right of the right-most line. FIGURE 8.16: Percentile method 95 percent confidence interval. Interval marked by vertical lines. 8.3.2 Standard error method Recall in Appendix A.2, we saw that if a numerical variable follows a normal distribution, or in other words the histogram of this variable is bell-shaped, then roughly 95% of values fall between \\(\\pm\\) 1.96 standard deviations of the mean. Given that our bootstrap distribution based on 1000 resamples with replacement in Figure 8.14 is normally shaped, let’s use this fact about normal distributions to construct a confidence interval in a different way. First, recall the bootstrap distribution has a mean equal to 1995.36. This value almost coincides exactly with the value of the sample mean \\(\\overline{x}\\) of our original 50 pennies of 1995.44. Second, let’s compute the standard deviation of the bootstrap distribution using the values of mean_year in the virtual_resampled_means data frame: virtual_resampled_means %&gt;% summarize(SE = sd(mean_year)) # A tibble: 1 x 1 SE &lt;dbl&gt; 1 2.15466 What is this value? Recall that the bootstrap distribution is an approximation to the sampling distribution. Recall also that the standard deviation of a sampling distribution has a special name: the standard error. Putting these two facts together, we can say that 2.155 is an approximation of the standard error of \\(\\overline{x}\\). Thus, using our 95% rule of thumb about normal distributions from Appendix A.2, we can use the following formula to determine the lower and upper endpoints of a 95% confidence interval for \\(\\mu\\): \\[ \\begin{aligned} \\overline{x} \\pm 1.96 \\cdot SE &amp;= (\\overline{x} - 1.96 \\cdot SE, \\overline{x} + 1.96 \\cdot SE)\\\\ &amp;= (1995.44 - 1.96 \\cdot 2.15, 1995.44 + 1.96 \\cdot 2.15)\\\\ &amp;= (1991.15, 1999.73) \\end{aligned} \\] Let’s now add the SE method confidence interval with dashed lines in Figure 8.17. FIGURE 8.17: Comparing two 95 percent confidence interval methods. We see that both methods produce nearly identical 95% confidence intervals for \\(\\mu\\) with the percentile method yielding \\((1991.06, 1999.28)\\) while the standard error method being \\((1991.22, 1999.66)\\). However, recall that we can only use the standard error rule when the bootstrap distribution is roughly normally-shaped. Now that we’ve introduced the concept of confidence intervals and laid out the intuition behind two methods for constructing them, let’s explore the code that allows us to construct them. Learning check (LC8.3) What condition about the bootstrap distribution must be met for us to be able to construct confidence intervals using the standard error method? (LC8.4) Say we wanted to construct a 68% confidence interval instead of a 95% confidence interval for \\(\\mu\\). Describe what changes are needed to make this happen. Hint: we suggest you look at Appendix A.2 on the normal distribution. 8.4 Constructing confidence intervals Recall that the process of resampling with a replacement we performed by hand in Section 8.1 and virtually in Section 8.2 is known as bootstrapping. The term bootstrapping originates in the expression of “pulling oneself up by their bootstraps,” meaning to “succeed only by one’s own efforts or abilities.” From a statistical perspective, it alludes to succeeding in being able to study the effects of sampling variation on estimates from the “effort” of a single sample. Or more precisely, constructing an approximation to the sampling distribution using only one sample. To perform this resampling with replacement virtually in Section 8.2, we used the rep_sample_n() function, making sure that the size of the resamples matched the original sample size of 50. In this section, we’ll build off these ideas to construct confidence intervals using a new package: the infer package for “tidy” and transparent statistical inference. 8.4.1 Original workflow Recall that in Section 8.2, we virtually performed bootstrap resampling with replacement to construct bootstrap distributions. Such distributions are approximations to the sampling distributions we saw in Chapter 7, but are constructed using only a single sample. Let’s revisit the original workflow using the %&gt;% pipe operator: First, we used the rep_sample_n() function to resample size = 50 pennies with replacement from the original sample of 50 pennies in pennies_sample by setting replace = TRUE. Furthermore, we repeated this resampling 1000 times by setting reps = 1000: pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE, reps = 1000) Second, since for each of our 1000 resamples of size 50, we wanted to compute a separate sample mean, we used the dplyr verb group_by() to group observations/rows together by the replicate variable… pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE, reps = 1000) %&gt;% group_by(replicate) … followed by using summarize() to compute the sample mean() year for each replicate group: pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE, reps = 1000) %&gt;% group_by(replicate) %&gt;% summarize(mean_year = mean(year)) For this simple case, we can get by with using the rep_sample_n() function and a couple of dplyr verbs to construct the bootstrap distribution. However, using only dplyr verbs only provides us with a limited set of tools. For more complicated situations, we’ll need a little more firepower. Let’s repeat this using the infer package. 8.4.2 infer package workflow The infer package is an R package for statistical inference. It makes efficient use of the %&gt;% pipe operator we saw in Section 3.1 to spell out the sequence of steps necessary to perform statistical inference in a “tidy” and transparent fashion. Furthermore, just as the dplyr package provides functions with intuitive verb-like names to perform data wrangling, the infer package provides functions intuitive verb-like names to perform statistical inference. Let’s go back to our pennies. Previously, we computed the value of the sample mean \\(\\overline{x}\\) using the dplyr function summarize(): pennies_sample %&gt;% summarize(stat = mean(year)) We’ll see that we can also do this using infer functions specify() and calculate(): pennies_sample %&gt;% specify(response = year) %&gt;% calculate(stat = &quot;mean&quot;) You might be asking yourself: “Isn’t the infer code longer? Why would I use that code?” While not immediately apparent, you’ll see that there are three chief benefits to the infer workflow as opposed to the dplyr workflow. First, the infer verb names better align with the overall resampling framework you need to understand to construct confidence intervals and to conduct hypothesis tests (in Chapter 9). We’ll see flowchart diagrams of this framework in the upcoming Figures 8.23 and 9.14. Second, you can jump back and forth seamlessly between confidence intervals and hypothesis testing with minimal changes to your code. This will become apparent in Subsection 9.3.2 when we’ll compare the infer code for both these inferential methods. Third, the infer workflow is much simpler for conducting inference when you have more than one variable. We’ll see two such situations. We’ll first see situations of two-sample inference where the sample data is collected from two groups, such as in Section 8.6 where we study the contagiousness of yawning and in Section 8.6 where we compare promotion rates of two groups at banks in the 1970s. Then in Section 10.4, we’ll see situations of inference for regression using the regression models you fit in Chapter 5. Let’s now illustrate the sequence of verbs necessary to construct a confidence interval for \\(\\mu\\), the population mean year of minting of all pennies in the US. 1. specify variables FIGURE 8.18: Diagram of specify() variables. The specify() function is used to choose which variables in a data frame will be the focus of our statistical inference. We do this by specifying the response argument. For example, in our pennies_sample data frame of the 50 pennies sampled from the bank, the variable of interest is year: pennies_sample %&gt;% specify(response = year) Response: year (numeric) # A tibble: 50 x 1 year &lt;dbl&gt; 1 2002 2 1986 3 2017 4 1988 5 2008 6 1983 7 2008 8 1996 9 2004 10 2000 # … with 40 more rows Notice how the data itself doesn’t change, but the Response: year (numeric) meta-data does. This is similar to how the group_by() verb from dplyr doesn’t change the data, but only adds “grouping” meta-data, as we saw in Section 3.4. We can also specify which variables will be the focus of our statistical inference using a formula = y ~ x. This is the same formula notation you saw in Chapters 5 and 6 on regression models: the response variable y is separated from the explanatory variable x by a ~ “tilde.” The following use of specify() with the formula argument yields the same result seen previously: pennies_sample %&gt;% specify(formula = year ~ NULL) Since in the case of pennies we only have a response variable and no explanatory variable of interest, we set the x on the right-hand side of the ~ to be NULL. While in the case of the pennies either specification works just fine, we’ll see examples later on where we have no choice but to use the formula specification. In particular in the upcoming Sections 8.6 on comparing two proportions and 10.4 on inference for regression. 2. generate replicates FIGURE 8.19: Diagram of generate() replicates. After we specify() the variables of interest, we pipe the results into the generate() function to generate replicates. In other words, repeat the resampling process a large number of times. Recall in Sections 8.2.2 and 8.2.3 we did this 35 and 1000 times. The generate() function’s first argument is reps, which sets the number of replicates we would like to generate. Since we want to resample the 50 pennies in pennies_sample with replacement 1000 times, we set reps = 1000. The second argument type determines the type of computer simulation we’d like to perform. We set this to type = &quot;bootstrap&quot; indicating that we want to perform bootstrap resampling. You’ll see different options for type in Chapter 9. pennies_sample %&gt;% specify(response = year) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) Response: year (numeric) # A tibble: 50,000 x 2 # Groups: replicate [1,000] replicate year &lt;int&gt; &lt;dbl&gt; 1 1 1996 2 1 1988 3 1 1979 4 1 1978 5 1 1983 6 1 1981 7 1 1993 8 1 1996 9 1 1992 10 1 1978 # … with 49,990 more rows Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 pennies with replacement 1000 times and 50,000 = 50 \\(\\times\\) 1000. The variable replicate indicates which resample each row belongs to. So it has the value 1 50 times, the value 2 50 times, all the way through to the value 1000 50 times. The default value of the type argument is &quot;bootstrap&quot;, so if the last line was written as generate(reps = 1000), we’d obtain the same results. Comparing with original workflow: Note that the steps up of the infer workflow so far produce the same results as the original workflow using the rep_sample_n() function we saw earlier. In other words, the following two code chunks produce similar results: # infer workflow: # Original workflow: pennies_sample %&gt;% pennies_sample %&gt;% specify(response = year) %&gt;% rep_sample_n(size = 50, replace = TRUE, generate(reps = 1000) reps = 1000) 3. calculate summary statistics FIGURE 8.20: Diagram of calculate() summary statistics. After we generate() many replicates of bootstrap resampling with replacement, we next want to summarize each of 1000 resamples of size 50 to a single statistic value. As seen in the diagram, the calculate() function does this. In our case, we want to calculate the mean year for each bootstrap resample of size 50. To do so, we set the stat argument to &quot;mean&quot;. You can also set the stat argument to a variety of other common summary statistics, like &quot;median&quot;, &quot;sum&quot;, &quot;sd&quot; (standard deviation), and &quot;prop&quot; (proportion). To see a list of all possible summary statistics you can use, type ?calculate to read the help file. We’ll use these stat functions throughout this book. Let’s save the result in a data frame called bootstrap_distribution and explore it’s contents: bootstrap_distribution &lt;- pennies_sample %&gt;% specify(response = year) %&gt;% generate(reps = 1000) %&gt;% calculate(stat = &quot;mean&quot;) bootstrap_distribution # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 1993.48 2 2 1993.8 3 3 1996.88 4 4 1995.34 5 5 1996.98 6 6 1995.72 7 7 1995.36 8 8 1992.6 9 9 1994.24 10 10 1993.16 # … with 990 more rows Observe that the resulting data frame has 1000 rows and 2 columns corresponding to the 1000 replicate values and the mean year for each bootstrap resample saved in the variable stat. Comparing with original workflow: You may have recognized at this point that the calculate() step in the infer workflow produces the same output as the group_by() %&gt;% summarize() steps in the original workflow: # infer workflow: # Original workflow: pennies_sample %&gt;% pennies_sample %&gt;% specify(response = year) %&gt;% rep_sample_n(size = 50, replace = TRUE, generate(reps = 1000) %&gt;% reps = 1000) %&gt;% calculate(stat = &quot;mean&quot;) group_by(replicate) %&gt;% summarize(mean_year = mean(year)) 4. visualize the results FIGURE 8.21: Diagram of visualize() results. The visualize() verb provides a quick way to visualize the bootstrap distribution as a histogram of the numerical stat variable’s values. visualize(bootstrap_distribution) FIGURE 8.22: Bootstrap distribution. Comparing with original workflow: In fact, visualize() is a wrapper function for the ggplot() function that uses a geom_histogram() layer. Recall that we illustrated the concept of a wrapper function in Figure 5.5 in Section 5.1.2. # infer workflow: # Original workflow: visualize(bootstrap_distribution) ggplot(bootstrap_distribution, aes(x = stat)) + geom_histogram() The visualize() function can take many other arguments which we’ll see momentarily to customize the plot further. It also works with helper functions to do the shading of the histogram values corresponding to the confidence interval values. Let’s recap the steps of the infer workflow for constructing a bootstrap distribution and then visualizing it. FIGURE 8.23: infer package workflow for confidence intervals. Recall how we introduced two different methods for constructing 95% confidence intervals for an unknown population parameter in Section 8.3: the percentile method and the standard error method. Let’s now check out the infer package code that explicitly constructs these. There are also some additional neat functions to visualize the resulting confidence intervals built-in! 8.4.3 Percentile method with infer Recall the percentile method for constructing 95% confidence intervals we introduced in Section 8.3.1. This method sets the lower endpoint of the confidence interval at the 2.5th percentile of the bootstrap distribution and similarly sets the upper endpoint at the 97.5th percentile. The resulting interval captures the middle 95% of the values of the sample mean in the bootstrap distribution. We can compute the 95% confidence interval by piping the bootstrap_distribution data frame we created into the get_confidence_interval() function from the infer package, with the confidence level set to 0.95 and the confidence interval type to be percentile. Let’s save the results in percentile_ci. percentile_ci &lt;- bootstrap_distribution %&gt;% get_confidence_interval(level = 0.95, type = &quot;percentile&quot;) percentile_ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 1991.16 1999.58 Alternatively, we can visualize the interval (1991.16, 1999.58) by piping the bootstrap_distribution data frame into the visualize() function and adding a shade_confidence_interval() layer. We set the endpoints argument to be percentile_ci. visualize(bootstrap_distribution) + shade_confidence_interval(endpoints = percentile_ci) FIGURE 8.24: Percentile method 95 percent confidence interval shaded corresponding to potential values. Observe in Figure 8.24 that 95% of the sample means stored in the stat variable in bootstrap_distribution fall between the two endpoints marked with the darker lines, with 2.5% of the sample means to the left of the shaded area and 2.5% of the sample means to the right. You also have the option to change the colors of the shading using the color and fill arguments. You can also use the shorter named function shade_ci() and the results will be the same. This is for folks that don’t want to type out all of confidence_interval and prefer to type out ci instead. Try out the following code! visualize(bootstrap_distribution) + shade_ci(endpoints = percentile_ci, color = &quot;hotpink&quot;, fill = &quot;khaki&quot;) 8.4.4 Standard error method with infer Recall the standard error method for constructing 95% confidence intervals we introduced in Section 8.3.2. For any distribution that is normally shaped, roughly 95% of the values lie within two standard deviations of the mean. In the case of the bootstrap distribution, the standard deviation has a special name: the standard error. So in our case, 95% of values of the bootstrap distribution will lie within \\(\\pm\\) 1.96 standard errors of \\(\\overline{x}\\). Thus, a 95% confidence interval is \\(\\overline{x} \\pm 1.96 \\cdot SE\\) = \\((\\overline{x} - 1.96 \\cdot SE,\\) \\(\\overline{x} + 1.96 \\cdot SE)\\). Computation of the 95% confidence interval can once again be done by piping the bootstrap_distribution data frame we created into the get_confidence_interval() function. However, this time we set the first type argument to be &quot;se&quot;. Second, we must specify the point_estimate argument in order to set the center of the confidence interval. We set this to be the sample mean of the original sample of 50 pennies of 1995.44. standard_error_ci &lt;- bootstrap_distribution %&gt;% get_confidence_interval(type = &quot;se&quot;, point_estimate = 1995.44) standard_error_ci # A tibble: 1 x 2 lower upper &lt;dbl&gt; &lt;dbl&gt; 1 1991.16 1999.72 If we would like to visualize the interval (1991.16, 1999.72), we can once again pipe the bootstrap_distribution data frame into the visualize() function and add a shade_confidence_interval() layer to our plot. We set the endpoints argument to be standard_error_ci. The resulting standard-error method based 95% confidence interval for \\(\\mu\\) can be seen in Figure 8.25. visualize(bootstrap_distribution) + shade_confidence_interval(endpoints = standard_error_ci) FIGURE 8.25: Standard error method 95 percent confidence interval. As noted in Section 8.3, both methods produce similar confidence intervals: Percentile method: (1991.16, 1999.58) Standard error method: (1991.16, 1999.72) Learning check (LC8.5) Construct a 95% confidence interval for the median year of minting of all US pennies? Use the percentile method and, if appropriate, then use the standard-error method. 8.5 Interpreting confidence intervals Now that we’ve shown you how to construct confidence intervals using a sample drawn from a population, let’s now focus on how to interpret their effectiveness. The effectiveness of a confidence interval is judged by whether or not it contains the true value of the population parameter. Going back to our fishing analogy in Section 8.3, this is like asking “Did our net capture the fish?” So for example, does our percentile-based confidence interval of (1991.16, 1999.58) “capture” the true mean year \\(\\mu\\) of all US pennies? Alas, we’ll never know, because we don’t know what the true value of \\(\\mu\\) is. After all, we’re sampling to estimate it! In order to interpret a confidence interval’s effectiveness, we need to know what the value of the population parameter is. That way we can say whether or not a confidence interval “captured” this value. Let’s revisit our sampling bowl from Chapter 7. What proportion of the bowl’s 2400 balls are red? Let’s compute this: bowl %&gt;% summarize(p_red = mean(color == &quot;red&quot;)) # A tibble: 1 x 1 p_red &lt;dbl&gt; 1 0.375 In this case, we know what the value of the population parameter is: we know that the population proportion \\(p\\) is 0.375. In other words, we know that 37.5% of the bowl’s balls are red. As we stated in Subsection 7.3.3, the sampling bowl exercise doesn’t really reflect how sampling is done in real-life, but rather was an idealized activity. In real-life, we won’t know what the true value of the population parameter is, hence the need for estimation. Let’s now construct confidence intervals for \\(p\\) using our 33 groups of friends’ samples from the bowl in Chapter 7. We’ll then see if the confidence intervals “captured” the true value of \\(p\\), which we know to be 37.5%. In other words: “Did net capture the fish?” 8.5.1 Did the net capture the fish? Recall that we had 33 groups of friends each take samples of size 50 from the bowl and then compute the sample proportion of red \\(\\widehat{p}\\). This resulted in 33 such estimates of \\(p\\). Let’s focus on Ilyas and Yohan’s sample, which is saved in the bowl_sample_1 data frame in the moderndive package: bowl_sample_1 # A tibble: 50 x 1 color &lt;chr&gt; 1 white 2 white 3 red 4 red 5 white 6 white 7 red 8 white 9 white 10 white # … with 40 more rows They observed 21 red balls out of 50 and thus their sample proportion \\(\\widehat{p}\\) was 21/50 = 0.42 = 42%. Think of this as the “spear” from our fishing analogy. Let’s now follow the infer package workflow from Section 8.4.2 to create a percentile method based 95% confidence interval for \\(p\\) using Ilyas and Yohan’s sample. Think of this as the “net.” 1. specify variables First, we specify() the response variable of interest color: bowl_sample_1 %&gt;% specify(response = color) Error: A level of the response variable `color` needs to be specified for the `success` argument in `specify()`. Whoops! We need to define which event is of interest! red or white balls? Since we are interested in the proportion red, let’s set success to be &quot;red&quot;: bowl_sample_1 %&gt;% specify(response = color, success = &quot;red&quot;) Response: color (factor) # A tibble: 50 x 1 color &lt;fct&gt; 1 white 2 white 3 red 4 red 5 white 6 white 7 red 8 white 9 white 10 white # … with 40 more rows 2. generate replicates Second, we generate() 1000 replicates of bootstrap resampling with replacement from bowl_sample_1 by setting reps = 1000 and type = &quot;bootstrap&quot;. bowl_sample_1 %&gt;% specify(response = color, success = &quot;red&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) Response: color (factor) # A tibble: 50,000 x 2 # Groups: replicate [1,000] replicate color &lt;int&gt; &lt;fct&gt; 1 1 white 2 1 white 3 1 red 4 1 white 5 1 white 6 1 white 7 1 white 8 1 red 9 1 white 10 1 white # … with 49,990 more rows Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 balls with replacement 1000 times and thus 50,000 = 50 \\(\\times\\) 1000. The variable replicate indicates which resample each row belongs to. So it has the value 1 50 times, the value 2 50 times, all the way through to the value 1000 50 times. 3. calculate summary statistics Third, we summarize each of 1000 resamples of size 50 with the proportion of “successes”. In other words, the proportion of the balls that are &quot;red&quot;. We can set the summary statistic to be calculated to be the proportion by setting the stat argument to be &quot;prop&quot;. Let’s save the result in a data frame called sample_1_bootstrap: sample_1_bootstrap &lt;- bowl_sample_1 %&gt;% specify(response = color, success = &quot;red&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;prop&quot;) sample_1_bootstrap # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 0.36 2 2 0.42 3 3 0.52 4 4 0.38 5 5 0.38 6 6 0.38 7 7 0.46 8 8 0.3 9 9 0.5 10 10 0.46 # … with 990 more rows Observe there are 1000 rows in this data frame and thus 1000 values of the variable stat. These 1000 values of stat represent our 1000 replicated values of the proportion, each based on a different resample. 4. visualize the results Fourth and lastly, let’s compute the resulting 95% confidence interval. percentile_ci_1 &lt;- sample_1_bootstrap %&gt;% get_confidence_interval(level = 0.95, type = &quot;percentile&quot;) percentile_ci_1 # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 0.28 0.540500 Let’s visualize the bootstrap distribution along with the percentile_ci_1 percentile-based 95% confidence interval for \\(p\\) in Figure 8.26. We’ll adjust the number of bins to better see the resulting shape. Furthermore, we’ll add a dashed vertical line at Ilyas and Yohan’s observed \\(\\widehat{p}\\) = 21/50 = 0.42 = 42% using geom_vline(). sample_1_bootstrap %&gt;% visualize(bins = 15) + shade_confidence_interval(endpoints = percentile_ci_1) + geom_vline(xintercept = 0.375, linetype = &quot;dashed&quot;) FIGURE 8.26: Bootstrap distribution. Did Ilyas and Yohan’s net capture the fish? In other words, did their 95% confidence interval for \\(p\\) based on their sample contain the true value of \\(p\\) of 0.375? Yes! 0.375 is between the endpoints of our confidence interval (0.28, 0.54). However, will every 95% confidence interval for \\(p\\) capture this value? In other words, if we had a different sample of 50 balls and constructed a different confidence interval, would it necessarily contain \\(p\\) = 0.375 as well? Let’s see! Let’s first take a different sample from the bowl, this time using the computer as we did in Chapter 7: bowl_sample_2 &lt;- bowl %&gt;% rep_sample_n(size = 50) bowl_sample_2 # A tibble: 50 x 3 # Groups: replicate [1] replicate ball_ID color &lt;int&gt; &lt;int&gt; &lt;chr&gt; 1 1 1665 red 2 1 1312 red 3 1 2105 red 4 1 810 white 5 1 189 white 6 1 1429 white 7 1 2294 red 8 1 1233 white 9 1 1951 white 10 1 2061 white # … with 40 more rows Let’s reapply the same infer functions on bowl_sample_2 to generate a different 95% confidence interval for \\(p\\). First we create the new bootstrap distribution and save the results in sample_2_bootstrap: sample_2_bootstrap &lt;- bowl_sample_2 %&gt;% specify(response = color, success = &quot;red&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;prop&quot;) sample_2_bootstrap # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 0.36 2 2 0.38 3 3 0.42 4 4 0.26 5 5 0.5 6 6 0.32 7 7 0.4 8 8 0.32 9 9 0.5 10 10 0.44 # … with 990 more rows We once again compute a percentile-based 95% confidence interval for \\(p\\): percentile_ci_2 &lt;- sample_2_bootstrap %&gt;% get_confidence_interval(level = 0.95, type = &quot;percentile&quot;) percentile_ci_2 # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 0.22 0.5 Does this new net capture the fish? In other words, does the 95% confidence interval for \\(p\\) based on the new sample contain the true value of \\(p\\) of 0.375? Yes again! 0.375 is between the endpoints of our confidence interval (0.22, 0.5). Let’s now repeat this process 100 more times: we take 100 virtual samples from the bowl and construct 100 95% confidence intervals. Let’s visualize the results in Figure 8.27 where: We mark the true value of \\(p\\) = 0.375 with a vertical line. We mark each of the 100 95% confidence intervals with horizontal lines. These are the “nets.” The horizontal line is colored grey if the confidence interval “captures” the true value of \\(p\\) marked with the vertical line. The horizontal line is colored black otherwise. FIGURE 8.27: 100 percentile-based 95 percent confidence intervals for \\(p\\). Of the 100 95% confidence intervals, 96 of them captured the true value \\(p\\) = 0.375, whereas 4 of them didn’t. In other words, 96 of our nets caught the fish, whereas 4 of our nets didn’t. This is where the “95% confidence level” we defined in Section 8.3 comes into play: for every 100 95% confidence intervals, we expect that 95 of them will capture \\(p\\) and that 5 of them won’t. Note that “expect” is a probabilistic statement referring to a long-run average. In other words, for every 100 confidence intervals, we will observe about 95 confidence intervals that capture \\(p\\), but not necessarily exactly 95. In Figure 8.27 for example, 96 of the confidence intervals capture \\(p\\). To further accentuate our point about confidence levels, let’s generate a figure similar to Figure 8.27, but this time constructing 80% standard-error method based confidence intervals instead. Let’s visualize the results in Figure 8.28 with the scale on the x-axis being the same as in Figure 8.27 to make comparison easy. Furthermore, since all standard-error method 95% confidence intervals for \\(p\\) are centered at their respective point estimates \\(\\widehat{p}\\), we mark this value on each line with dots. FIGURE 8.28: 100 SE-based 80 percent confidence intervals for \\(p\\) with point estimate center marked with dots. Observe how the 80% confidence intervals are narrower than the 95% confidence intervals, reflecting our lower degree of confidence. Think of this as using a smaller “net.” We’ll explore other determinants of confidence interval width in the upcoming Section 8.5.3. Furthermore, observe that of the 100 80% confidence intervals, 82 of them captured the population proportion \\(p\\) = 0.375, whereas 18 of them did not. Since we lowered the confidence level from 95% to 80%, we now have a much larger number of confidence intervals that failed to “catch the fish.” 8.5.2 Precise &amp; shorthand interpretation Let’s return our attention to 95% confidence intervals. The precise and mathematically correct interpretation of a 95% confidence interval is a little long-winded: Precise interpretation: If we repeated our sampling procedure a large number of times, we expect about 95% of the resulting confidence intervals to capture the value of the population parameter. This is what we observed in Figure 8.27. Our confidence interval construction procedure is 95% “reliable.” In other words, we can expect our confidence intervals to include the true population parameter about 95% of the time. A common but incorrect interpretation is: “There is a 95% probability that the confidence interval contains \\(p\\).” Looking at Figure 8.27, each of the confidence intervals either does or doesn’t contain \\(p\\). In other words, the probability is either a 1 or a 0. So if the 95% confidence level only relates to the reliability of the confidence interval construction procedure and not to a given confidence interval itself, what insight can be derived from a given confidence interval? For example, going back to the pennies example, we found that the percentile method 95% confidence interval for \\(\\mu\\) was (1991.16, 1999.58) whereas the standard error method 95% confidence interval was (1991.16, 1999.72). What can be said about these two intervals? Loosely speaking, we can think of these intervals as our “best guess” of a plausible range of values for the mean year \\(\\mu\\) of all US pennies. For the rest of this book, we’ll use the following shorthand summary of the precise interpretation. Short-hand interpretation: We are 95% “confident” that a 95% confidence interval captures the value of the population parameter. We use quotation marks around “confident” to emphasize that while 95% relates to the reliability of our confidence interval construction procedure, ultimately a constructed confidence interval is our best guess of an interval that contains the population parameter. In other words, it’s our best net. So returning to our pennies example and focusing on the percentile-method, we are 95% “confident” that the true mean year of pennies in circulation in 2019 is somewhere between 1991.16 and 1999.58. 8.5.3 Width of confidence intervals Now that we know how to interpret confidence intervals, let’s go over some factors that determine their width. Impact of confidence level One factor that determines confidence interval widths is the pre-specified confidence level. For example, in Figures 8.27 and 8.28, we compared the widths of 95% and 80% confidence intervals and observed that the 95% confidence intervals were wider. The quantification of the confidence level should match what many expect of the word “confident.” In order to be more confident in our best guess of a range of values, we need to widen the range of values. To elaborate on this, imagine we want to guess the forecasted high temperature in Seoul, South Korea on August 15th. Given Seoul’s temperate climate with 4 distinct seasons, we could say somewhat confidently that the high temperature would be between 50°F - 95°F (10°C - 35°C). However, if we wanted a temperature range we were absolutely confident about, would we need to widen it. We need this wider range to allow for the possibility of anomalous weather, like a freak cold spell or an extreme heat wave. So a range of temperatures we could be near certain about would be between 32°F - 110°F (0°C - 43°C). On the other hand, if could tolerate being a little less confident, we could narrow this range to between 70°F - 85°F (21°C - 30°C). Let’s revisit our sampling bowl from Chapter 7. Let’s compare \\(10 \\times 3 = 30\\) confidence intervals for \\(p\\) based on three different confidence levels: 80%, 95%, and 99%. Specifically, we’ll first take 30 different random samples of size \\(n\\) = 50 balls from the bowl. Then we’ll construct 10 percentile-based confidence intervals using each of the three different confidence levels. Finally, we’ll compare the widths of these intervals. We visualize the resulting confidence intervals in Figure 8.29 along with a vertical line marking the true value of \\(p\\) = 0.375. FIGURE 8.29: Ten 80, 95, and 99 percent confidence intervals for \\(p\\) based on \\(n = 50\\). Observe that as the confidence level increases from 80% to 95% to 99%, the confidence intervals tend to get wider. Let’s compare their average widths in Table 8.2. TABLE 8.2: Average width of 80, 95, and 99 percent confidence intervals. Confidence level Mean width 80% 0.166 95% 0.264 99% 0.338 So in order to have a higher confidence level, our confidence intervals must be wider. Ideally, we would have both a high confidence level and narrow confidence intervals. However, we cannot have it both ways. If we want to “be more confident”, we need to allow for wider intervals. Conversely, if we would like a narrow interval, we must tolerate a lower confidence level. The moral of the story is: Higher confidence levels tend to produce wider confidence intervals. However, when looking at Figure 8.29 it is important to keep in mind that we kept the sample size fixed at \\(n\\) = 50. In other words, all \\(10 \\times 3 = 30\\) random samples from the bowl had the same sample size. What happens if instead we took samples of different sizes? Recall that we did this in Section 7.2.4 using virtual shovels with 25, 50, and 100 slots. We delve into this next. Impact of sample size This time, let’s fix the confidence level at 95%, but consider three different sample sizes \\(n\\): 25, 50, and 100. Specifically, we’ll first take 10 different random samples of size 25, 10 different random samples of size 50, and 10 different random samples of size 100. We’ll then construct 95% percentile-based confidence intervals. Finally, we’ll compare the widths of these intervals. We visualize the resulting 30 confidence intervals in Figure 8.30. Note also the vertical line marking the true value of \\(p\\) = 0.375. FIGURE 8.30: Ten 95 percent confidence intervals for \\(p\\) based on n = 25, 50, and 100. Observe that as the confidence intervals are constructed from larger and larger sample sizes, they tend to get narrower. Let’s compare the average widths in Table 8.3. TABLE 8.3: Average width of 95 percent confidence intervals based on n = 25, 50, and 100. Sample size Mean width n = 25 0.380 n = 50 0.270 n = 100 0.183 The moral of the story is: Larger sample sizes tend to produce narrower confidence intervals. Recall that this was a key message in Section 7.3.3. As we used larger and larger shovels for our samples, the sample proportions red \\(\\widehat{p}\\) tended to vary less. In other words, our estimates got more and more precise. Recall that we visualized these results in Figure 7.15, where we compared the sampling distributions for \\(\\widehat{p}\\) based on samples of size \\(n\\) equal 25, 50, and 100. We also quantified the sampling variation of these sampling distributions using their standard deviation, which has that special name: the standard error. So as the sample size increases, the standard error decreases. In fact, the standard error is another related factor in determining confidence interval width. We’ll explore this fact in Subsection 8.7.2 when we discuss theory-based methods for constructing confidence intervals using mathematical formulas. Such methods are an alternative to the computer-based methods we’ve been using so far. 8.6 Case study: Is yawning contagious? Let’s apply our knowledge of confidence intervals to answer the question: “Is yawning contagious?” If you see someone else yawn, are you more likely to yawn? In an episode of the US show Mythbusters, the hosts conducted an experiment to answer this question. The episode is available to view in the United States on the Discovery Network website here and more information about the episode is also available on IMDb. 8.6.1 Mythbusters study data Fifty adult participants who thought they were being considered for an appearance on the show were interviewed by a show recruiter. In the interview, the recruiter either yawned or did not. Participants then sat by themselves in a large van and were asked to wait. While in the van, the Mythbusters team watched the participants using a hidden camera to see if they yawned. The data frame containing the results of their experiment is available in the mythbusters_yawn data frame included in the moderndive package: mythbusters_yawn # A tibble: 50 x 3 subj group yawn &lt;int&gt; &lt;chr&gt; &lt;chr&gt; 1 1 seed yes 2 2 control yes 3 3 seed no 4 4 seed yes 5 5 seed no 6 6 control no 7 7 seed yes 8 8 control no 9 9 control no 10 10 seed no # … with 40 more rows The variables are: subj: The participant ID with values 1 through 50. group: A binary treatment variable indicating whether the participant was exposed to yawning. &quot;seed&quot; indicates the participant was exposed to yawning while &quot;control&quot; indicates the participant was not. yawn: A binary response variable indicating whether the participant ultimately yawned. Recall that you learned about treatment and response variables in Subsection 5.3.1 in our discussion on confounding variables. Let’s use some data wrangling to obtain counts of the four possible outcomes: mythbusters_yawn %&gt;% group_by(group, yawn) %&gt;% summarize(count = n()) # A tibble: 4 x 3 # Groups: group [2] group yawn count &lt;chr&gt; &lt;chr&gt; &lt;int&gt; 1 control no 12 2 control yes 4 3 seed no 24 4 seed yes 10 Let’s first focus on the &quot;control&quot; group participants who were not exposed to yawning. 12 such participants did not yawn, while 4 such participants did. So out of the 16 people who were not exposed to yawning, 4/16 = 0.25 = 25% did yawn. Let’s now focus on the &quot;seed&quot; group participants who were exposed to yawning. 24 such participants did not yawn, while 10 such participants did yawn. So out of the 34 people who were exposed to yawning, 10/34 = 0.294 = 29.4% did yawn. Comparing these two percentages, the participants who were exposed to yawning yawned 29.4% - 25% = 4.4% more often than those who were not. 8.6.2 Sampling scenario Let’s now revisit this study in terms of terminology and notation related to sampling we studied in Section 7.3.1. In Chapter 7 our study population was the bowl of \\(N\\) = 2400 balls. Our population parameter of interest was the population proportion of these balls that were red, denoted mathematically by \\(p\\). In order to estimate \\(p\\), we extracted a sample of 50 balls using the shovel and computed the relevant point estimate: the sample proportion that were red, denoted mathematically by \\(\\widehat{p}\\). Who is the study population here? All humans? All the people who watch the show Mythbusters? It’s hard to say! This question can only be answered if we know how the show’s hosts recruited participants! In other words, what was the sampling methodology used by the Mythbusters to recruit participants? We alas are not provided with this information. Only for the purposes of this case study, however, we’ll assume that the 50 participants are a representative sample of all Americans given the popularity of this show. Thus, we’ll be assuming that any results of this experiment will generalize to all \\(N\\) = 327 million Americans (2018 population). Just like with our sampling bowl, the population parameter here will involve proportions. However, in this case it will be the difference in population proportions \\(p_{seed} - p_{control}\\), where \\(p_{seed}\\) is the proportion of all Americans who if exposed to yawning will yawn themselves, and \\(p_{control}\\) is the proportion of all Americans who if not exposed to yawning still yawn themselves. Correspondingly, the point estimate/sample statistic based the Mythbusters’ sample of participants will be the difference in sample proportions \\(\\widehat{p}_{seed} - \\widehat{p}_{control}\\). Let’s extend Table 7.5 of scenarios of sampling for inference to include our latest scenario. TABLE 8.4: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Notation. 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) This is known as a two-sample inference situation since we have two separate samples. Based on their two-samples of size \\(n_{seed}\\) = 34 and \\(n_{control}\\) = 16, their point estimate is \\[ \\widehat{p}_{seed} - \\widehat{p}_{control} = \\frac{24}{34} - \\frac{12}{16} = 0.04411765 \\approx 4.4\\% \\] However, say the Mythbusters repeated this experiment. In other words, say they recruited 50 new participants and exposed 34 of them to yawning and 16 not. Would they obtain the exact same estimated difference of 4.4%? Probably not, again, because of sampling variation. How does this sampling variation affect their estimate of 4.4%? In other words, what would be a plausible range of values for this difference that accounts for this sampling variation? We can answer this question with confidence intervals! Furthermore, since the Mythbusters only have a single two-sample of 50 participants, the would have to construct a 95% confidence interval for \\(p_{seed} - p_{control}\\) using bootstrap resampling with replacement. We make a couple of important notes. First, for the comparison between the &quot;seed&quot; and &quot;control&quot; groups to make sense however, both groups need to be independent from each other. Otherwise, they could influence each other’s results. Second, the order of the subtraction in the difference doesn’t matter so long as you are consistent and tailor your interpretations accordingly. In other words, using a point estimate of \\(\\widehat{p}_{seed} - \\widehat{p}_{control}\\) or \\(\\widehat{p}_{control} - \\widehat{p}_{seed}\\) does not make a material difference, you just need to stay consistent and interpret your results accordingly. 8.6.3 Constructing the confidence interval As we did in Section 8.4.2, let’s first construct the bootstrap distribution for \\(\\widehat{p}_{seed} - \\widehat{p}_{control}\\) and then use this to construct 95% confidence intervals for \\(p_{seed} - p_{control}\\). We’ll do this using the infer workflow again. However, since the difference in proportions is a new scenario for inference, we’ll need to use some new arguments in the infer functions along the way. 1. specify variables Let’s take our mythbusters_yawn data frame and specify() which variables are of interest using the y ~ x formula interface where: Our response variable is yawn: whether or not a participant yawned. It has levels &quot;yes&quot; and &quot;no&quot;. The explanatory variable is group: whether or not a participant was exposed to yawning. It has levels &quot;seed&quot; (exposed to yawning) and &quot;control&quot; (not exposed to yawning). mythbusters_yawn %&gt;% specify(formula = yawn ~ group) Error: A level of the response variable `yawn` needs to be specified for the `success` argument in `specify()`. Alas, we got an error message similar to the one from Subsection 8.5.1: infer is telling us that one of the levels of the categorical variable yawn needs to be defined as the success. Recall that we define success to be the event of interest we are trying to count and compute proportions of. Are we interested in those participants who &quot;yes&quot; yawned or those who &quot;no&quot; didn’t yawn? This isn’t clear to R, so we need to set the success argument to &quot;yes&quot; as follows: mythbusters_yawn %&gt;% specify(formula = yawn ~ group, success = &quot;yes&quot;) Response: yawn (factor) Explanatory: group (factor) # A tibble: 50 x 2 yawn group &lt;fct&gt; &lt;fct&gt; 1 yes seed 2 yes control 3 no seed 4 yes seed 5 no seed 6 no control 7 yes seed 8 no control 9 no control 10 no seed # … with 40 more rows 2. generate replicates Our next step is to perform bootstrap resampling with replacement like we did with the slips of paper in our pennies activity in Section 8.1. We saw how it works with both a single variable in computing bootstrap means in Subsection 8.4 and in computing bootstrap proportions in Section 8.5, but we haven’t yet worked with bootstrapping involving multiple variables though. In the infer package, bootstrapping with multiple variables means that each row is potentially resampled. Let’s investigate this by looking at the first few rows of mythbusters_yawn: head(mythbusters_yawn) # A tibble: 6 x 3 subj group yawn &lt;int&gt; &lt;chr&gt; &lt;chr&gt; 1 1 seed yes 2 2 control yes 3 3 seed no 4 4 seed yes 5 5 seed no 6 6 control no When we bootstrap this data, we are potentially pulling the subject’s readings multiple times. Thus, we could see the entries of &quot;seed&quot; for group and &quot;no&quot; for yawn together in a new row in a bootstrap sample. This is further seen by exploring the sample_n() function in dplyr on this smaller 6-row data frame comprised of head(mythbusters_yawn). The sample_n() function can perform this bootstrapping procedure and is similar to the rep_sample_n() function in infer, except that it is not repeated, but rather only performs one sample with or without replacement. head(mythbusters_yawn) %&gt;% sample_n(size = 6, replace = TRUE) # A tibble: 6 x 3 subj group yawn &lt;int&gt; &lt;chr&gt; &lt;chr&gt; 1 1 seed yes 2 6 control no 3 1 seed yes 4 5 seed no 5 4 seed yes 6 4 seed yes We can see that in this bootstrap sample generated from the first six rows of mythbusters_yawn, we have some rows repeated. The same is true when we perform the generate() step in infer as done in what follows. Using this fact, we generate 1000 replicates, or in other words, we bootstrap resample the 50 participants with replacement 1000 times. mythbusters_yawn %&gt;% specify(formula = yawn ~ group, success = &quot;yes&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) Response: yawn (factor) Explanatory: group (factor) # A tibble: 50,000 x 3 # Groups: replicate [1,000] replicate yawn group &lt;int&gt; &lt;fct&gt; &lt;fct&gt; 1 1 no seed 2 1 no seed 3 1 yes control 4 1 yes seed 5 1 no control 6 1 yes seed 7 1 no control 8 1 no seed 9 1 no seed 10 1 no seed # … with 49,990 more rows Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 participants with replacement 1000 times and 50,000 = 1000 \\(\\times\\) 50. The variable replicate indicates which resample each row belongs to. So it has the value 1 50 times, the value 2 50 times, all the way through to the value 1000 50 times. 3. calculate summary statistics After we generate() many replicates of bootstrap resampling with replacement, we next want to summarize the bootstrap resamples of size 50 with a single summary statistic, the difference in proportions. We do this by setting the stat argument to &quot;diff in props&quot;: mythbusters_yawn %&gt;% specify(formula = yawn ~ group, success = &quot;yes&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;) Error: Statistic is based on a difference; specify the `order` in which to subtract the levels of the explanatory variable. We see another error here. We need to specify the order of the subtraction. Is it \\(\\widehat{p}_{seed} - \\widehat{p}_{control}\\) or \\(\\widehat{p}_{control} - \\widehat{p}_{seed}\\). We specify it to be \\(\\widehat{p}_{seed} - \\widehat{p}_{control}\\) by setting order = c(&quot;seed&quot;, &quot;control&quot;). Note that you could’ve also set order = c(&quot;control&quot;, &quot;seed&quot;). As we stated earlier, the order of the subtraction does not matter, so long as you stay consistent throughout your analysis and tailor your interpretations accordingly. Let’s save the output in a data frame bootstrap_distribution_yawning: bootstrap_distribution_yawning &lt;- mythbusters_yawn %&gt;% specify(formula = yawn ~ group, success = &quot;yes&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;seed&quot;, &quot;control&quot;)) bootstrap_distribution_yawning # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 -0.0213904 2 2 0.0459770 3 3 0 4 4 -0.0129870 5 5 0.326765 6 6 0.122807 7 7 0.293718 8 8 0.0761905 9 9 0.0679117 10 10 -0.0231729 # … with 990 more rows Observe that the resulting data frame has 1000 rows and 2 columns corresponding to the 1000 replicate ID’s and the 1000 difference in proportions for each bootstrap resample in stat. 4. visualize the results In Figure 8.31 we visualize() the resulting bootstrap resampling distribution. Let’s also add a vertical line at 0 by adding a geom_vline() layer. visualize(bootstrap_distribution_yawning) + geom_vline(xintercept = 0) FIGURE 8.31: Bootstrap distribution. First, let’s compute the 95% confidence interval for \\(p_{seed} - p_{control}\\) using the percentile method, in other words by identifying the 2.5th and 97.5th percentiles which include the middle 95% of values. Recall that this method does not require the bootstrap distribution to be normally shaped. bootstrap_distribution_yawning %&gt;% get_confidence_interval(type = &quot;percentile&quot;, level = 0.95) # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 -0.218313 0.304763 Second, since the bootstrap distribution is roughly bell-shaped, we can construct a confidence interval using the standard error method as well. Recall that to construct a confidence interval using the standard error method, we need to specify the center of the interval using the point_estimate argument. In our case, we need to set it to be the difference in sample proportions of 4.4% that the Mythbusters observed. However, we can also use the infer workflow to compute this value by excluding the generate() 1000 bootstrap replicates step. In other words, do not generate replicates, but rather use only the original sample data. We can achieve this by commenting out the generate() line, telling R to ignore it: mythbusters_yawn %&gt;% specify(formula = yawn ~ group, success = &quot;yes&quot;) %&gt;% # generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;seed&quot;, &quot;control&quot;)) # A tibble: 1 x 1 stat &lt;dbl&gt; 1 0.0441176 We thus plug this value as the point_estimate argument. bootstrap_distribution_yawning %&gt;% get_confidence_interval(type = &quot;se&quot;, point_estimate = 0.0441176) # A tibble: 1 x 2 lower upper &lt;dbl&gt; &lt;dbl&gt; 1 -0.213435 0.301670 Let’s visualize both confidence intervals in Figure 8.32, with the percentile method interval marked with solid lines and the standard error method marked with dashed lines. Observe that they are both similar to each other. FIGURE 8.32: Two 95 percent confidence intervals: percentile method (solid) and standard error method (dashed). 8.6.4 Interpreting the confidence interval Given that both confidence intervals are quite similar, let’s focus our interpretation to only the percentile method confidence interval of (-0.218, 0.305). Recall from Subsection 8.5.2 that the precise statistical interpretation of a 95% confidence interval is: if repeated this construction procedure 100 times, then we expect about 95 of the confidence intervals to capture the true value of \\(p_{seed} - p_{control}\\). In other words, if we gathered 100 samples of \\(n\\) = 50 participants from a similar pool of people and constructed 100 confidence intervals, about 95 of them will contain the true value of \\(p_{seed} - p_{control}\\) while about 5 won’t. Given that this is a little long winded, we use the shorthand interpretation: we’re 95% “confident” that the true difference in proportions \\(p_{seed} - p_{control}\\) is between (-0.22, 0.3). There is one value of particular interest that this 95% confidence interval contains: zero. If \\(p_{seed} - p_{control}\\) were equal to 0, then there would be no difference in proportion yawning between the two groups. This would suggest that there is no associated effect of being exposed to yawning on whether you yawn yourself. In our case, since the 95% confidence interval includes 0, we cannot conclusively say if either proportion is larger. Of our 1000 bootstrap resamples with replacement, sometimes \\(\\widehat{p}_{seed}\\) was higher and thus those exposed to yawning yawned themselves more often. At other times, the reverse happened. Say on the other hand the 95% confidence interval was entirely above zero. This would suggestive that \\(p_{seed} - p_{control} &gt; 0\\), or in other words \\(p_{seed} &gt; p_{control}\\), and thus we’d have evidence suggesting those exposed to yawning do yawn more often. 8.7 Conclusion 8.7.1 Comparing bootstrap and sampling distributions Let’s talk more about the relationship between sampling distributions and bootstrap distributions. Recall back in Section 7.2.3, we took 1000 virtual samples from the bowl using a virtual shovel, computed 1000 values of the sample proportion red \\(\\widehat{p}\\), then visualized their distribution in a histogram. Recall that this distribution is called the sampling distribution of \\(\\widehat{p}\\) . Furthermore, the standard deviation of the sampling distribution has a special name: the standard error. We also mentioned that this sampling activity does not reflect how sampling is done in real-life. Rather, it was an idealized version of sampling so that we could study the effects of sampling variation on estimates, like the proportion of the shovel’s balls that are red. In real-life however, one would take a single sample that’s as large as possible, much like in the Obama poll we saw in Section 7.4. But how can we get a sense of the effect of sampling variation on estimates if we only have one sample and thus only one estimate? Don’t we need many samples and hence many estimates? The workaround to having a single sample was to perform bootstrap resampling with replacement from the single sample. We did this in the resampling activity in Section 8.1 where we focused on the mean year of minting of pennies. We used pieces of paper representing the original sample of 50 pennies from the bank and resampled them with replacement from a hat. We had 35 of our friends perform this activity and visualized the resulting 35 sample means \\(\\overline{x}\\) in a histogram in Figure 8.11. This distribution was called the bootstrap distribution of \\(\\overline{x}\\). We stated at the time that the bootstrap distribution is an approximation to the sampling distribution of \\(\\overline{x}\\) in the sense that both distributions will have a similar shape and similar spread. Thus the standard error of the bootstrap distribution can be used as an approximation to the standard error of the sampling distribution. Let’s show you that this is the case by now compare these two types of distributions. Specifically, we’ll compare the The sampling distribution of \\(\\widehat{p}\\) based on 1000 virtual samples from the bowl from Section 7.2.3. The bootstrap distribution of \\(\\widehat{p}\\) based on 1000 virtual resamples with replacement from Ilyas and Yohan’s single sample bowl_sample_1 from Section 8.5.1 Sampling distribution Here is the code you previously saw in Section 7.2.3 to construct the sampling distribution of \\(\\widehat{p}\\), with some small changes to incorporate the statistical terminology relating to sampling you learned in Section 7.3.1. FIGURE 8.33: Previously seen sampling distribution of sample proportion red for \\(n = 1000\\). An important thing to keep in mind is the default value for replace is FALSE when using rep_sample_n(). This is because when sampling 50 balls with a shovel, we are extracting 50 balls one-by-one without replacing them. This is in contrast to bootstrap resampling with replacement, where we resample a ball and put it back, and repeat this process 50 times. Let’s quantify the variability in this sampling distribution by calculating the standard deviation of the propr_red variable representing 1000 values of the sample proportion \\(\\widehat{p}\\). Remember that the standard deviation of the sampling distribution is the standard error, frequently denoted as se. sampling_distribution %&gt;% summarize(se = sd(prop_red)) # A tibble: 1 x 1 se &lt;dbl&gt; 1 0.0673987 Bootstrap distribution Here is the code you previously saw in Section 8.5.1 to construct the bootstrap distribution of \\(\\widehat{p}\\) based on Ilyas and Yohan’s original sample of 50 balls saved in bowl_sample_1. # Compute the bootstrap distribution using infer workflow: bootstrap_distribution &lt;- bowl_sample_1 %&gt;% specify(response = color, success = &quot;red&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;prop&quot;) FIGURE 8.34: Bootstrap distribution of sample proportion red for \\(n = 1000\\). bootstrap_distribution %&gt;% summarize(se = sd(stat)) # A tibble: 1 x 1 se &lt;dbl&gt; 1 0.0693340 Comparison Now that we have computed both the sampling distribution and the bootstrap distributions, let’s compare them side-by-side in Figure 8.35. We’ll make both histograms have matching scales on the x and y-axes to make them more comparable. Furthermore, we’ll add: To the sampling distribution on the top: a solid line denoting the proportion of the bowl’s balls that are red \\(p\\) = 0.375. To the bootstrap distribution on the bottom: a dashed line at the sample proportion \\(\\widehat{p}\\) = 21/50 = 0.42 = 42% that Ilyas and Yohan observed. FIGURE 8.35: Comparing the sampling and bootstrap distributions of \\(\\widehat{p}\\) There is a lot going on in Figure 8.35, so let’s break down all the comparisons slowly. First, observe how the sampling distribution on top is centered at \\(p\\) = 0.375. This is because the sampling is done at random and in an unbiased fashion. So the estimates \\(\\widehat{p}\\) are centered at the true value of \\(p\\). However, this is not the case with the following bootstrap distribution. The bootstrap distribution is centered at 0.42, which is the proportion red of Ilyas and Yohan’s 50 sampled balls. This is because we are resampling from the same sample over and over again. Since the bootstrap distribution is centered at the original sample’s proportion, it doesn’t necessarily provide a better estimate of \\(p\\) = 0.375. This leads us to our first lesson about bootstrapping: The bootstrap distribution will likely not have the same center as the sampling distribution. In other words, bootstrapping cannot improve the quality of an estimate. Second, let’s now compare the spread (in the words the variation) of the two distributions: they are somewhat similar. In the previous code, we computed the standard deviations of both distributions as well. Recall that such standard deviations have a special name: standard errors. Let’s compare them in Table 8.5. TABLE 8.5: Comparing standard errors Distribution type Standard error Sampling distribution 0.067 Bootstrap distribution 0.069 Notice that the bootstrap distribution’s standard error is a rather good approximation to the sampling distribution’s standard error. This leads us to our second lesson about bootstrapping: Even if the bootstrap distribution might not have the same center as the sampling distribution, it will likely have very similar shape and spread. In other words, bootstrapping will give you a good estimate of the standard error. Thus, using the fact that the bootstrap distribution and sampling distributions have similar spreads, we can build confidence intervals using bootstrapping as we’ve done all throughout this chapter! 8.7.2 Theory-based confidence intervals So far in this chapter, we’ve constructed confidence intervals using two methods: the percentile method and the standard error method. Recall also from Section 8.3.2 that we can only use the standard-error method if the bootstrap distribution is bell-shaped i.e. normally distributed. In a similar vein, if the sampling distribution is normally shaped, there is another method for constructing confidence intervals that does not involve using your computer. You can use a theory-based method involving a mathematical formulas! The formula uses the rule of thumb we saw in Appendix A.2 that 95% of values in a normal distribution are within \\(\\pm 1.96\\) standard deviations of the mean. In the case of sampling and bootstrap distributions, recall that the standard deviation has a special name: the standard error. Theory-based method for computing standard errors There exists in many cases a formula that approximates the standard error! In the case of our bowl where we used the sample proportion red \\(\\widehat{p}\\) to estimate the proportion of the bowl’s balls that are red, the formula that approximates the standard error is: \\[\\text{SE}_{\\widehat{p}} \\approx \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] For example, recall from bowl_sample_1 that Yohan and Ilyas sampled \\(n\\) = 50 balls and observed a sample proportion \\(\\widehat{p}\\) of 21/50 = 0.42. So using the formula, an approximation of the standard error of \\(\\widehat{p}\\) is \\[\\text{SE}_{\\widehat{p}} \\approx \\sqrt{\\frac{0.42(1-0.42)}{50}} = \\sqrt{0.004872} = 0.0698 \\approx 0.070\\] The key observation to make here is that there is an \\(n\\) in the denominator. In other words, as the sample size \\(n\\) increases, the standard error decreases. We’ve demonstrated this fact this using our virtual shovels in Section 7.3.3. If you don’t recall this demonstration, we highly recommend you go back and read that section. Let’s compare this theory-based standard error to the standard error of the sampling and bootstrap distributions you computed previously in Subsection 8.7.1 in Table 8.6. Notice how they are all similar! TABLE 8.6: Comparing standard errors Distribution type Standard error Sampling distribution 0.067 Bootstrap distribution 0.069 Formula approximation 0.070 Going back to Yohan and Ilyas’ sample proportion of \\(\\widehat{p}\\) of 21/50 = 0.42, say this were based on a sample of size \\(n\\) = 100 instead of 50. Then the standard error would be: \\[\\text{SE}_{\\widehat{p}} \\approx \\sqrt{\\frac{0.42(1-0.42)}{100}} = \\sqrt{0.002436} = 0.0494\\] Observe that the standard error has gone done from 0.0698 to 0.0494. In other words, the “typical” error of our estimates using \\(n\\) = 100 will go down and hence are more precise. Recall we illustrated the difference between accuracy and precision of estimates in Figure 7.16. Why is this formula true? Unfortunately, we don’t have the tools at this point to prove this; you’ll need to take a more advanced course in probability and statistics. Theory-based method for constructing confidence intervals Using these theory-based standard errors, let’s present a theory-based method for constructing 95% confidence intervals that does not involve using a computer, but rather mathematical formulas. Note that this theory-based method only holds if the sampling distribution is normally shaped, so that we can use the 95% rule of thumb about normal distributions in Appendix A.2. Collect a single representative sample of size \\(n\\) that’s as large as possible. Compute the point estimate: the sample proportion \\(\\widehat{p}\\). Think of this as the center of your “net.” Compute the approximation to the standard error \\[\\text{SE}_{\\widehat{p}} \\approx \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] Compute a quantity known as the margin of error (more later): \\[\\text{MoE}_{\\widehat{p}} = 1.96 \\cdot \\text{SE}_{\\widehat{p}} = 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] Compute both endpoints of the confidence interval. The lower end-point. Think of this as the left end-point of the net: \\[\\widehat{p} - \\text{MoE}_{\\widehat{p}} = \\widehat{p} - 1.96 \\cdot \\text{SE}_{\\widehat{p}} = \\widehat{p} - 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] The upper endpoint. Think of this as the right end-point of the net: \\[\\widehat{p} + \\text{MoE}_{\\widehat{p}} = \\widehat{p} + 1.96 \\cdot \\text{SE}_{\\widehat{p}} = \\widehat{p} + 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] Alternatively, you can succinctly summarize a 95% confidence interval for \\(p\\) using the \\(\\pm\\) symbol: \\[\\widehat{p} \\pm \\text{MoE}_{\\widehat{p}} = \\widehat{p} \\pm 1.96 \\cdot \\text{SE}_{\\widehat{p}} = \\widehat{p} \\pm 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] So going back to Yohan and Ilyas’ sample of \\(n=50\\) balls that had 21 red balls, the 95% confidence interval for \\(p\\) is 0.42 \\(\\pm\\) 1.96 \\(\\cdot\\) 0.0698 = 0.42 \\(\\pm\\) 0.137 = (0.42 - 0.137, 0.42 + 0.137) = (0.283, 0.557). In other words, Yohan and Ilyas are 95% “confident” that the true proportion red of the bowl’s balls is between 28.3% and 55.7%. Given that the true population proportion \\(p\\) was 0.375, in this case they successfully captured the fish. In Step 4, we defined a statistical quantity known as the margin of error. You can think of this quantity as how much the net extends to the left and to the right of the center of our net. The 1.96 multiplier roots in the 95% rule of thumb we introduced earlier and the fact that we want the confidence level to be 95%. The value of the margin error entirely determines the width of the confidence interval. Recall from Section 8.5.3 that confidence interval widths are determined by an interplay of the confidence level, the sample size \\(n\\), and the standard error. Let’s revisit the poll of President Obama’s approval rating among young Americans aged 18-29 we introduced in Section 7.4. Pollsters found that based on a representative sample of \\(n\\) = 2089 young Americans, \\(\\widehat{p}\\) = 0.41 = 41% supported President Obama. If you look towards the end of the article, it also states: “The poll’s margin of error was plus or minus 2.1 percentage points.” This is precisely the \\(\\text{MoE}\\): \\[ \\begin{aligned} \\text{MoE} &amp;= 1.96 \\cdot \\text{SE} = 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}} = 1.96 \\cdot \\sqrt{\\frac{0.41(1-0.41)}{2089}} \\\\ &amp;= 1.96 \\cdot 0.0108 = 0.021 = 2.1% \\end{aligned} \\] Their poll results are based on a confidence level of 95% and the resulting 95% confidence interval for the proportion of all young Americans who support Obama is: \\(\\widehat{p} \\pm \\text{MoE}\\) = 0.42 \\(\\pm\\) 0.021 = (0.339, 0.441) = (33.9%, 44.1%). Confidence intervals based on 33 tactile samples Let’s revisit our 33 friends’ samples from the bowl from Section 7.1.3. We’ll use their 33 samples to construct 33 theory-based 95% confidence intervals for \\(p\\). Recall this data was saved in the tactile_prop_red data frame included in the moderndive package: rename() the variable prop_red to p_hat, the statistical name of the sample proportion \\(\\widehat{p}\\). mutate() a new variable n making explicit the sample size of 50. mutate() other new variables computing: The standard error SE for \\(\\widehat{p}\\) using the previous formula. The margin of error MoE by multiplying the SE by 1.96 The left endpoint of the confidence interval lower_ci The right endpoint of the confidence interval upper_ci conf_ints &lt;- tactile_prop_red %&gt;% rename(p_hat = prop_red) %&gt;% mutate( n = 50, SE = sqrt(p_hat * (1 - p_hat) / n), MoE = 1.96 * SE, lower_ci = p_hat - MoE, upper_ci = p_hat + MoE ) conf_ints # A tibble: 33 x 9 group replicate red_balls p_hat n SE MoE lower_ci upper_ci &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 Ilyas, … 1 21 0.42 50 0.0697997 0.136807 0.283193 0.556807 2 Morgan,… 2 17 0.34 50 0.0669925 0.131305 0.208695 0.471305 3 Martin,… 3 21 0.42 50 0.0697997 0.136807 0.283193 0.556807 4 Clark, … 4 21 0.42 50 0.0697997 0.136807 0.283193 0.556807 5 Riddhi,… 5 18 0.36 50 0.0678823 0.133049 0.226951 0.493049 6 Andrew,… 6 19 0.38 50 0.0686440 0.134542 0.245458 0.514542 7 Julia 7 19 0.38 50 0.0686440 0.134542 0.245458 0.514542 8 Rachel,… 8 11 0.22 50 0.0585833 0.114823 0.105177 0.334823 9 Daniel,… 9 15 0.3 50 0.0648074 0.127023 0.172977 0.427023 10 Josh, M… 10 17 0.34 50 0.0669925 0.131305 0.208695 0.471305 # … with 23 more rows In Figure 8.36, let’s plot the 33 confidence intervals for \\(p\\) saved in conf_ints along with a vertical line at \\(p\\) = 0.375 indicating the true proportion of the bowl’s balls that are red. Furthermore, let’s mark the sample proportions \\(\\widehat{p}\\) that are the centers of the confidence intervals with dots. FIGURE 8.36: 33 95 percent confidence intervals based on 33 tactile samples of size n = 50. Observe that 31 of the 33 confidence intervals “captured” the true value of \\(p\\), for a success rate of 31 / 33 = 93.94%. While this is not quite 95%, recall that we expect about 95% of such confidence intervals to capture \\(p\\). The actual observed success rate will vary slightly. Theory-based methods like this have largely been used in the past because we didn’t have the computing power to perform simulation-based methods such as bootstrapping. They are still commonly used however and if the sampling is normally distributed, we have access to an alternative method for constructing confidence intervals as well as performing hypothesis tests as we will see in Chapter 9. The kind of computer-based statistical inference we’ve seen so far has a particular name in the field of statistics: simulation-based inference. This is because we are performing statistical inference using computer simulations. In our opinion, two large benefits of simulation-based methods over theory-based methods are that 1) they are easier for people new to statistical inference to understand and 2) they also work in situations where theory-based methods and mathematical formulas don’t exist. 8.7.3 Additional resources An R script file of all R code used in this chapter is available here. If you want more examples of the infer workflow to construct confidence intervals, we suggest you check out the infer package homepage, in particular, a series of example analyses available at https://infer.netlify.com/articles/. 8.7.4 What’s to come? Now that we’ve equipped ourselves with confidence intervals, in Chapter 9 we’ll cover the other common tool for statistical inference: hypothesis testing. "],
-["9-hypothesis-testing.html", "Chapter 9 Hypothesis Testing 9.1 Promotions activity 9.2 Understanding hypothesis tests 9.3 Conducting hypothesis tests 9.4 Interpreting hypothesis tests 9.5 Case study: Are action or romance movies rated higher? 9.6 Conclusion", " Chapter 9 Hypothesis Testing Now that we’ve studied confidence intervals in Chapter 8, let’s study the commonly used method for statistical inference: hypothesis testing. Hypothesis tests allow us to take a sample of data from a population and infer about the plausibility of competing hypotheses. For example, in the upcoming “promotions” activity in Section 9.1, you’ll study the data collected from a psychology study in the 1970’s to investigate whether there exists gender-based discrimination in promotion rates in the banking industry. The good news is we’ve already covered many of the necessary concepts to understand hypothesis testing in Chapters 7 and 8. We will expand further on these ideas here and also provide a general framework for understanding hypothesis tests. By understanding this general framework, you’ll be able to adapt it to many different scenarios. The same can be said for confidence intervals. There was one general framework that applies to all confidence intervals and the infer package was designed around this framework. While the specifics may change slightly for different types of confidence intervals, the general framework stays the same. We believe that this approach is much better for long-term learning than focusing on specific details for specific confidence intervals and as you’ll now see, hypothesis tests as well. If you’d like more practice or you’re curious to see how this framework applies to different scenarios, you can find fully-worked out examples for many common hypothesis tests and their corresponding confidence intervals in Appendix B. We recommend that you carefully review these examples as they also cover how the general frameworks apply to traditional theory-based methods like the \\(t\\)-test and normal-theory confidence intervals. You’ll see there that these traditional methods are just approximations for the computer-based methods we’ve been focusing on. However, they also require conditions to be met for their results to be valid. Computer-based methods using randomization, simulation, and bootstrapping have much fewer restrictions. Furthermore, they help develop your computational thinking, which is one big reason they are emphasized throughout this book. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once: ggplot2 for data visualization dplyr for data wrangling tidyr for converting data to “tidy” format readr for importing spreadsheet data into R As well as the more advanced purrr, tibble, stringr, and forcats packages If needed, read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(infer) library(moderndive) library(nycflights13) library(ggplot2movies) 9.1 Promotions activity Let’s start with an activity studying the effect of gender on promotions at a bank. 9.1.1 Does gender affect promotions at bank? Say you are working at a bank in the 1970’s and you are submitting your resume to apply for a promotion. Will your gender affect your chances of getting promoted? To answer this question, we’ll focus on data from a study published in the “Journal of Applied Psychology” in 1974. This data is also used in the OpenIntro series of statistics textbooks. To begin the study, 48 bank supervisors were asked to assume the role of a hypothetical director of a bank with multiple branches. Every one of the bank supervisors was given a resume and asked whether or not the candidate on the resume was fit to be promoted to a new position in one of their branches. However, each of these 48 resumes were identical in all respects except one: the name of the applicant at the top of the resume. 24 of the supervisors were randomly given resumes with stereotypically “male” names while 24 of the supervisors were randomly given resumes with stereotypically “female” names. Since only (binary) gender varied from resume to resume, researchers could isolate the effect of this variable in promotion rates. While many people today (including us, the authors) disagree with such binary views of gender, it is important to remember that this study was conducted at a time where more nuanced views of gender were not as prevalent. Despite this imperfection, we decided to still use this example as we feel it presents ideas still relevant today about how we could study discrimination in the workplace. The moderndive package contains the data on the 48 applicants in the promotions data frame. Let’s explore this data first: promotions # A tibble: 48 x 3 id decision gender &lt;int&gt; &lt;fct&gt; &lt;fct&gt; 1 1 promoted male 2 2 promoted male 3 3 promoted male 4 4 promoted male 5 5 promoted male 6 6 promoted male 7 7 promoted male 8 8 promoted male 9 9 promoted male 10 10 promoted male # … with 38 more rows The variable id acts as an identification variable for all 48 rows, the decision variable indicates whether the applicant was selected for promotion or not, while the gender variable indicates the gender of the name used on the resume. Recall that this data does not pertain to 24 actual men and 24 actual women, but rather 48 identical resumes of which 24 were assigned stereotypically “male” names and 24 were assigned stereotypical “female” names. Let’s perform an exploratory data analysis of the relationship between the two categorical variables decision and gender. Recall that we saw in Section 2.8.3 that one way we can visualize such a relationship is using a stacked barplot. ggplot(promotions, aes(x = gender, fill = decision)) + geom_bar() + labs(x = &quot;Gender of name on resume&quot;) FIGURE 9.1: Barplot of relationship between gender and promotion decision. Observe in Figure 9.1 that it appears that resumes with female names were much less likely to be accepted for promotion. Let’s quantify these promotion rates by computing the proportion of resumes accepted for promotion for each group using the dplyr package for data wrangling: promotions %&gt;% group_by(gender, decision) %&gt;% summarize(n = n()) # A tibble: 4 x 3 # Groups: gender [2] gender decision n &lt;fct&gt; &lt;fct&gt; &lt;int&gt; 1 male not 3 2 male promoted 21 3 female not 10 4 female promoted 14 So of the 24 resumes with male names, 21 were selected for promotion, for a proportion of 21/24 = 0.875 = 87.5%. On the other hand, of the 24 resumes with female names, 14 were selected for promotion, for a proportion of 14/24 = 0.583 = 58.3%. Comparing these two rates of promotion, it appears that resumes with male names were selected for promotion at a rate 0.875 - 0.583 = 0.292 = 29.2% higher than resumes with female names. This is suggestive of an advantage for resumes with a male name on it. The question is however, does this provide conclusive evidence that there is gender discrimination in promotions at banks? Could a difference in promotion rates of 29.2% still occur by chance, even in a hypothetical world where no gender-based discrimination existed? In other words, what is the role of sampling variation? To answer this question, we’ll again rely on a computer to run simulations. 9.1.2 Shuffling once First, try to imagine a hypothetical universe where no gender discrimination in promotions existed. In such a hypothetical universe, the gender of an applicant would have no bearing on their chances of promotion. Bringing things back to our promotions data frame, the gender variable would thus be an irrelevant label. If these gender labels were irrelevant, then we could randomly reassign them by “shuffling” them to no consequence! To illustrate this idea, let’s narrow our focus to six arbitrarily chosen resumes of the 48 in Table 9.1. The decision column shows that three resumes resulted in promotion while three didn’t. The gender column shows what the original gender of the resume name was. However, in our hypothesized universe of no gender discrimination, gender is irrelevant and thus it is of no consequence to randomly “shuffle” the values of gender. The shuffled_gender column shows one such possible random shuffling. Observe how the number of male and female names remains the same at three each, but they are now listed in a different order. TABLE 9.1: One example of shuffling gender variable. resume number decision gender shuffled gender 1 not male male 2 not female male 3 not female female 4 promoted male female 5 promoted male female 6 promoted female male Again, such random shuffling of the gender label only makes sense in our hypothesized universe of no gender discrimination. How could we extend this shuffling of the gender variable to all 48 resumes by hand? One way would be by using standard deck of 52 playing cards, which we display in Figure 9.2. FIGURE 9.2: Standard deck of 52 playing cards. Since half the cards are red and the other half are black, by removing 2 red cards and 2 black cards, we would end up with 24 red cards and 24 black cards. After shuffling these 48 cards as seen in Figure 9.3, we can flip the cards over one-by-one, assigning “male” for each red card and “female” for each black card. FIGURE 9.3: Shuffling a deck of cards. We’ve saved one such shuffling in the promotions_shuffled data frame of the moderndive package. If you view both the original promotions and the shuffled promotions_shuffled data frames and compare them, you’ll see that while the decision variables are identical, the gender variables are different. promotions_shuffled # A tibble: 48 x 3 id decision gender &lt;int&gt; &lt;fct&gt; &lt;fct&gt; 1 1 promoted female 2 2 promoted female 3 3 promoted male 4 4 promoted female 5 5 promoted male 6 6 promoted male 7 7 promoted male 8 8 promoted female 9 9 promoted male 10 10 promoted female # … with 38 more rows Let’s repeat the same exploratory data analysis we did for the original promotions data on our shuffled promotions_shuffled data frame. Let’s create a barplot visualizing the relationship between decision and the new shuffled gender variable and compare this to the original unshuffled version in Figure 9.4. ggplot(promotions_shuffled, aes(x = gender, fill = decision)) + geom_bar() + labs(x = &quot;Gender of resume name&quot;) FIGURE 9.4: Barplots of relationship of promotion with gender (left) and shuffled gender (right). It appears the difference in “male names” versus “female names” promotion rates is now different. Compared to the original data in the left barplot, the new “shuffled” data in the right barplot has promotion rates that are much more similar. Let’s also compute the proportion of resumes accepted for promotion for each group: promotions_shuffled %&gt;% group_by(gender, decision) %&gt;% summarize(n = n()) # A tibble: 4 x 3 # Groups: gender [2] gender decision n &lt;fct&gt; &lt;fct&gt; &lt;int&gt; 1 male not 6 2 male promoted 18 3 female not 7 4 female promoted 17 So in this hypothetical universe of no discrimination, 18/24 = 0.75 = 75% of “male” resumes were selected for promotion. On the other hand, 17/24 = 0.708 = 70.8% of “female” resumes were selected for promotion. Comparing these two values, it appears that resumes with male names were selected for promotion at a rate that was 0.75 - 0.708 = 0.042 = 4.2% different that resumes with female names. Observe how this difference in rates is different than the difference in rates of 0.292 = 29.2% we originally observed. This is once again due to sampling variation. How can we better understand the effect of this sampling variation? By repeating this shuffling several times! 9.1.3 Shuffling 16 times We recruited 16 groups of our friends to repeat this shuffling exercise. They recorded these values in a shared spreadsheet; we display a snapshot of the first 10 rows and 5 columns in Figure 9.5 FIGURE 9.5: Snapshot of shared spreadsheet of shuffling results. For each of these 16 columns of “shuffles”, we computed the difference in promotion rates, and in Figure 9.6 we display their distribution in a histogram. We also mark the observed difference in promotion rate that happened in real-life of 0.292 = 29.2% with a red line. FIGURE 9.6: Distribution of shuffled differences in promotions. Before we discuss the distribution of the histogram, we emphasize the key thing to remember: this histogram represents differences in promotion rates that one would observe in our hypothesized universe of no gender discrimination. Observe first that the histogram is roughly centered at 0. Saying that the difference in promotion rates is 0 is equivalent to saying that both genders had the same promotion rate. In other words, the center of these 16 values is consistent with what we would expect in our hypothesized universe of no gender discrimination. However, while the values are centered at 0, there is variation about 0. This is because even in a hypothesized universe of no gender discrimination, you will still likely observe small differences in promotion rates because of chance sampling variation. Looking at the histogram in Figure 9.6, such differences could even be as extreme as -0.292 or 0.208. Turning our attention to what we observed in real-life: the difference of 0.292 = 29.2% is marked with a red line. Ask yourself: in a hypothesized world of no gender discrimination, how likely would it be that we observe this difference? While opinions may differ, in our opinion not often! Now ask yourself: what does these results say about our hypothesized universe of no gender discrimination? 9.1.4 What did we just do? What we just demonstrated in this activity is the statistical procedure known as hypothesis testing using a permutation test. The term “permutation” is the mathematical term for “shuffling”: take a series of values and reorder them randomly, as you did with the playing cards. In fact, permutations are another form of resampling, like the bootstrap method you performed in Chapter 8. While the bootstrap method involves resampling with replacement, permutation methods involve resampling without replacement. Think of our exercise involving the slips of paper representing pennies and the hat in Section 8.1: after sampling a penny, you put it back in the hat. Now think of our deck of cards. After drawing a card, you laid it out in front of you, recorded the color, and then you did not put it back in the deck. In our previous example, we tested the validity of the hypothesized universe of no gender discrimination. The evidence contained in our observed sample of 48 resumes was somewhat inconsistent with our hypothesized universe. Thus, we would be inclined to reject this hypothesized universe and declare that the evidence suggests there is gender discrimination. Recall our case study on whether yawning is contagious from Section 8.6. The previous example involves inference about an unknown difference of population proportions as well. This time it will be \\(p_{m} - p_{f}\\), where \\(p_{m}\\) is the population proportion of resumes with male names being recommended for promotion and \\(p_{f}\\) is the equivalent for resumes with female names. Recall that this is one of the scenarios for inference we’ve seen so far in Table 9.2. TABLE 9.2: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Notation. 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) So based on our sample of \\(n_m\\) = 24 “male” applicants and \\(n_w\\) = 24 “female” applicants, the point estimate for \\(p_{m} - p_{f}\\) is the difference in sample proportions \\(\\widehat{p}_{m} -\\widehat{p}_{f}\\) = 0.875 - 0.583 = 0.292 = 29.2%. This difference in favor of “male” resumes of 0.292 is greater than 0, suggesting discrimination in favor of men. However the question we asked ourselves was “is this difference meaningfully different than 0?” In other words, is that difference indicative of true discrimination, or can we just attribute it to sampling variation? Hypothesis testing allows us to make such distinctions. 9.2 Understanding hypothesis tests Much like the terminology, notation, and definitions relating to sampling you saw in Section 7.3, there is a lot of terminology, notation, and definitions related to hypothesis testing. Learning these may seem like a very daunting task at first. However with practice, practice, and practice, anyone can master them. First, a hypothesis is a statement about the value of an unknown population parameter. In our resume activity, our population parameter of interest is the difference in population proportions \\(p_{m} - p_{f}\\). Hypothesis tests can involve any of the population parameters in Table 7.5 of the 6 inference scenarios we’ll cover in this book and more. Second, a hypothesis test consists of a test between two competing hypotheses: 1) a null hypothesis \\(H_0\\) (pronounced “H-naught”) versus 2) an alternative hypothesis \\(H_A\\) (also denoted \\(H_1\\)). Generally the null hypothesis is a claim that there is “no effect” or “no difference of interest.” In many cases, the null hypothesis represents the status quo or a situation that nothing interesting is happening. Furthermore, generally the alternative hypothesis is the claim the experimenter or researcher wants to establish or find evidence to support. It is viewed as a “challenger” hypothesis to the null hypothesis \\(H_0\\). In our resume activity, an appropriate hypothesis test would be: \\[ \\begin{aligned} H_0 &amp;: \\text{men and women are promoted at the same rate}\\\\ \\text{vs } H_A &amp;: \\text{men are promoted at a higher rate than women} \\end{aligned} \\] Note some of the choices we have made. First, we set the null hypothesis \\(H_0\\) to be that there is no difference in promotion rate and the “challenger” alternative hypothesis \\(H_A\\) to be that there is a difference. While it would not be wrong in principle to reverse the two, it is a convention in statistical inference that the null hypothesis is set to reflect a “null” situation where “nothing is going on.” As we discussed earlier, in this case, that there is no difference in promotion rates. Furthermore we set \\(H_A\\) to be that men are promoted at a higher rate, a subjective choice reflecting a prior suspicion we have that this is the case. We call such alternative hypotheses one-sided alternatives. If someone else however does not share such suspicions and only wants to investigate that there is a difference, whether higher or lower, they would set what is known as a two-sided alternative. We can re-express the formulation of our hypothesis test using the mathematical notation for our population parameter of interest, the difference in population proportions \\(p_{m} - p_{f}\\): \\[ \\begin{aligned} H_0 &amp;: p_{m} - p_{f} = 0\\\\ \\text{vs } H_A&amp;: p_{m} - p_{f} &gt; 0 \\end{aligned} \\] Observe how the alternative hypothesis \\(H_A\\) is one-sided \\(p_{m} - p_{f} &gt; 0\\). Had we opted for a two-sided alternative, we would have set \\(p_{m} - p_{f} \\neq 0\\). To keep things simple for now, we’ll stick with the simpler one-sided alternative. We’ll present an example of a two-sided alternative in Section 9.5. Third, a test statistic is a point estimate/sample statistic formula used for hypothesis testing. Note that a sample statistic is merely a summary statistic based on a sample of observations. Recall we saw in Section 3.3 that a summary statistic takes in many values and returns only one. Here, the sample would be the \\(n_m\\) = 24 resumes with male names and the \\(n_f\\) = 24 resumes with female names. Hence, the point estimate of interest is the difference in sample proportions \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\). Fourth, the observed test statistic is the value of the test statistic that we observed in real-life. In our case, we computed this value using the data saved in the promotions data frame. It was the observed difference of \\(\\widehat{p}_{m} -\\widehat{p}_{f}\\) = 0.875 - 0.583 = 0.292 = 29.2% in favor of resumes with male names. Fifth, the null distribution is the sampling distribution of the test statistic assuming the null hypothesis \\(H_0\\) is true. Ooof! That’s a long one! Let’s unpack it slowly. The key to understanding the null distribution is that the null hypothesis \\(H_0\\) assumed to be true. We’re not saying that \\(H_0\\) is true at this point, we’re only assuming it to be true for hypothesis testing purposes. In our case, this corresponds to our hypothesized universe of no gender discrimination in promotion rates. Assuming the null hypothesis \\(H_0\\), also stated as “Under \\(H_0\\),” how does the test statistic vary due to sampling variation? In our case, how will the difference in sample proportions \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\) vary due to sampling? Recall from Section 7.3.2 that distributions that display how point estimates vary due to sampling variation are called sampling distributions. The only additional thing to keep in mind about null distributions is that they are sampling distributions assuming the null hypothesis \\(H_0\\) is true. In our case, we previously visualized a null distribution in Figure 9.6, which we re-display in Figure 9.7 using our new notation and terminology. It is the distribution of the 16 different difference in sample proportions our friends computed assuming a hypothetical universe of no gender discrimination. We also mark the value of the observed test statistic of 0.292 with a vertical line. FIGURE 9.7: Null distribution and observed test statistic. Sixth, the p-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis \\(H_0\\) is true. Double ooof! Let’s unpack this slowly as well. You can think of the p-value as a quantification of “surprise”: assuming \\(H_0\\) is true, how surprised are we with what we observed? Or in our case, in our hypothesized universe of no gender discrimination, how surprised are we that we observed a difference in promotion rates of 0.292? Very surprised? Somewhat surprised? The p-value quantifies this probability, or in the case of our 16 differences in sample proportions in Figure 9.7, what proportion had a more “extreme” result? Here, extreme is defined in terms of the alternative hypothesis \\(H_A\\) that “male” applicants are promoted at a higher rate than “female” applicants. In other words, how often was the discrimination in favor of men even more pronounced than 0.875 - 0.583 = 0.292 = 29.2%? In this case, 0 times out of 16 did we obtain a difference in proportion greater than or equal to the observed difference of 0.292 = 29.2%. A very rare outcome! Given the rarity of such a pronounced in difference in promotion rates in our hypothesized universe of no gender discrimination, we’re inclined to reject our hypothesized universe in favor of one stating there is discrimination in favor of the “male” applicants. In other words, we reject \\(H_0\\) in favor of \\(H_A\\). Seventh and lastly, in many hypothesis testing procedures, it is commonly recommended to set the significance level of the test beforehand. It is denoted by the Greek letter \\(\\alpha\\) (pronounced “alpha”). This value acts as a cutoff on the p-value, where if the p-value falls below \\(\\alpha\\), we would “reject the null hypothesis \\(H_0\\).” Alternatively, if the p-value does not fall below \\(\\alpha\\), we would “fail to reject \\(H_0\\).” Note the latter statement is not quite the same as saying we “accept \\(H_0\\).” This distinction is rather subtle and not immediately obvious. So we’ll revisit it later in Section 9.4. While different fields tend to use different values of \\(\\alpha\\), some commonly used values for \\(\\alpha\\) are 0.1, 0.01, and 0.05, with 0.05 being the choice people often make without putting much thought into it. We’ll talk more about \\(\\alpha\\) significance levels in Section 9.4, but first let’s fully conduct the hypothesis test corresponding to our promotions activity using the infer package. 9.3 Conducting hypothesis tests In Section 8.4, we showed you how to construct confidence intervals. We first illustrated how to do this using raw dplyr data wrangling verbs and the rep_sample_n() function from Section 7.2.3 which we used as a virtual shovel. In particular, we constructed confidence intervals by resampling with replacement by setting the replace = TRUE argument to the rep_sample_n() function. We then showed you how to perform the same task using the infer package workflow. While both workflows resulted in the same bootstrap distribution from which we can construct confidence intervals, the infer package workflow emphasizes each of the steps in the overall process in Figure 9.8. It does so using function names that are intuitively named with verbs: specify() the variables of interest in your data frame. generate() replicates of bootstrap resamples with replacement. calculate() the summary statistic of interest. visualize() the resulting bootstrap distribution and confidence interval. FIGURE 9.8: Confidence intervals with the infer package. In this section, we’ll now show you how to seamlessly modify the previously seen infer code for constructing confidence intervals to conduct hypothesis tests. You’ll notice that the basic outline of the workflow is almost identical, except for an additional hypothesize() step between the specify() and generate() steps, as can be seen in Figure 9.9. FIGURE 9.9: Hypothesis testing with the infer package. Furthermore, we’ll use a pre-specified significance level \\(\\alpha\\) = 0.001 for this hypothesis test. Let’s leave discussion on the choice of this \\(\\alpha\\) value until later on in Section 9.4. 9.3.1 infer package workflow 1. specify variables Recall that we use the specify() verb to specify the response variable and, if needed, any explanatory variables for our study. In this case, since we are interested in any potential effects of gender on promotion decisions, we set decision as the response variable and gender as the explanatory variable. We do so using a formula = response ~ explanatory argument where response is the name of the response variable in the data frame and explanatory is the name of the explanatory variable. So in our case it is decision ~ gender. Furthermore, since we are interested in the proportion of resumes &quot;promoted&quot;, and not the proportion of resumes not promoted, we set the argument success = &quot;promoted&quot;. promotions %&gt;% specify(formula = decision ~ gender, success = &quot;promoted&quot;) Response: decision (factor) Explanatory: gender (factor) # A tibble: 48 x 2 decision gender &lt;fct&gt; &lt;fct&gt; 1 promoted male 2 promoted male 3 promoted male 4 promoted male 5 promoted male 6 promoted male 7 promoted male 8 promoted male 9 promoted male 10 promoted male # … with 38 more rows Again, notice how the promotions data itself doesn’t change, but the Response: decision (factor) and Explanatory: gender (factor) meta-data do. This is similar to how the group_by() verb from dplyr doesn’t change the data, but only adds “grouping” meta-data, as we saw in Section 3.4. 2. hypothesize the null In order to conduct hypothesis tests using the infer workflow, we need a new step not present for confidence intervals: hypothesize(). Recall from Section 9.2 that our hypothesis test was \\[ \\begin{aligned} H_0 &amp;: p_{m} - p_{f} = 0\\\\ \\text{vs } H_A&amp;: p_{m} - p_{f} &gt; 0 \\end{aligned} \\] In other words, the null hypothesis \\(H_0\\) corresponding to our “hypothesized universe” stated that there was no difference in gender-based discrimination rates. We set this null hypothesis \\(H_0\\) in our infer workflow using the null argument of the hypothesize() function to either: &quot;point&quot; for hypotheses involving a single sample or &quot;independence&quot; for hypotheses involving two samples In our case, since we have two samples (the resumes with “male” and “female” names), we set null = &quot;independence&quot;. promotions %&gt;% specify(formula = decision ~ gender, success = &quot;promoted&quot;) %&gt;% hypothesize(null = &quot;independence&quot;) # A tibble: 48 x 2 decision gender &lt;fct&gt; &lt;fct&gt; 1 promoted male 2 promoted male 3 promoted male 4 promoted male 5 promoted male 6 promoted male 7 promoted male 8 promoted male 9 promoted male 10 promoted male # … with 38 more rows Again, the data has not changed yet. This will occur at the upcoming generate() step; we’re merely setting meta-data for now. Where do the terms &quot;point&quot; and &quot;independence&quot; come from? These are two technical statistics terms. The term “point” relates from the fact that for a single group of observations, you will test the value of a single point. Going back to the pennies example from Chapter 8, say we wanted to test if the mean year of all US pennies was equal to 1993 or not. We would be testing the value of a “point” \\(\\mu\\), the mean year of all US pennies, as follows \\[ \\begin{aligned} H_0 &amp;: \\mu = 1993\\\\ \\text{vs } H_A&amp;: \\mu \\neq 1993 \\end{aligned} \\] The term “independence” relates to the fact that for two groups of observations, you are testing whether or not the response variable is independent of the explanatory variable that assigns the groups. In our case, we are testing whether the decision response variable is “independent” of the explanatory variable gender that assigns each resume to either of the two groups. 3. generate replicates After we hypothesize() the null hypothesis, we generate() replicates of “shuffled” datasets assuming the null hypothesis is true. We do this by repeating the shuffling exercise you performed in Section 9.1 several times. Instead of merely doing it 16 times as our groups of friends did, let’s use the computer to repeat this 1000 times by setting reps = 1000 in the generate() function. However, unlike for confidence intervals where we generated replicates using type = &quot;bootstrap&quot; resampling with replacement, we’ll now perform shuffles/permutations by setting type = &quot;permute&quot;. Recall that shuffles/permutations are a kind of resampling, but unlike the bootstrap method, they involve resampling without replacement. promotions %&gt;% specify(formula = decision ~ gender, success = &quot;promoted&quot;) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) Response: decision (factor) Explanatory: gender (factor) Null Hypothesis: independence # A tibble: 48,000 x 3 # Groups: replicate [1,000] decision gender replicate &lt;fct&gt; &lt;fct&gt; &lt;int&gt; 1 promoted male 1 2 not male 1 3 promoted male 1 4 promoted female 1 5 promoted female 1 6 promoted female 1 7 promoted female 1 8 promoted female 1 9 promoted female 1 10 not female 1 # … with 47,990 more rows Observe that the resulting data frame has 48,000 rows. This is because we performed shuffles/permutations of the 48 values of gender 1000 times and 48,000 = 1000 \\(\\times\\) 48. The variable replicate indicates which resample each row belongs to. So it has the value 1 48 times, the value 2 48 times, all the way through to the value 1000 48 times. 4. calculate summary statistics Now that we have generated 1000 replicates of “shuffles” assuming the null hypothesis is true, let’s calculate() the appropriate summary statistic for each of our 1000 shuffles. Recall from Section 9.2 that point estimates/summary statistics related to hypothesis testing have a specific name: test statistics. Since the unknown population parameter of interest is the difference in population proportions \\(p_{m} - p_{f}\\), the test statistic of interest here is the difference in sample proportions \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\). For each of our 1000 shuffles, we can calculate this test statistic by setting stat = &quot;diff in props&quot;. Furthermore, since we are interested in \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\) we set order = c(&quot;male&quot;, &quot;female&quot;). As we stated earlier, the order of the subtraction does not matter, so long as you stay consistent throughout your analysis and tailor your interpretations accordingly. Let’s save the result in a data frame called null_distribution: null_distribution &lt;- promotions %&gt;% specify(formula = decision ~ gender, success = &quot;promoted&quot;) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;male&quot;, &quot;female&quot;)) null_distribution # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 -0.208333 2 2 0.291667 3 3 0.125 4 4 -0.208333 5 5 -0.125 6 6 0.0416667 7 7 -0.0416667 8 8 0.291667 9 9 0.0416667 10 10 0.125 # … with 990 more rows Observe that we have 1000 values of stat, each representing one instance of \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\) in a hypothesized world of no gender discrimination. Observe as well we chose the name of this data frame carefully: null_distribution. Recall once again from Section 9.2 that sampling distributions when the null hypothesis \\(H_0\\) is assumed to be true have a special name: the null distribution. But wait! What happened in real-life? What was the observed difference in promotion rates? In other words, what was the observed test statistic \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\)? Recall from Section 9.1 that we computed this observed difference by hand to be 0.875 - 0.583 = 0.292 = 29.2%. We can also compute this value using the previous infer code but with the hypothesize() and generate() steps removed. Let’s save this in obs_diff_prop obs_diff_prop &lt;- promotions %&gt;% specify(decision ~ gender, success = &quot;promoted&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;male&quot;, &quot;female&quot;)) obs_diff_prop # A tibble: 1 x 1 stat &lt;dbl&gt; 1 0.291667 5. visualize the p-value The final step is to measure how surprised we are by a promotion difference of 29.2% in a hypothesized universe of no gender discrimination. If the observed difference of 0.292 is highly unlikely, then we would be inclined to reject the validity of our hypothesized universe. We start by visualizing the null distribution of our 1000 values of \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\) using visualize() in Figure 9.10. Recall that these are values of the difference in promotion rates assuming \\(H_0\\) is true, in other words in our hypothesized universe of no gender discrimination. visualize(null_distribution, binwidth = 0.1) FIGURE 9.10: Null distribution Let’s now add what happened in real-life to Figure 9.10, the observed difference in promotion rates of 0.875 - 0.583 = 0.292 = 29.2%. However, instead of merely adding a vertical line using geom_vline(), let’s use the shade_p_value() function with obs_stat set to the observed test statistic value we saved in obs_diff_prop. Furthermore, we’ll set the direction = &quot;right&quot; reflecting our alternative hypothesis \\(H_A: p_{m} - p_{f} &gt; 0\\). Recall our alternative hypothesis \\(H_A\\) is that \\(p_{m} - p_{f} &gt; 0\\), stating that there is a difference in promotion rates in favor of resumes with male names. “More extreme” here corresponds to differences that are “bigger” or “more positive” or “more to the right.” Hence we set the direction argument of shade_p_value() to be &quot;right&quot;. On the other hand, had our alternative hypothesis \\(H_A\\) been the other possible one-sided alternative \\(p_{m} - p_{f} &lt; 0\\), suggesting discrimination in favor of resumes with female names, we would’ve set direction = &quot;left&quot;. Had our alternative hypothesis \\(H_A\\) been two-sided \\(p_{m} - p_{f} \\neq 0\\), suggesting discrimination in either direction, we would’ve set direction = &quot;both&quot;. visualize(null_distribution, bins = 10) + shade_p_value(obs_stat = obs_diff_prop, direction = &quot;right&quot;) FIGURE 9.11: Shaded histogram to show p-value. In the resulting Figure 9.11, the solid red line marks 0.292 = 29.2%. However, what does the shaded-region correspond to? This is the p-value. Recall the definition of the p-value from Section 9.2: A p-value is the probability of obtaining a test statistic just as or more extreme than the observed test statistic assuming the null hypothesis \\(H_0\\) is true. So judging by the shaded region in Figure 9.11, it seems we would somewhat rarely observe differences in promotion rates of 0.292 = 29.2% or more in a hypothesized universe of no gender discrimination. In other words, the p-value is somewhat small. Hence, we would be inclined to reject this hypothesized universe, or using statistical language we would “reject \\(H_0\\).” What fraction of the null distribution is shaded? In other words, what is the exact value of the p-value? We can compute it using the get_p_value() function with the same arguments as the previous visualize() code: null_distribution %&gt;% get_p_value(obs_stat = obs_diff_prop, direction = &quot;right&quot;) # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0.027 Keeping the definition of a p-value in mind, the probability of observing a difference in promotion rates as large as 0.292 = 29.2% due to sampling variation alone is 0.027 = 2.7%. Since this p-value is greater than our pre-specified significance level \\(\\alpha\\) = 0.001, we fail to reject the null hypothesis \\(H_0: p_{m} - p_{f} = 0\\). In other words, this p-value wasn’t sufficiently small to reject our hypothesized universe of no gender discrimination. Observe that whether we reject the null hypothesis \\(H_0\\) or not depends in large part on our choice of significance level \\(\\alpha\\). We’ll discuss this more in Section 9.4.3. 9.3.2 Comparison with confidence intervals One of the great things about the infer package is that we can jump seamlessly between conducting hypothesis tests and constructing confidence intervals with minimal changes! Recall the code from the previous section that creates the null distribution, which in turn is needed to compute the p-value: null_distribution &lt;- promotions %&gt;% specify(formula = decision ~ gender, success = &quot;promoted&quot;) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;male&quot;, &quot;female&quot;)) To create the corresponding bootstrap distribution needed to construct a 95% confidence interval for \\(p_{m} - p_{f}\\), we only need to make two changes. First, we remove the hypothesize() step since we are no longer assuming a null hypothesis \\(H_0\\) is true. We can do this by deleting or commenting out the hypothesize() line of code. Second, we switch the type of resampling in the generate() step to be &quot;bootstrap&quot; instead of &quot;permute&quot;. bootstrap_distribution &lt;- promotions %&gt;% specify(formula = decision ~ gender, success = &quot;promoted&quot;) %&gt;% # Change 1 - Remove hypothesize(): # hypothesize(null = &quot;independence&quot;) %&gt;% # Change 2 - Switch type from &quot;permute&quot; to &quot;bootstrap&quot;: generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;male&quot;, &quot;female&quot;)) Using this bootstrap_distribution, let’s first compute the percentile-based confidence intervals, as we did in Section 8.4: percentile_ci &lt;- bootstrap_distribution %&gt;% get_confidence_interval(level = 0.95, type = &quot;percentile&quot;) percentile_ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 0.0414187 0.522222 Using our shorthand interpretation for 95% confidence intervals from Section 8.5.2, we are 95% “confident” that the true difference in population proportions \\(p_{m} - p_{f}\\) is between (0.041, 0.522). Let’s visualize bootstrap_distribution and this percentile-based 95% confidence interval for \\(p_{m} - p_{f}\\) in Figure 9.12. visualize(bootstrap_distribution) + shade_confidence_interval(endpoints = percentile_ci) FIGURE 9.12: Percentile-based 95 percent confidence interval. Notice a key value that is not included in the 95% confidence interval for \\(p_{m} - p_{f}\\): the value 0. In other words, a difference of 0 is not included in our net, suggesting that \\(p_{m}\\) and \\(p_{f}\\) are truly different! Furthermore, observe how the entirety of the 95% confidence interval for \\(p_{m} - p_{f}\\) lies above 0, suggesting that this difference is in favor of men. Since the bootstrap distribution appears to be roughly normally shaped, we can also use the standard error method as we did in Section 8.4. In this case, we must specify the point_estimate argument as the observed difference in promotion rates 0.292 = 29.2% saved in obs_diff_prop. This value acts as the center of the confidence interval. se_ci &lt;- bootstrap_distribution %&gt;% get_confidence_interval(level = 0.95, type = &quot;se&quot;, point_estimate = obs_diff_prop) se_ci # A tibble: 1 x 2 lower upper &lt;dbl&gt; &lt;dbl&gt; 1 0.0490607 0.534273 Let’s visualize bootstrap_distribution again, but now the standard error based 95% confidence interval for \\(p_{m} - p_{f}\\) in Figure 9.13. Again, notice how the value 0 is not included in our confidence interval, again suggesting that \\(p_{m}\\) and \\(p_{f}\\) are truly different! visualize(bootstrap_distribution) + shade_confidence_interval(endpoints = se_ci) FIGURE 9.13: Standard error-based 95 percent confidence interval. Learning check (LC9.1) Conduct the same analysis comparing male and female promotion rates using the median rating instead of the mean rating? What was different and what was the same? (LC9.2) Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between the promotion rate of males and females using this study. (LC9.3) Why are we relatively confident that the distributions of the sample proportions will be good approximations of the population distributions of promotion proportions for the two genders? (LC9.4) Using the definition of “\\(p\\)-value”, write in words what the \\(p\\)-value represents for the hypothesis test comparing the promotion rates for males and females. (LC9.5) What is the value of the \\(p\\)-value for the hypothesis test comparing the mean rating of romance to action movies? How can it be interpreted in the context of the problem? 9.3.3 “There is only one test” Let’s recap the steps necessary to conduct a hypothesis test using the terminology, notation, and definitions related to sampling you saw in Section 9.2 and the infer workflow from Section 9.3.1: specify() the variables of interest in your data frame. hypothesize() the null hypothesis \\(H_0\\). In other words, set a “model for the universe” assuming \\(H_0\\) is true. generate() shuffles assuming \\(H_0\\) is true. In other words, simulate data assuming \\(H_0\\) in true. calculate() the test statistic of interest, both for the observed data and your simulated data. visualize() the resulting null distribution and compute the p-value by comparing the null distribution to the observed test statistic. While this is a lot to digest, especially the first time you encounter hypothesis testing, the nice thing is that once you understand this general framework, then you can understand any hypothesis test. In a famous blog post, computer scientist Allen Downey called this the “There is only one test” framework, for which he created the flowchart displayed in Figure 9.14. FIGURE 9.14: Allan Downey’s hypothesis testing framework. Notice its similarity with the “hypothesis testing via infer” diagram you saw in Figure 9.9. That’s because the infer package was explicitly designed to match the “There is only one test” framework. So if you can understand the framework, you can easily generalize these ideas for all hypothesis testing scenarios. Whether for population proportions \\(p\\), population means \\(\\mu\\), differences in population proportions \\(p_1 - p_2\\), differences in population means \\(\\mu_1 - \\mu_2\\), and as you’ll see in Chapter 10 on inference for regression, population regression intercepts \\(\\beta_0\\) and population regression slopes \\(\\beta_1\\) as well. 9.4 Interpreting hypothesis tests Interpreting the results of hypothesis tests are one of the more challenging aspects of this method for statistical inference. In this section, we’ll focus on ways to help with deciphering the process and address some common misconceptions. 9.4.1 Two possible outcomes In Section 9.2, we mentioned that given a pre-specified significance level \\(\\alpha\\) there are two possible outcomes of a hypothesis test: If the p-value is less than \\(\\alpha\\), then we reject the null hypothesis \\(H_0\\) in favor of \\(H_A\\). If the p-value is greater than or equal to \\(\\alpha\\), we fail to reject the null hypothesis \\(H_0\\). Unfortunately, the latter result is often misinterpreted as “accepting the null hypothesis \\(H_0\\).” While at first glance it may seem that the statements “failing to reject \\(H_0\\)” and “accepting \\(H_0\\)” are equivalent, there actually is a subtle difference. Saying that we “accept the null hypothesis \\(H_0\\)” is equivalent to stating “we think the null hypothesis \\(H_0\\) is true.” However, saying that we “fail to reject the null hypothesis \\(H_0\\)” is saying something else: “While \\(H_0\\) might still be false, we don’t have enough evidence to say so.” In other words, there is an absence of enough proof. However, the absence of proof is not proof of absence. To further shed light on this distinction, let’s use the United States criminal justice system as an analogy. A criminal trial in the United States is a similar situation to hypothesis tests whereby a choice between two contradictory claims must be made about a defendant who is on trial: The defendant is truly either “innocent” or “guilty.” The defendant is presumed “innocent until proven guilty.” The defendant is found guilty only if there is strong evidence that the defendant is guilty. The phrase “beyond a reasonable doubt” is often used as a guideline for determining a cutoff for when enough evidence exists to find the defendant guilty. The defendant is found to be either “not guilty” or “guilty” in the ultimate verdict. In other words, “not guilty” verdicts are not suggesting the defendant is “innocent”, but instead that “while the defendant may still actually be guilty, there wasn’t enough evidence to prove this fact.” Now let’s make the connection with hypothesis tests: Either the null hypothesis \\(H_0\\) or the alternative hypothesis \\(H_A\\) is true. Hypothesis tests are always conducted assuming the null hypothesis \\(H_0\\) is true. We reject the null hypothesis \\(H_0\\) in favor of \\(H_A\\) only if the evidence found in the sample suggests that \\(H_A\\) is true. The significance level \\(\\alpha\\) is used as a guideline to set the threshold on how strong evidence we require. We ultimately decide to either “fail to reject \\(H_0\\)” or “reject \\(H_0\\).” So while gut instinct may suggest “failing to reject \\(H_0\\)” and “accepting \\(H_0\\)” are equivalent statements, they are not. “Accepting \\(H_0\\)” is equivalent to finding a defendant innocent. However, courts do not defendants “innocent,” but rather they find them “not guilty.” Putting things differently, defense attorneys do not need to prove that their clients are innocent, rather they only need to prove that clients are “not guilty beyond a reasonable doubt”. So going back to our resumes activity in Section 9.3, recall that our hypothesis test was \\(H_0: p_{m} - p_{f} = 0\\) versus \\(H_A: p_{m} - p_{f} &gt; 0\\) and that we used a pre-specified significance level of \\(\\alpha\\) = 0.001. We found a p-value of 0.027. Since the p-value was greater than \\(\\alpha\\) = 0.001, we failed to reject \\(H_0\\). In other words, we didn’t find any evidence in this particular sample to say that \\(H_0\\) is false at the \\(\\alpha\\) = 0.001 significance level. We also state this conclusion using non-statistical language: we didn’t find enough evidence in this data to suggest that there was no gender discrimination. 9.4.2 Types of errors Unfortunately, there is some chance a jury or a judge can make an incorrect decision in a criminal trial by reaching the wrong verdict. For example, finding a truly innocent defendant “guilty”. Or on the other hand, finding a truly guilty defendant “not guilty.” This can often stem from the fact that prosecutors don’t have access to all the relevant evidence, but instead are limited to whatever evidence the police can find. The same holds for hypothesis tests. We can make incorrect decisions about a population parameter because we only have a sample of data from the population and thus sampling variation can lead us to incorrect conclusions. There are two possible erroneous conclusions in a criminal trial: either 1) a truly innocent person is found guilty or 2) a truly guilty person is found not guilty. Similarly, there are two possible errors in a hypothesis test: either 1) rejecting \\(H_0\\) when in fact \\(H_0\\) is true, called a Type I error or 2) failing to reject \\(H_0\\) when in fact \\(H_0\\) is false, called a Type II error. Another term used for “Type I error” is “false positive” while another term for “Type II error” include “false negative.” This risk of error is the price researchers pay for basing inference on a sample instead of performing a census on the entire population. But as we’ve seen in our numerous examples and activities so far, censuses are often very expensive and other times impossible, and thus researchers have no choice but to use a sample. Thus in any hypothesis test based on a sample, we have no choice but to tolerate the chance that a Type I error will be made and some chance that a Type II error will occur. To help understand the concepts of Type I error and Type II errors, we apply these terms to our criminal justice analogy in Figure 9.15. FIGURE 9.15: Type I and Type II errors in criminal trials. Thus a Type I error corresponds to incorrectly putting a truly innocent person in jail whereas a Type II error corresponds to letting a truly guilty person go free. Let’s show the corresponding table for hypothesis tests FIGURE 9.16: Type I and Type II errors in hypothesis tests. 9.4.3 How do we choose alpha? If we are using a sample to make inferences about a population, we run the risk of making errors. For confidence intervals, a corresponding “error” would be constructing a confidence interval that does not contain the true value of the population parameter. For hypothesis tests, this would be making either a Type I or Type II error. Obviously, we want to minimize the probability of either error; we want a small probability of making an incorrect conclusion: The probability of a Type I Error occurring is denoted by \\(\\alpha\\). The value of \\(\\alpha\\) is called the significance level of the hypothesis test, which we defined in Section 9.2 The probability of a Type II Error is denoted by \\(\\beta\\). The value of \\(1-\\beta\\) is known as the power of the hypothesis test. In other words, \\(\\alpha\\) corresponds to the probability of incorrectly rejecting \\(H_0\\) when in fact \\(H_0\\) is true. On the other hand, \\(\\beta\\) corresponds to the probability of incorrectly failing to reject \\(H_0\\) when in fact \\(H_0\\) is false. Ideally, we want \\(\\alpha = 0\\) and \\(\\beta = 0\\), meaning that the chance of making either error is 0. However, this can never be the case in any situation where we are sampling for inference. There will always be the possibility of making either error when we use sample data. Furthermore, these two error probabilities are inversely related. As the probability of a Type I error goes down, the probability of a Type II error goes up. What is typically done in practice is to fix the probability of a Type I error by pre-specifying a significance level \\(\\alpha\\) and then try to minimize \\(\\beta\\). In other words, we will tolerate a certain fraction of incorrect rejections of the null hypothesis \\(H_0\\), and then try to minimize the fraction of incorrect non-rejections of \\(H_0\\). So for example if we used \\(\\alpha\\) = 0.01, we would be using a hypothesis testing procedure that in the long run would incorrectly reject the null hypothesis \\(H_0\\) one percent of the time. This is analogous to setting the confidence level of a confidence interval. So what value should you use for \\(\\alpha\\)? Different fields have different conventions, but some commonly used values include 0.10, 0.05, 0.01, and 0.001. However, it is important to keep in mind that if you use a relatively small value of \\(\\alpha\\) then all things being equal, p-values will have a harder time being less than \\(\\alpha\\). Thus we would reject the null hypothesis less often. In other words, we would reject the null hypothesis \\(H_0\\) only if we have very strong evidence to do so. This is known as a “conservative” test. On the other hand, if we used a relatively large value of \\(\\alpha\\) then all things being equal, p-values will have an easier time being less than \\(\\alpha\\). Thus we would reject the null hypothesis more often. In other words, we would reject the null hypothesis \\(H_0\\) even if we only have mild evidence to do so. This is known as a “liberal” test. Learning check (LC9.6) What is wrong about saying “The defendant is innocent.” based on the US system of criminal trials? (LC9.7) What is the purpose of hypothesis testing? (LC9.8) What are some flaws with hypothesis testing? How could we alleviate them? (LC9.9) Consider two \\(\\alpha\\) significance levels of 0.1 and 0.01. Of the two, which would lead to a more liberal hypothesis testing procedure? In other words, one that will, all things being equal, lead to more rejections of the null hypothesis \\(H_0\\)? 9.5 Case study: Are action or romance movies rated higher? Let’s apply our knowledge of hypothesis testing to answer the question: “Are action or romance movies rated higher on IMDb?” IMDb is a database on the internet providing information on movie and television show casts, plot summaries, trivia, and ratings. We’ll investigate if, on average, action or romance movies get higher ratings on IMDb. 9.5.1 IMDb ratings data The movies dataset in the ggplot2movies package contains information on 58,788 movies that have been rated by users of IMDB.com. movies # A tibble: 58,788 x 24 title year length budget rating votes r1 r2 r3 r4 r5 r6 &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5 2 $100… 1939 71 NA 6 20 0 14.5 4.5 24.5 14.5 14.5 3 $21 … 1941 7 NA 8.200 5 0 0 0 0 0 24.5 4 $40,… 1996 70 NA 8.200 6 14.5 0 0 0 0 0 5 $50,… 1975 71 NA 3.4 17 24.5 4.5 0 14.5 14.5 4.5 6 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5 14.5 14.5 14.5 7 $win… 2002 93 NA 5.3 200 4.5 0 4.5 4.5 24.5 24.5 8 &#39;15&#39; 2002 25 NA 6.7 24 4.5 4.5 4.5 4.5 4.5 14.5 9 &#39;38 1987 97 NA 6.6 18 4.5 4.5 4.5 0 0 0 10 &#39;49-… 1917 61 NA 6 51 4.5 0 4.5 4.5 4.5 44.5 # … with 58,778 more rows, and 12 more variables: r7 &lt;dbl&gt;, r8 &lt;dbl&gt;, r9 &lt;dbl&gt;, # r10 &lt;dbl&gt;, mpaa &lt;chr&gt;, Action &lt;int&gt;, Animation &lt;int&gt;, Comedy &lt;int&gt;, # Drama &lt;int&gt;, Documentary &lt;int&gt;, Romance &lt;int&gt;, Short &lt;int&gt; We’ll focus on a random sample of 68 movies that are classified as either “action” or “romance” movies but not both. We disregard movies that are classified as both so that we can assign all 68 movies into either category. Furthermore, since the original movies dataset was a little messy, we provide a pre-wrangled version of our data in the movies_sample data frame included in the moderndive package. If you’re curious, you can look at the necessary data wrangling code to do this on GitHub. movies_sample # A tibble: 68 x 4 title year rating genre &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; 1 Underworld 1985 3.1 Action 2 Love Affair 1932 6.3 Romance 3 Junglee 1961 6.8 Romance 4 Eversmile, New Jersey 1989 5 Romance 5 Search and Destroy 1979 4 Action 6 Secreto de Romelia, El 1988 4.9 Romance 7 Amants du Pont-Neuf, Les 1991 7.4 Romance 8 Illicit Dreams 1995 3.5 Action 9 Kabhi Kabhie 1976 7.7 Romance 10 Electric Horseman, The 1979 5.8 Romance # … with 58 more rows The variables include the title and year the movie was filmed. Furthermore, we have a numerical variable rating, which is the IMDb rating out of 10 stars, and a binary categorical variable genre indicating if the movie was an Action or Romance movie. We are interested in whether Action or Romance movies got a higher rating on average. Let’s perform an exploratory data analysis of this data. Recall from Section 2.7.1 that a boxplot is a visualization we can use to show the relationship between a numerical and a categorical variable. Another option you saw in Section 2.6 would be to use a faceted histogram. However in the interest of brevity, let’s only present the boxplot in Figure 9.17. ggplot(data = movies_sample, aes(x = genre, y = rating)) + geom_boxplot() + labs(y = &quot;IMDb rating&quot;) FIGURE 9.17: Boxplot of IMDb rating vs genre. Eyeballing Figure 9.17, it appears that romance movies have a higher median rating. Do we have reason to believe however, that there is a significant difference between the mean rating for action movies compared to romance movies? It’s hard to say just based on the plot. The boxplot does show that the median sample rating is higher for romance movies. However, there is a large amount of overlap between the boxes. Let’s calculate some summary statistic split by the binary categorical variable genre: the number of movies, the mean rating, and the standard deviation split. We’ll do this using dplyr data wrangling verbs. Notice in particular how we count the number of each type of movie using the n() summary function. movies_sample %&gt;% group_by(genre) %&gt;% summarize(n = n(), mean_rating = mean(rating), std_dev = sd(rating)) # A tibble: 2 x 4 genre n mean_rating std_dev &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; 1 Action 32 5.275 1.36121 2 Romance 36 6.32222 1.60963 Observe that we have 36 movies with an average rating of 6.32 stars and 32 movies with an average rating of 5.28 stars. The difference in these average ratings is thus 6.32 - 5.28 = 1.05. So there appears to be an edge of 1.05 stars in favor of romance movies. The question is however, are these results indicative of a true difference for all romance and action movies? Or could we attribute this difference to chance sampling variation? 9.5.2 Sampling scenario Let’s now revisit this study in terms of terminology and notation related to sampling we studied in Section 7.3.1. The study population is all movies in the IMDb database that are either action or romance (but not both). The sample from this population is the 68 movies included in the movies_sample dataset. Since this sample was randomly taken from the population movies, it is representative of all romance and action movies on IMDb. Thus, any analysis and results based on movies_sample can generalize to the entire population. What are the relevant population parameter and point estimates? We introduce the fourth sampling scenario in Table 9.3. TABLE 9.3: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Notation. 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) 4 Difference in population means \\(\\mu_1 - \\mu_2\\) Difference in sample means \\(\\overline{x}_1 - \\overline{x}_2\\) So whereas the sampling bowl exercise in Section 7.1 concerned proportions, the pennies exercise in Section 8.1 concerned means, the case study on whether yawning is contagious in Section 8.6 and the promotions activity in Section 9.1 concerned differences in proportions, we are now concerned with differences in means. In other words, the population parameter of interest is the difference in population mean ratings \\(\\mu_a - \\mu_r\\), where \\(\\mu_a\\) is the mean rating of all action movies on IMDb and similarly \\(\\mu_r\\) is the mean rating of all romance movies. Additionally the point estimate/sample statistic of interest is the difference in sample means \\(\\overline{x}_a - \\overline{x}_r\\), where \\(\\overline{x}_a\\) is the mean rating of the \\(n_a\\) = 32 movies in our sample and \\(\\overline{x}_r\\) is the mean rating of the \\(n_r\\) = 36 in our sample. Based on our earlier exploratory data analysis, our estimate \\(\\overline{x}_a - \\overline{x}_r\\) is 5.28 - 6.32 = -1.05. So there appears to be a slight difference of -1.05 in favor of romance movies. The question is however, could this difference of -1.05 be merely due to chance and sampling variation? Or are these results indicative of a true difference in mean ratings for all romance and action movies on IMDb? To answer this question, we’ll use hypothesis testing. 9.5.3 Conducting the hypothesis test We’ll be testing: \\[ \\begin{aligned} H_0 &amp;: \\mu_a - \\mu_r = 0\\\\ \\text{vs } H_A&amp;: \\mu_a - \\mu_r \\neq 0 \\end{aligned} \\] In other words, the null hypothesis \\(H_0\\) suggests that both romance and action movies have the same mean rating. This is the “hypothesized universe” we’ll assume is true. On the other hand, the alternative hypothesis \\(H_A\\) suggests that there is a difference. Unlike the one-sided alternative we used in the promotions exercise \\(H_a: p_m - p_f &gt; 0\\), we are now considering a two-sided alternative of \\(H_A: \\mu_a - \\mu_r \\neq 0\\). Furthermore, we’ll pre-specify a relatively high significance level of \\(\\alpha\\) = 0.2. By setting this value high, all things being equal, there is a higher chance that the p-value will be less than \\(\\alpha\\). Thus there is a higher chance that we’ll reject the null hypothesis \\(H_0\\) in favor of the alternative hypothesis \\(H_A\\). In other words, we’ll reject the hypothesis that there is no difference in mean ratings for all action and romance movies, even if we only have mild evidence. 1. specify variables Let’s now perform all the steps of the infer workflow. We first specify() the variables of interest in the movies_sample data frame using the formula rating ~ genre. This tells infer that the numerical variable rating is the outcome variable while the binary categorical variable genre is the explanatory variable. Note than unlike when we were previously interested in proportions, since we are now interested in the mean of a numerical variable, we do not need to set the success argument. movies_sample %&gt;% specify(formula = rating ~ genre) Response: rating (numeric) Explanatory: genre (factor) # A tibble: 68 x 2 rating genre &lt;dbl&gt; &lt;fct&gt; 1 3.1 Action 2 6.3 Romance 3 6.8 Romance 4 5 Romance 5 4 Action 6 4.9 Romance 7 7.4 Romance 8 3.5 Action 9 7.7 Romance 10 5.8 Romance # … with 58 more rows Observe at this point that the data in movies_sample has not changed. The only change so far is the newly defined Response: rating (numeric) and Explanatory: genre (factor) meta-data. 2. hypothesize the null We set the null hypothesis \\(H_0: \\mu_a - \\mu_r = 0\\) by using the hypothesize() function. Since we have two samples, action and romance movies, we set null = &quot;independence&quot; as we described in Section 9.3. movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% hypothesize(null = &quot;independence&quot;) # A tibble: 68 x 2 rating genre &lt;dbl&gt; &lt;fct&gt; 1 3.1 Action 2 6.3 Romance 3 6.8 Romance 4 5 Romance 5 4 Action 6 4.9 Romance 7 7.4 Romance 8 3.5 Action 9 7.7 Romance 10 5.8 Romance # … with 58 more rows 3. generate replicates After we have set the null hypothesis, we generate “shuffled” replicates assuming the null hypothesis is true by repeating the shuffling/permutation exercise you performed in Section 9.1. We’ll repeat this resampling without replacement of type = &quot;permute&quot; a total of reps = 1000 times . movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) Response: rating (numeric) Explanatory: genre (factor) Null Hypothesis: independence # A tibble: 68,000 x 3 # Groups: replicate [1,000] rating genre replicate &lt;dbl&gt; &lt;fct&gt; &lt;int&gt; 1 4.4 Action 1 2 5.2 Romance 1 3 7.3 Romance 1 4 4.9 Romance 1 5 4.100 Action 1 6 7.4 Romance 1 7 5 Romance 1 8 5.100 Action 1 9 4.4 Romance 1 10 8 Romance 1 # … with 67,990 more rows Observe that the resulting data frame has 68,000 rows. This is because we performed resampling of 68 movies with replacement 1000 times and 68,000 = 68 \\(\\times\\) 1000. The variable replicate indicates which resample each row belongs to. So it has the value 1 68 times, the value 2 68 times, all the way through to the value 1000 68 times. 4. calculate summary statistics Now that we have 1000 replicated “shuffles” assuming the null hypothesis \\(H_0\\) that both Action and Romance movies on average have the same ratings on IMDb, let’s calculate() the appropriate summary statistic for these 1000 replicated shuffles. Recall from Section 9.2 that point estimates/summary statistics relating to hypothesis testing have a specific name: test statistics. Since the unknown population parameter of interest is the difference in population means \\(\\mu_{a} - \\mu_{r}\\), the test statistic of interest here is the difference in sample means \\(\\overline{x}_{a} - \\overline{x}_{r}\\). For each of our 1000 shuffles, we can calculate this test statistic by setting stat = &quot;diff in means&quot;. Furthermore, since we are interested in \\(\\overline{x}_{a} - \\overline{x}_{r}\\), we set order = c(&quot;Action&quot;, &quot;Romance&quot;). Let’s save the results in a data frame called null_distribution_movies: null_distribution_movies &lt;- movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% calculate(stat = &quot;diff in means&quot;, order = c(&quot;Action&quot;, &quot;Romance&quot;)) null_distribution_movies # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 -0.923264 2 2 0.363542 3 3 0.404861 4 4 0.463889 5 5 -0.610417 6 6 -0.279861 7 7 -0.262153 8 8 -0.291667 9 9 -0.114583 10 10 0.398958 # … with 990 more rows Observe that we have 1000 values of stat, each representing one instance of \\(\\overline{x}_{a} - \\overline{x}_{r}\\). The 1000 values form the null distribution, which is the technical term for the sampling distribution of the difference in sample means \\(\\overline{x}_{a} - \\overline{x}_{r}\\) assuming \\(H_0\\) is true. But wait! What happened in real-life? What was the observed difference in promotion rates? In other words, what was the observed test statistic \\(\\overline{x}_{a} - \\overline{x}_{r}\\)? Recall that our earlier data wrangling from earlier, this observed difference in means was 5.28 - 6.32 = -1.05. We can also achieve this using the code that constructed the null distribution null_distribution_movies but with the hypothesize() and generate() steps removed. Let’s save this in obs_diff_means: obs_diff_means &lt;- movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% calculate(stat = &quot;diff in means&quot;, order = c(&quot;Action&quot;, &quot;Romance&quot;)) obs_diff_means # A tibble: 1 x 1 stat &lt;dbl&gt; 1 -1.04722 5. visualize the p-value Lastly, in order to compute the p-value, we have to assess how “extreme” the observed difference in means of -1.05 is. We do this by comparing -1.05 to our null distribution, which was constructed in a hypothesized universe of no true difference in movie ratings. Let’s visualize both the null distribution and the p-value in Figure 9.18. However, unlike our example in Section 9.3.1 involving promotions, since we have a two-sided alternative hypothesis \\(H_A: \\mu_a - \\mu_r \\neq 0\\), we have to allow for both possibilities for “more extreme”, so we set direction = &quot;both&quot;. visualize(null_distribution_movies, bins = 10) + shade_p_value(obs_stat = obs_diff_means, direction = &quot;both&quot;) FIGURE 9.18: Null distribution, observed test statistic, and p-value. Let’s go over the elements of this plot. First, the histogram is the null distribution. Second, the solid line is the observed test statistic, or the difference in sample means we observed in real-life of 5.28 - 6.32 = -1.05. Third, the two shaded areas of the histogram form the p-value, or the probability of obtaining a test statistic just as or more extreme than the observed test statistic assuming the null hypothesis \\(H_0\\) is true. What proportion of the null distribution is shaded? In other words, what is the numerical value of the p-value? We use the get_p_value() function to compute this value: null_distribution_movies %&gt;% get_p_value(obs_stat = obs_diff_means, direction = &quot;both&quot;) # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0.016 This p-value of 0.016 is somewhat small. In other words, there is a somewhat small chance that we’d observe a difference of 5.28 - 6.32 = -1.05 in a hypothesized universe where there was truly no difference in ratings. This p-value is in fact much smaller than our pre-specified \\(\\alpha\\) significance level of 0.2. Thus, we are very inclined to reject the null hypothesis \\(H_0: \\mu_a - \\mu_r = 0\\), in favor of the alternative hypothesis \\(H_A: \\mu_a - \\mu_r \\neq 0\\). In non-statistical language, the conclusion is: the evidence in this sample of data suggests that we should reject the hypothesis that there is no difference in mean IMDb ratings between romance and action movies in favor of the hypothesis that there is a difference. Learning check (LC9.10) Conduct the same analysis comparing action movies versus romantic movies using the median rating instead of the mean rating. What was different and what was the same? (LC9.11) What conclusions can you make from viewing the faceted histogram looking at rating versus genre that you couldn’t see when looking at the boxplot? (LC9.12) Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between mean movie ratings for action and romance movies. (LC9.13) Why are we relatively confident that the distributions of the sample ratings will be good approximations of the population distributions of ratings for the two genres? (LC9.14) Using the definition of \\(p\\)-value, write in words what the \\(p\\)-value represents for the hypothesis test comparing the mean rating of romance to action movies. (LC9.15) What is the value of the \\(p\\)-value for the hypothesis test comparing the mean rating of romance to action movies? (LC9.16) Do the results of the hypothesis test match up with the original plots we made looking at the population of movies? Why or why not? 9.6 Conclusion 9.6.1 Theory-based hypothesis tests Much as we did in Section 8.7.2 when we showed you a theory-based method for constructing confidence intervals that involved mathematical formulas, we now present an example of a traditional theory-based method to conduct hypothesis tests. This method relies on probability models, probability distributions, and a few assumptions to construct the null distribution. This is in contrast to the approach we’ve been using throughout this book where we relied on computer simulations to construct the null distribution. These traditional theory-based methods have been used for decades mostly because researchers didn’t have access to computers that could run thousands of calculations quickly and efficiently. Now that computing power is much cheaper and more accessible, simulation-based methods are much more feasible. However researchers in many fields continue to use theory-based methods. Hence we make it a point to include an example here. As we’ll show in this section, any theory-based method is ultimately an approximation to the simulation-based method. The theory-based method we’ll focus on is known as the two-sample \\(t\\)-test for testing differences in sample means. However, the test statistic we’ll use won’t be the difference in sample means \\(\\overline{x}_1 - \\overline{x}_2\\), but rather the related two-sample \\(t\\)-statistic. The data we’ll use will once again be the movies_sample data of action and romance movies from Section 9.5. Two-sample t-statistic A common task in statistics is the process of “standardizing a variable.” By standardizing different variables, we make them more comparable. For example, say you are interested in studying the distribution of temperature recordings from Portland, Oregon, USA with temperature recordings in Montreal, Quebec, Canada. Given that US temperatures are generally recorded in degrees Fahrenheit and Canadian temperatures are generally recorded in degrees Celsius, how can we make them comparable? One approach would be to convert degrees Fahrenheit into Celsius, or vice versa. Another approach would be to convert them both to a common “standardized” scale, like degrees Kelvin. One common method for standardizing a variable from probability theory is to compute the \\(z\\)-score: \\[z = \\frac{x - \\mu}{\\sigma}\\] where \\(x\\) represents one value of a variable, \\(\\mu\\) represents the mean of that variable, and \\(\\sigma\\) represents that standard deviation of the variable. You first subtract the mean \\(\\mu\\) from each value of \\(x\\) and then divide \\(x - \\mu\\) by the standard deviation \\(\\sigma\\). These operations will have the effect of “re-centering” your variable around 0 and “re-scaling” your variable \\(x\\) so that they have what are known as “standard units.” Thus for every value that your variable can take, it has a corresponding \\(z\\)-score that gives how many standard units away that value is from the mean \\(\\mu\\). \\(z\\)-scores are normally distributed with mean 0 and standard deviation 1. Such a curve is called a “\\(z\\)-distribution” as well a “standard normal” curve and they have the common, bell-shaped pattern from Figure 9.19. We discuss this further in Appendix A.2. FIGURE 9.19: Standard normal z curve. Bringing these back to the difference of sample mean ratings \\(\\overline{x}_a - \\overline{x}_r\\) of action versus romance movies, how would we standardize this variable? By once again subtracting its mean and dividing by its standard deviation. Recall two facts from Section 7.3.3. First, if the sampling was done in a representative fashion, then the sampling distribution of \\(\\overline{x}_a - \\overline{x}_r\\) will be centered at the true population parameter \\(\\mu_a - \\mu_r\\). Second, the standard deviation of point estimates like \\(\\overline{x}_a - \\overline{x}_r\\) have a special name: the standard error Applying these ideas, we present the two-sample \\(t\\)-statistic: \\[t = \\dfrac{ (\\bar{x}_a - \\bar{x}_r) - (\\mu_a - \\mu_r)}{ \\text{SE}_{\\bar{x}_a - \\bar{x}_r} } = \\dfrac{ (\\bar{x}_a - \\bar{x}_r) - (\\mu_a - \\mu_r)}{ \\sqrt{\\dfrac{{s_a}^2}{n_a} + \\dfrac{{s_r}^2}{n_r}} }\\] Oofda! There is a lot to try to unpack here! Let’s go slowly. In the numerator \\(\\bar{x}_a-\\bar{x}_r\\) is the difference in sample means while \\(\\mu_a - \\mu_r\\) is the difference in population means. In the denominator \\(s_a\\) and \\(s_r\\) are the sample standard deviations of the action and romance movies in our sample movies_sample. Lastly, \\(n_a\\) and \\(n_r\\) are the sample sizes of the action and romance movies. Putting this together gives us the standard error \\(\\text{SE}_{\\bar{x}_a - \\bar{x}_r}\\). Observe that the formula for \\(\\text{SE}_{\\bar{x}_a - \\bar{x}_r}\\) has the sample sizes \\(n_a\\) and \\(n_r\\) in them. So as the sample sizes increase, the standard error goes down. We’ve seen this concept numerous times now, in particular in our simulations using the three virtual shovels with \\(n\\) = 25, 50, and 100 slots in Figure 7.15 and in Section 8.5.3 where we studied the effect of using larger sample sizes on the widths of confidence intervals. So how can we use the two-sample \\(t\\)-statistic as a test statistic in our hypothesis test? First, assuming the null hypothesis \\(H_0: \\mu_a - \\mu_r = 0\\) is true, the right-hand side of the numerator, \\(\\mu_a - \\mu_r\\), becomes 0. Second, similarly to how the Central Limit Theorem from Section 7.5.2 states that sample means follow a normal distribution, it can be mathematically proven that the two-sample \\(t\\)-statistic follows a \\(t\\) distribution with degrees of freedom “roughly equal” to \\(df = n_a + n_r - 2\\). We display three examples of \\(t\\)-distributions in Figure 9.20 along with the standard normal \\(z\\) curve. FIGURE 9.20: Examples of t-distributions and the z curve. Begin by looking at the center of the plot at 0 on the horizontal axis. As you move up from the value of 0, follow along with the labels and note that the bottom curve corresponds to 1 degree of freedom, the curve above it is for 3 degrees of freedom, the curve above that is for 10 degrees of freedom, and lastly the dashed curve is the standard normal \\(z\\) curve. Observe that all four curves have a bell shape, are centered at 0, and that as the degrees of freedom increase, the \\(t\\)-distribution more and more resembles the standard normal \\(z\\) curve. The “degrees of freedom” measures how different the \\(t\\) distribution will be from a normal distribution. \\(t\\)-distributions tend to have more values in the tails of their distributions than the standard normal \\(z\\) curve. This “roughly equal” statement indicates that the equation \\(df = n_a + n_r - 2\\) is a “good enough” approximation to the true degrees of freedom. The true formula is a bit more complicated than this simple expression, but we’ve found the formula to be beyond the reach of those new to statistical inference and it does little to build the intuition of the \\(t\\)-test. The message to retain however is that small sample sizes lead to small degrees of freedom and thus small sample sizes lead to \\(t\\)-distributions that are different than the \\(z\\) curve. On the other hand, large sample sizes lead to large degrees of freedom and thus lead to \\(t\\) distributions that closely align with the standard normal \\(z\\)-curve. So, assuming the null hypothesis \\(H_0\\) is true, our formula for the test statistic simplifies a bit: \\[t = \\dfrac{ (\\bar{x}_a - \\bar{x}_r) - 0}{ \\sqrt{\\dfrac{{s_a}^2}{n_a} + \\dfrac{{s_r}^2}{n_r}} } = \\dfrac{ \\bar{x}_a - \\bar{x}_r}{ \\sqrt{\\dfrac{{s_a}^2}{n_a} + \\dfrac{{s_r}^2}{n_r}} }\\] Let’s compute the values necessary for this two-sample \\(t\\)-statistic. Recall the summary statistics we computed during our exploratory data analysis in Section 9.5.1. movies_sample %&gt;% group_by(genre) %&gt;% summarize(n = n(), mean_rating = mean(rating), std_dev = sd(rating)) # A tibble: 2 x 4 genre n mean_rating std_dev &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; 1 Action 32 5.275 1.36121 2 Romance 36 6.32222 1.60963 Using these values, the observed two-sample \\(t\\)-test statistic is \\[ \\dfrac{ \\bar{x}_a - \\bar{x}_r}{ \\sqrt{\\dfrac{{s_a}^2}{n_a} + \\dfrac{{s_r}^2}{n_r}} } = \\dfrac{5.28 - 6.32}{ \\sqrt{\\dfrac{{1.36}^2}{32} + \\dfrac{{1.61}^2}{36}} } = -2.906 \\] Great! How can we compute the p-value using this theory-based test statistic? We need to compare it to a null distribution, which we construct next. Null distribution Let’s revisit the null distribution for the test statistic \\(\\bar{x}_a - \\bar{x}_r\\) we constructed in Section 9.5. Let’s visualize this in the left-hand plot of Figure 9.21 # Construct null distribution of xbar_a - xbar_m: null_distribution_movies &lt;- movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% calculate(stat = &quot;diff in means&quot;, order = c(&quot;Action&quot;, &quot;Romance&quot;)) visualize(null_distribution_movies, bins = 10) The infer package also includes some built-in theory-based test statistics as well. So instead of calculating the test statistic of interest as the &quot;diff in means&quot; \\(\\bar{x}_a - \\bar{x}_r\\), we can calculate this defined two-sample \\(t\\)-statistic by setting stat = &quot;t&quot;. Let’s visualize this in the right-hand plot of Figure 9.21 # Construct null distribution of t: null_distribution_movies_t &lt;- movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% # Notice we switched stat from &quot;diff in means&quot; to &quot;t&quot; calculate(stat = &quot;t&quot;, order = c(&quot;Action&quot;, &quot;Romance&quot;)) visualize(null_distribution_movies_t, bins = 10) FIGURE 9.21: Comparing the null distributions of two test statistics. Observe that while the shape of the null distributions of both the difference in means \\(\\bar{x}_a - \\bar{x}_r\\) and the two-sample \\(t\\)-statistic are similar, the scales on the x-axis are different. The two-sample \\(t\\)-statistic are spread out over a larger range. However, a traditional theory-based \\(t\\)-test doesn’t look at the simulated histogram in null_distribution_movies_t, but instead it looks at the \\(t\\)-distribution curve with degrees of freedom equal to roughly 65.85. This calculation is based on the complicated formula referenced previously, which we approximated with \\(df = n_a + n_r - 2\\) = 32 + 36 - 2 = 66. Let’s overlay this \\(t\\)-distribution curve over the top of our simulated two-sample \\(t\\)-statistics using the method = &quot;both&quot; argument in visualize(). visualize(null_distribution_movies_t, bins = 10, method = &quot;both&quot;) FIGURE 9.22: Null distribution using t-statistic and t-distribution. Observe that the curve does a good job of approximating the histogram here. To calculate the \\(p\\)-value in this case, we need to figure out how much of the total area under the \\(t\\)-distribution curve is equal or “more extreme” our observed two-sample \\(t\\)-statistic. Since our alternative hypothesis \\(H_A: \\mu_a - \\mu_r \\neq 0\\) is a two-sided alternative, we need to add up the areas in both tails. We first compute the observed two-sample \\(t\\)-statistic using infer verbs: obs_two_sample_t &lt;- movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% calculate(stat = &quot;t&quot;, order = c(&quot;Action&quot;, &quot;Romance&quot;)) obs_two_sample_t # A tibble: 1 x 1 stat &lt;dbl&gt; 1 -2.90589 So we are interested in finding the percentage of values that are at or above obs_two_sample_t = -2.906 or at or below -obs_two_sample_t = 2.906. We do this using the shade_p_value() function with the direction argument set to &quot;both&quot;: visualize(null_distribution_movies_t, method = &quot;both&quot;) + shade_p_value(obs_stat = obs_two_sample_t, direction = &quot;both&quot;) FIGURE 9.23: Null distribution using t-statistic and t-distribution with p-value shaded. (We’ll discuss this warning message shortly.) What is the p-value? We apply get_p_value() to our null distribution saved in null_distribution_movies_t: null_distribution_movies_t %&gt;% get_p_value(obs_stat = obs_two_sample_t, direction = &quot;both&quot;) # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0.004 We have a very small p-value, and thus it is very unlikely that these results are due to sampling variation. Thus, we are inclined to reject \\(H_0\\). Let’s come back to that earlier warning message: Check to make sure the conditions have been met for the theoretical method. {infer} currently does not check these for you. To be able to use the \\(t\\)-test and other such theoretical methods, there are always a few conditions to check. The infer package does not automatically check these conditions, hence the warning message we received. These conditions are necessary so that the underlying mathematical theory holds. In order for the results of our two-sample \\(t\\)-test to be valid, three conditions must be met: Nearly normal populations or large sample sizes. A general rule of thumb that works in many (but not all) situations is that the sample size \\(n\\) should be greater than 30. Both samples are selected independently of each other. All observations are independent from each other. Let’s see if these conditions hold for our movies_sample data: This is met since \\(n_a\\) = 32 and \\(n_r\\) = 36 are both larger than 30, satisfying our rule of thumb. This is met since we sampled the action and romance movies at random and in an unbiased fashion from the database of all IMDb movies. Unfortunately, we don’t know how IMDb computes the ratings. For example, if the same person rated multiple movies, then those observations would be related and hence not independent. Assuming all three conditions are met, we can be reasonably certain that the theory-based \\(t\\)-test results are valid. If any of the conditions were not met, we couldn’t put as much faith into any conclusions. 9.6.2 When inference is not needed We’ve now walked through several different examples of how to use the infer package to perform statistical inference: constructing confidence intervals and conducting hypothesis tests. For each of these examples, we made it a point to always perform an exploratory data analysis (EDA) first. Specifically by looking at the raw data values, by using data visualization via ggplot2, and by data wrangling via dplyr beforehand. We highly encourage you to always do the same. As a beginner to statistics, EDA helps you develop intuition as to what statistical methods like confidence intervals and hypothesis tests can tell us. Even as a seasoned practitioner of statistics, EDA helps guide your statistical investigations. In particular, is statistical inference even needed? Let’s consider an example. Say we’re interested in the following question: Of all flights leaving a New York City airport, are Hawaiian Airlines flights in the air for longer than Alaska Airlines flights? Furthermore, let’s assume that 2013 flights are a representative sample of all such flights. Then we can use the flights data frame in the nycflights13 package we introduced in Section 1.4 to answer our question. Let’s filter this data frame to only include Hawaiian and Alaska Airlines using their carrier codes HA and AS: flights_sample &lt;- flights %&gt;% filter(carrier %in% c(&quot;HA&quot;, &quot;AS&quot;)) There are two possible statistical inference methods we could use to answer such questions. First, we could construct a 95% confidence interval for the difference in population means \\(\\mu_{HA} - \\mu_{AS}\\), where \\(\\mu_{HA}\\) is the mean air time of all Hawaiian Airlines flights and \\(\\mu_{AS}\\) is the mean air time of all Alaska Airlines flights. We could then check if the entirety of the interval is greater than 0, suggesting that \\(\\mu_{HA} - \\mu_{AS} &gt; 0\\), or in other words suggesting that \\(\\mu_{HA} &gt; \\mu_{AS}\\). Second, we could perform a hypothesis test of the null hypothesis \\(H_0: \\mu_{HA} - \\mu_{AS} = 0\\) versus the alternative hypothesis \\(H_A: \\mu_{HA} - \\mu_{AS} &gt; 0\\). However, let’s first construct an exploratory visualization as we suggested earlier. Since air_time is numerical and carrier is categorical, a boxplot can display the relationship between these two variables, which we display in Figure 9.24 ggplot(data = flights_sample, mapping = aes(x = carrier, y = air_time)) + geom_boxplot() + labs(x = &quot;Carrier&quot;, y = &quot;Air Time&quot;) FIGURE 9.24: Air time for Hawaiian and Alaska Airlines flights departing NYC in 2013. This is what we like to call “you don’t need no PhD in statistics” moments. You don’t need to be an expert in statistics to know that Alaska Airlines and Hawaiian Airlines have significantly different air times. The two boxes don’t even overlap! Constructing a confidence interval or conducting a hypothesis test would frankly not provide much more insight than Figure 9.24. Let’s investigate why we observe such a clear cut difference between these two airlines using data wrangling. Let’s first group by the rows of flights_sample not only by carrier but also by destination dest. Subsequently we’ll compute two summary statistics: the number of observations using n() and the mean airtime: flights_sample %&gt;% group_by(carrier, dest) %&gt;% summarize(n = n(), mean_time = mean(air_time, na.rm =TRUE)) # A tibble: 2 x 4 # Groups: carrier [2] carrier dest n mean_time &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; 1 AS SEA 714 325.618 2 HA HNL 342 623.088 It turns out that from New York City, Alaska only flies to SEA (Seattle) from New York City (NYC) while Hawaiian only flies to HNL (Honolulu) from NYC. Given the clear difference in distance from New York City to Seattle versus New York City to Honolulu, it is not surprising that we observe such different air times in flights. This is a clear example of not needing to do anything more than a simple exploratory data analysis using data visualization and descriptive statistics to get an appropriate conclusion. This is why we highly recommend you perform an EDA of any sample data before running statistical inference methods like confidence intervals and hypothesis tests. Learning check (LC9.17) Could we make the same type of immediate conclusion that SFO had a statistically greater air_time if, say, its corresponding standard deviation was 200 minutes? What about 100 minutes? Explain. 9.6.3 Problems with p-values On top of the many common misunderstandings about hypothesis testing and p-values we listed in Section 9.4, another unfortunate consequence of the expanded use of p-values and hypothesis testing is a phenomenon known as “p-hacking.” p-hacking is the act of “cherry-picking” only results that are “statistically significant” while dismissing those that aren’t, even if at the expense of the scientific ideas. There are lots of articles written recently about misunderstandings and the problems with p-values. We encourage you to check some of them out: Misunderstandings of \\(p\\)-values What a nerdy debate about p-values shows about science - and how to fix it Statisticians issue warning over misuse of \\(P\\) values You Can’t Trust What You Read About Nutrition A Litany of Problems with p-values Such issues were getting so bad that the American Statistical Association (ASA) put out a statement in 2016 titled “The ASA Statement on p-Values: Context, Process, and Purpose” with six principles underlying the proper use and interpretation of p-values. The ASA released this guidance on p-values to improve the conduct and interpretation of quantitative science and inform the growing emphasis on reproducibility of science research. We as authors much prefer the use of confidence intervals for statistical inference, since in our opinion they are much less prone to large misinterpretation. However, many fields still exclusively use \\(p\\)-values for statistical inference, thus we still included them in our text. We encourage you to learn more about “p-hacking” as well and its implication for science. 9.6.4 Additional resources An R script file of all R code used in this chapter is available here. If you want more examples of the infer workflow to conducting hypothesis tests, we suggest you check out the infer package homepage, in particular, a series of example analyses available at https://infer.netlify.com/articles/. 9.6.5 What’s to come We conclude by showing the infer pipeline diagram for hypothesis testing. FIGURE 9.25: infer package workflow for hypothesis testing. Now that we’ve armed ourselves with an understanding of confidence intervals from Chapter 8 and hypothesis tests from this chapter, we’ll now study inference for regression in the upcoming Chapter 10. We’ll revisit the regression models we studied in Chapters 5 on basic regression and 6. For example, recall Table 5.2, where we displayed the regression table corresponding to our regression model for an instructor’s teaching score as a function of their “beauty” score. # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals) # Get regression table: get_regression_table(score_model) TABLE 9.4: Linear regression table. term estimate std_error statistic p_value lower_ci upper_ci intercept 3.880 0.076 50.96 0 3.731 4.030 bty_avg 0.067 0.016 4.09 0 0.035 0.099 We previously saw in Section 5.1.2 that the values in the estimate column are the fitted intercept \\(b_0\\) and fitted slope for beauty score \\(b_1\\). In Chapter 10, we’ll unpack the remaining columns: std_error which is the standard error, statistic which is the observed standardized test statistic to compute the p_value, and the 95% confidence intervals as given by lower_ci and upper_ci. "],
-["10-inference-for-regression.html", "Chapter 10 Inference for Regression 10.1 Regression refresher 10.2 Interpreting regression tables 10.3 Conditions for inference for regression 10.4 Simulation-based inference for regression 10.5 Conclusion", " Chapter 10 Inference for Regression In our penultimate chapter, we’ll revisit the regression models we first studied in Chapters 5 and 6. Armed with our knowledge of confidence intervals and hypothesis tests from Chapters 8 and 9, we’ll be able to apply statistical inference to further our understanding of relationships between outcome and explanatory variables. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once: ggplot2 for data visualization dplyr for data wrangling tidyr for converting data to “tidy” format readr for importing spreadsheet data into R As well as the more advanced purrr, tibble, stringr, and forcats packages If needed, read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(moderndive) library(infer) 10.1 Regression refresher Before jumping into inference for regression, let’s remind ourselves of the University of Texas teaching evaluations analysis in Section 5.1. 10.1.1 Teaching evals analysis Recall using simple linear regression we modeled the relationship between A numerical outcome variable \\(y\\), the instructor’s teaching score and A single numerical explanatory variable \\(x\\), the instructor’s “beauty” score. We first created an evals_ch6 data frame that selected a subset of variables from the evals data frame included in the moderndive package. This evals_ch6 data frame contains only the variables of interest for our analysis, in particular the instructor’s teaching score and the “beauty” rating bty_avg: evals_ch6 &lt;- evals %&gt;% select(ID, score, bty_avg, age) glimpse(evals_ch6) Observations: 463 Variables: 4 $ ID &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18… $ score &lt;dbl&gt; 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4… $ bty_avg &lt;dbl&gt; 5.00, 5.00, 5.00, 5.00, 3.00, 3.00, 3.00, 3.33, 3.33, 3.17, 3… $ age &lt;int&gt; 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4… In Section 5.1.1, we performed an exploratory data analysis of the relationship between these two variables. We saw there that there was a weakly positive correlation of 0.187 between the two variables. This was evidenced in Figure 10.1 of the scatterplot along with the “best-fitting” regression line that summarizes the linear relationship between the two variables. Recall in Subsection 5.3.2 that we defined a “best-fitting” line as the line that minimizes the sum of squared residuals. ggplot(evals_ch6, aes(x = bty_avg, y = score)) + geom_point() + labs(x = &quot;Beauty Score&quot;, y = &quot;Teaching Score&quot;, title = &quot;Relationship between teaching and beauty scores&quot;) + geom_smooth(method = &quot;lm&quot;, se = FALSE) FIGURE 10.1: Relationship with regression line. Looking at this plot again, you might be asking “Does that line really have all that positive of a slope?” It does increase from left to right as the bty_avg variable increases, but by how much? To get to this information, recall that we followed a two-step procedure: We first “fit” the linear regression model using the lm() function with the formula score ~ bty_avg. We saved this model in score_model. We get the regression table by applying the get_regression_table() from the moderndive package to score_model. # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals_ch6) # Get regression table: get_regression_table(score_model) TABLE 10.1: Previously seen linear regression table. term estimate std_error statistic p_value lower_ci upper_ci intercept 3.880 0.076 50.96 0 3.731 4.030 bty_avg 0.067 0.016 4.09 0 0.035 0.099 Using the values in the estimate column of the resulting regression table in Table 10.1, we could then obtain the equation of the “best-fitting” regression line in Figure 10.1: \\[ \\begin{aligned} \\widehat{y} &amp;= b_0 + b_1 \\cdot x\\\\ \\widehat{\\text{score}} &amp;= b_0 + b_{\\text{bty}\\_\\text{avg}} \\cdot\\text{bty}\\_\\text{avg}\\\\ &amp;= 3.880 + 0.067\\cdot\\text{bty}\\_\\text{avg} \\end{aligned} \\] where \\(b_0\\) is the fitted intercept and \\(b_1\\) is the fitted slope for bty_avg. Recall the interpretation of the \\(b_1\\) = 0.067 value of the fitted slope: For every increase of one unit in “beauty” rating, there is an associated increase, on average, of 0.067 units of evaluation score. Thus, the slope value quantifies the relationship between the y variable of score and the x variable bty_avg. We also discussed the intercept value of \\(b_0\\) = 3.88 and its lack of practical interpretation, since the range of possible “beauty” scores does not include 0. 10.1.2 Sampling scenario Let’s now revisit this study in terms of terminology and notation related to sampling we studied in Section 7.3.1. First, let’s view the instructors for these 463 courses as a representative sample from a greater study population. In our case, let’s assume that the study population is all instructors at UT Austin and that the sample of instructors who taught these 463 is a representative sample. Unfortunately, we can only assume these two facts without more knowledge of the sampling methodology used by the researchers. Since we are viewing these \\(n\\) = 463 courses as a sample, we can view our fitted slope \\(b_1\\) = 0.067 as a point estimate of the population slope \\(\\beta_1\\). In other words, \\(\\beta_1\\) quantifies the relationship between teaching score and “beauty” average bty_avg for all instructors at UT Austin. Similarly, we can view our fitted intercept \\(b_0\\) = 3.88 as a point estimate of the population intercept \\(\\beta_0\\) for all instructors at UT Austin. Putting these two ideas together, we can view the equation of the fitted line \\(\\widehat{y}\\) = \\(b_0 + b_1 \\cdot x\\) = \\(3.880 + 0.067 \\cdot \\text{bty}\\_\\text{avg}\\) as an estimate of some true and unknown population line \\(y = \\beta_0 + \\beta_1 \\cdot x\\). Thus we can draw parallels between our teaching evals analysis and all the sampling scenarios we’ve seen previously in Table 7.5. In this chapter, we’ll focus on the final two scenarios: regression slopes and regression intercepts. TABLE 10.2: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Notation. 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) 4 Difference in population means \\(\\mu_1 - \\mu_2\\) Difference in sample means \\(\\overline{x}_1 - \\overline{x}_2\\) 5 Population regression slope \\(\\beta_1\\) Fitted regression slope \\(b_1\\) or \\(\\widehat{\\beta}_1\\) 6 Population regression intercept \\(\\beta_0\\) Fitted regression intercept \\(b_0\\) or \\(\\widehat{\\beta}_0\\) Since we are now viewing our fitted slope \\(b_1\\) and fitted intercept \\(b_0\\) as point estimates based on a sample, these estimates will be subject to sampling variability, as we’ve seen numerous times throughout this book. In other words, if we collected new sample of data on a different set of \\(n\\) = 463 courses and their instructors, the new fitted slope \\(b_1\\) will likely differ from 0.067. The same goes for the new fitted intercept \\(b_0\\). But by how much will they differ? In other words, by how much will these estimates vary? This information is contained in the remaining columns of the regression table in Table 10.1. Our knowledge of sampling from Chapter 7, confidence intervals from Chapter 8, and hypothesis tests from Chapter 9 will help us interpret these remaining columns. 10.2 Interpreting regression tables In Chapters 5 and 6 and in our regression refresher earlier, we focused only on the two leftmost columns the regression table in Table 10.1: term and estimate. Let’s now shift our attention to the remaining columns: std_error, statistic, p_value, lower_ci and upper_ci. TABLE 10.3: Previously seen regression table. term estimate std_error statistic p_value lower_ci upper_ci intercept 3.880 0.076 50.96 0 3.731 4.030 bty_avg 0.067 0.016 4.09 0 0.035 0.099 Given the lack of practical interpretation for the fitted intercept \\(b_0\\), in this section we’ll focus only on the second row of the table corresponding to the fitted slope \\(b_1\\). We’ll first interpret the std_error, statistic, p_value, lower_ci and upper_ci columns. Afterwards in the upcoming Subsection 10.2.5, we’ll discuss how R computes these values. 10.2.1 Standard error The third column of the regression table in Table 10.1 std_error corresponds to the standard error of our estimates. Recall the definition of standard error we saw in Subsection 7.3.2: The standard error is the standard deviation of any point estimate computed from a sample. So what does this mean in terms of the fitted slope \\(b_1\\) = 0.067? This value is just one possible value of the fitted slope resulting from this particular sample of \\(n\\) = 463 pairs of teaching and beauty scores. However, if we collected a different sample of \\(n\\) = 463 pairs of teaching and beauty scores, we will almost certainly obtain a different fitted slope \\(b_1\\). This is due to sampling variability. Say we hypothetically collected 1000 such samples of pairs of teaching and beauty scores, computed the 1000 resulting values of the fitted slope \\(b_1\\), and visualized them in a histogram. This would be a visualization of the sampling distribution of \\(b_1\\), which we defined in Subsection 7.3.2. Further recall that the standard deviation of the sampling distribution of \\(b_1\\) has a special name: the standard error. Recall that we constructed three sampling distributions for the sample proportion \\(\\widehat{p}\\) using shovels of size 25, 50, and 100 in Figure 7.12. We observed that as the sample size increased, the standard error decreased as evidenced by the narrowing sampling distribution. The standard error of \\(b_1\\) similarly quantifies how much variation in the fitted slope \\(b_1\\) one would expect between different samples. So in our case, we can expect about 0.016 units of variation in the bty_avg slope variable. Recall that the estimate and std_error values play a key role in inferring the value of the unknown population slope \\(\\beta_1\\) relating to all instructors. In Section 10.4, we’ll perform a simulation using the infer package to construct the bootstrap distribution for \\(b_1\\) in this case. Recall from Subsection 8.7.1 that the bootstrap distribution is an approximation to the sampling distribution in that they have a similar shape. Since they have a similar shape, they have similar standard errors. However, unlike the sampling distribution, the bootstrap distribution is constructed from a single sample, which is a practice more aligned with what’s done in real-life. 10.2.2 Test statistic The fourth column of the regression table in Table 10.1 statistic corresponds to a test statistic relating to the following hypothesis test: \\[ \\begin{aligned} H_0 &amp;: \\beta_1 = 0\\\\ \\text{vs } H_A&amp;: \\beta_1 \\neq 0 \\end{aligned} \\] Recall our terminology, notation, and definitions related to hypothesis tests we introduced in Section 9.2. A hypothesis test consists of a test between two competing hypotheses: 1) a null hypothesis \\(H_0\\) versus 2) an alternative hypothesis \\(H_A\\). A test statistic is a point estimate/sample statistic formula used for hypothesis testing. Here, our null hypothesis \\(H_0\\) assumes that the population slope \\(\\beta_1\\) is 0. If the population slope \\(\\beta_1\\) is truly 0, then this is saying that there is no true relationship between teaching and “beauty” scores for all the instructors in our population. In other words, \\(x\\) = “beauty” score would have no associated effect on \\(y\\) = teaching score. The alternative hypothesis \\(H_A\\), on the other hand, assumes that population slope \\(\\beta_1\\) is not 0, meaning it could be either positive or negative, suggesting either a positive or negative relationship between teaching and “beauty” scores. Recall we called such alternative hypotheses two-sided. By convention, all hypothesis testing for regression assumes two-sided alternatives. Recall our “hypothesized universe” of no gender discrimination we assumed in our promotions activity in Section 9.1. Similarly here when conducting this hypothesis test, we’ll assume a “hypothesized universe” where there is no relationship between teaching and “beauty” scores. In other words, we’ll assume the null hypothesis \\(H_0: \\beta_1 = 0\\) is true. The statistic column in the regression table is a tricky one however. It corresponds to a standardized t-test statistic, much like the two-sample \\(t\\) statistic we saw in Subsection 9.6.1 where we used a theory-based method for conducting hypothesis tests. In both these cases, the null distribution can be mathematically proven to be a \\(t\\)-distribution. Since such test statistics are tricky for individuals new to statistical inference to study, we’ll skip this and jump into interpreting the p-value. If you’re curious however, we’ve included a discussion of this standardized t-test statistic in Subsection 10.5.1. 10.2.3 p-value The fifth column of the regression table in Table 10.1 p-value corresponds to the p-value of the hypothesis test \\(H_0: \\beta_1 = 0\\) versus \\(H_A: \\beta_1 \\neq 0\\). Again recalling our terminology, notation, and definitions related to hypothesis tests we introduced in Section 9.2, let’s focus on the definition of the p-value: A p-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis \\(H_0\\) is true Recall that you can intuitively think of the p-value as quantifying how “extreme” the observed fitted slope of \\(b_1\\) = 0.067 is in a “hypothesized universe” where is there is no relationship between teaching and “beauty” scores. Following the hypothesis testing procedure we outlined in Section 9.4, since the p-value in this case is 0, for any choice of significance level \\(\\alpha\\) we would reject \\(H_0\\) in favor of \\(H_A\\). Using non-statistical language, this is saying: we reject the hypothesis that there is no relationship between teaching and “beauty” scores in favor of the hypothesis that that is. In other words, the evidence suggests there is a significant relationship, one that is positive. More precisely however, the p-value corresponds to how extreme the observed test statistic of 4.09 is when compared to the appropriate null distribution. In Section 10.4, we’ll perform a simulation using the infer package to construct the null distribution in this case. An extra caveat here is that the results of this hypothesis test are only valid if certain “conditions for inference for regression” are met, which we’ll introduce shortly in Section 10.3. 10.2.4 Confidence interval The two rightmost columns of the regression table in Table 10.1 lower_ci and upper_ci correspond to the endpoints of the 95% confidence interval for the population slope \\(\\beta_1\\). Recall our analogy of “nets are to fish” what “confidence intervals are to population parameters” from Section 8.3. The resulting 95% confidence interval for \\(\\beta_1\\) of (0.035, 0.099) is a range of plausible values for the population slope \\(\\beta_1\\) of the linear relationship between teaching and “beauty” scores. As we introduced in Section 8.5.2 on the precise and shorthand interpretation of confidence intervals, the statistically precise interpretation of this confidence interval is: “if we repeated this sampling procedure a large number of times, we expect about 95% of the resulting confidence intervals to capture the value of the population slope \\(\\beta_1\\).” However, we’ll summarize this using our shorthand interpretation that “we’re 95% ‘confident’ that the true population slope \\(\\beta_1\\) lies between 0.035 and 0.099.” Notice in this case that the resulting 95% confidence interval for \\(\\beta_1\\) of (0.035, 0.099) does not contain a very particular value: \\(\\beta_1\\) equals 0. Recall we mentioned that if the population regression slope \\(\\beta_1\\) is 0, this is equivalent to saying there is no relationship between teaching and “beauty” scores. Since \\(\\beta_1\\) = 0 is not in our plausible range of values for \\(\\beta_1\\), we are inclined to believe that there in fact is a relationship between teaching and “beauty” scores. So in this case, the conclusion about the population slope \\(\\beta_1\\) from the 95% confidence interval matches the conclusion from the hypothesis test: evidence suggests that there is a meaningful relationship between teaching and “beauty” scores! Recall from Subsection 8.5.3 however, that the confidence level is one of many factors that determine confidence interval widths. So for example, say we used a higher confidence level of 99% instead of 95%. The resulting confidence intervals for \\(\\beta_1\\) would be wider and thus might now include 0. The lesson to remember here is that any confidence interval based conclusion depends highly on the confidence level used. What are the calculations that went into computing the two endpoints of the 95% confidence interval for \\(\\beta_1\\)? Recall our sampling bowl example from Section 8.7.2 lower_ci and upper_ci. Since the sampling and bootstrap distributions of the sample proportion \\(\\widehat{p}\\) were roughly normal, we could use the rule of thumb for bell-shaped distributions from Appendix A.2 to create a 95% confidence interval for \\(p\\) with the following equation: \\[\\widehat{p} \\pm \\text{MoE}_{\\widehat{p}} = \\widehat{p} \\pm 1.96 \\cdot \\text{SE}_{\\widehat{p}} = \\widehat{p} \\pm 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] We can generalize this to other point estimates that have roughly normally shaped sampling and bootstrap distributions: \\[\\text{point estimate} \\pm \\text{MoE} = \\text{point estimate} \\pm 1.96 \\cdot \\text{SE}\\] We’ll show in Section 10.4 that the sampling/bootstrap distribution for the fitted slope \\(b_1\\) is in fact bell-shaped as well. Thus we can construct a 95% confidence interval for \\(\\beta_1\\) with the following equation: \\[b_1 \\pm \\text{MoE}_{b_1} = b_1 \\pm 1.96 \\cdot \\text{SE}_{b_1}\\] What is the value of the standard error \\(\\text{SE}_{b_1}\\)? It is in fact in the third column of the regression table in Table 10.1: 0.016. Thus \\[ \\begin{aligned} b_1 \\pm 1.96 \\cdot \\text{SE}_{b_1} &amp;= 0.067 \\pm 1.96 \\cdot 0.016 = 0.067 \\pm 0.031\\\\ &amp;= (0.036, 0.098) \\end{aligned} \\] This closely matches the (0.035, 0.099) confidence interval in the last two columns of Table 10.1. Much like hypothesis tests however, the results of this confidence interval also only valid if the “conditions for inference for regression” discussed in Section 10.3 are met. 10.2.5 How does R compute the table? Since we didn’t perform the simulation to get the values of the standard error, test statistic, p-value, and endpoints of the 95% confidence interval in Table 10.1, you might be wondering how were these values computed. What did R do behind the scenes? Does R run simulations like we did using the infer package in Chapters 8 and 9 on confidence intervals and hypothesis testing? The answer is no! Much like the theory-based method for constructing confidence intervals you saw in Section 8.7.2 and the theory-based hypothesis test you saw in Section 9.6.1, there exist mathematical formulas that allow you to construct confidence intervals and conduct hypothesis tests for inference for regression. These formulas were derived in a time when computers didn’t exist, so it would’ve been impossible to run the extensive computer simulations we have in this book. We present these formulas in Subsection 10.5.1 on “theory-based inference for regression.” In the upcoming Section 10.4, we’ll go over a simulation-based approach to constructing confidence intervals and conducting hypothesis tests using the infer package. In particular, we’ll convince you that the bootstrap distribution of the fitted slope \\(b_1\\) is indeed bell-shaped. 10.3 Conditions for inference for regression Recall in Section 8.3.2 we stated that we could only use the standard-error based method for constructing confidence intervals if the bootstrap distribution was bell shaped. Similarly, there are certain conditions that need to be met in order for the results of our hypothesis tests and confidence intervals we described in Section 10.2 to have valid meaning. These conditions must be met for the assumed underlying mathematical and probability theory to hold true. For inference for regression, there are four conditions that need to be met. Note the first four letters of these conditions as highlighted in bold in what follows: LINE. This can serve as a nice reminder of what to check for whenever you perform linear regression. Linearity of relationship between variables Independence of the residuals Normality of the residuals Equality of variance of the residuals Conditions L, N, and E can be verified through what is known as a residual analysis. Condition I can only be verified through an understanding of how the data was collected. In this section, we’ll go over a refresher on residuals, verify whether each of the 4 LINE conditions hold true, and then discuss the implications. 10.3.1 Residuals refresher Recall our definition of a residual from Section 5.1.3: it is the observed value minus the fitted value \\(y - \\widehat{y}\\). Recall that residuals can be thought of as the error or the “lack-of-fit” between the observed value \\(y\\) and the fitted value \\(\\widehat{y}\\) on the regression line in Figure 10.1. In Figure 10.2, we illustrate one particular residual out of 463 using an arrow, as well its corresponding observed and fitted values using a circle and a square. FIGURE 10.2: Example of observed value, fitted value, and residual. Furthermore, we can automate the calculation of all \\(n\\) = 463 residuals by applying the get_regression_points() function to our saved regression model in score_model. Observe how the resulting values of residual are roughly equal to score - score_hat (there is a slight difference due to rounding error). # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals_ch6) # Get regression points: regression_points &lt;- get_regression_points(score_model) regression_points # A tibble: 463 x 5 ID score bty_avg score_hat residual &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 1 4.7 5 4.214 0.486 2 2 4.100 5 4.214 -0.114 3 3 3.9 5 4.214 -0.314 4 4 4.8 5 4.214 0.586 5 5 4.600 3 4.08 0.52 6 6 4.3 3 4.08 0.22 7 7 2.8 3 4.08 -1.28 8 8 4.100 3.333 4.102 -0.002 9 9 3.4 3.333 4.102 -0.702 10 10 4.5 3.16700 4.091 0.40900 # … with 453 more rows A residual analysis is used to verify conditions L, N, and E and can be performed using appropriate data visualizations. While there are more sophisticated statistical approaches that can also be done, we’ll focus on the much simpler approach of look at plots. 10.3.2 Linearity of relationship The first condition is that the relationship between the outcome variable \\(y\\) and the explanatory variable \\(x\\) must be Linear. Recall the scatterplot in Figure 10.1 where we had the explanatory variable \\(x\\) “beauty” score and the outcome variable \\(y\\) teaching score. Would you say that the relationship between \\(x\\) and \\(y\\) is linear? It’s hard to say because of the scatter of the points about the line. In the authors’ opinions, we feel this relationship is “linear enough”. Let’s present an example where the relationship between \\(x\\) and \\(y\\) is clearly not linear in Figure 10.3. In this case, the points clearly do not form a line, but rather a U-shaped polynomial line. In this case, any results from an inference for regression would not be valid. FIGURE 10.3: Example of clearly non-linear relationship. 10.3.3 Independence of residuals The second condition is that the residuals must be Independent. In other words, the different observations in our data must be independent of one another. For our UT Austin data, while there is data on 463 courses, these 463 courses were actually taught by 94 unique instructors. In other words, the same professor is often included more than once in our data. The original evals data frame that we used to construct the evals_ch6 data frame has a variable prof_ID, which is an anonymized identification variable for the professor: evals %&gt;% select(ID, prof_ID, score, bty_avg) # A tibble: 463 x 4 ID prof_ID score bty_avg &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; 1 1 1 4.7 5 2 2 1 4.100 5 3 3 1 3.9 5 4 4 1 4.8 5 5 5 2 4.600 3 6 6 2 4.3 3 7 7 2 2.8 3 8 8 3 4.100 3.333 9 9 3 3.4 3.333 10 10 4 4.5 3.16700 # … with 453 more rows For example, the professor with prof_ID equal to 1 taught the first 4 courses in the data, the professor with prof_ID equal to 2 taught the next 3, and so on. Given that the same professor taught these first four courses, it is reasonable to expect that these four teaching scores are related to each other. If a professor gets a high score in one class, chances are fairly good they’ll get a high score in another. This dataset thus provides different information than if we had 463 unique instructors teaching the 463 courses. In this case we say there exists dependence between observations. The first four courses taught by professor 1 are dependent, the next 3 courses taught by professor 2 are related, and so on. Any proper analysis of this data needs to take into account that we have repeated measures for the same profs. So in this case, the independence condition is not met. What does this mean for our analysis? We’ll address this in Subsection 10.3.6 coming up, after we check the remaining two conditions. 10.3.4 Normality of residuals The third condition is that the residuals should follow a Normal distribution. Furthermore, the center of this distribution should be 0. In other words, sometimes the regression model will make positive errors: \\(y - \\widehat{y} &gt; 0\\). Other times, the regression model will make equally negative errors: \\(y - \\widehat{y} &lt; 0\\). However, on average the errors should equal 0. The simplest way to check the normality of the residuals is to look at a histogram, which we visualize in Figure 10.4. ggplot(regression_points, aes(x = residual)) + geom_histogram(binwidth = 0.25, color = &quot;white&quot;) + labs(x = &quot;Residual&quot;) FIGURE 10.4: Histogram of residuals. This histogram shows that we have more positive residuals than negative. Since the residual \\(y-\\widehat{y}\\) is positive when \\(y &gt; \\widehat{y}\\), it seems our regression model’s fitted teaching scores \\(\\widehat{y}\\) tend to underestimate the true teaching scores \\(y\\). Furthermore, this histogram has a slight left-skew in that there is a tail on the left. Another way to say the residuals exhibit a negative skew. Is this a problem? Again, there is a certain amount of subjectivity in the response. In the authors’ opinion, while there is a slight skew to the residuals, we feel it isn’t drastic. On the other hand, others might disagree with our assessment. Let’s present examples where the residuals clearly do and don’t follow a normal distribution in Figure 10.5. In this case of the model yielding the clearly non-normal residuals on the right, any results from an inference for regression would not be valid. FIGURE 10.5: Example of clearly normal and clearly non-normal residuals. 10.3.5 Equality of variance The fourth and final condition is that the residuals should exhibit Equal variance for across all values of the explanatory variable \\(x\\). In other words, the value and spread of the residuals should not depend on the value of the explanatory variable \\(x\\). Recall the scatterplot in Figure 10.1: we had the explanatory variable \\(x\\) “beauty” score on the x-axis and the outcome variable\\(y\\) teaching score on the y-axis. Instead, let’s create a scatterplot that has the same values on the x-axis, but now with the residual \\(y-\\widehat{y}\\) on the y-axis as seen in Figure 10.6. ggplot(regression_points, aes(x = bty_avg, y = residual)) + geom_point() + labs(x = &quot;Beauty Score&quot;, y = &quot;Residual&quot;) + geom_hline(yintercept = 0, col = &quot;blue&quot;, size = 1) FIGURE 10.6: Plot of residuals over beauty score. You can think of this plot as a modified version of the plot with the regression line in Figure 10.1, but with the regression line flattened out to \\(y=0\\). Looking at this plot, would you say that the spread of the residuals around the blue line is constant across all values of the explanatory variable \\(x\\) “beauty” score? This question is rather qualitative and subjective in nature, thus different people may respond with different answers. For example, some people might say that there is slightly more variation in the residuals for smaller values of \\(x\\) than with for higher ones. However, it can be argued that there isn’t a drastic non-constancy. In Figure 10.7 let’s present an example where the residuals clearly do not have equal variance across all values of the explanatory variable \\(x\\). FIGURE 10.7: Example of clearly non-equal variance. Observe how the spread of the residuals increases as the value of \\(x\\) increases. This is a situation known as heteroskedasticity. Any inference for regression based on a model yielding such a pattern in the residuals would not be valid. 10.3.6 What’s the conclusion? Let’s list our four conditions for inference for regression again and indicate whether or not they were satisfied in our analysis: Linearity of relationship between variables: Yes Independence of residuals: No Normality of residuals: Somewhat Equality of variance: Yes So what does this mean for the results of our confidence intervals and hypothesis tests in Section 10.2? First, the Independence condition. The fact that there exist dependencies between different rows in evals_ch6 must be addressed. In more advanced statistics courses, you’ll learn how to incorporate such dependencies into your regression models. One such technique is called hierarchical/multilevel modeling. Second, when conditions L, N, E are not met, it often means there is a shortcoming in our model. For example, it may be the case that using only a single explanatory variable is insufficient, as we did with “beauty” score. We may need to incorporate more explanatory variables in a multiple regression model as we did in Chapter 6. In our case, the best we can do is view the results suggested by our confidence intervals and hypothesis tests as preliminary. That while a preliminary analysis suggests that there is a significant relationship between teaching and “beauty” scores, further investigation is warranted. In particular, by improving the preliminary score ~ bty_avg model so that the 4 conditions are met. When the 4 conditions are roughly met, then we can put more faith into our confidence intervals and p-values. The conditions for inference in regression problems are a key part of regression analysis that are of vital importance to the processes of constructing confidence intervals and conducting hypothesis tests. However, it is often the case with regression analysis in the real-world that not all the conditions are completely met. Furthermore, as you saw there is a level of subjectivity in the residual analyses to verify the L, N, and E conditions. So what can you do? We as authors advocate for transparency in communicating all results. This lets the stakeholders of any analysis know about a model’s shortcomings or whether the model is “good enough.” Learning check (LC10.1) Continue with our regression using age as the explanatory variable and teaching score as the outcome variable. Use the get_regression_points() function to get the observed values, fitted values, and residuals for all 463 instructors. Perform a residual analysis and look for any systematic patterns in the residuals. Ideally, there should be little to no pattern. 10.4 Simulation-based inference for regression Recall in Subsection 10.2.5 when we interpreted the third through seventh columns of a regression table, we stated that R doesn’t do simulations to compute these values. Rather R uses theory-based methods that involve mathematical formulas. In this section, we’ll use the simulation-based methods you previously learned in Chapters 8 and 9 to recreate the values in the regression table in Table 10.1. In particular, we’ll use the infer package workflow to Construct a 95% confidence interval for the population slope \\(\\beta_1\\) using bootstrap resampling with replacement. We did this previously in Sections 8.4 with the pennies data and 8.6 with the mythbusters_yawn data. Conduct a hypothesis test of \\(H_0: \\beta_1 = 0\\) vs \\(H_A: \\beta_1 \\neq 1\\) using a permutation test. We did this previously in Sections 9.3 with the promotions data and 9.5 with the movies_sample IMDb data. 10.4.1 Confidence interval for slope We’ll construct a 95% confidence interval for \\(\\beta_1\\) using the infer workflow outlined in Subsection 8.4.2. Specifically, we’ll first construct the bootstrap distribution for the fitted slope \\(b_1\\) using our single sample of 463 courses: specify() the variables of interest in evals_ch6 with the formula: score ~ bty_avg. generate() replicates by using bootstrap resampling with replacement from the original sample of 463 courses. We generate reps = 1000 replicates using type = &quot;bootstrap&quot;. calculate() the summary statistic of interest: the fitted slope \\(b_1\\). Then using this bootstrap distribution we’ll construct the 95% confidence interval using the percentile method and (if appropriate) the standard error method as well. It is important to note in this case that the bootstrapping with replacement is done row-by-row. Thus, the original pairs of score and bty_avg values are always kept together, but different pairs of score and bty_avg values may be resampled multiple times The resulting confidence interval will denote a range of plausible values for the unknown population slope \\(\\beta_1\\) quantifying the relationship between teaching and “beauty” scores for all professors at UT Austin. Let’s first construct the bootstrap distribution for the fitted slope \\(b_1\\): bootstrap_distn_slope &lt;- evals_ch6 %&gt;% specify(formula = score ~ bty_avg) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;slope&quot;) bootstrap_distn_slope # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 0.0651055 2 2 0.0382313 3 3 0.108056 4 4 0.0666601 5 5 0.0715932 6 6 0.0854565 7 7 0.0624868 8 8 0.0412859 9 9 0.0796269 10 10 0.0761299 # … with 990 more rows Observe how we have 1000 values of the bootstrapped slope \\(b_1\\) in the stat column. Let’s visualize these resulting 1000 bootstrapped values in Figure 10.8. visualize(bootstrap_distn_slope) FIGURE 10.8: Bootstrap distribution of slope. Observe how the bootstrap distribution is roughly bell-shaped. Recall from Section 8.7.1 that shape of the bootstrap distribution of \\(b_1\\) closely approximates the shape of the sampling distribution of \\(b_1\\). Percentile-method First, let’s compute the 95% confidence interval for \\(\\beta_1\\) using the percentile method, in other words by identifying the 2.5th and 97.5th percentiles which include the middle 95% of values. Recall that this method does not require the bootstrap distribution to be normally shaped. percentile_ci &lt;- bootstrap_distn_slope %&gt;% get_confidence_interval(type = &quot;percentile&quot;, level = 0.95) percentile_ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 0.0323411 0.0990027 The resulting percentile-based 95% confidence interval for \\(\\beta_1\\) of (0.032, 0.099) is similar to the confidence interval in the regression Table 10.1 of (0.035, 0.099). Standard error method Since the bootstrap distribution in Figure 10.8 appears to be roughly bell-shaped, we can also construct a 95% confidence interval for \\(\\beta_1\\) using the standard error method. In order to do this, we need to first compute fitted slope \\(b_1\\), which will act as the center of our standard error-based confidence interval. While we saw in the regression table in Table 10.1 that this was \\(b_1\\) = 0.067, we can also use the infer pipeline with the generate() step removed: observed_slope &lt;- evals %&gt;% specify(score ~ bty_avg) %&gt;% calculate(stat = &quot;slope&quot;) observed_slope # A tibble: 1 x 1 stat &lt;dbl&gt; 1 0.0666370 We then use the get_ci() function with level = 0.95 to compute the 95% confidence interval for \\(\\beta_1\\). Note that setting the point_estimate argument to the observed_slope of 0.067 sets the center of the confidence interval. se_ci &lt;- bootstrap_distn_slope %&gt;% get_ci(level = 0.95, type = &quot;se&quot;, point_estimate = observed_slope) se_ci # A tibble: 1 x 2 lower upper &lt;dbl&gt; &lt;dbl&gt; 1 0.0333767 0.0998974 The resulting standard error-based 95% confidence interval for \\(\\beta_1\\) of (0.033, 0.1) is however slightly different than the confidence interval in the regression Table 10.1 of (0.035, 0.099). Comparing all three Let’s compare all three confidence intervals in Figure 10.9, where the percentile-based confidence interval is marked with solid lines, the standard error based confidence interval is marked with dashed lines, and the theory-based confidence interval (0.035, 0.099) from the regression table in Table 10.1 is marked with dotted lines. visualize(bootstrap_distn_slope) + shade_confidence_interval(endpoints = percentile_ci, fill = NULL, linetype = &quot;solid&quot;, color = &quot;black&quot;) + shade_confidence_interval(endpoints = se_ci, fill = NULL, linetype = &quot;dashed&quot;, color = &quot;black&quot;) + shade_confidence_interval(endpoints = c(0.035, 0.099), fill = NULL, linetype = &quot;dotted&quot;, color = &quot;black&quot;) FIGURE 10.9: Comparing three confidence intervals for the slope. Observe that all three are quite similar! Furthermore, none of the three confidence intervals for \\(\\beta_1\\) contain 0 and are entirely located above 0. This is suggesting that there is in fact a meaningful positive relationship between teaching and “beauty” scores. 10.4.2 Hypothesis test for slope Let’s now conduct a hypothesis test of \\(H_0: \\beta_1 = 0\\) vs \\(H_A: \\beta_1 \\neq 1\\). We will use the infer package, which follows the hypothesis testing paradigm in the “There is Only One Test” diagram in Figure 9.14. Let’s first think about what it means for \\(\\beta_1\\) to be zero as assumed in the null hypothesis \\(H_0\\). Recall we said if \\(\\beta_1 = 0\\), then this is saying there is no relationship between the teaching and “beauty” scores. Thus assuming this particular null hypothesis \\(H_0\\) means that in our “hypothesized universe” there is no relationship between score and bty_avg. We can therefore shuffle/permute the bty_avg variable to no consequence. We construct the null distribution of the fitted slope \\(b_1\\) by following the steps. Recall from Section 9.2 on terminology, notation, and definitions related to hypothesis testing where we defined the null distribution: the sampling distribution of our test statistic \\(b_1\\) assuming the null hypothesis \\(H_0\\) is true. specify() the variables of interest in evals_ch6 with the formula: score ~ bty_avg. hypothesize() the null hypothesis of independence. Recall from Section 9.3 that this is an additional step that needs to be added for hypothesis testing. generate() replicates by permuting/shuffling the explanatory variable bty_avg from the original sample of 463 courses. We generate reps = 1000 replicates using type = &quot;permute&quot;. calculate() the test statistic of interest: the fitted slope \\(b_1\\). In this case, we permute the values of bty_avg across the values of score 1000 times. We can do this shuffling/permuting since we assumed a “hypothesized universe” of no relationship between these two variables. Then we calculate the &quot;slope&quot; coefficient for each of these 1000 generated samples. null_distn_slope &lt;- evals %&gt;% specify(score ~ bty_avg) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% calculate(stat = &quot;slope&quot;) Observe the resulting null distribution for the fitted slope \\(b_1\\) in Figure 10.10. visualize(null_distn_slope) FIGURE 10.10: Null distribution. Notice how it is centered at \\(b_1\\) = 0. This is because in our hypothesized universe, there is no relationship between score and bty_avg. In other words \\(\\beta_1\\) = 0. Thus the most typical fitted slope \\(b_1\\) we observe across our simulations is 0. Observe furthermore how there is variation around this central value of 0. Let’s visualize the p-value in the null distribution by comparing it to the observed test statistic of \\(b_1\\) = 0.067 in Figure 10.11. We’ll do this by adding a shade_p_value() layer to the previous visualize() code. visualize(null_distn_slope) + shade_p_value(obs_stat = observed_slope, direction = &quot;both&quot;) FIGURE 10.11: Null distribution and p-value. Since the observed fitted slope 0.067 falls far to the right of this null distribution and thus the shaded region doesn’t overlap it, we’ll have a \\(p\\)-value of 0. For completeness’s sake, however, let’s compute the numerical value of the p-value anyways using the get_p_value() function. It takes the same inputs as the shade_p_value() function: null_distn_slope %&gt;% get_p_value(obs_stat = observed_slope, direction = &quot;both&quot;) # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0 This matches the p-value of 0 in the regression table in Table 10.1. We therefore reject the null hypothesis \\(H_0: \\beta_1 = 0\\) in favor of the alternative hypothesis \\(H_A: \\beta_1 \\neq 1\\). We thus have evidence that suggests there is a significant relationship between teaching and “beauty” scores for all instructors at UT Austin. When the conditions for inference for regression are met and the null distribution has a bell shape, we are likely to see similar results between the simulation-based results we just demonstrated and the theory-based results shown in the regression table in Table 10.1. Learning check (LC10.2) Repeat the inference but this time for the correlation coefficient instead of the slope. Note the implementation of stat = &quot;correlation&quot; in the calculate() function of the infer package. 10.5 Conclusion 10.5.1 Theory-based inference for regression Recall in Section 10.2.5 when we interpreted the regression table in Table 10.1, we mentioned that R does not compute its values using simulation-based methods for constructing confidence intervals and conducting hypothesis tests as we did in Chapters 8 and 9 using the infer package. Rather, R uses a theory-based approach using mathematical formulas, much like the theory-based confidence intervals you saw in Subsection 8.7.2 and the theory-based hypothesis tests you saw in Subsection 9.6.1. These formulas were derived in a time when computers didn’t exist, so it would’ve been impossible to run the extensive computer simulations. In particular, there is a formula for the standard error of the fitted slope \\(b_1\\): \\[\\text{SE}_{b_1} = \\dfrac{\\dfrac{s_y}{s_x} \\cdot \\sqrt{1-r^2}}{\\sqrt{n-2}}\\] As with many formulas in statistics, there’s a lot going on here, so let’s first break down what each symbol represents. First \\(s_x\\) and \\(s_y\\) are the sample standard deviations of the explanatory variable bty_avg and the response variable score respectively. Second, \\(r\\) is the sample correlation coefficient between score and bty_avg. This was computed as 0.187 in Chapter 5. Lastly, \\(n\\) is the number of pairs of points in the evals_ch6 data frame, here 463. To put this formula into words, the standard error of \\(b_1\\) depends on the relationship between the variability of the response variable and the variability of the explanatory variable as measured in the \\(s_y / s_x\\) term. Next it looks into the relationship of how the two variables relate to each other in the \\(\\sqrt{1-r^2}\\) term. However, the most important observation to make in the previous formula is that there is a \\(n - 2\\) in the denominator. In other words, as the sample size \\(n\\) increases, the standard error \\(\\text{SE}_{b_1}\\) decreases. Just as we demonstrated in Section 7.3.3 when we used shovels with \\(n\\) = 25, 50, and 100, the amount of sampling variation of the fitted slope \\(b_1\\) will depend on the sample size \\(n\\). In particular, as the sample size increases, both the sampling and bootstrap distributions narrows. In other words, the standard error \\(\\text{SE}_{b_1}\\) decreases. Hence our estimates \\(b_1\\) of the true population slope \\(\\beta_1\\) get more and more precise. R then uses this formula for the standard error of \\(b_1\\) in the third column of the regression table and subsequently to construct 95% confidence intervals. But what about the hypothesis test? Much like with our theory-based hypothesis test in Subsection 9.6.1, R uses the following \\(t\\)-statistic as the test statistic for hypothesis testing: \\[ t = \\dfrac{ b_1 - \\beta_1}{ \\text{SE}_{b_1}} \\] And since the null hypothesis \\(H_0: \\beta_1 = 0\\) is assumed during the hypothesis test, the \\(t\\)-statistic becomes \\[ t = \\dfrac{ b_1 - 0}{ \\text{SE}_{b_1}} = \\dfrac{ b_1 }{ \\text{SE}_{b_1}} \\] What are the values of \\(b_1\\) and \\(\\text{SE}_{b_1}\\)? They are in the estimate and std_error column of the regression table in Table 10.1. Thus the value of 4.09 in the table is computed as 0.067/0.016 = 4.188. Note there is a slight difference due to rounding error. Lastly, to compute the p-value, we need to compare to observed test statistic of 4.09 to the appropriate null distribution. Recall from Section 9.2, that a null distribution is the sampling distribution of the test statistic assuming the null hypothesis \\(H_0\\) is true. Much like in our theory-based hypothesis test in Section 9.6.1, it can be mathematically proven that this distribution is a \\(t\\)-distribution with degrees of freedom equal to \\(df\\) = n - 2 = 463 - 2 = 461. Don’t worry if you’re feeling a little overwhelmed at this point. There is a lot of background theory to understand before you can fully make sense of the equations for theory-based methods. That being said, theory-based methods and simulation-based methods for constructing confidence intervals and conducting hypothesis tests often yield consistent results. In our opinion, two large benefits of simulation-based methods over theory-based is that 1) they are easier for people new to statistical inference to understand and 2) they also work in situations where theory-based methods and mathematical formulas don’t exist. 10.5.2 Summary of statistical inference We’ve now completed the last two sampling scenarios first introduced in the “Scenarios of sampling for inference” table in Subsection 7.5.1, which we re-display in Table 10.4. Armed with the regression modeling techniques you learned in Chapters 5 and 6, your understanding of sampling for inference in Chapter 7, and the tools for statistical inference like confidence intervals and hypothesis tests in Chapters 8 and 9, you’re now equipped to study the significance of relationships between variables in a wide array of data! TABLE 10.4: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Notation. 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) 4 Difference in population means \\(\\mu_1 - \\mu_2\\) Difference in sample means \\(\\overline{x}_1 - \\overline{x}_2\\) 5 Population regression slope \\(\\beta_1\\) Fitted regression slope \\(b_1\\) or \\(\\widehat{\\beta}_1\\) 6 Population regression intercept \\(\\beta_0\\) Fitted regression intercept \\(b_0\\) or \\(\\widehat{\\beta}_0\\) 10.5.3 Additional resources An R script file of all R code used in this chapter is available here. 10.5.4 What’s to come You’ve now concluded the last major part of the book on “Statistical Inference via infer.” The closing Chapter 11 concludes this book with various case studies involving real data, such as house prices in Seattle, WA. You’ll see how the principles in this book can help you become a great storyteller with data! "],
-["11-thinking-with-data.html", "Chapter 11 Tell the Story with Data 11.1 Case study: Seattle house prices 11.2 Case study: Effective data storytelling Concluding remarks", " Chapter 11 Tell the Story with Data Recall in the Preface and at the end of chapters throughout this book, we displayed the “ModernDive flowchart” mapping your journey through this book. FIGURE 11.1: ModernDive Flowchart. Let’s go over a refresher of what you’ve covered so far. You first got started with data in Chapter 1 where you learned about the difference between R and RStudio, started coding in R, installed and loaded your first R packages, and explored your first dataset: all domestic departure flights from a New York City airport in 2013. Then you covered the following three portions of this book: Data science with tidyverse. You assembled your data science toolbox using tidyverse packages. In particular you Ch.2: Visualized data using the ggplot2 package. Ch.3: Wrangled data using the dplyr package. Ch.4: Learned about the concept of “tidy” data as a standardized data frame input and output format for all packages in the tidyverse. Furthermore, you learned how to import spreadsheet files into R using the readr package. Data modeling with moderndive. Using these data science tools and helper functions from the moderndive package, you fit your first data models. In particular: Ch.5: Basic regression models with only one explanatory variable. Ch.6: Multiple regression models with more than one explanatory variable. Statistical inference with infer. Once again using your newly acquired data science tools, you unpacked statistical inference using the infer package. In particular you: Ch.7: Learned about the role that sampling variability plays in statistical inference and the role that sample size plays in sampling variability. Ch.8: Constructed confidence intervals. Ch.9: Conducted hypothesis tests. Data modeling with moderndive (revisited): Armed with your understanding of statistical inference, you revisited and reviewed the models you constructed in Ch.5 &amp; Ch.6. In particular you: Ch.10: Interpreted confidence intervals and hypothesis tests in a regression setting. All this was our way of guiding you through your first experiences of “thinking with data,” an expression originally coined by Google’s Diane Lambert . The philosophy underlying this expression guided the path we set for you in the flowchart in Figure 11.1. This philosophy is well summarized in the introduction to “Practical Data Science for Stats”: a collection of pre-prints focusing on the practical side of data science workflows and statistical analysis curated by Jennifer Bryan and Hadley Wickham. They quote: There are many aspects of day-to-day analytical work that are almost absent from the conventional statistics literature and curriculum. And yet these activities account for a considerable share of the time and effort of data analysts and applied statisticians. The goal of this collection is to increase the visibility and adoption of modern data analytical workflows. We aim to facilitate the transfer of tools and frameworks between industry and academia, between software engineering and statistics and computer science, and across different domains. In other words, to be equipped to “think with data” in the 21st century, analysts need practice going through the “Data/Science Pipeline” we saw in the Preface (re-displayed in Figure 11.2). It is our opinion that for too long, statistics education only focused on parts of this pipeline, instead of going through it in its entirety . FIGURE 11.2: Data/Science Pipeline. To conclude this book, we’ll present you with some additional case studies of working with data. In Section 11.1 we’ll take you through a full-pass of the “Data/Science Pipeline” in order to analyze the sale price of houses in Seattle, WA, USA. In Section 11.2, we’ll present you with some examples of effective data storytelling drawn from the data journalism website FiveThirtyEight.com. We present these case studies to you because we believe that you should not only be able to “think with data,” but also be able to “tell the story with data.” Let’s explore how this might be done! Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(moderndive) library(skimr) library(fivethirtyeight) 11.1 Case study: Seattle house prices Kaggle.com is a machine learning and predictive modeling competition website that hosts datasets uploaded by companies, governmental organizations, and other individuals. One of their datasets is the “House Sales in King County, USA”. It consists of sale prices of homes sold between May 2014 and May 2015 in King County, Washington, USA, which includes the greater Seattle metropolitan area. This dataset is in the house_prices data frame included in the moderndive package. The dataset consists of 21,613 houses and 21 variables describing these houses (for a full list and description of these variables, see the help file by running ?house_prices in the console). In this case study, we’ll create a multiple regression model where: The outcome variable \\(y\\) is the sale price of houses. Two explanatory variables: A numerical explanatory variable \\(x_1\\): house size sqft_living as measured in square feet of living space. Note that 1 square foot is about 0.09 square meters. A categorical explanatory variable \\(x_2\\): house condition, a categorical variable with 5 levels where 1 indicates “poor” and 5 indicates “excellent.” 11.1.1 Exploratory data analysis: Part I As we’ve said numerous times throughout this book, a crucial first step when presented with data is to perform an exploratory data analysis (EDA). Exploratory data analysis can give you a sense of your data, help identify issues with your data, bring to light any outliers, and help inform model construction. Recall the three common steps in an exploratory data analysis we introduced in Section 5.1.1: Looking at the raw data values. Computing summary statistics. Creating data visualizations. First, let’s look at the raw data using View() to bring up RStudio’s spreadsheet viewer and the glimpse() function from the dplyr package: View(house_prices) glimpse(house_prices) Observations: 21,613 Variables: 21 $ id &lt;chr&gt; &quot;7129300520&quot;, &quot;6414100192&quot;, &quot;5631500400&quot;, &quot;2487200875&quot;,… $ date &lt;date&gt; 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-0… $ price &lt;dbl&gt; 221900, 538000, 180000, 604000, 510000, 1225000, 257500… $ bedrooms &lt;int&gt; 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2… $ bathrooms &lt;dbl&gt; 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2… $ sqft_living &lt;int&gt; 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 18… $ sqft_lot &lt;int&gt; 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470… $ floors &lt;dbl&gt; 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, … $ waterfront &lt;lgl&gt; FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,… $ view &lt;int&gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0… $ condition &lt;fct&gt; 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4… $ grade &lt;fct&gt; 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, … $ sqft_above &lt;int&gt; 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 18… $ sqft_basement &lt;int&gt; 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0,… $ yr_built &lt;int&gt; 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2… $ yr_renovated &lt;int&gt; 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ zipcode &lt;fct&gt; 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198,… $ lat &lt;dbl&gt; 47.5, 47.7, 47.7, 47.5, 47.6, 47.7, 47.3, 47.4, 47.5, 4… $ long &lt;dbl&gt; -122, -122, -122, -122, -122, -122, -122, -122, -122, -… $ sqft_living15 &lt;int&gt; 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 2… $ sqft_lot15 &lt;int&gt; 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113,… Here are some questions you can ask yourself at this stage of an EDA: Which variables are numerical and which are categorical? For the categorical variables, what are their levels? Besides the variables we’ll be using in our regression model, what other variables do you think would be useful to use in a model for house price? Observe, for example, that while the condition variable has values 1 through 5, these are saved in R as fct factors. This is R’s way of saving categorical variables. So you should think of these as the “labels” 1 through 5 and not the numerical values 1 through 5. Let’s now perform the second step in an EDA: computing summary statistics. Recall from Section 3.3 that summary statistics are single numerical values that summarize a large number of values. Examples of summary statistics include the mean, the median, the standard deviation, and various percentiles. We could do this using the summarize() function the dplyr package along with R’s built-in summary functions, like mean() and median(). However, recall in Section 3.5, we saw the following code that computes a variety of summary statistics of the variable gain, which is the amount of time that a flight makes up mid-air: gain_summary &lt;- flights %&gt;% summarize( min = min(gain, na.rm = TRUE), q1 = quantile(gain, 0.25, na.rm = TRUE), median = quantile(gain, 0.5, na.rm = TRUE), q3 = quantile(gain, 0.75, na.rm = TRUE), max = max(gain, na.rm = TRUE), mean = mean(gain, na.rm = TRUE), sd = sd(gain, na.rm = TRUE), missing = sum(is.na(gain)) ) To repeat this for all three price, sqft_living, and condition variables would be tedious to code up. So instead, let’s use the convenient skim() function from the skimr package we first used in Subsection 6.1.1, being sure to only select() the variables of interest for our model: house_prices %&gt;% select(price, sqft_living, condition) %&gt;% skim() Skim summary statistics n obs: 21613 n variables: 3 ── Variable type:factor ───────────────────────────────────────────────────────────── variable missing complete n n_unique top_counts ordered condition 0 21613 21613 5 3: 14031, 4: 5679, 5: 1701, 2: 172 FALSE ── Variable type:integer ───────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 p75 p100 sqft_living 0 21613 21613 2079.9 918.44 290 1427 1910 2550 13540 ── Variable type:numeric ───────────────────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 p75 p100 price 0 21613 21613 540088.14 367127.2 75000 321950 450000 645000 7700000 Observe that the mean price of $540,088 is larger than the median of $450,000. This is because a small number of very expensive houses are inflating the average. In other words, there are “outlier” house prices in our dataset. (This fact will become very apparent when we create our visualizations next.) However, the median is not as sensitive to such outlier house prices. This is why news about the real estate market generally report median house prices and not mean/average house prices. We say here that the median is more robust to outliers than the mean. Similarly, while both the standard deviation and interquartile-range (IQR) are both measures of spread and variability, the IQR is more robust to outliers. Let’s now perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Let’s first create univariate visualizations, in other produce plots focusing on single variables at a time. Since price and sqft_living are numerical variables, we can visualize their distributions using a geom_histogram() as seen in Section 2.5 on histograms. On the other hand, since condition is categorical, we can visualize its distribution using a geom_bar(). Recall from Section 2.8 on barplots that since condition is not “pre-counted”, we use a geom_bar() and not a geom_col(). # Histogram of house price: ggplot(house_prices, aes(x = price)) + geom_histogram(color = &quot;white&quot;) + labs(x = &quot;price (USD)&quot;, title = &quot;House price&quot;) # Histogram of sqft_living: ggplot(house_prices, aes(x = sqft_living)) + geom_histogram(color = &quot;white&quot;) + labs(x = &quot;living space (square feet)&quot;, title = &quot;House size&quot;) # Barplot of condition: ggplot(house_prices, aes(x = condition)) + geom_bar() + labs(x = &quot;condition&quot;, title = &quot;House condition&quot;) In Figure 11.3, we display all three of these visualizations at once. FIGURE 11.3: Exploratory visualizations of Seattle house prices data. First, observe in the bottom plot that most houses are of condition “3”, with a few more of condition “4” and “5”, and almost none that are “1” or “2”. Next, observe in the histogram for price in the top-left plot that a majority of houses are less than two million dollars. Observe also that the x-axis stretches out to 8 million dollars, even though there does not appear to be any houses close to that price. This is because there are a very small number of houses with prices closer to 8 million. These are the outlier house prices we mentioned earlier. We say that the variable price is right skewed as exhibited by the long right tail. Notice, observe in the histogram of sqft_living in the middle plot as well that most houses appear to have less than 5000 square feet of living space. For comparison an American football field is about 57,600 square feet whereas a standard soccer /association football field is about 64,000 square feet. Observe also that this variable is also right skewed, although not as drastically as the price variable. For both the price and sqft_living variables, the right-skew makes distinguishing houses at the lower end of the x-axis hard. This is because the scale of the x-axis is compressed by the small number of very expensive and very large houses. So what can we do about this skew? Let’s apply a log10-transformation to these variables. If you are unfamiliar with such transformations, we highly recommend you read Appendix A.3 on log-transformations. Briefly however, log-transformations allow us to alter the scale a variable to focus on multiplicative changes instead of additive changes. In other words, relative changes instead of absolute changes. Such multiplicative/relative changes are also called changes in orders of magnitude. Let’s create new log10-transformed versions of the right-skewed variable price and sqft_living using the mutate() function from Section 3.5, but we’ll give the latter the name log10_size, which is shorter and easier to understand than the name log10_sqft_living. house_prices &lt;- house_prices %&gt;% mutate( log10_price = log10(price), log10_size = log10(sqft_living) ) Let’s display the before and after effects of this transformation on these variables for only the first 10 rows of house_prices: house_prices %&gt;% select(price, log10_price, sqft_living, log10_size) # A tibble: 21,613 x 4 price log10_price sqft_living log10_size &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; 1 221900 5.34616 1180 3.07188 2 538000 5.73078 2570 3.40993 3 180000 5.25527 770 2.88649 4 604000 5.78104 1960 3.29226 5 510000 5.70757 1680 3.22531 6 1225000 6.08814 5420 3.73400 7 257500 5.41078 1715 3.23426 8 291850 5.46516 1060 3.02531 9 229500 5.36078 1780 3.25042 10 323000 5.50920 1890 3.27646 # … with 21,603 more rows Observe in particular the houses in the sixth and third row. The house in the sixth row has price $1,225,000, which is just above one million dollars. Since \\(10^6\\) is one million, its log10_price is 6.09. Contrast this with all other houses with log10_price less than six, since they all have price less than $1,000,000. The house in the third row is the only house with sqft_living less than 1000. Since \\(1000 = 10^3\\), it’s the lone house with log10_size less than 3. Let’s now visualize the before and after effects of this transformation for price in Figure 11.4. # Before log10-transformation: ggplot(house_prices, aes(x = price)) + geom_histogram(color = &quot;white&quot;) + labs(x = &quot;price (USD)&quot;, title = &quot;House price: Before&quot;) # After log10-transformation: ggplot(house_prices, aes(x = log10_price)) + geom_histogram(color = &quot;white&quot;) + labs(x = &quot;log10 price (USD)&quot;, title = &quot;House price: After&quot;) FIGURE 11.4: House price before and after log10-transformation. Observe that after the transformation, the distribution is much less skewed, and in this case, more symmetric and more bell-shaped. Now you can now more easily distinguish the lower priced houses. Let’s do the same for house size, where the variable sqft_living and was log10-transformed to log10_size. # Before log10-transformation: ggplot(house_prices, aes(x = sqft_living)) + geom_histogram(color = &quot;white&quot;) + labs(x = &quot;living space (square feet)&quot;, title = &quot;House size: Before&quot;) # After log10-transformation: ggplot(house_prices, aes(x = log10_size)) + geom_histogram(color = &quot;white&quot;) + labs(x = &quot;log10 living space (square feet)&quot;, title = &quot;House size: After&quot;) FIGURE 11.5: House size before and after log10-transformation. Observe in Figure 11.5 that the log10-transformation has a similar effect of un-skewing the variable. We emphasize that while in these two cases the resulting distributions are more symmetric and bell-shaped, this is not always necessarily the case. Given the now un-skewed nature of log10_price and log10_size, we are going to revise our multiple regression model to use our new variables: The outcome variable \\(y\\) is the sale log10_price of houses. Two explanatory variables: A numerical explanatory variable \\(x_1\\): house size log10_size as measured in log10 square feet of living space. A categorical explanatory variable \\(x_2\\): house condition, a categorical variable with 5 levels where 1 indicates “poor” and 5 indicates “excellent.” 11.1.2 Exploratory data analysis: Part II Let’s now continue our EDA by creating multivariate visualizations. Unlike the univariate histograms and barplot in the earlier Figures 11.3, 11.4, and 11.5, multivariate visualizations show relationships between more than one variable. This is an important step of an EDA to perform since the goal of modeling is to explore relationships between variables. Since our model involves a numerical outcome variable, a numerical explanatory variable, and a categorical explanatory variable, we are in a similar regression modeling situation as in Section 6.1 where we studied UT Austin teaching scores dataset. Recall in that case the numerical outcome variable was teaching score, the numerical explanatory variable was instructor age, and the categorical explanatory variable was (binary) gender. We thus have two choices of models we can fit. Either 1) an interaction model where the regression line for each condition level will have both a different slope and a different intercept or 2) a parallel slopes model where the regression line for each condition level will have the same slope but different intercepts. Recall from Subsection 6.1.3 on the parallel slopes model that the ggplot2 package does not have a convenient way to plot a parallel slopes model. We therefore use the special purpose gg_parallel_slopes() function included in the moderndive package. We plot both resulting models in Figure 11.6, with the interaction model in the left-hand plot. # Plot interaction model ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) + geom_point(alpha = 0.05) + geom_smooth(method = &quot;lm&quot;, se = FALSE) + labs(y = &quot;log10 price&quot;, x = &quot;log10 size&quot;, title = &quot;House prices in Seattle&quot;) # Plot parallel slopes model gg_parallel_slopes(y = &quot;log10_price&quot;, num_x = &quot;log10_size&quot;, cat_x = &quot;condition&quot;, data = house_prices, alpha = 0.05) FIGURE 11.6: Interaction and parallel slopes models. In both cases, we see there is a positive relationship between house price and size, meaning as houses are larger, they tend to be more expensive. Furthermore, in both plots it seems that houses of condition 5 tend to be the most expensive for most house sizes as evidenced by the fact that the purple line is highest, followed by condition 4 and 3. As for condition 1 and 2, this pattern isn’t as clear. Recall from the univariate barplot of condition in Figure 11.3, there are very few houses of condition 1 or 2. Let’s also show a faceted version of just the interaction model in Figure 11.7. It is now much more apparent that there are very few houses of condition 1 or 2. ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) + geom_point(alpha = 0.4) + geom_smooth(method = &quot;lm&quot;, se = FALSE) + labs(y = &quot;log10 price&quot;, x = &quot;log10 size&quot;, title = &quot;House prices in Seattle&quot;) + facet_wrap(~condition) FIGURE 11.7: Facetted plot of interaction model. Which exploratory visualization of the interaction model is better, the one in the left-hand plot of Figure 11.6 or the faceted version in Figure 11.7? There is no universal right answer. You need to make a choice depending on what you want to convey, and own that choice. 11.1.3 Regression modeling Which of the two models in Figure 11.6 is “better”? The interaction model in the left-hand plot or the parallel slopes model in the right-hand plot? We had a similar discussion in Subsection 6.3.1 on model selection. Recall that we stated that we should only favor more complex models if the additional complexity is warranted. In this case, the more complex model is the interaction model since it considers five intercepts and five slopes total. This is in contrast to the parallel slopes model which considers five intercepts but only one common slope. Is the additional complexity of the interaction model warranted? Looking at the left-hand plot Figure 11.6, we’re of the opinion that it is, as evidenced by the slight x-like pattern to some of the lines. Therefore, we’ll focus the rest of this analysis only on the interaction model. This visual approach is somewhat subjective however, so feel free to disagree! What are the 5 different slopes and 5 different intercepts for the interaction model? We can obtain these values from the regression table. Recall our two-step process for getting the regression table: # Fit regression model: price_interaction &lt;- lm(log10_price ~ log10_size * condition, data = house_prices) # Get regression table: get_regression_table(price_interaction) TABLE 11.1: Regression table for interaction model. term estimate std_error statistic p_value lower_ci upper_ci intercept 3.330 0.451 7.380 0.000 2.446 4.215 log10_size 0.690 0.148 4.652 0.000 0.399 0.980 condition2 0.047 0.498 0.094 0.925 -0.930 1.024 condition3 -0.367 0.452 -0.812 0.417 -1.253 0.519 condition4 -0.398 0.453 -0.879 0.380 -1.286 0.490 condition5 -0.883 0.457 -1.931 0.053 -1.779 0.013 log10_size:condition2 -0.024 0.163 -0.148 0.882 -0.344 0.295 log10_size:condition3 0.133 0.148 0.893 0.372 -0.158 0.424 log10_size:condition4 0.146 0.149 0.979 0.328 -0.146 0.437 log10_size:condition5 0.310 0.150 2.067 0.039 0.016 0.604 Recall we saw in Section 6.1.2 how to interpret a regression table when there exist both numerical and categorical explanatory variables. Let’s now do the same for all 10 values in the estimate column of Table 11.1. In this case, the “baseline for comparison” group for the categorical variable condition are the condition 1 houses, since “1” comes first alphanumerically. Thus, the intercept and log10_size values are the intercept and slope for log10_size for this baseline group. Next, the condition2 through condition5 terms are the offsets in intercepts relative to the condition 1 intercept. Finally, the log10_size:condition2 through log10_size:condition5 are the offsets in slopes for log10_size relative to the condition 1 slope for log10_size. Let’s simplify this by writing out the equation of each of the five regression lines using these 10 estimate values. We’ll write out each line in the following format: \\[ \\widehat{\\log10(\\text{price})} = \\hat{\\beta}_0 + \\hat{\\beta}_{\\text{size}} \\cdot \\log10(\\text{size}) \\] Condition 1: \\(\\widehat{\\log10(\\text{price})} = 3.33 + 0.69 \\cdot \\log10(\\text{size})\\) Condition 2: \\(\\widehat{\\log10(\\text{price})} = (3.33 + 0.047) + (0.69 - 0.024) \\cdot \\log10(\\text{size}) = 3.38 + 0.666 \\cdot \\log10(\\text{size})\\) Condition 3: \\(\\widehat{\\log10(\\text{price})} = (3.33 - 0.367) + (0.69 + 0.133) \\cdot \\log10(\\text{size}) = 2.96 + 0.823 \\cdot \\log10(\\text{size})\\) Condition 4: \\(\\widehat{\\log10(\\text{price})} = (3.33 - 0.398) + (0.69 + 0.146) \\cdot \\log10(\\text{size}) = 2.93 + 0.836 \\cdot \\log10(\\text{size})\\) Condition 5: \\(\\widehat{\\log10(\\text{price})} = (3.33 - 0.883) + (0.69 + 0.31) \\cdot \\log10(\\text{size}) = 2.45 + 1 \\cdot \\log10(\\text{size})\\) These correspond to the regression lines in the left-hand plot of Figure 11.6 and the faceted plot in Figure 11.7. For homes of all 5 condition types, as the size of the house increases, the price increases. This is what most would expect. However, the rate of increase of price with size is fastest for the homes with condition 3, 4, and 5 of 0.823, 0.836, and 1 respectively. These are the three largest slopes out of the five. 11.1.4 Making predictions Say you’re a realtor and someone calls you asking you how much their home will sell for. They tell you that it’s in condition = 5 and is sized 1900 square feet. What do you tell them? Let’s use the interaction model we fit to make predictions! We first make this prediction visually in Figure 11.8. The predicted log10_price of this house is marked with a black dot. This is where the following two lines intersect: The purple regression line for the condition = 5 homes and The vertical dashed black line at log10_size equals 3.28, since our predictor variable is the log10-transformed square feet of living space of \\(\\log10(1900) = 3.28\\) . FIGURE 11.8: Interaction model with prediction. Eyeballing it, it seems the predicted log10_price seems to be around 5.75. Let’s now obtain the exact numerical value for the prediction using the equation of the regression line for the condition = 5 houses, being sure to log10() the square footage first. 2.45 + 1 * log10(1900) [1] 5.73 This value is very close to our earlier visually made prediction of 5.75. But wait! Is our prediction for the price of this house $5.75? No! Remember that we are using log10_price as our outcome variable! So if we want a prediction in dollar units of price, we need to un-log this by taking a power of 10 as described in Appendix A.3. 10^(2.45 + 1 * log10(1900)) [1] 535493 So we our predicted price for this home of condition 5 and size 1900 square feet is $535,493. Learning check (LC11.1) Repeat the regression modeling in Subsection 11.1.3 and the prediction making you just did on the house of condition 5 and size 1900 square feet in Subsection 11.1.4, but using the parallel slopes model you visualized in Figure 11.6. Hint: it’s $524,807! 11.2 Case study: Effective data storytelling As we’ve progressed throughout this book, you’ve seen how to work with data in a variety of ways. You’ve learned effective strategies for plotting data by understanding which types of plots work best for which combinations of variable types. You’ve summarized data in spreadsheet form and calculated summary statistics for a variety of different variables. Furthermore, you’ve seen the value of statistical inference as a process to come to conclusions about a population by using sampling. Lastly, you’ve explored how to fit linear regression model and the importance of checking the conditions required so that all confidence intervals and hypothesis tests have valid interpretation. All throughout, you’ve learned many computational techniques and focused on writing R code that’s reproducible. We now present another set of case studies, but this time on the “effective data storytelling” done by data journalists around the world. Great data stories don’t mislead the reader, but rather engulf them in understanding the importance that data plays in our lives through storytelling. 11.2.1 Bechdel test for Hollywood gender representation We recommend you read and analyze Walt Hickey’s FiveThirtyEight.com article “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women.” In it, Walt Hickey did a study across several decades of how many movies pass the Bechdel test, an informal test of gender representation in a movie created by Alison Bechdel. As you read over the article, think carefully about how Walt is using data, graphics, and analyses to tell the reader a story. In the spirit of reproducibility, FiveThirtyEight has also shared the data and R code that they used for this article. You can also find the data used in many more of their articles on their GitHub page. ModernDive co-authors Chester Ismay and Albert Y. Kim along with Jennifer Chunn went one step further by creating the fivethirtyeight package which provides access to these datasets. For a complete list of all 107 datasets included in the fivethirtyeight package, check out the package webpage at https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html. Furthermore, example “vignettes” of fully reproducible start-to-finish analyses of some of these data using dplyr, ggplot2, and other packages in the tidyverse are available here. For example, a vignette showing how to reproduce one of the plots at the end of the article on the Bechdel test is available here. 11.2.2 US Births in 1999 Here is another example involving the US_births_1994_2003 data frame included in the fivethirtyeight package. This data provides information about the number of daily births in the United States between 1994 and 2003. For more information on this data frame including a link to the original article on FiveThirtyEight.com, check out the help file by running ?US_births_1994_2003 in the console. It’s always a good idea to preview your data, either by using RStudio’s spreadsheet View() function or using glimpse() from the dplyr package: glimpse(US_births_1994_2003) Observations: 3,652 Variables: 6 $ year &lt;int&gt; 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1… $ month &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ date_of_month &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … $ date &lt;date&gt; 1994-01-01, 1994-01-02, 1994-01-03, 1994-01-04, 1994-0… $ day_of_week &lt;ord&gt; Sat, Sun, Mon, Tues, Wed, Thurs, Fri, Sat, Sun, Mon, Tu… $ births &lt;int&gt; 8096, 7772, 10142, 11248, 11053, 11406, 11251, 8653, 79… We’ll focus on the number of births for each date, but only for births that occurred in 1999. Recall from Section 3.2 we can do this using the filter() function from the dplyr package: US_births_1999 &lt;- US_births_1994_2003 %&gt;% filter(year == 1999) As discussed in Section 2.4, since date is a notion of time and thus has sequential ordering to it, a linegraph would be a more appropriate visualization to use than a scatterplot. In other words, we should use a geom_line() instead of geom_point(). Recall that such plots are called time series plots. ggplot(US_births_1999, aes(x = date, y = births)) + geom_line() + labs(x = &quot;Data&quot;, y = &quot;Number of births&quot;, title = &quot;US Births in 1999&quot;) FIGURE 11.9: Number of births in US in 1999. We see a big dip occurring just before January 1st, 2000, mostly likely due to the holiday season. However, what about the large spike of over 14,000 births occurring just before October 1st, 1999? What could be the reason for this anomalously high spike? Let’s sort the rows of US_births_1999 in descending order of the number of births. Recall from Section 3.6 that we can use the arrange() function from the dplyr function to do this, making sure to sort births in descending order: US_births_1999 %&gt;% arrange(desc(births)) # A tibble: 365 x 6 year month date_of_month date day_of_week births &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;date&gt; &lt;ord&gt; &lt;int&gt; 1 1999 9 9 1999-09-09 Thurs 14540 2 1999 12 21 1999-12-21 Tues 13508 3 1999 9 8 1999-09-08 Wed 13437 4 1999 9 21 1999-09-21 Tues 13384 5 1999 9 28 1999-09-28 Tues 13358 6 1999 7 7 1999-07-07 Wed 13343 7 1999 7 8 1999-07-08 Thurs 13245 8 1999 8 17 1999-08-17 Tues 13201 9 1999 9 10 1999-09-10 Fri 13181 10 1999 12 28 1999-12-28 Tues 13158 # … with 355 more rows The date with the highest number of births (14,540) is in fact 1999-09-09. If we write down this date in month/day/year format (a standard format in the US), the date with the highest number of births is 9/9/99! All nines! Could it be that parents deliberately induced labor at a higher rate on this date? Maybe? Whatever the cause may be, this fact makes a fun story! Learning check (LC11.2) What date between 1994 and 2003 has the fewest number of births in the US? What story could you tell about why this is the case? Time to think with data and further tell the story with data! How could statistical modeling help you here? What types of statistical inference would be helpful? What else can you find and where can you take this analysis? We leave these questions to you as the reader to explore and examine. Remember to get in touch with us via our contact info in the Preface. We’d love to see what you come up with! 11.2.3 Script of R code An R script file of all R code used in this chapter is available here. Concluding remarks Now that you’ve made it to this point in the book, we suspect that you know a thing or two about how to work with data in R! You’ve also gained a lot of knowledge about how to use simulation techniques for statistical inference and how these techniques help build intuition about traditional theory-based inferential methods like the \\(t\\)-test. The hope is that you’ve come to appreciate the power of data in all respects, such as data wrangling, tidying datasets, and data visualization, data modeling, and statistical inference. In our opinion, however, data visualization may be the most important tool for a data scientist to have in their toolbox. If you can create truly beautiful graphics that display information in ways that the reader can clearly understand, you have great power to tell your tale with data. Let’s hope that these skills help you tell great stories with data into the future. Thanks for coming along this journey as we dove into modern data analysis using R and the tidyverse! "],
-["A-appendixA.html", "A Statistical Background A.1 Basic statistical terms A.2 Normal distribution A.3 log10 transformations", " A Statistical Background A.1 Basic statistical terms Note that all the following statistical terms apply only to numerical variables, except the distribution which can exist for both numerical and categorical variables. A.1.1 Mean The mean is the most commonly reported measure of center. It is commonly called the average though this term can be a little ambiguous. The mean is the sum of all of the data elements divided by how many elements there are. If we have \\(n\\) data points, the mean is given by: \\[Mean = \\frac{x_1 + x_2 + \\cdots + x_n}{n}\\] A.1.2 Median The median is calculated by first sorting a variable’s data from smallest to largest. After sorting the data, the middle element in the list is the median. If the middle falls between two values, then the median is the mean of those two middle values. A.1.3 Standard deviation We will next discuss the standard deviation of a variable. The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far we expect a given data value will be from its mean: \\[Standard \\, deviation = \\sqrt{\\frac{(x_1 - Mean)^2 + (x_2 - Mean)^2 + \\cdots + (x_n - Mean)^2}{n - 1}}\\] A.1.4 Five-number summary The five-number summary consists of five summary statistics: the minimum, the first quantile AKA 25th percentile, the second quantile AKA median AKA 50th percentile, the third quantile AKA 75th, and the maximum. The five-number summary of a variable is used when constructing boxplots, as seen in Section 2.7. The quantiles are calculated as first quantile (\\(Q_1\\)): the median of the first half of the sorted data third quantile (\\(Q_3\\)): the median of the second half of the sorted data The interquartile range (IQR) is defined as \\(Q_3 - Q_1\\) and is a measure of how spread out the middle 50% of values are. The IQR corresponds to the length of a box in a boxplot. The median and the interquartile range are not influenced by the presence of outliers in the ways that the mean and standard deviation are. It is, thus, recommended for skewed datasets. We say in this case that the median and interquartile range are more robust to outliers. A.1.5 Distribution The distribution of a variable shows how frequently different values of a variable occur. Looking at visualization of a distribution can show where the values are centered, show how the values vary, and give some information about where a typical value might fall. It can also alert you to the presence of outliers. Recall from Chapter 2 that we can visualize the distribution of a numerical variable using a histogram and that we can visualize the distribution of a categorical variable using a barplot. A.1.6 Outliers Outliers correspond to values in the dataset that fall far outside the range of “ordinary” values. In context of a boxplot, by default they correspond to values below \\(Q_1 - (1.5 * IQR)\\) or above \\(Q_3 + (1.5 * IQR)\\). A.2 Normal distribution Let’s discuss one particular kind of distribution: normal distributions . Such bell-shaped distributions are defined by two values: 1) the mean \\(\\mu\\) (“mu”) which locates the center of the distribution and 2) the standard deviation \\(\\sigma\\) (“sigma”) which determines the variation of the distribution. In Figure A.1, we plot three normal distributions where: The solid normal curve has mean \\(\\mu\\) = 5 and standard deviation \\(\\sigma\\) = 2. The dashed normal curve has mean \\(\\mu\\) = 5 and standard deviation \\(\\sigma\\) = 5. The dotted normal curve has mean \\(\\mu\\) = 15 and standard deviation \\(\\sigma\\) = 2. FIGURE A.1: Three normal distributions. Notice how the solid and dashed line normal curves have the same center due to their common mean \\(\\mu\\) = 5. However the dashed line normal curve is wider due to its larger standard deviation of \\(\\sigma\\) = 5. On the other hand, the solid and dotted line normal curves have the same variation due to their common standard deviation \\(\\sigma\\) = 2. However, they are centered at different locations. When the mean \\(\\mu\\) = 0 and the standard deviation \\(\\sigma\\) = 1, the normal distribution has a special name: the standard normal distribution or the \\(z\\)-curve. Furthermore, if a variable follows a normal curve, there are three rules of thumb we can use: 68% of values will lie within \\(\\pm\\) 1 standard deviation of the mean. 95% of values will lie within \\(\\pm\\) 1.96 \\(\\approx\\) 2 standard deviations of the mean. 99.7% of values will lie within \\(\\pm\\) 3 standard deviations of the mean. Let’s illustrate this on a standard normal curve in Figure A.2. The dashed lines are at -3, -1.96, -1, 0, 1, 1.96, and 3. These 7 lines cut up the x-axis into 8 segments. The areas under the normal curve for each of the 8 segments are marked and add up to 100%. For example: The middle two segments represent the interval -1 to 1. The shaded area above this interval represents 34% + 34% = 68% of the area under the curve. In other words, 68% of values. The middle four segments represent the interval -1.96 to 1.96. The shaded area above this interval represents 13.5% + 34% + 34% + 13.5%= 95% of the area under the curve. In other words, 95% of values. The middle six segments represent the interval -3 to 3. The shaded area above this interval represents 2.35% + 13.5% + 34% + 34% + 13.5% + 2.35% = 99.7% of the area under the curve. In other words, 99.7% of values. FIGURE A.2: Rules of thumb about areas under normal curves Learning check Say you have a normal distribution with mean \\(\\mu\\) = 6 and standard deviation \\(\\sigma\\) = 3. (LC11.3) What proportion of the area under the normal curve is less than 3? Greater than 12? Between 0 and 12? (LC11.4) What is the 2.5th percentile of the area under the normal curve? The 95th percentile? The 100th percentile? A.3 log10 transformations At its simplest, log10 transformations return base 10 logarithms. For example, since \\(1000 = 10^3\\), running log10(1000) returns 3 in R. To undo a log10-transformation, we raise 10 to this value. For example, to undo the previous log10-transformation and return the original value of 1000, we raise 10 to this value to the power of 3 by running 10^(3) = 1000 in R. Log-transformations allow us to focus on changes in orders of magnitude. In other words, they allow us to focus on multiplicative changes instead of additive ones. Let’s illustrate this idea in Table A.1 with examples of prices of consumer goods in US dollars. TABLE A.1: log10-transformed prices, orders of magnitude, and examples Price log10(Price) Order of magnitude Examples $1 0 Singles Cups of coffee $10 1 Tens Books $100 2 Hundreds Mobile phones $1,000 3 Thousands High definition TV’s $10,000 4 Tens of thousands Cars $100,000 5 Hundreds of thousands Luxury cars &amp; houses $1,000,000 6 Millions Luxury houses Let’s make some remarks about log10-transformations based on Table A.1: When purchasing a cup of coffee, we tend to think of prices ranging in single dollars. Ex: $2 or $3. However when purchasing a mobile phone, we don’t tend to think of their prices in units of single dollars such as $313 or $727. Instead, we tend to think of their prices in units of hundreds of dollars. Ex: $300 or $700. Thus cups of coffee and mobile phones are of different orders of magnitude of price. Let’s say we want to know the log10-transformed value of $76. This would be hard to compute exactly without a calculator. However, since $76 is between $10 and $100 and since log10(10) = 1 and log10(100) = 2, we know log10(76) will be between 1 and 2. In fact, log10(76) is 1.880814. log10-transformations are monotonic, meaning they preserve orders. So if Price A is lower than Price B, then log10(Price A) will also be lower than log10(Price B). Most importantly, increments of one in log10-scale correspond to relative multiplicative changes in the original scale and not absolute additive changes. For example, increasing a log10(Price) from 3 to 4 corresponds to a multiplicative increase by a factor of x10: $100 to $1000. "],
-["B-appendixB.html", "B Inference Examples Needed packages B.1 Inference mind map B.2 One mean B.3 One proportion B.4 Two proportions B.5 Two means (independent samples) B.6 Two means (paired samples)", " B Inference Examples This appendix is designed to provide you with examples of the five basic hypothesis tests and their corresponding confidence intervals. Traditional theory-based methods as well as computational-based methods are presented. Note: This appendix is still under construction. If you would like to contribute, please check us out on GitHub at https://github.com/moderndive/moderndive_book. Please check out our sneak peak of infer below in the meanwhile. For more details on infer visit https://infer.netlify.com/. Needed packages library(dplyr) library(ggplot2) library(infer) library(knitr) library(kableExtra) library(readr) library(janitor) B.1 Inference mind map To help you better navigate and choose the appropriate analysis, we’ve created a mind map on http://coggle.it available here and below. FIGURE B.1: Mind map for Inference. B.2 One mean B.2.1 Problem statement The National Survey of Family Growth conducted by the Centers for Disease Control gathers information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men’s and women’s health. One of the variables collected on this survey is the age at first marriage. 5,534 randomly sampled US women between 2006 and 2010 completed the survey. The women sampled here had been married at least once. Do we have evidence that the mean age of first marriage for all US women from 2006 to 2010 is greater than 23 years? (Tweaked a bit from Diez, Barr, and Çetinkaya-Rundel 2014 [Chapter 4]) B.2.2 Competing hypotheses In words Null hypothesis: The mean age of first marriage for all US women from 2006 to 2010 is equal to 23 years. Alternative hypothesis: The mean age of first marriage for all US women from 2006 to 2010 is greater than 23 years. In symbols (with annotations) \\(H_0: \\mu = \\mu_{0}\\), where \\(\\mu\\) represents the mean age of first marriage for all US women from 2006 to 2010 and \\(\\mu_0\\) is 23. \\(H_A: \\mu &gt; 23\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.2.3 Exploring the sample data age_at_marriage &lt;- read_csv(&quot;https://moderndive.com/data/ageAtMar.csv&quot;) age_summ &lt;- age_at_marriage %&gt;% summarize(sample_size = n(), mean = mean(age), sd = sd(age), minimum = min(age), lower_quartile = quantile(age, 0.25), median = median(age), upper_quartile = quantile(age, 0.75), max = max(age)) kable(age_summ) %&gt;% kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16), latex_options = c(&quot;hold_position&quot;)) sample_size mean sd minimum lower_quartile median upper_quartile max 5534 23.4 4.72 10 20 23 26 43 The histogram below also shows the distribution of age. ggplot(data = age_at_marriage, mapping = aes(x = age)) + geom_histogram(binwidth = 3, color = &quot;white&quot;) The observed statistic of interest here is the sample mean: x_bar &lt;- age_at_marriage %&gt;% specify(response = age) %&gt;% calculate(stat = &quot;mean&quot;) x_bar # A tibble: 1 x 1 stat &lt;dbl&gt; 1 23.4402 Guess about statistical significance We are looking to see if the observed sample mean of 23.44 is statistically greater than \\(\\mu_0 = 23\\). They seem to be quite close, but we have a large sample size here. Let’s guess that the large sample size will lead us to reject this practically small difference. B.2.4 Non-traditional methods Bootstrapping for hypothesis test In order to look to see if the observed sample mean of 23.44 is statistically greater than \\(\\mu_0 = 23\\), we need to account for the sample size. We also need to determine a process that replicates how the original sample of size 5534 was selected. We can use the idea of bootstrapping to simulate the population from which the sample came and then generate samples from that simulated population to account for sampling variability. Recall how bootstrapping would apply in this context: Sample with replacement from our original sample of 5534 women and repeat this process 10,000 times, calculate the mean for each of the 10,000 bootstrap samples created in Step 1., combine all of these bootstrap statistics calculated in Step 2 into a boot_distn object, and shift the center of this distribution over to the null value of 23. (This is needed since it will be centered at 23.44 via the process of bootstrapping.) set.seed(2018) null_distn_one_mean &lt;- age_at_marriage %&gt;% specify(response = age) %&gt;% hypothesize(null = &quot;point&quot;, mu = 23) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;mean&quot;) null_distn_one_mean %&gt;% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a right-tailed test so we will be looking for values that are greater than or equal to 23.44 for our \\(p\\)-value. null_distn_one_mean %&gt;% visualize(obs_stat = x_bar, direction = &quot;greater&quot;) Calculate \\(p\\)-value pvalue &lt;- null_distn_one_mean %&gt;% get_pvalue(obs_stat = x_bar, direction = &quot;greater&quot;) pvalue # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0 So our \\(p\\)-value is 0 and we reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are far into the tail of the null distribution. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\mu\\) using our sample data using bootstrapping. Note that we don’t need to shift this distribution since we want the center of our confidence interval to be our point estimate \\(\\bar{x}_{obs} = 23.44\\). boot_distn_one_mean &lt;- age_at_marriage %&gt;% specify(response = age) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;mean&quot;) ci &lt;- boot_distn_one_mean %&gt;% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 23.3159 23.5651 boot_distn_one_mean %&gt;% visualize(endpoints = ci, direction = &quot;between&quot;) We see that 23 is not contained in this confidence interval as a plausible value of \\(\\mu\\) (the unknown population mean) and the entire interval is larger than 23. This matches with our hypothesis test results of rejecting the null hypothesis in favor of the alternative (\\(\\mu &gt; 23\\)). Interpretation: We are 95% confident the true mean age of first marriage for all US women from 2006 to 2010 is between 23.316 and 23.565. B.2.5 Traditional methods Check conditions Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: The observations are collected independently. The cases are selected independently through random sampling so this condition is met. Approximately normal: The distribution of the response variable should be normal or the sample size should be at least 30. The histogram for the sample above does show some skew. The Q-Q plot below also shows some skew. ggplot(data = age_at_marriage, mapping = aes(sample = age)) + stat_qq() The sample size here is quite large though (\\(n = 5534\\)) so both conditions are met. Test statistic The test statistic is a random variable based on the sample data. Here, we want to look at a way to estimate the population mean \\(\\mu\\). A good guess is the sample mean \\(\\bar{X}\\). Recall that this sample mean is actually a random variable that will vary as different samples are (theoretically, would be) collected. We are looking to see how likely is it for us to have observed a sample mean of \\(\\bar{x}_{obs} = 23.44\\) or larger assuming that the population mean is 23 (assuming the null hypothesis is true). If the conditions are met and assuming \\(H_0\\) is true, we can “standardize” this original test statistic of \\(\\bar{X}\\) into a \\(T\\) statistic that follows a \\(t\\) distribution with degrees of freedom equal to \\(df = n - 1\\): \\[ T =\\dfrac{ \\bar{X} - \\mu_0}{ S / \\sqrt{n} } \\sim t (df = n - 1) \\] where \\(S\\) represents the standard deviation of the sample and \\(n\\) is the sample size. Observed test statistic While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the t_test() function to perform this analysis for us. t_test_results &lt;- age_at_marriage %&gt;% infer::t_test(formula = age ~ NULL, alternative = &quot;greater&quot;, mu = 23) t_test_results # A tibble: 1 x 6 statistic t_df p_value alternative lower_ci upper_ci &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; 1 6.93570 5533 2.25216e-12 greater 23.3358 Inf We see here that the \\(t_{obs}\\) value is 6.936. Compute \\(p\\)-value The \\(p\\)-value—the probability of observing an \\(t_{obs}\\) value of 6.936 or more in our null distribution of a \\(t\\) with 5533 degrees of freedom—is essentially 0. State conclusion We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that our observed sample mean was statistically greater than the hypothesized mean has supporting evidence here. Based on this sample, we have evidence that the mean age of first marriage for all US women from 2006 to 2010 is greater than 23 years. Confidence interval t.test(x = age_at_marriage$age, alternative = &quot;two.sided&quot;, mu = 23)$conf [1] 23.3 23.6 attr(,&quot;conf.level&quot;) [1] 0.95 B.2.6 Comparing results Observing the bootstrap distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions also being met (the large sample size was the driver here) leads us to better guess that using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) will lead to similar results. B.3 One proportion B.3.1 Problem statement The CEO of a large electric utility claims that 80 percent of his 1,000,000 customers are satisfied with the service they receive. To test this claim, the local newspaper surveyed 100 customers, using simple random sampling. 73 were satisfied and the remaining were unsatisfied. Based on these findings from the sample, can we reject the CEO’s hypothesis that 80% of the customers are satisfied? [Tweaked a bit from http://stattrek.com/hypothesis-test/proportion.aspx?Tutorial=AP] B.3.2 Competing hypotheses In words Null hypothesis: The proportion of all customers of the large electric utility satisfied with service they receive is equal 0.80. Alternative hypothesis: The proportion of all customers of the large electric utility satisfied with service they receive is different from 0.80. In symbols (with annotations) \\(H_0: \\pi = p_{0}\\), where \\(\\pi\\) represents the proportion of all customers of the large electric utility satisfied with service they receive and \\(p_0\\) is 0.8. \\(H_A: \\pi \\ne 0.8\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.3.3 Exploring the sample data elec &lt;- c(rep(&quot;satisfied&quot;, 73), rep(&quot;unsatisfied&quot;, 27)) %&gt;% as_data_frame() %&gt;% rename(satisfy = value) The bar graph below also shows the distribution of satisfy. ggplot(data = elec, aes(x = satisfy)) + geom_bar() The observed statistic is computed as p_hat &lt;- elec %&gt;% specify(response = satisfy, success = &quot;satisfied&quot;) %&gt;% calculate(stat = &quot;prop&quot;) p_hat # A tibble: 1 x 1 stat &lt;dbl&gt; 1 0.73 Guess about statistical significance We are looking to see if the sample proportion of 0.73 is statistically different from \\(p_0 = 0.8\\) based on this sample. They seem to be quite close, and our sample size is not huge here (\\(n = 100\\)). Let’s guess that we do not have evidence to reject the null hypothesis. B.3.4 Non-traditional methods Simulation for hypothesis test In order to look to see if 0.73 is statistically different from 0.8, we need to account for the sample size. We also need to determine a process that replicates how the original sample of size 100 was selected. We can use the idea of an unfair coin to simulate this process. We will simulate flipping an unfair coin (with probability of success 0.8 matching the null hypothesis) 100 times. Then we will keep track of how many heads come up in those 100 flips. Our simulated statistic matches with how we calculated the original statistic \\(\\hat{p}\\): the number of heads (satisfied) out of our total sample of 100. We then repeat this process many times (say 10,000) to create the null distribution looking at the simulated proportions of successes: set.seed(2018) null_distn_one_prop &lt;- elec %&gt;% specify(response = satisfy, success = &quot;satisfied&quot;) %&gt;% hypothesize(null = &quot;point&quot;, p = 0.8) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;prop&quot;) null_distn_one_prop %&gt;% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a two-tailed test so we will be looking for values that are 0.8 - 0.73 = 0.07 away from 0.8 in BOTH directions for our \\(p\\)-value: null_distn_one_prop %&gt;% visualize(obs_stat = p_hat, direction = &quot;both&quot;) Calculate \\(p\\)-value pvalue &lt;- null_distn_one_prop %&gt;% get_pvalue(obs_stat = p_hat, direction = &quot;both&quot;) pvalue # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0.1136 So our \\(p\\)-value is 0.114 and we fail to reject the null hypothesis at the 5% level. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\pi\\) using our sample data. To do so, we use bootstrapping, which involves sampling with replacement from our original sample of 100 survey respondents and repeating this process 10,000 times, calculating the proportion of successes for each of the 10,000 bootstrap samples created in Step 1., combining all of these bootstrap statistics calculated in Step 2 into a boot_distn object, identifying the 2.5th and 97.5th percentiles of this distribution (corresponding to the 5% significance level chosen) to find a 95% confidence interval for \\(\\pi\\), and interpret this confidence interval in the context of the problem. boot_distn_one_prop &lt;- elec %&gt;% specify(response = satisfy, success = &quot;satisfied&quot;) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;prop&quot;) Just as we use the mean function for calculating the mean over a numerical variable, we can also use it to compute the proportion of successes for a categorical variable where we specify what we are calling a “success” after the ==. (Think about the formula for calculating a mean and how R handles logical statements such as satisfy == &quot;satisfied&quot; for why this must be true.) ci &lt;- boot_distn_one_prop %&gt;% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 0.64 0.81 boot_distn_one_prop %&gt;% visualize(endpoints = ci, direction = &quot;between&quot;) We see that 0.80 is contained in this confidence interval as a plausible value of \\(\\pi\\) (the unknown population proportion). This matches with our hypothesis test results of failing to reject the null hypothesis. Interpretation: We are 95% confident the true proportion of customers who are satisfied with the service they receive is between 0.64 and 0.81. B.3.5 Traditional methods Check conditions Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: The observations are collected independently. The cases are selected independently through random sampling so this condition is met. Approximately normal: The number of expected successes and expected failures is at least 10. This condition is met since 73 and 27 are both greater than 10. Test statistic The test statistic is a random variable based on the sample data. Here, we want to look at a way to estimate the population proportion \\(\\pi\\). A good guess is the sample proportion \\(\\hat{P}\\). Recall that this sample proportion is actually a random variable that will vary as different samples are (theoretically, would be) collected. We are looking to see how likely is it for us to have observed a sample proportion of \\(\\hat{p}_{obs} = 0.73\\) or larger assuming that the population proportion is 0.80 (assuming the null hypothesis is true). If the conditions are met and assuming \\(H_0\\) is true, we can standardize this original test statistic of \\(\\hat{P}\\) into a \\(Z\\) statistic that follows a \\(N(0, 1)\\) distribution. \\[ Z =\\dfrac{ \\hat{P} - p_0}{\\sqrt{\\dfrac{p_0(1 - p_0)}{n} }} \\sim N(0, 1) \\] Observed test statistic While one could compute this observed test statistic by “hand” by plugging the observed values into the formula, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. The calculation has been done in R below for completeness though: p_hat &lt;- 0.73 p0 &lt;- 0.8 n &lt;- 100 (z_obs &lt;- (p_hat - p0) / sqrt( (p0 * (1 - p0)) / n)) [1] -1.75 We see here that the \\(z_{obs}\\) value is around -1.75. Our observed sample proportion of 0.73 is 1.75 standard errors below the hypothesized parameter value of 0.8. Visualize and compute \\(p\\)-value elec %&gt;% specify(response = satisfy, success = &quot;satisfied&quot;) %&gt;% hypothesize(null = &quot;point&quot;, p = 0.8) %&gt;% calculate(stat = &quot;z&quot;) %&gt;% visualize(method = &quot;theoretical&quot;, obs_stat = z_obs, direction = &quot;both&quot;) 2 * pnorm(z_obs) [1] 0.0801 The \\(p\\)-value—the probability of observing an \\(z_{obs}\\) value of -1.75 or more extreme (in both directions) in our null distribution—is around 8%. Note that we could also do this test directly using the prop.test function. stats::prop.test(x = table(elec$satisfy), n = length(elec$satisfy), alternative = &quot;two.sided&quot;, p = 0.8, correct = FALSE) 1-sample proportions test without continuity correction data: table(elec$satisfy), null probability 0.8 X-squared = 3, df = 1, p-value = 0.08 alternative hypothesis: true p is not equal to 0.8 95 percent confidence interval: 0.636 0.807 sample estimates: p 0.73 prop.test does a \\(\\chi^2\\) test here but this matches up exactly with what we would expect: \\(x^2_{obs} = 3.06 = (-1.75)^2 = (z_{obs})^2\\) and the \\(p\\)-values are the same because we are focusing on a two-tailed test. Note that the 95 percent confidence interval given above matches well with the one calculated using bootstrapping. State conclusion We, therefore, do not have sufficient evidence to reject the null hypothesis. Our initial guess that our observed sample proportion was not statistically greater than the hypothesized proportion has not been invalidated. Based on this sample, we have do not evidence that the proportion of all customers of the large electric utility satisfied with service they receive is different from 0.80, at the 5% level. B.3.6 Comparing results Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions also being met leads us to better guess that using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) will lead to similar results. B.4 Two proportions B.4.1 Problem statement A 2010 survey asked 827 randomly sampled registered voters in California “Do you support? Or do you oppose? Drilling for oil and natural gas off the Coast of California? Or do you not know enough to say?” Conduct a hypothesis test to determine if the data provide strong evidence that the proportion of college graduates who do not have an opinion on this issue is different than that of non-college graduates. (Tweaked a bit from Diez, Barr, and Çetinkaya-Rundel 2014 [Chapter 6]) B.4.2 Competing hypotheses In words Null hypothesis: There is no association between having an opinion on drilling and having a college degree for all registered California voters in 2010. Alternative hypothesis: There is an association between having an opinion on drilling and having a college degree for all registered California voters in 2010. Another way in words Null hypothesis: The probability that a Californian voter in 2010 having no opinion on drilling and is a college graduate is the same as that of a non-college graduate. Alternative hypothesis: These parameter probabilities are different. In symbols (with annotations) \\(H_0: \\pi_{college} = \\pi_{no\\_college}\\) or \\(H_0: \\pi_{college} - \\pi_{no\\_college} = 0\\), where \\(\\pi\\) represents the probability of not having an opinion on drilling. \\(H_A: \\pi_{college} - \\pi_{no\\_college} \\ne 0\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.4.3 Exploring the sample data offshore &lt;- read_csv(&quot;https://moderndive.com/data/offshore.csv&quot;) offshore %&gt;% tabyl(college_grad, response) college_grad no opinion opinion no 131 258 yes 104 334 off_summ &lt;- offshore %&gt;% group_by(college_grad) %&gt;% summarize(prop_no_opinion = mean(response == &quot;no opinion&quot;), sample_size = n()) ggplot(offshore, aes(x = college_grad, fill = response)) + geom_bar(position = &quot;fill&quot;) + coord_flip() Guess about statistical significance We are looking to see if a difference exists in the size of the bars corresponding to no opinion for the plot. Based solely on the plot, we have little reason to believe that a difference exists since the bars seem to be about the same size, BUT…it’s important to use statistics to see if that difference is actually statistically significant! B.4.4 Non-traditional methods Collecting summary info The observed statistic is d_hat &lt;- offshore %&gt;% specify(response ~ college_grad, success = &quot;no opinion&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;yes&quot;, &quot;no&quot;)) d_hat # A tibble: 1 x 1 stat &lt;dbl&gt; 1 -0.0993180 Randomization for hypothesis test In order to look to see if the observed sample proportion of no opinion for college graduates of 0.337 is statistically different than that for graduates of 0.237, we need to account for the sample sizes. Note that this is the same as looking to see if \\(\\hat{p}_{grad} - \\hat{p}_{nograd}\\) is statistically different than 0. We also need to determine a process that replicates how the original group sizes of 389 and 438 were selected. We can use the idea of randomization testing (also known as permutation testing) to simulate the population from which the sample came (with two groups of different sizes) and then generate samples using shuffling from that simulated population to account for sampling variability. set.seed(2018) null_distn_two_props &lt;- offshore %&gt;% specify(response ~ college_grad, success = &quot;no opinion&quot;) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;yes&quot;, &quot;no&quot;)) null_distn_two_props %&gt;% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a two-tailed test so we will be looking for values that are greater than or equal to -0.099 or less than or equal to 0.099 for our \\(p\\)-value. null_distn_two_props %&gt;% visualize(obs_stat = d_hat, direction = &quot;two_sided&quot;) Calculate \\(p\\)-value pvalue &lt;- null_distn_two_props %&gt;% get_pvalue(obs_stat = d_hat, direction = &quot;two_sided&quot;) pvalue # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0.003 So our \\(p\\)-value is 0.003 and we reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are far into the tails of the null distribution. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\pi_{college} - \\pi_{no\\_college}\\) using our sample data with bootstrapping. boot_distn_two_props &lt;- offshore %&gt;% specify(response ~ college_grad, success = &quot;no opinion&quot;) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;yes&quot;, &quot;no&quot;)) ci &lt;- boot_distn_two_props %&gt;% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 -0.161207 -0.0378500 boot_distn_two_props %&gt;% visualize(endpoints = ci, direction = &quot;between&quot;) We see that 0 is not contained in this confidence interval as a plausible value of \\(\\pi_{college} - \\pi_{no\\_college}\\) (the unknown population parameter). This matches with our hypothesis test results of rejecting the null hypothesis. Since zero is not a plausible value of the population parameter, we have evidence that the proportion of college graduates in California with no opinion on drilling is different than that of non-college graduates. Interpretation: We are 95% confident the true proportion of non-college graduates with no opinion on offshore drilling in California is between 0.16 dollars smaller to 0.04 dollars smaller than for college graduates. B.4.5 Traditional methods B.4.6 Check conditions Remember that in order to use the short-cut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: Each case that was selected must be independent of all the other cases selected. This condition is met since cases were selected at random to observe. Sample size: The number of pooled successes and pooled failures must be at least 10 for each group. We need to first figure out the pooled success rate: \\[\\hat{p}_{obs} = \\dfrac{131 + 104}{827} = 0.28.\\] We now determine expected (pooled) success and failure counts: \\(0.28 \\cdot (131 + 258) = 108.92\\), \\(0.72 \\cdot (131 + 258) = 280.08\\) \\(0.28 \\cdot (104 + 334) = 122.64\\), \\(0.72 \\cdot (104 + 334) = 315.36\\) Independent selection of samples: The cases are not paired in any meaningful way. We have no reason to suspect that a college graduate selected would have any relationship to a non-college graduate selected. B.4.7 Test statistic The test statistic is a random variable based on the sample data. Here, we are interested in seeing if our observed difference in sample proportions corresponding to no opinion on drilling (\\(\\hat{p}_{college, obs} - \\hat{p}_{no\\_college, obs}\\) = 0.033) is statistically different than 0. Assuming that conditions are met and the null hypothesis is true, we can use the standard normal distribution to standardize the difference in sample proportions (\\(\\hat{P}_{college} - \\hat{P}_{no\\_college}\\)) using the standard error of \\(\\hat{P}_{college} - \\hat{P}_{no\\_college}\\) and the pooled estimate: \\[ Z =\\dfrac{ (\\hat{P}_1 - \\hat{P}_2) - 0}{\\sqrt{\\dfrac{\\hat{P}(1 - \\hat{P})}{n_1} + \\dfrac{\\hat{P}(1 - \\hat{P})}{n_2} }} \\sim N(0, 1) \\] where \\(\\hat{P} = \\dfrac{\\text{total number of successes} }{ \\text{total number of cases}}.\\) Observed test statistic While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the prop.test function to perform this analysis for us. z_hat &lt;- offshore %&gt;% specify(response ~ college_grad, success = &quot;no opinion&quot;) %&gt;% calculate(stat = &quot;z&quot;, order = c(&quot;yes&quot;, &quot;no&quot;)) z_hat # A tibble: 1 x 1 stat &lt;dbl&gt; 1 -3.16081 The observed difference in sample proportions is 3.16 standard deviations smaller than 0. The \\(p\\)-value—the probability of observing a \\(Z\\) value of -3.16 or more extreme in our null distribution—is 0.0016. This can also be calculated in R directly: 2 * pnorm(-3.16, lower.tail = TRUE) [1] 0.00158 B.4.8 State conclusion We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference did not exist in the proportions of no opinion on offshore drilling between college educated and non-college educated Californians was not validated. We do have evidence to suggest that there is a dependency between college graduation and position on offshore drilling for Californians. B.4.9 Comparing results Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions were not met since the number of pairs was small, but the sample data was not highly skewed. Using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) lead to similar results. B.5 Two means (independent samples) B.5.1 Problem statement Average income varies from one region of the country to another, and it often reflects both lifestyles and regional living expenses. Suppose a new graduate is considering a job in two locations, Cleveland, OH and Sacramento, CA, and he wants to see whether the average income in one of these cities is higher than the other. He would like to conduct a hypothesis test based on two randomly selected samples from the 2000 Census. (Tweaked a bit from Diez, Barr, and Çetinkaya-Rundel 2014 [Chapter 5]) B.5.2 Competing hypotheses In words Null hypothesis: There is no association between income and location (Cleveland, OH and Sacramento, CA). Alternative hypothesis: There is an association between income and location (Cleveland, OH and Sacramento, CA). Another way in words Null hypothesis: The mean income is the same for both cities. Alternative hypothesis: The mean income is different for the two cities. In symbols (with annotations) \\(H_0: \\mu_{sac} = \\mu_{cle}\\) or \\(H_0: \\mu_{sac} - \\mu_{cle} = 0\\), where \\(\\mu\\) represents the average income. \\(H_A: \\mu_{sac} - \\mu_{cle} \\ne 0\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.5.3 Exploring the sample data cle_sac &lt;- read.delim(&quot;https://moderndive.com/data/cleSac.txt&quot;) %&gt;% rename(metro_area = Metropolitan_area_Detailed, income = Total_personal_income) %&gt;% na.omit() inc_summ &lt;- cle_sac %&gt;% group_by(metro_area) %&gt;% summarize(sample_size = n(), mean = mean(income), sd = sd(income), minimum = min(income), lower_quartile = quantile(income, 0.25), median = median(income), upper_quartile = quantile(income, 0.75), max = max(income)) kable(inc_summ) %&gt;% kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16), latex_options = c(&quot;hold_position&quot;)) metro_area sample_size mean sd minimum lower_quartile median upper_quartile max Cleveland_ OH 212 27467 27681 0 8475 21000 35275 152400 Sacramento_ CA 175 32428 35774 0 8050 20000 49350 206900 The boxplot below also shows the mean for each group highlighted by the red dots. ggplot(cle_sac, aes(x = metro_area, y = income)) + geom_boxplot() + stat_summary(fun.y = &quot;mean&quot;, geom = &quot;point&quot;, color = &quot;red&quot;) Guess about statistical significance We are looking to see if a difference exists in the mean income of the two levels of the explanatory variable. Based solely on the boxplot, we have reason to believe that no difference exists. The distributions of income seem similar and the means fall in roughly the same place. B.5.4 Non-traditional methods Collecting summary info We now compute the observed statistic: d_hat &lt;- cle_sac %&gt;% specify(income ~ metro_area) %&gt;% calculate(stat = &quot;diff in means&quot;, order = c(&quot;Sacramento_ CA&quot;, &quot;Cleveland_ OH&quot;)) d_hat # A tibble: 1 x 1 stat &lt;dbl&gt; 1 4960.48 Randomization for hypothesis test In order to look to see if the observed sample mean for Sacramento of 27467.066 is statistically different than that for Cleveland of 32427.543, we need to account for the sample sizes. Note that this is the same as looking to see if \\(\\bar{x}_{sac} - \\bar{x}_{cle}\\) is statistically different than 0. We also need to determine a process that replicates how the original group sizes of 212 and 175 were selected. We can use the idea of randomization testing (also known as permutation testing) to simulate the population from which the sample came (with two groups of different sizes) and then generate samples using shuffling from that simulated population to account for sampling variability. set.seed(2018) null_distn_two_means &lt;- cle_sac %&gt;% specify(income ~ metro_area) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;diff in means&quot;, order = c(&quot;Sacramento_ CA&quot;, &quot;Cleveland_ OH&quot;)) null_distn_two_means %&gt;% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a two-tailed test so we will be looking for values that are greater than or equal to 4960.477 or less than or equal to -4960.477 for our \\(p\\)-value. null_distn_two_means %&gt;% visualize(obs_stat = d_hat, direction = &quot;both&quot;) Calculate \\(p\\)-value pvalue &lt;- null_distn_two_means %&gt;% get_pvalue(obs_stat = d_hat, direction = &quot;both&quot;) pvalue # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0.1298 So our \\(p\\)-value is 0.13 and we fail to reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are not very far into the tail of the null distribution. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\mu_{sac} - \\mu_{cle}\\) using our sample data with bootstrapping. Here we will bootstrap each of the groups with replacement instead of shuffling. This is done using the groups argument in the resample function to fix the size of each group to be the same as the original group sizes of 175 for Sacramento and 212 for Cleveland. boot_distn_two_means &lt;- cle_sac %&gt;% specify(income ~ metro_area) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;diff in means&quot;, order = c(&quot;Sacramento_ CA&quot;, &quot;Cleveland_ OH&quot;)) ci &lt;- boot_distn_two_means %&gt;% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 -1445.53 11307.8 boot_distn_two_means %&gt;% visualize(endpoints = ci, direction = &quot;between&quot;) We see that 0 is contained in this confidence interval as a plausible value of \\(\\mu_{sac} - \\mu_{cle}\\) (the unknown population parameter). This matches with our hypothesis test results of failing to reject the null hypothesis. Since zero is a plausible value of the population parameter, we do not have evidence that Sacramento incomes are different than Cleveland incomes. Interpretation: We are 95% confident the true mean yearly income for those living in Sacramento is between 1445.53 dollars smaller to 11307.82 dollars higher than for Cleveland. Note: You could also use the null distribution based on randomization with a shift to have its center at \\(\\bar{x}_{sac} - \\bar{x}_{cle} = \\$4960.48\\) instead of at 0 and calculate its percentiles. The confidence interval produced via this method should be comparable to the one done using bootstrapping above. B.5.5 Traditional methods Check conditions Remember that in order to use the short-cut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: The observations are independent in both groups. This metro_area variable is met since the cases are randomly selected from each city. Approximately normal: The distribution of the response for each group should be normal or the sample sizes should be at least 30. ggplot(cle_sac, aes(x = income)) + geom_histogram(color = &quot;white&quot;, binwidth = 20000) + facet_wrap(~ metro_area) We have some reason to doubt the normality assumption here since both the histograms show deviation from a normal model fitting the data well for each group. The sample sizes for each group are greater than 100 though so the assumptions should still apply. Independent samples: The samples should be collected without any natural pairing. There is no mention of there being a relationship between those selected in Cleveland and in Sacramento. B.5.6 Test statistic The test statistic is a random variable based on the sample data. Here, we are interested in seeing if our observed difference in sample means (\\(\\bar{x}_{sac, obs} - \\bar{x}_{cle, obs}\\) = 4960.477) is statistically different than 0. Assuming that conditions are met and the null hypothesis is true, we can use the \\(t\\) distribution to standardize the difference in sample means (\\(\\bar{X}_{sac} - \\bar{X}_{cle}\\)) using the approximate standard error of \\(\\bar{X}_{sac} - \\bar{X}_{cle}\\) (invoking \\(S_{sac}\\) and \\(S_{cle}\\) as estimates of unknown \\(\\sigma_{sac}\\) and \\(\\sigma_{cle}\\)). \\[ T =\\dfrac{ (\\bar{X}_1 - \\bar{X}_2) - 0}{ \\sqrt{\\dfrac{S_1^2}{n_1} + \\dfrac{S_2^2}{n_2}} } \\sim t (df = min(n_1 - 1, n_2 - 1)) \\] where 1 = Sacramento and 2 = Cleveland with \\(S_1^2\\) and \\(S_2^2\\) the sample variance of the incomes of both cities, respectively, and \\(n_1 = 175\\) for Sacramento and \\(n_2 = 212\\) for Cleveland. Observed test statistic Note that we could also do (ALMOST) this test directly using the t.test function. The x and y arguments are expected to both be numeric vectors here so we’ll need to appropriately filter our datasets. cle_sac %&gt;% specify(income ~ metro_area) %&gt;% calculate(stat = &quot;t&quot;, order = c(&quot;Cleveland_ OH&quot;, &quot;Sacramento_ CA&quot;)) # A tibble: 1 x 1 stat &lt;dbl&gt; 1 -1.50062 We see here that the observed test statistic value is around -1.5. While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. B.5.7 Compute \\(p\\)-value The \\(p\\)-value—the probability of observing an \\(t_{174}\\) value of -1.501 or more extreme (in both directions) in our null distribution—is 0.13. This can also be calculated in R directly: 2 * pt(-1.501, df = min(212 - 1, 175 - 1), lower.tail = TRUE) [1] 0.135 We can also approximate by using the standard normal curve: 2 * pnorm(-1.501) [1] 0.133 Note that the 95 percent confidence interval given above matches well with the one calculated using bootstrapping. B.5.8 State conclusion We, therefore, do not have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference not existing in the means was backed by this statistical analysis. We do not have evidence to suggest that the true mean income differs between Cleveland, OH and Sacramento, CA based on this data. B.5.9 Comparing results Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions also being met leads us to better guess that using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) will lead to similar results. B.6 Two means (paired samples) Problem statement Trace metals in drinking water affect the flavor and an unusually high concentration can pose a health hazard. Ten pairs of data were taken measuring zinc concentration in bottom water and surface water at 10 randomly selected locations on a stretch of river. Do the data suggest that the true average concentration in the surface water is smaller than that of bottom water? (Note that units are not given.) [Tweaked a bit from https://onlinecourses.science.psu.edu/stat500/node/51] B.6.1 Competing hypotheses In words Null hypothesis: The mean concentration in the bottom water is the same as that of the surface water at different paired locations. Alternative hypothesis: The mean concentration in the surface water is smaller than that of the bottom water at different paired locations. In symbols (with annotations) \\(H_0: \\mu_{diff} = 0\\), where \\(\\mu_{diff}\\) represents the mean difference in concentration for surface water minus bottom water. \\(H_A: \\mu_{diff} &lt; 0\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.6.2 Exploring the sample data zinc_tidy &lt;- read_csv(&quot;https://moderndive.com/data/zinc_tidy.csv&quot;) We want to look at the differences in surface - bottom for each location: zinc_diff &lt;- zinc_tidy %&gt;% group_by(loc_id) %&gt;% summarize(pair_diff = diff(concentration)) %&gt;% ungroup() Next we calculate the mean difference as our observed statistic: d_hat &lt;- zinc_diff %&gt;% specify(response = pair_diff) %&gt;% calculate(stat = &quot;mean&quot;) d_hat # A tibble: 1 x 1 stat &lt;dbl&gt; 1 -0.0804 The histogram below also shows the distribution of pair_diff. ggplot(zinc_diff, aes(x = pair_diff)) + geom_histogram(binwidth = 0.04, color = &quot;white&quot;) Guess about statistical significance We are looking to see if the sample paired mean difference of -0.08 is statistically less than 0. They seem to be quite close, but we have a small number of pairs here. Let’s guess that we will fail to reject the null hypothesis. B.6.3 Non-traditional methods Bootstrapping for hypothesis test In order to look to see if the observed sample mean difference \\(\\bar{x}_{diff} = 4960.477\\) is statistically less than 0, we need to account for the number of pairs. We also need to determine a process that replicates how the paired data was selected in a way similar to how we calculated our original difference in sample means. Treating the differences as our data of interest, we next use the process of bootstrapping to build other simulated samples and then calculate the mean of the bootstrap samples. We hypothesize that the mean difference is zero. This process is similar to comparing the One Mean example seen above, but using the differences between the two groups as a single sample with a hypothesized mean difference of 0. set.seed(2018) null_distn_paired_means &lt;- zinc_diff %&gt;% specify(response = pair_diff) %&gt;% hypothesize(null = &quot;point&quot;, mu = 0) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;mean&quot;) null_distn_paired_means %&gt;% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a left-tailed test so we will be looking for values that are less than or equal to 4960.477 for our \\(p\\)-value. null_distn_paired_means %&gt;% visualize(obs_stat = d_hat, direction = &quot;less&quot;) Calculate \\(p\\)-value pvalue &lt;- null_distn_paired_means %&gt;% get_pvalue(obs_stat = d_hat, direction = &quot;less&quot;) pvalue # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0 So our \\(p\\)-value is essentially 0 and we reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are far into the left tail of the null distribution. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\mu_{diff}\\) using our sample data (the calculated differences) with bootstrapping. This is similar to the bootstrapping done in a one sample mean case, except now our data is differences instead of raw numerical data. Note that this code is identical to the pipeline shown in the hypothesis test above except the hypothesize() function is not called. boot_distn_paired_means &lt;- zinc_diff %&gt;% specify(response = pair_diff) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;mean&quot;) ci &lt;- boot_distn_paired_means %&gt;% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 -0.112200 -0.0503 boot_distn_paired_means %&gt;% visualize(endpoints = ci, direction = &quot;between&quot;) We see that 0 is not contained in this confidence interval as a plausible value of \\(\\mu_{diff}\\) (the unknown population parameter). This matches with our hypothesis test results of rejecting the null hypothesis. Since zero is not a plausible value of the population parameter and since the entire confidence interval falls below zero, we have evidence that surface zinc concentration levels are lower, on average, than bottom level zinc concentrations. Interpretation: We are 95% confident the true mean zinc concentration on the surface is between 0.11 units smaller to 0.05 units smaller than on the bottom. B.6.4 Traditional methods Check conditions Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: The observations among pairs are independent. The locations are selected independently through random sampling so this condition is met. Approximately normal: The distribution of population of differences is normal or the number of pairs is at least 30. The histogram above does show some skew so we have reason to doubt the population being normal based on this sample. We also only have 10 pairs which is fewer than the 30 needed. A theory-based test may not be valid here. Test statistic The test statistic is a random variable based on the sample data. Here, we want to look at a way to estimate the population mean difference \\(\\mu_{diff}\\). A good guess is the sample mean difference \\(\\bar{X}_{diff}\\). Recall that this sample mean is actually a random variable that will vary as different samples are (theoretically, would be) collected. We are looking to see how likely is it for us to have observed a sample mean of \\(\\bar{x}_{diff, obs} = 0.0804\\) or larger assuming that the population mean difference is 0 (assuming the null hypothesis is true). If the conditions are met and assuming \\(H_0\\) is true, we can “standardize” this original test statistic of \\(\\bar{X}_{diff}\\) into a \\(T\\) statistic that follows a \\(t\\) distribution with degrees of freedom equal to \\(df = n - 1\\): \\[ T =\\dfrac{ \\bar{X}_{diff} - 0}{ S_{diff} / \\sqrt{n} } \\sim t (df = n - 1) \\] where \\(S\\) represents the standard deviation of the sample differences and \\(n\\) is the number of pairs. Observed test statistic While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the t_test function on the differences to perform this analysis for us. t_test_results &lt;- zinc_diff %&gt;% infer::t_test(formula = pair_diff ~ NULL, alternative = &quot;less&quot;, mu = 0) t_test_results # A tibble: 1 x 6 statistic t_df p_value alternative lower_ci upper_ci &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; 1 -4.86381 9 0.000445558 less -Inf -0.0500982 We see here that the \\(t_{obs}\\) value is -4.864. Compute \\(p\\)-value The \\(p\\)-value—the probability of observing a \\(t_{obs}\\) value of -4.864 or less in our null distribution of a \\(t\\) with 9 degrees of freedom—is 0. This can also be calculated in R directly: pt(-4.8638, df = nrow(zinc_diff) - 1, lower.tail = TRUE) [1] 0.000446 State conclusion We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that our observed sample mean difference was not statistically less than the hypothesized mean of 0 has been invalidated here. Based on this sample, we have evidence that the mean concentration in the bottom water is greater than that of the surface water at different paired locations. B.6.5 Comparing results Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions were not met since the number of pairs was small, but the sample data was not highly skewed. Using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) lead to similar results here. References "],
+["index.html", "Statistical Inference via Data Science A ModernDive into R and the tidyverse Special Announcement", " Statistical Inference via Data Science A ModernDive into R and the tidyverse Chester Ismay and Albert Y. Kim Foreword by Kelly S. McConville November 25, 2019 Special Announcement We’re excited to announce that we’ve signed a book deal with CRC Press! We will be publishing our first fully complete online version of ModernDive in November 2019, with a corresponding print edition to follow in December 2019. Don’t worry though, our content will remain freely available on ModernDive.com. "],
+["foreword.html", "Foreword", " Foreword These are exciting times in statistics and data science education. (I am predicting this statement will continue to be true regardless of whether you are reading this foreword in 2020 or 2050.) But (isn’t there always a but?), as a statistics educator, it can also feel a bit overwhelming to stay on top of all the new statistical, technological, and pedagogical innovations. I find myself constantly asking, “Am I teaching my students the correct content, with the relevant software, and in the most effective way?”. Before I make all of us feel lost at sea, let me point out how great a life raft I have found in ModernDive. In a sea of intro stats textbooks, ModernDive floats to the top of my list, and let me tell you why. (Note my use of ModernDive here refers to the book in its shortened title version. This also matches up nicely with the neat hex sticker Drs. Ismay and Kim created for the cover of ModernDive, too.) My favorite aspect of ModernDive, if I must pick a favorite, is that students gain experience with the whole data analysis pipeline (see Figure 0.2). In particular, ModernDive is one of the few intro stats textbooks that teaches students how to wrangle data. And, while data cleaning may not be as groovy as model building, it’s often a prerequisite step! The world is full of messy data and ModernDive equips students to transform their data via the dplyr package. Speaking of dplyr, students of ModernDive are exposed to the tidyverse suite of R packages. Designed with a common structure, tidyverse functions are written to be easy to learn and use. And, since most intro stats students are programming newbies, ModernDive carefully walks the students through each new function it presents and provides frequent reinforcement through the many Learning checks dispersed throughout the chapters. Overall, ModernDive includes wise choices for the placement of topics. Starting with data visualization, ModernDive gets students building ggplot2 graphs early on and then continues to reinforce important concepts graphically throughout the book. After moving through data wrangling and data importing, modeling plays a prominent role, with two chapters devoted to building regression models and a later chapter on inference for regression. Lastly, statistical inference is presented through a computational lens with calculations done via the infer package. I first met Drs. Ismay and Kim while attending their workshop at the 2017 US Conference on Teaching Statistics. They pushed us as participants to put data first and to use computers, instead of math, as the engine for statistical inference. That experience helped me modernize my own intro stats course and introduced me to two really forward-thinking statistics and data science educators. It has been exciting to see ModernDive develop and grow into such a wonderful, timely textbook. I hope you have decided to dive on in! Kelly S. McConville, Reed College "],
+["preface.html", "Preface Introduction for students Introduction for instructors Connect and contribute Acknowledgements About this book", " Preface Help! I’m new to R and RStudio and I need to learn about them! However, I’m completely new to coding! What do I do? If you’re asking yourself this question, then you’ve come to the right place! Start with the “Introduction for students” section. Are you an instructor hoping to use this book in your courses? We recommend you first read the “Introduction for students” section first. Then, read the “Introduction for instructors” section for more information on how to teach with this book. Are you looking to connect with and contribute to ModernDive? Then, read the “Connect and contribute” section for information on how. Are you curious about the publishing of this book? Then, read the “About this book” section for more information on the open-source technology, in particular R Markdown and the bookdown package. This is version 1.0.0 of ModernDive published on November 25, 2019. For previous versions of ModernDive, see the “About this book” section below. Introduction for students This book assumes no prerequisites: no algebra, no calculus, and no prior programming/coding experience. This is intended to be a gentle introduction to the practice of analyzing data and answering questions using data the way data scientists, statisticians, data journalists, and other researchers would. We present a map of your upcoming journey in Figure 0.1. FIGURE 0.1: ModernDive flowchart. You’ll first get started with data in Chapter 1 where you’ll learn about the difference between R and RStudio, start coding in R, install and load your first R packages, and explore your first dataset: all domestic departure flights from a New York City airport in 2013. Then you’ll cover the following three portions of this book (Parts 2 and 4 are combined into a single portion): Data science with tidyverse. You’ll assemble your data science toolbox using tidyverse packages. In particular, you’ll Ch.2: Visualize data using the ggplot2 package. Ch.3: Wrangle data using the dplyr package. Ch.4: Learn about the concept of “tidy” data as a standardized data input and output format for all packages in the tidyverse. Furthermore, you’ll learn how to import spreadsheet files into R using the readr package. Data modeling with moderndive. Using these data science tools and helper functions from the moderndive package, you’ll fit your first data models. In particular, you’ll Ch.5: Discover basic regression models with only one explanatory variable. Ch.6: Examine multiple regression models with more than one explanatory variable. Statistical inference with infer. Once again using your newly acquired data science tools, you’ll unpack statistical inference using the infer package. In particular, you’ll: Ch.7: Learn about the role that sampling variability plays in statistical inference and the role that sample size plays in this sampling variability. Ch.8: Construct confidence intervals using bootstrapping. Ch.9: Conduct hypothesis tests using permutation. Data modeling with moderndive (revisited): Armed with your understanding of statistical inference, you’ll revisit and review the models you’ll construct in Ch.5 and Ch.6. In particular, you’ll: Ch.10: Interpret confidence intervals and hypothesis tests in a regression setting. We’ll end with a discussion on what it means to “tell your story with data” in Chapter 11 by presenting example case studies.1 What we hope you will learn from this book We hope that by the end of this book, you’ll have learned how to: Use R and the tidyverse suite of R packages for data science. Fit your first models to data, using a method known as linear regression. Perform statistical inference using sampling, confidence intervals. and hypothesis tests. Tell your story with data using these tools. What do we mean by data stories? We mean any analysis involving data that engages the reader in answering questions with careful visuals and thoughtful discussion. Further discussions on data stories can be found in the blog post “Tell a Meaningful Story With Data.” Over the course of this book, you will develop your “data science toolbox,” equipping yourself with tools such as data visualization, data formatting, data wrangling, and data modeling using regression. In particular, this book will lean heavily on data visualization. In today’s world, we are bombarded with graphics that attempt to convey ideas. We will explore what makes a good graphic and what the standard ways are used to convey relationships within data. In general, we’ll use visualization as a way of building almost all of the ideas in this book. To impart the statistical lessons of this book, we have intentionally minimized the number of mathematical formulas used. Instead, you’ll develop a conceptual understanding of statistics using data visualization and computer simulations. We hope this is a more intuitive experience than the way statistics has traditionally been taught in the past and how it is commonly perceived. Finally, you’ll learn the importance of literate programming. By this we mean you’ll learn how to write code that is useful not just for a computer to execute, but also for readers to understand exactly what your analysis is doing and how you did it. This is part of a greater effort to encourage reproducible research (see the “Reproducible research” subsection in this Preface for more details). Hal Abelson coined the phrase that we will follow throughout this book: Programs must be written for people to read, and only incidentally for machines to execute. We understand that there may be challenging moments as you learn to program. Both of us continue to struggle and find ourselves often using web searches to find answers and reach out to colleagues for help. In the long run though, we all can solve problems faster and more elegantly via programming. We wrote this book as our way to help you get started and you should know that there is a huge community of R users that are happy to help everyone along as well. This community exists in particular on the internet on various forums and websites such as stackoverflow.com. Data/science pipeline You may think of statistics as just being a bunch of numbers. We commonly hear the phrase “statistician” when listening to broadcasts of sporting events. Statistics (in particular, data analysis), in addition to describing numbers like with baseball batting averages, plays a vital role in all of the sciences. You’ll commonly hear the phrase “statistically significant” thrown around in the media. You’ll see articles that say, “Science now shows that chocolate is good for you.” Underpinning these claims is data analysis. By the end of this book, you’ll be able to better understand whether these claims should be trusted or whether we should be wary. Inside data analysis are many sub-fields that we will discuss throughout this book (though not necessarily in this order): data collection data wrangling data visualization data modeling inference correlation and regression interpretation of results data communication/storytelling These sub-fields are summarized in what Grolemund and Wickham have previously termed the “data/science pipeline” in Figure 0.2. FIGURE 0.2: Data/science pipeline. We will begin by digging into the grey Understand portion of the cycle with data visualization, then with a discussion on what is meant by tidy data and data wrangling, and then conclude by talking about interpreting and discussing the results of our models via Communication. These steps are vital to any statistical analysis. But, why should you care about statistics? There’s a reason that many fields require a statistics course. Scientific knowledge grows through an understanding of statistical significance and data analysis. You needn’t be intimidated by statistics. It’s not the beast that it used to be and, paired with computation, you’ll see how reproducible research in the sciences particularly increases scientific knowledge. Reproducible research The most important tool is the mindset, when starting, that the end product will be reproducible. – Keith Baggerly Another goal of this book is to help readers understand the importance of reproducible analyses. The hope is to get readers into the habit of making their analyses reproducible from the very beginning. This means we’ll be trying to help you build new habits. This will take practice and be difficult at times. You’ll see just why it is so important for you to keep track of your code and document it well to help yourself later and any potential collaborators as well. Copying and pasting results from one program into a word processor is not an ideal way to conduct efficient and effective scientific research. It’s much more important for time to be spent on data collection and data analysis and not on copying and pasting plots back and forth across a variety of programs. In traditional analyses, if an error was made with the original data, we’d need to step through the entire process again: recreate the plots and copy-and-paste all of the new plots and our statistical analysis into our document. This is error prone and a frustrating use of time. We want to help you to get away from this tedious activity so that we can spend more time doing science. We are talking about computational reproducibility. - Yihui Xie Reproducibility means a lot of things in terms of different scientific fields. Are experiments conducted in a way that another researcher could follow the steps and get similar results? In this book, we will focus on what is known as computational reproducibility. This refers to being able to pass all of one’s data analysis, datasets, and conclusions to someone else and have them get exactly the same results on their machine. This allows for time to be spent interpreting results and considering assumptions instead of the more error prone way of starting from scratch or following a list of steps that may be different from machine to machine. Final note for students At this point, if you are interested in instructor perspectives on this book, ways to contribute and collaborate, or the technical details of this book’s construction and publishing, then continue with the rest of the chapter. Otherwise, let’s get started with R and RStudio in Chapter 1! Introduction for instructors Resources Here are some resources to help you use ModernDive: We’ve included review questions posed as Learning checks. You can find all the solutions to all Learning checks in Appendix D of the online version of the book at https://moderndive.com/D-appendixD.html. Dr. Jenny Smetzer and Albert Y. Kim have written a series of labs and problem sets. You can find them at https://moderndive.com/labs. You can see the webpages for two courses that use ModernDive: Smith College “SDS192 Introduction to Data Science”: https://rudeboybert.github.io/SDS192/. Smith College “SDS220 Introduction to Probability and Statistics”: https://rudeboybert.github.io/SDS220/. Why did we write this book? This book is inspired by Mathematical Statistics with Resampling and R (Chihara and Hesterberg 2011) OpenIntro: Intro Stat with Randomization and Simulation (Diez, Barr, and Çetinkaya-Rundel 2014) R for Data Science (Grolemund and Wickham 2017) The first book, designed for upper-level undergraduates and graduate students, provides an excellent resource on how to use resampling to impart statistical concepts like sampling distributions using computation instead of large-sample approximations and other mathematical formulas. The last two books are free options for learning about introductory statistics and data science, providing an alternative to the many traditionally expensive introductory statistics textbooks. When looking over the introductory statistics textbooks that currently exist, we found there wasn’t one that incorporated many newly developed R packages directly into the text, in particular the many packages included in the tidyverse set of packages, such as ggplot2, dplyr, tidyr, and readr that will be the focus of this book’s first part on “Data Science with tidyverse.” Additionally, there wasn’t an open-source and easily reproducible textbook available that exposed new learners to all four of the learning goals we listed in the “Introduction for students” subsection. We wanted to write a book that could develop theory via computational techniques and help novices master the R language in doing so. Who is this book for? This book is intended for instructors of traditional introductory statistics classes using RStudio, who would like to inject more data science topics into their syllabus. RStudio can be used in either the server version or the desktop version. (This is discussed further in Subsection 1.1.1.) We assume that students taking the class will have no prior algebra, no calculus, nor programming/coding experience. Here are some principles and beliefs we kept in mind while writing this text. If you agree with them, this is the book for you. Blur the lines between lecture and lab With increased availability and accessibility of laptops and open-source non-proprietary statistical software, the strict dichotomy between lab and lecture can be loosened. It’s much harder for students to understand the importance of using software if they only use it once a week or less. They forget the syntax in much the same way someone learning a foreign language forgets the grammar rules. Frequent reinforcement is key. Focus on the entire data/science research pipeline We believe that the entirety of Grolemund and Wickham’s data/science pipeline as seen in Figure 0.2 should be taught. We heed George Cobb’s call to “minimize prerequisites to research”: students should be answering questions with data as soon as possible. It’s all about the data We leverage R packages for rich, real, and realistic datasets that at the same time are easy-to-load into R, such as the nycflights13 and fivethirtyeight packages. We believe that data visualization is a “gateway drug” for statistics and that the grammar of graphics as implemented in the ggplot2 package is the best way to impart such lessons. However, we often hear: “You can’t teach ggplot2 for data visualization in intro stats!” We, like David Robinson, are much more optimistic and have found our students have been largely successful in learning it. dplyr has made data wrangling much more accessible to novices, and hence much more interesting datasets can be explored. Use simulation/resampling to introduce statistical inference, not probability/mathematical formulas Instead of using formulas, large-sample approximations, and probability tables, we teach statistical concepts using simulation-based inference. This allows for a de-emphasis of traditional probability topics, freeing up room in the syllabus for other topics. Bridges to these mathematical concepts are given as well to help with relation of these traditional topics with more modern approaches. Don’t fence off students from the computation pool, throw them in! Computing skills are essential to working with data in the 21st century. Given this fact, we feel that to shield students from computing is to ultimately do them a disservice. We are not teaching a course on coding/programming per se, but rather just enough of the computational and algorithmic thinking necessary for data analysis. Complete reproducibility and customizability We are frustrated when textbooks give examples, but not the source code and the data itself. We give you the source code for all examples as well as the whole book! While we have made choices to occasionally hide the code that produces more complicated figures, reviewing the book’s GitHub repository will provide you with all the code (see below). Ultimately the best textbook is one you’ve written yourself. You know best your audience, their background, and their priorities. You know best your own style and the types of examples and problems you like best. Customization is the ultimate end. We encourage you to take what we’ve provided and make it work for your own needs. For more about how to make this book your own, see “About this book” later in this Preface. Connect and contribute If you would like to connect with ModernDive, check out the following links: If you would like to receive periodic updates about ModernDive (roughly every 6 months), please sign up for our mailing list. Contact Albert at albert.ys.kim@gmail.com and Chester at chester.ismay@gmail.com. We’re on Twitter at https://twitter.com/ModernDive. If you would like to contribute to ModernDive, there are many ways! We would love your help and feedback to make this book as great as possible! For example, if you find any errors, typos, or areas for improvement, then please email us or post an issue on our GitHub issues page. If you are familiar with GitHub and would like to contribute, see the “About this book” section. Acknowledgements The authors would like to thank Nina Sonneborn, Dr. Alison Hill, Kristin Bott, Dr. Jenny Smetzer, and the participants of our 2017 and 2019 USCOTS workshops for their feedback and suggestions. We’d also like to thank Dr. Andrew Heiss for contributing nearly all of Subsection 1.2.3 on “Errors, warnings, and messages,” Evgeni Chasnovski for creating the new geom_parallel_slopes() extension to the ggplot2 package for plotting parallel slopes models, and Starry Zhou for her many edits to the book. A special thanks goes to Dr. Yana Weinstein, cognitive psychological scientist and co-founder of The Learning Scientists, for her extensive feedback. We were both honored to have Dr. Kelly S. McConville write the Foreword of the book. Dr. McConville is a pioneer in statistics education and was a source of great inspiration to both of us as we continued to update the book to get it to its current form. Thanks additionally to the continued contributions by members of the community to the book on GitHub and to the many individuals that have recommended this book to others. We are so very appreciative of all of you! Lastly, a very special shout out to any student who has ever taken a class with us at either Pacific University, Reed College, Middlebury College, Amherst College, or Smith College. We couldn’t have made this book without you! About this book This book was written using RStudio’s bookdown package by Yihui Xie (Xie 2019). This package simplifies the publishing of books by having all content written in R Markdown. The bookdown/R Markdown source code for all versions of ModernDive is available on GitHub: Latest online version The most up-to-date release: Version 1.0.0 released on November 25, 2019 (source code) Available at https://moderndive.com/ Print version The CRC Press print version of ModernDive corresponds to Version 1.0.0. Development online version The working copy of the next version which is currently being edited: Preview of development version is available at https://moderndive.netlify.com/. Source code: Available on ModernDive’s GitHub repository page at https://github.com/moderndive/moderndive_book. Previous online versions Older versions that may be out of date: Version 0.6.1 released on August 28, 2019 (source code) Version 0.6.0 released on August 7, 2019 (source code) Version 0.5.0 released on February 24, 2019 (source code) Version 0.4.0 released on July 21, 2018 (source code) Version 0.3.0 released on February 3, 2018 (source code) Version 0.2.0 released on August 2, 2017 (source code) Version 0.1.3 released on February 9, 2017 (source code) Version 0.1.2 released on January 22, 2017 (source code) Could this be a new paradigm for textbooks? Instead of the traditional model of textbook companies publishing updated editions of the textbook every few years, we apply a software design influenced model of publishing more easily updated versions. We can then leverage open-source communities of instructors and developers for ideas, tools, resources, and feedback. As such, we welcome your GitHub pull requests. Finally, since this book is under a Creative Commons Attribution - NonCommercial - ShareAlike 4.0 license, feel free to modify the book as you wish for your own non-commercial needs, but please list the authors at the top of index.Rmd as: “Chester Ismay, Albert Y. Kim, and YOU!” References "],
+["about-the-authors.html", "About the authors", " About the authors Chester Ismay Albert Y. Kim Chester Ismay is a Data Science Evangelist at DataRobot in Portland, OR, USA. In this role, he leads data science, machine learning, and data engineering in-person workshops for DataRobot University. He completed his PhD in statistics from Arizona State University in 2013. He has previously worked in a variety of roles including as an actuary at Scottsdale Insurance Company (now Nationwide E&amp;S/Specialty), as a freelance data science consultant, and at Ripon College, Reed College, and Pacific University. In addition to his work for ModernDive, he also contributed as initial developer of the infer R package and is author and maintainer of the thesisdown R package. Email: chester.ismay@gmail.com Webpage: https://chester.rbind.io/ Twitter: old_man_chester GitHub: https://github.com/ismayc Albert Y. Kim is an Assistant Professor of Statistical &amp; Data Sciences at Smith College in Northampton, MA, USA. He completed his PhD in statistics at the University of Washington in 2011. Previously he worked in the Search Ads Metrics Team at Google Inc. as well as at Reed, Middlebury, and Amherst Colleges. In addition to his work for ModernDive, he is a co-author of the resampledata and SpatialEpi R packages. Email: albert.ys.kim@gmail.com Webpage: https://rudeboybert.rbind.io/ Twitter: rudeboybert GitHub: https://github.com/rudeboybert Both Drs. Ismay and Kim, along with Jennifer Chunn, are co-authors of the fivethirtyeight package of code and datasets published by the data journalism website FiveThirtyEight.com. "],
+["1-getting-started.html", "Chapter 1 Getting Started with Data in R 1.1 What are R and RStudio? 1.2 How do I code in R? 1.3 What are R packages? 1.4 Explore your first datasets 1.5 Conclusion", " Chapter 1 Getting Started with Data in R Before we can start exploring data in R, there are some key concepts to understand first: What are R and RStudio? How do I code in R? What are R packages? We’ll introduce these concepts in the upcoming Sections 1.1-1.3. If you are already somewhat familiar with these concepts, feel free to skip to Section 1.4 where we’ll introduce our first dataset: all domestic flights departing one of the three main New York City (NYC) airports in 2013. This is a dataset we will explore in depth for much of the rest of this book. 1.1 What are R and RStudio? Throughout this book, we will assume that you are using R via RStudio. First time users often confuse the two. At its simplest, R is like a car’s engine while RStudio is like a car’s dashboard as illustrated in Figure 1.1. FIGURE 1.1: Analogy of difference between R and RStudio. More precisely, R is a programming language that runs computations, while RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well. 1.1.1 Installing R and RStudio Note about RStudio Server or RStudio Cloud: If your instructor has provided you with a link and access to RStudio Server or RStudio Cloud, then you can skip this section. We do recommend after a few months of working on RStudio Server/Cloud that you return to these instructions to install this software on your own computer though. You will first need to download and install both R and RStudio (Desktop version) on your computer. It is important that you install R first and then install RStudio. You must do this first: Download and install R by going to https://cloud.r-project.org/. If you are a Windows user: Click on “Download R for Windows”, then click on “base”, then click on the Download link. If you are macOS user: Click on “Download R for (Mac) OS X”, then under “Latest release:” click on R-X.X.X.pkg, where R-X.X.X is the version number. For example, the latest version of R as of November 25, 2019 was R-3.6.1. If you are a Linux user: Click on “Download R for Linux” and choose your distribution for more information on installing R for your setup. You must do this second: Download and install RStudio at https://www.rstudio.com/products/rstudio/download/. Scroll down to “Installers for Supported Platforms” near the bottom of the page. Click on the download link corresponding to your computer’s operating system. 1.1.2 Using R via RStudio Recall our car analogy from earlier. Much as we don’t drive a car by interacting directly with the engine but rather by interacting with elements on the car’s dashboard, we won’t be using R directly but rather we will use RStudio’s interface. After you install R and RStudio on your computer, you’ll have two new programs (also called applications) you can open. We’ll always work in RStudio and not in the R application. Figure 1.2 shows what icon you should be clicking on your computer. FIGURE 1.2: Icons of R versus RStudio on your computer. After you open RStudio, you should see something similar to Figure 1.3. (Note that slight differences might exist if the RStudio interface is updated after 2019 to not be this by default.) FIGURE 1.3: RStudio interface to R. Note the three panes which are three panels dividing the screen: the console pane, the files pane, and the environment pane. Over the course of this chapter, you’ll come to learn what purpose each of these panes serves. 1.2 How do I code in R? Now that you’re set up with R and RStudio, you are probably asking yourself, “OK. Now how do I use R?”. The first thing to note is that unlike other statistical software programs like Excel, SPSS, or Minitab that provide point-and-click interfaces, R is an interpreted language. This means you have to type in commands written in R code. In other words, you have to code/program in R. Note that we’ll use the terms “coding” and “programming” interchangeably in this book. While it is not required to be a seasoned coder/computer programmer to use R, there is still a set of basic programming concepts that new R users need to understand. Consequently, while this book is not a book on programming, you will still learn just enough of these basic programming concepts needed to explore and analyze data effectively. 1.2.1 Basic programming concepts and terminology We now introduce some basic programming concepts and terminology. Instead of asking you to memorize all these concepts and terminology right now, we’ll guide you so that you’ll “learn by doing.” To help you learn, we will always use a different font to distinguish regular text from computer_code. The best way to master these topics is, in our opinions, through deliberate practice with R and lots of repetition. Basics: Console pane: where you enter in commands. Running code: the act of telling R to perform an act by giving it commands in the console. Objects: where values are saved in R. We’ll show you how to assign values to objects and how to display the contents of objects. Data types: integers, doubles/numerics, logicals, and characters. Integers are values like -1, 0, 2, 4092. Doubles or numerics are a larger set of values containing both the integers but also fractions and decimal values like -24.932 and 0.8. Logicals are either TRUE or FALSE while characters are text such as “cabbage”, “Hamilton”, “The Wire is the greatest TV show ever”, and “This ramen is delicious.” Note that characters are often denoted with the quotation marks around them. Vectors: a series of values. These are created using the c() function, where c() stands for “combine” or “concatenate.” For example, c(6, 11, 13, 31, 90, 92) creates a six element series of positive integer values . Factors: categorical data are commonly represented in R as factors. Categorical data can also be represented as strings. We’ll study this difference as we progress through the book. Data frames: rectangular spreadsheets. They are representations of datasets in R where the rows correspond to observations and the columns correspond to variables that describe the observations. We’ll cover data frames later in Section 1.4. Conditionals: Testing for equality in R using == (and not =, which is typically used for assignment). For example, 2 + 1 == 3 compares 2 + 1 to 3 and is correct R code, while 2 + 1 = 3 will return an error. Boolean algebra: TRUE/FALSE statements and mathematical operators such as &lt; (less than), &lt;= (less than or equal), and != (not equal to). For example, 4 + 2 &gt;= 3 will return TRUE, but 3 + 5 &lt;= 1 will return FALSE. Logical operators: &amp; representing “and” as well as | representing “or.” For example, (2 + 1 == 3) &amp; (2 + 1 == 4) returns FALSE since both clauses are not TRUE (only the first clause is TRUE). On the other hand, (2 + 1 == 3) | (2 + 1 == 4) returns TRUE since at least one of the two clauses is TRUE. Functions, also called commands: Functions perform tasks in R. They take in inputs called arguments and return outputs. You can either manually specify a function’s arguments or use the function’s default values. For example, the function seq() in R generates a sequence of numbers. If you just run seq() it will return the value 1. That doesn’t seem very useful! This is because the default arguments are set as seq(from = 1, to = 1). Thus, if you don’t pass in different values for from and to to change this behavior, R just assumes all you want is the number 1. You can change the argument values by updating the values after the = sign. If we try out seq(from = 2, to = 5) we get the result 2 3 4 5 that we might expect. We’ll work with functions a lot throughout this book and you’ll get lots of practice in understanding their behaviors. To further assist you in understanding when a function is mentioned in the book, we’ll also include the () after them as we did with seq() above. This list is by no means an exhaustive list of all the programming concepts and terminology needed to become a savvy R user; such a list would be so large it wouldn’t be very useful, especially for novices. Rather, we feel this is a minimally viable list of programming concepts and terminology you need to know before getting started. We feel that you can learn the rest as you go. Remember that your mastery of all of these concepts and terminology will build as you practice more and more. 1.2.2 Errors, warnings, and messages One thing that intimidates new R and RStudio users is how it reports errors, warnings, and messages. R reports errors, warnings, and messages in a glaring red font, which makes it seem like it is scolding you. However, seeing red text in the console is not always bad. R will show red text in the console pane in three different situations: Errors: When the red text is a legitimate error, it will be prefaced with “Error in…” and will try to explain what went wrong. Generally when there’s an error, the code will not run. For example, we’ll see in Subsection 1.3.3 if you see Error in ggplot(...) : could not find function &quot;ggplot&quot;, it means that the ggplot() function is not accessible because the package that contains the function (ggplot2) was not loaded with library(ggplot2). Thus you cannot use the ggplot() function without the ggplot2 package being loaded first. Warnings: When the red text is a warning, it will be prefaced with “Warning:” and R will try to explain why there’s a warning. Generally your code will still work, but with some caveats. For example, you will see in Chapter 2 if you create a scatterplot based on a dataset where two of the rows of data have missing entries that would be needed to create points in the scatterplot, you will see this warning: Warning: Removed 2 rows containing missing values (geom_point). R will still produce the scatterplot with all the remaining non-missing values, but it is warning you that two of the points aren’t there. Messages: When the red text doesn’t start with either “Error” or “Warning”, it’s just a friendly message. You’ll see these messages when you load R packages in the upcoming Subsection 1.3.2 or when you read data saved in spreadsheet files with the read_csv() function as you’ll see in Chapter 4. These are helpful diagnostic messages and they don’t stop your code from working. Additionally, you’ll see these messages when you install packages too using install.packages() as discussed in Subsection 1.3.1. Remember, when you see red text in the console, don’t panic. It doesn’t necessarily mean anything is wrong. Rather: If the text starts with “Error”, figure out what’s causing it. Think of errors as a red traffic light: something is wrong! If the text starts with “Warning”, figure out if it’s something to worry about. For instance, if you get a warning about missing values in a scatterplot and you know there are missing values, you’re fine. If that’s surprising, look at your data and see what’s missing. Think of warnings as a yellow traffic light: everything is working fine, but watch out/pay attention. Otherwise, the text is just a message. Read it, wave back at R, and thank it for talking to you. Think of messages as a green traffic light: everything is working fine and keep on going! 1.2.3 Tips on learning to code Learning to code/program is quite similar to learning a foreign language. It can be daunting and frustrating at first. Such frustrations are common and it is normal to feel discouraged as you learn. However, just as with learning a foreign language, if you put in the effort and are not afraid to make mistakes, anybody can learn and improve. Here are a few useful tips to keep in mind as you learn to program: Remember that computers are not actually that smart: You may think your computer or smartphone is “smart,” but really people spent a lot of time and energy designing them to appear “smart.” In reality, you have to tell a computer everything it needs to do. Furthermore, the instructions you give your computer can’t have any mistakes in them, nor can they be ambiguous in any way. Take the “copy, paste, and tweak” approach: Especially when you learn your first programming language or you need to understand particularly complicated code, it is often much easier to take existing code that you know works and modify it to suit your ends. This is as opposed to trying to type out the code from scratch. We call this the “copy, paste, and tweak” approach. So early on, we suggest not trying to write code from memory, but rather take existing examples we have provided you, then copy, paste, and tweak them to suit your goals. After you start feeling more confident, you can slowly move away from this approach and write code from scratch. Think of the “copy, paste, and tweak” approach as training wheels for a child learning to ride a bike. After getting comfortable, they won’t need them anymore. The best way to learn to code is by doing: Rather than learning to code for its own sake, we find that learning to code goes much smoother when you have a goal in mind or when you are working on a particular project, like analyzing data that you are interested in and that is important to you. Practice is key: Just as the only method to improve your foreign language skills is through lots of practice and speaking, the only method to improving your coding skills is through lots of practice. Don’t worry, however, we’ll give you plenty of opportunities to do so! 1.3 What are R packages? Another point of confusion with many new R users is the idea of an R package. R packages extend the functionality of R by providing additional functions, data, and documentation. They are written by a worldwide community of R users and can be downloaded for free from the internet. For example, among the many packages we will use in this book are the ggplot2 package (Wickham, Chang, et al. 2019) for data visualization in Chapter 2, the dplyr package (Wickham, François, et al. 2019) for data wrangling in Chapter 3, the moderndive package (Kim and Ismay 2019) that accompanies this book, and the infer package (Bray et al. 2019) for “tidy” and transparent statistical inference in Chapters 8, 9, and 10. A good analogy for R packages is they are like apps you can download onto a mobile phone: FIGURE 1.4: Analogy of R versus R packages. So R is like a new mobile phone: while it has a certain amount of features when you use it for the first time, it doesn’t have everything. R packages are like the apps you can download onto your phone from Apple’s App Store or Android’s Google Play. Let’s continue this analogy by considering the Instagram app for editing and sharing pictures. Say you have purchased a new phone and you would like to share a photo you have just taken with friends on Instagram. You need to: Install the app: Since your phone is new and does not include the Instagram app, you need to download the app from either the App Store or Google Play. You do this once and you’re set for the time being. You might need to do this again in the future when there is an update to the app. Open the app: After you’ve installed Instagram, you need to open it. Once Instagram is open on your phone, you can then proceed to share your photo with your friends and family. The process is very similar for using an R package. You need to: Install the package: This is like installing an app on your phone. Most packages are not installed by default when you install R and RStudio. Thus if you want to use a package for the first time, you need to install it first. Once you’ve installed a package, you likely won’t install it again unless you want to update it to a newer version. “Load” the package: “Loading” a package is like opening an app on your phone. Packages are not “loaded” by default when you start RStudio on your computer; you need to “load” each package you want to use every time you start RStudio. Let’s perform these two steps for the ggplot2 package for data visualization. 1.3.1 Package installation Note about RStudio Server or RStudio Cloud: If your instructor has provided you with a link and access to RStudio Server or RStudio Cloud, you might not need to install packages, as they might be preinstalled for you by your instructor. That being said, it is still a good idea to know this process for later on when you are not using RStudio Server or Cloud, but rather RStudio Desktop on your own computer. There are two ways to install an R package: an easy way and a more advanced way. Let’s install the ggplot2 package the easy way first as shown in Figure 1.5. In the Files pane of RStudio: Click on the “Packages” tab. Click on “Install” next to Update. Type the name of the package under “Packages (separate multiple with space or comma):” In this case, type ggplot2. Click “Install.” FIGURE 1.5: Installing packages in R the easy way. An alternative but slightly less convenient way to install a package is by typing install.packages(&quot;ggplot2&quot;) in the console pane of RStudio and pressing Return/Enter on your keyboard. Note you must include the quotation marks around the name of the package. Much like an app on your phone, you only have to install a package once. However, if you want to update a previously installed package to a newer version, you need to reinstall it by repeating the earlier steps. Learning check (LC1.1) Repeat the earlier installation steps, but for the dplyr, nycflights13, and knitr packages. This will install the earlier mentioned dplyr package for data wrangling, the nycflights13 package containing data on all domestic flights leaving a NYC airport in 2013, and the knitr package for generating easy-to-read tables in R. We’ll use these packages in the next section. Note that if you’d like your output on your computer to match up exactly with the output presented throughout the book, you may want to use the exact versions of the packages that we used. You can find a full listing of these packages and their versions in Appendix E. This likely won’t be relevant for novices, but we included it for reproducibility reasons. 1.3.2 Package loading Recall that after you’ve installed a package, you need to “load it.” In other words, you need to “open it.” We do this by using the library() command. For example, to load the ggplot2 package, run the following code in the console pane. What do we mean by “run the following code”? Either type or copy-and-paste the following code into the console pane and then hit the Enter key. library(ggplot2) If after running the earlier code, a blinking cursor returns next to the &gt; “prompt” sign, it means you were successful and the ggplot2 package is now loaded and ready to use. If, however, you get a red “error message” that reads ... Error in library(ggplot2) : there is no package called ‘ggplot2’ ... it means that you didn’t successfully install it. This is an example of an “error message” we discussed in Subsection 1.2.2. If you get this error message, go back to Subsection 1.3.1 on R package installation and make sure to install the ggplot2 package before proceeding. Learning check (LC1.2) “Load” the dplyr, nycflights13, and knitr packages as well by repeating the earlier steps. 1.3.3 Package use One very common mistake new R users make when wanting to use particular packages is they forget to “load” them first by using the library() command we just saw. Remember: you have to load each package you want to use every time you start RStudio. If you don’t first “load” a package, but attempt to use one of its features, you’ll see an error message similar to: Error: could not find function This is a different error message than the one you just saw on a package not having been installed yet. R is telling you that you are trying to use a function in a package that has not yet been “loaded.” R doesn’t know where to find the function you are using. Almost all new users forget to do this when starting out, and it is a little annoying to get used to doing it. However, you’ll remember with practice and after some time it will become second nature for you. 1.4 Explore your first datasets Let’s put everything we’ve learned so far into practice and start exploring some real data! Data comes to us in a variety of formats, from pictures to text to numbers. Throughout this book, we’ll focus on datasets that are saved in “spreadsheet”-type format. This is probably the most common way data are collected and saved in many fields. Remember from Subsection 1.2.1 that these “spreadsheet”-type datasets are called data frames in R. We’ll focus on working with data saved as data frames throughout this book. Let’s first load all the packages needed for this chapter, assuming you’ve already installed them. Read Section 1.3 for information on how to install and load R packages if you haven’t already. library(nycflights13) library(dplyr) library(knitr) At the beginning of all subsequent chapters in this book, we’ll always have a list of packages that you should have installed and loaded in order to work with that chapter’s R code. 1.4.1 nycflights13 package Many of us have flown on airplanes or know someone who has. Air travel has become an ever-present aspect of many people’s lives. If you look at the Departures flight information board at an airport, you will frequently see that some flights are delayed for a variety of reasons. Are there ways that we can understand the reasons that cause flight delays? We’d all like to arrive at our destinations on time whenever possible. (Unless you secretly love hanging out at airports. If you are one of these people, pretend for a moment that you are very much anticipating being at your final destination.) Throughout this book, we’re going to analyze data related to all domestic flights departing from one of New York City’s three main airports in 2013: Newark Liberty International (EWR), John F. Kennedy International (JFK), and LaGuardia Airport (LGA). We’ll access this data using the nycflights13 R package, which contains five datasets saved in five data frames: flights: Information on all 336,776 flights. airlines: A table matching airline names and their two-letter International Air Transport Association (IATA) airline codes (also known as carrier codes) for 16 airline companies. For example, “DL” is the two-letter code for Delta. planes: Information about each of the 3,322 physical aircraft used. weather: Hourly meteorological data for each of the three NYC airports. This data frame has 26,115 rows, roughly corresponding to the \\(365 \\times 24 \\times 3 = 26,280\\) possible hourly measurements one can observe at three locations over the course of a year. airports: Names, codes, and locations of the 1,458 domestic destinations. 1.4.2 flights data frame We’ll begin by exploring the flights data frame and get an idea of its structure. Run the following code in your console, either by typing it or by cutting-and-pasting it. It displays the contents of the flights data frame in your console. Note that depending on the size of your monitor, the output may vary slightly. flights # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 544 545 -1 1004 1022 5 2013 1 1 554 600 -6 812 837 6 2013 1 1 554 558 -4 740 728 7 2013 1 1 555 600 -5 913 854 8 2013 1 1 557 600 -3 709 723 9 2013 1 1 557 600 -3 838 846 10 2013 1 1 558 600 -2 753 745 # … with 336,766 more rows, and 11 more variables: arr_delay &lt;dbl&gt;, # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt; Let’s unpack this output: A tibble: 336,776 x 19: A tibble is a specific kind of data frame in R. This particular data frame has 336,776 rows corresponding to different observations. Here, each observation is a flight. 19 columns corresponding to 19 variables describing each observation. year, month, day, dep_time, sched_dep_time, dep_delay, and arr_time are the different columns, in other words, the different variables of this dataset. We then have a preview of the first 10 rows of observations corresponding to the first 10 flights. R is only showing the first 10 rows, because if it showed all 336,776 rows, it would overwhelm your screen. ... with 336,766 more rows, and 11 more variables: indicating to us that 336,766 more rows of data and 11 more variables could not fit in this screen. Unfortunately, this output does not allow us to explore the data very well, but it does give a nice preview. Let’s look at some different ways to explore data frames. 1.4.3 Exploring data frames There are many ways to get a feel for the data contained in a data frame such as flights. We present three functions that take as their “argument” (their input) the data frame in question. We also include a fourth method for exploring one particular column of a data frame: Using the View() function, which brings up RStudio’s built-in data viewer. Using the glimpse() function, which is included in the dplyr package. Using the kable() function, which is included in the knitr package. Using the $ “extraction operator,” which is used to view a single variable/column in a data frame. 1. View(): Run View(flights) in your console in RStudio, either by typing it or cutting-and-pasting it into the console pane. Explore this data frame in the resulting pop up viewer. You should get into the habit of viewing any data frames you encounter. Note the uppercase V in View(). R is case-sensitive, so you’ll get an error message if you run view(flights) instead of View(flights). Learning check (LC1.3) What does any ONE row in this flights dataset refer to? A. Data on an airline B. Data on a flight C. Data on an airport D. Data on multiple flights By running View(flights), we can explore the different variables listed in the columns. Observe that there are many different types of variables. Some of the variables like distance, day, and arr_delay are what we will call quantitative variables. These variables are numerical in nature. Other variables here are categorical. Note that if you look in the leftmost column of the View(flights) output, you will see a column of numbers. These are the row numbers of the dataset. If you glance across a row with the same number, say row 5, you can get an idea of what each row is representing. This will allow you to identify what object is being described in a given row by taking note of the values of the columns in that specific row. This is often called the observational unit. The observational unit in this example is an individual flight departing from New York City in 2013. You can identify the observational unit by determining what “thing” is being measured or described by each of the variables. We’ll talk more about observational units in Subsection 1.4.4 on identification and measurement variables. 2. glimpse(): The second way we’ll cover to explore a data frame is using the glimpse() function included in the dplyr package. Thus, you can only use the glimpse() function after you’ve loaded the dplyr package by running library(dplyr). This function provides us with an alternative perspective for exploring a data frame than the View() function: glimpse(flights) Observations: 336,776 Variables: 19 $ year &lt;int&gt; 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, … $ month &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … $ day &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … $ dep_time &lt;int&gt; 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,… $ sched_dep_time &lt;int&gt; 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600,… $ dep_delay &lt;dbl&gt; 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -… $ arr_time &lt;int&gt; 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849… $ sched_arr_time &lt;int&gt; 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851… $ arr_delay &lt;dbl&gt; 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -… $ carrier &lt;chr&gt; &quot;UA&quot;, &quot;UA&quot;, &quot;AA&quot;, &quot;B6&quot;, &quot;DL&quot;, &quot;UA&quot;, &quot;B6&quot;, &quot;EV&quot;, &quot;B6&quot;, … $ flight &lt;int&gt; 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, … $ tailnum &lt;chr&gt; &quot;N14228&quot;, &quot;N24211&quot;, &quot;N619AA&quot;, &quot;N804JB&quot;, &quot;N668DN&quot;, &quot;N39… $ origin &lt;chr&gt; &quot;EWR&quot;, &quot;LGA&quot;, &quot;JFK&quot;, &quot;JFK&quot;, &quot;LGA&quot;, &quot;EWR&quot;, &quot;EWR&quot;, &quot;LGA&quot;… $ dest &lt;chr&gt; &quot;IAH&quot;, &quot;IAH&quot;, &quot;MIA&quot;, &quot;BQN&quot;, &quot;ATL&quot;, &quot;ORD&quot;, &quot;FLL&quot;, &quot;IAD&quot;… $ air_time &lt;dbl&gt; 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, … $ distance &lt;dbl&gt; 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,… $ hour &lt;dbl&gt; 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, … $ minute &lt;dbl&gt; 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, … $ time_hour &lt;dttm&gt; 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 … Observe that glimpse() will give you the first few entries of each variable in a row after the variable name. In addition, the data type (see Subsection 1.2.1) of the variable is given immediately after each variable’s name inside &lt; &gt;. Here, int and dbl refer to “integer” and “double”, which are computer coding terminology for quantitative/numerical variables. “Doubles” take up twice the size to store on a computer compared to integers. In contrast, chr refers to “character”, which is computer terminology for text data. In most forms, text data, such as the carrier or origin of a flight, are categorical variables. The time_hour variable is another data type: dttm. These types of variables represent date and time combinations. However, we won’t work with dates and times in this book; we leave this topic for other data science books like Introduction to Data Science by Tiffany-Anne Timbers, Melissa Lee, and Trevor Campbell or R for Data Science (Grolemund and Wickham 2017). Learning check (LC1.4) What are some other examples in this dataset of categorical variables? What makes them different than quantitative variables? 3. kable(): The final way to explore the entirety of a data frame is using the kable() function from the knitr package. Let’s explore the different carrier codes for all the airlines in our dataset two ways. Run both of these lines of code in the console: airlines kable(airlines) At first glance, it may not appear that there is much difference in the outputs. However, when using tools for producing reproducible reports such as R Markdown, the latter code produces output that is much more legible and reader-friendly. You’ll see us use this reader-friendly style in many places in the book when we want to print a data frame as a nice table. 4. $ operator Lastly, the $ operator allows us to extract and then explore a single variable within a data frame. For example, run the following in your console airlines$name We used the $ operator to extract only the name variable and return it as a vector of length 16. We’ll only be occasionally exploring data frames using the $ operator, instead favoring the View() and glimpse() functions. 1.4.4 Identification and measurement variables There is a subtle difference between the kinds of variables that you will encounter in data frames. There are identification variables and measurement variables. For example, let’s explore the airports data frame by showing the output of glimpse(airports): glimpse(airports) Observations: 1,458 Variables: 8 $ faa &lt;chr&gt; &quot;04G&quot;, &quot;06A&quot;, &quot;06C&quot;, &quot;06N&quot;, &quot;09J&quot;, &quot;0A9&quot;, &quot;0G6&quot;, &quot;0G7&quot;, &quot;0P2&quot;, … $ name &lt;chr&gt; &quot;Lansdowne Airport&quot;, &quot;Moton Field Municipal Airport&quot;, &quot;Schaumbu… $ lat &lt;dbl&gt; 41.1, 32.5, 42.0, 41.4, 31.1, 36.4, 41.5, 42.9, 39.8, 48.1, 39.… $ lon &lt;dbl&gt; -80.6, -85.7, -88.1, -74.4, -81.4, -82.2, -84.5, -76.8, -76.6, … $ alt &lt;dbl&gt; 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 1… $ tz &lt;dbl&gt; -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5,… $ dst &lt;chr&gt; &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;U&quot;, &quot;A&quot;, &quot;A&quot;, &quot;U&quot;, &quot;A&quot;… $ tzone &lt;chr&gt; &quot;America/New_York&quot;, &quot;America/Chicago&quot;, &quot;America/Chicago&quot;, &quot;Amer… The variables faa and name are what we will call identification variables, variables that uniquely identify each observational unit. In this case, the identification variables uniquely identify airports. Such variables are mainly used in practice to uniquely identify each row in a data frame. faa gives the unique code provided by the FAA for that airport, while the name variable gives the longer official name of the airport. The remaining variables (lat, lon, alt, tz, dst, tzone) are often called measurement or characteristic variables: variables that describe properties of each observational unit. For example, lat and long describe the latitude and longitude of each airport. Furthermore, sometimes a single variable might not be enough to uniquely identify each observational unit: combinations of variables might be needed. While it is not an absolute rule, for organizational purposes it is considered good practice to have your identification variables in the leftmost columns of your data frame. Learning check (LC1.5) What properties of each airport do the variables lat, lon, alt, tz, dst, and tzone describe in the airports data frame? Take your best guess. (LC1.6) Provide the names of variables in a data frame with at least three variables where one of them is an identification variable and the other two are not. Further, create your own tidy data frame that matches these conditions. 1.4.5 Help files Another nice feature of R are help files, which provide documentation for various functions and datasets. You can bring up help files by adding a ? before the name of a function or data frame and then run this in the console. You will then be presented with a page showing the corresponding documentation if it exists. For example, let’s look at the help file for the flights data frame. ?flights The help file should pop up in the Help pane of RStudio. If you have questions about a function or data frame included in an R package, you should get in the habit of consulting the help file right away. Learning check (LC1.7) Look at the help file for the airports data frame. Revise your earlier guesses about what the variables lat, lon, alt, tz, dst, and tzone each describe. 1.5 Conclusion We’ve given you what we feel is a minimally viable set of tools to explore data in R. Does this chapter contain everything you need to know? Absolutely not. To try to include everything in this chapter would make the chapter so large it wouldn’t be useful! As we said earlier, the best way to add to your toolbox is to get into RStudio and run and write code as much as possible. 1.5.1 Additional resources If you are new to the world of coding, R, and RStudio and feel you could benefit from a more detailed introduction, we suggest you check out the short book, Getting Used to R, RStudio, and R Markdown (Ismay and Kennedy 2016). It includes screencast recordings that you can follow along and pause as you learn. This book also contains an introduction to R Markdown, a tool used for reproducible research in R. FIGURE 1.6: Preview of Getting Used to R, RStudio, and R Markdown. 1.5.2 What’s to come? We’re now going to start the “Data Science with tidyverse” portion of this book in Chapter 2 as shown in Figure 1.7 with what we feel is the most important tool in a data scientist’s toolbox: data visualization. We’ll continue to explore the data included in the nycflights13 package using the ggplot2 package for data visualization. You’ll see that data visualization is a powerful tool to add to your toolbox for data exploration that provides additional insight to what the View() and glimpse() functions can provide. FIGURE 1.7: ModernDive flowchart - on to Part I! References "],
+["2-viz.html", "Chapter 2 Data Visualization 2.1 The grammar of graphics 2.2 Five named graphs - the 5NG 2.3 5NG#1: Scatterplots 2.4 5NG#2: Linegraphs 2.5 5NG#3: Histograms 2.6 Facets 2.7 5NG#4: Boxplots 2.8 5NG#5: Barplots 2.9 Conclusion", " Chapter 2 Data Visualization We begin the development of your data science toolbox with data visualization. By visualizing data, we gain valuable insights we couldn’t initially obtain from just looking at the raw data values. We’ll use the ggplot2 package, as it provides an easy way to customize your plots. ggplot2 is rooted in the data visualization theory known as the grammar of graphics (Wilkinson 2005), developed by Leland Wilkinson. At their most basic, graphics/plots/charts (we use these terms interchangeably in this book) provide a nice way to explore the patterns in data, such as the presence of outliers, distributions of individual variables, and relationships between groups of variables. Graphics are designed to emphasize the findings and insights you want your audience to understand. This does, however, require a balancing act. On the one hand, you want to highlight as many interesting findings as possible. On the other hand, you don’t want to include so much information that it overwhelms your audience. As we will see, plots also help us to identify patterns and outliers in our data. We’ll see that a common extension of these ideas is to compare the distribution of one numerical variable, such as what are the center and spread of the values, as we go across the levels of a different categorical variable. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Read Section 1.3 for information on how to install and load R packages. library(nycflights13) library(ggplot2) library(dplyr) 2.1 The grammar of graphics We start with a discussion of a theoretical framework for data visualization known as “the grammar of graphics.” This framework serves as the foundation for the ggplot2 package which we’ll use extensively in this chapter. Think of how we construct and form sentences in English by combining different elements, like nouns, verbs, articles, subjects, objects, etc. We can’t just combine these elements in any arbitrary order; we must do so following a set of rules known as a linguistic grammar. Similarly to a linguistic grammar, “the grammar of graphics” defines a set of rules for constructing statistical graphics by combining different types of layers. This grammar was created by Leland Wilkinson (Wilkinson 2005) and has been implemented in a variety of data visualization software platforms like R, but also Plotly and Tableau. 2.1.1 Components of the grammar In short, the grammar tells us that: A statistical graphic is a mapping of data variables to aesthetic attributes of geometric objects. Specifically, we can break a graphic into the following three essential components: data: the dataset containing the variables of interest. geom: the geometric object in question. This refers to the type of object we can observe in a plot. For example: points, lines, and bars. aes: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are mapped to variables in the dataset. You might be wondering why we wrote the terms data, geom, and aes in a computer code type font. We’ll see very shortly that we’ll specify the elements of the grammar in R using these terms. However, let’s first break down the grammar with an example. 2.1.2 Gapminder data In February 2006, a Swedish physician and data advocate named Hans Rosling gave a TED talk titled “The best stats you’ve ever seen” where he presented global economic, health, and development data from the website gapminder.org. For example, for data on 142 countries in 2007, let’s consider only a few countries in Table 2.1 as a peak into the data. TABLE 2.1: Gapminder 2007 Data: First 3 of 142 countries Country Continent Life Expectancy Population GDP per Capita Afghanistan Asia 43.8 31889923 975 Albania Europe 76.4 3600523 5937 Algeria Africa 72.3 33333216 6223 Each row in this table corresponds to a country in 2007. For each row, we have 5 columns: Country: Name of country. Continent: Which of the five continents the country is part of. Note that “Americas” includes countries in both North and South America and that Antarctica is excluded. Life Expectancy: Life expectancy in years. Population: Number of people living in the country. GDP per Capita: Gross domestic product (in US dollars). Now consider Figure 2.1, which plots this for all 142 of the data’s countries. FIGURE 2.1: Life expectancy over GDP per capita in 2007. Let’s view this plot through the grammar of graphics: The data variable GDP per Capita gets mapped to the x-position aesthetic of the points. The data variable Life Expectancy gets mapped to the y-position aesthetic of the points. The data variable Population gets mapped to the size aesthetic of the points. The data variable Continent gets mapped to the color aesthetic of the points. We’ll see shortly that data corresponds to the particular data frame where our data is saved and that “data variables” correspond to particular columns in the data frame. Furthermore, the type of geometric object considered in this plot are points. That being said, while in this example we are considering points, graphics are not limited to just points. We can also use lines, bars, and other geometric objects. Let’s summarize the three essential components of the grammar in Table 2.2. TABLE 2.2: Summary of the grammar of graphics for this plot data variable aes geom GDP per Capita x point Life Expectancy y point Population size point Continent color point 2.1.3 Other components There are other components of the grammar of graphics we can control as well. As you start to delve deeper into the grammar of graphics, you’ll start to encounter these topics more frequently. In this book, we’ll keep things simple and only work with these two additional components: faceting breaks up a plot into several plots split by the values of another variable (Section 2.6) position adjustments for barplots (Section 2.8) Other more complex components like scales and coordinate systems are left for a more advanced text such as R for Data Science (Grolemund and Wickham 2017). Generally speaking, the grammar of graphics allows for a high degree of customization of plots and also a consistent framework for easily updating and modifying them. 2.1.4 ggplot2 package In this book, we will use the ggplot2 package for data visualization, which is an implementation of the grammar of graphics for R (Wickham, Chang, et al. 2019). As we noted earlier, a lot of the previous section was written in a computer code type font. This is because the various components of the grammar of graphics are specified in the ggplot() function included in the ggplot2 package. For the purposes of this book, we’ll always provide the ggplot() function with the following arguments (i.e., inputs) at a minimum: The data frame where the variables exist: the data argument. The mapping of the variables to aesthetic attributes: the mapping argument which specifies the aesthetic attributes involved. After we’ve specified these components, we then add layers to the plot using the + sign. The most essential layer to add to a plot is the layer that specifies which type of geometric object we want the plot to involve: points, lines, bars, and others. Other layers we can add to a plot include the plot title, axes labels, visual themes for the plots, and facets (which we’ll see in Section 2.6). Let’s now put the theory of the grammar of graphics into practice. 2.2 Five named graphs - the 5NG In order to keep things simple in this book, we will only focus on five different types of graphics, each with a commonly given name. We term these “five named graphs” or in abbreviated form, the 5NG: scatterplots linegraphs boxplots histograms barplots We’ll also present some variations of these plots, but with this basic repertoire of five graphics in your toolbox, you can visualize a wide array of different variable types. Note that certain plots are only appropriate for categorical variables, while others are only appropriate for numerical variables. 2.3 5NG#1: Scatterplots The simplest of the 5NG are scatterplots, also called bivariate plots. They allow you to visualize the relationship between two numerical variables. While you may already be familiar with scatterplots, let’s view them through the lens of the grammar of graphics we presented in Section 2.1. Specifically, we will visualize the relationship between the following two numerical variables in the flights data frame included in the nycflights13 package: dep_delay: departure delay on the horizontal “x” axis and arr_delay: arrival delay on the vertical “y” axis for Alaska Airlines flights leaving NYC in 2013. This requires paring down the data from all 336,776 flights that left NYC in 2013, to only the 714 Alaska Airlines flights that left NYC in 2013. We do this so our scatterplot will involve a manageable 714 points, and not an overwhelmingly large number like 336,776. To achieve this, we’ll take the flights data frame, filter the rows so that only the 714 rows corresponding to Alaska Airlines flights are kept, and save this in a new data frame called alaska_flights using the &lt;- assignment operator: alaska_flights &lt;- flights %&gt;% filter(carrier == &quot;AS&quot;) For now, we suggest you don’t worry if you don’t fully understand this code. We’ll see later in Chapter 3 on data wrangling that this code uses the dplyr package for data wrangling to achieve our goal: it takes the flights data frame and filters it to only return the rows where carrier is equal to &quot;AS&quot;, Alaska Airlines’ carrier code. Recall from Section 1.2 that testing for equality is specified with == and not =. Convince yourself that this code achieves what it is supposed to by exploring the resulting data frame by running View(alaska_flights). You’ll see that it has 714 rows, consisting of only 714 Alaska Airlines flights. Learning check (LC2.1) Take a look at both the flights and alaska_flights data frames by running View(flights) and View(alaska_flights). In what respect do these data frames differ? For example, think about the number of rows in each dataset. 2.3.1 Scatterplots via geom_point Let’s now go over the code that will create the desired scatterplot, while keeping in mind the grammar of graphics framework we introduced in Section 2.1. Let’s take a look at the code and break it down piece-by-piece. ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point() Within the ggplot() function, we specify two of the components of the grammar of graphics as arguments (i.e., inputs): The data as the alaska_flights data frame via data = alaska_flights. The aesthetic mapping by setting mapping = aes(x = dep_delay, y = arr_delay). Specifically, the variable dep_delay maps to the x position aesthetic, while the variable arr_delay maps to the y position. We then add a layer to the ggplot() function call using the + sign. The added layer in question specifies the third component of the grammar: the geometric object. In this case, the geometric object is set to be points by specifying geom_point(). After running these two lines of code in your console, you’ll notice two outputs: a warning message and the graphic shown in Figure 2.2. Warning: Removed 5 rows containing missing values (geom_point). FIGURE 2.2: Arrival delays versus departure delays for Alaska Airlines flights from NYC in 2013. Let’s first unpack the graphic in Figure 2.2. Observe that a positive relationship exists between dep_delay and arr_delay: as departure delays increase, arrival delays tend to also increase. Observe also the large mass of points clustered near (0, 0), the point indicating flights that neither departed nor arrived late. Let’s turn our attention to the warning message. R is alerting us to the fact that five rows were ignored due to them being missing. For these 5 rows, either the value for dep_delay or arr_delay or both were missing (recorded in R as NA), and thus these rows were ignored in our plot. Before we continue, let’s make a few more observations about this code that created the scatterplot. Note that the + sign comes at the end of lines, and not at the beginning. You’ll get an error in R if you put it at the beginning of a line. When adding layers to a plot, you are encouraged to start a new line after the + (by pressing the Return/Enter button on your keyboard) so that the code for each layer is on a new line. As we add more and more layers to plots, you’ll see this will greatly improve the legibility of your code. To stress the importance of adding the layer specifying the geometric object, consider Figure 2.3 where no layers are added. Because the geometric object was not specified, we have a blank plot which is not very useful! ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) FIGURE 2.3: A plot with no layers. Learning check (LC2.2) What are some practical reasons why dep_delay and arr_delay have a positive relationship? (LC2.3) What variables in the weather data frame would you expect to have a negative correlation (i.e., a negative relationship) with dep_delay? Why? Remember that we are focusing on numerical variables here. Hint: Explore the weather dataset by using the View() function. (LC2.4) Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaska Air flights? (LC2.5) What are some other features of the plot that stand out to you? (LC2.6) Create a new scatterplot using different variables in the alaska_flights data frame by modifying the example given. 2.3.2 Overplotting The large mass of points near (0, 0) in Figure 2.2 can cause some confusion since it is hard to tell the true number of points that are plotted. This is the result of a phenomenon called overplotting. As one may guess, this corresponds to points being plotted on top of each other over and over again. When overplotting occurs, it is difficult to know the number of points being plotted. There are two methods to address the issue of overplotting. Either by Adjusting the transparency of the points or Adding a little random “jitter”, or random “nudges”, to each of the points. Method 1: Changing the transparency The first way of addressing overplotting is to change the transparency/opacity of the points by setting the alpha argument in geom_point(). We can change the alpha argument to be any value between 0 and 1, where 0 sets the points to be 100% transparent and 1 sets the points to be 100% opaque. By default, alpha is set to 1. In other words, if we don’t explicitly set an alpha value, R will use alpha = 1. Note how the following code is identical to the code in Section 2.3 that created the scatterplot with overplotting, but with alpha = 0.2 added to the geom_point() function: ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2) FIGURE 2.4: Arrival vs. departure delays scatterplot with alpha = 0.2. The key feature to note in Figure 2.4 is that the transparency of the points is cumulative: areas with a high-degree of overplotting are darker, whereas areas with a lower degree are less dark. Note furthermore that there is no aes() surrounding alpha = 0.2. This is because we are not mapping a variable to an aesthetic attribute, but rather merely changing the default setting of alpha. In fact, you’ll receive an error if you try to change the second line to read geom_point(aes(alpha = 0.2)). Method 2: Jittering the points The second way of addressing overplotting is by jittering all the points. This means giving each point a small “nudge” in a random direction. You can think of “jittering” as shaking the points around a bit on the plot. Let’s illustrate using a simple example first. Say we have a data frame with 4 identical rows of x and y values: (0,0), (0,0), (0,0), and (0,0). In Figure 2.5, we present both the regular scatterplot of these 4 points (on the left) and its jittered counterpart (on the right). FIGURE 2.5: Regular and jittered scatterplot. In the left-hand regular scatterplot, observe that the 4 points are superimposed on top of each other. While we know there are 4 values being plotted, this fact might not be apparent to others. In the right-hand jittered scatterplot, it is now plainly evident that this plot involves four points since each point is given a random “nudge.” Keep in mind, however, that jittering is strictly a visualization tool; even after creating a jittered scatterplot, the original values saved in the data frame remain unchanged. To create a jittered scatterplot, instead of using geom_point(), we use geom_jitter(). Observe how the following code is very similar to the code that created the scatterplot with overplotting in Subsection 2.3.1, but with geom_point() replaced with geom_jitter(). ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_jitter(width = 30, height = 30) FIGURE 2.6: Arrival versus departure delays jittered scatterplot. In order to specify how much jitter to add, we adjusted the width and height arguments to geom_jitter(). This corresponds to how hard you’d like to shake the plot in horizontal x-axis units and vertical y-axis units, respectively. In this case, both axes are in minutes. How much jitter should we add using the width and height arguments? On the one hand, it is important to add just enough jitter to break any overlap in points, but on the other hand, not so much that we completely alter the original pattern in points. As can be seen in the resulting Figure 2.6, in this case jittering doesn’t really provide much new insight. In this particular case, it can be argued that changing the transparency of the points by setting alpha proved more effective. When would it be better to use a jittered scatterplot? When would it be better to alter the points’ transparency? There is no single right answer that applies to all situations. You need to make a subjective choice and own that choice. At the very least when confronted with overplotting, however, we suggest you make both types of plots and see which one better emphasizes the point you are trying to make. Learning check (LC2.7) Why is setting the alpha argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot? (LC2.8) After viewing Figure 2.4, give an approximate range of arrival delays and departure delays that occur most frequently. How has that region changed compared to when you observed the same plot without alpha = 0.2 set in Figure 2.2? 2.3.3 Summary Scatterplots display the relationship between two numerical variables. They are among the most commonly used plots because they can provide an immediate way to see the trend in one numerical variable versus another. However, if you try to create a scatterplot where either one of the two variables is not numerical, you might get strange results. Be careful! With medium to large datasets, you may need to play around with the different modifications to scatterplots we saw such as changing the transparency/opacity of the points or by jittering the points. This tweaking is often a fun part of data visualization, since you’ll have the chance to see different relationships emerge as you tinker with your plots. 2.4 5NG#2: Linegraphs The next of the five named graphs are linegraphs. Linegraphs show the relationship between two numerical variables when the variable on the x-axis, also called the explanatory variable, is of a sequential nature. In other words, there is an inherent ordering to the variable. The most common examples of linegraphs have some notion of time on the x-axis: hours, days, weeks, years, etc. Since time is sequential, we connect consecutive observations of the variable on the y-axis with a line. Linegraphs that have some notion of time on the x-axis are also called time series plots. Let’s illustrate linegraphs using another dataset in the nycflights13 package: the weather data frame. Let’s explore the weather data frame by running View(weather) and glimpse(weather). Furthermore let’s read the associated help file by running ?weather to bring up the help file. Observe that there is a variable called temp of hourly temperature recordings in Fahrenheit at weather stations near all three major airports in New York City: Newark (origin code EWR), John F. Kennedy International (JFK), and LaGuardia (LGA). However, instead of considering hourly temperatures for all days in 2013 for all three airports, for simplicity let’s only consider hourly temperatures at Newark airport for the first 15 days in January. Recall in Section 2.3, we used the filter() function to only choose the subset of rows of flights corresponding to Alaska Airlines flights. We similarly use filter() here, but by using the &amp; operator we only choose the subset of rows of weather where the origin is &quot;EWR&quot;, the month is January, and the day is between 1 and 15. Recall we performed a similar task in Section 2.3 when creating the alaska_flights data frame of only Alaska Airlines flights, a topic we’ll explore more in Chapter 3 on data wrangling. early_january_weather &lt;- weather %&gt;% filter(origin == &quot;EWR&quot; &amp; month == 1 &amp; day &lt;= 15) Learning check (LC2.9) Take a look at both the weather and early_january_weather data frames by running View(weather) and View(early_january_weather). In what respect do these data frames differ? (LC2.10) View() the flights data frame again. Why does the time_hour variable uniquely identify the hour of the measurement, whereas the hour variable does not? 2.4.1 Linegraphs via geom_line Let’s create a time series plot of the hourly temperatures saved in the early_january_weather data frame by using geom_line() to create a linegraph, instead of using geom_point() like we used previously to create scatterplots: ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = temp)) + geom_line() FIGURE 2.7: Hourly temperature in Newark for January 1-15, 2013. Much as with the ggplot() code that created the scatterplot of departure and arrival delays for Alaska Airlines flights in Figure 2.2, let’s break down this code piece-by-piece in terms of the grammar of graphics: Within the ggplot() function call, we specify two of the components of the grammar of graphics as arguments: The data to be the early_january_weather data frame by setting data = early_january_weather. The aesthetic mapping by setting mapping = aes(x = time_hour, y = temp). Specifically, the variable time_hour maps to the x position aesthetic, while the variable temp maps to the y position aesthetic. We add a layer to the ggplot() function call using the + sign. The layer in question specifies the third component of the grammar: the geometric object in question. In this case, the geometric object is a line set by specifying geom_line(). Learning check (LC2.11) Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis? (LC2.12) Why are linegraphs frequently used when time is the explanatory variable on the x-axis? (LC2.13) Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013. 2.4.2 Summary Linegraphs, just like scatterplots, display the relationship between two numerical variables. However, it is preferred to use linegraphs over scatterplots when the variable on the x-axis (i.e., the explanatory variable) has an inherent ordering, such as some notion of time. 2.5 5NG#3: Histograms Let’s consider the temp variable in the weather data frame once again, but unlike with the linegraphs in Section 2.4, let’s say we don’t care about its relationship with time, but rather we only care about how the values of temp distribute. In other words: What are the smallest and largest values? What is the “center” or “most typical” value? How do the values spread out? What are frequent and infrequent values? One way to visualize this distribution of this single variable temp is to plot them on a horizontal line as we do in Figure 2.8: FIGURE 2.8: Plot of hourly temperature recordings from NYC in 2013. This gives us a general idea of how the values of temp distribute: observe that temperatures vary from around 11°F (-11°C) up to 100°F (38°C). Furthermore, there appear to be more recorded temperatures between 40°F and 60°F than outside this range. However, because of the high degree of overplotting in the points, it’s hard to get a sense of exactly how many values are between say 50°F and 55°F. What is commonly produced instead of Figure 2.8 is known as a histogram. A histogram is a plot that visualizes the distribution of a numerical value as follows: We first cut up the x-axis into a series of bins, where each bin represents a range of values. For each bin, we count the number of observations that fall in the range corresponding to that bin. Then for each bin, we draw a bar whose height marks the corresponding count. Let’s drill-down on an example of a histogram, shown in Figure 2.9. FIGURE 2.9: Example histogram. Let’s focus only on temperatures between 30°F (-1°C) and 60°F (15°C) for now. Observe that there are three bins of equal width between 30°F and 60°F. Thus we have three bins of width 10°F each: one bin for the 30-40°F range, another bin for the 40-50°F range, and another bin for the 50-60°F range. Since: The bin for the 30-40°F range has a height of around 5000. In other words, around 5000 of the hourly temperature recordings are between 30°F and 40°F. The bin for the 40-50°F range has a height of around 4300. In other words, around 4300 of the hourly temperature recordings are between 40°F and 50°F. The bin for the 50-60°F range has a height of around 3500. In other words, around 3500 of the hourly temperature recordings are between 50°F and 60°F. All nine bins spanning 10°F to 100°F on the x-axis have this interpretation. 2.5.1 Histograms via geom_histogram Let’s now present the ggplot() code to plot your first histogram! Unlike with scatterplots and linegraphs, there is now only one variable being mapped in aes(): the single numerical variable temp. The y-aesthetic of a histogram, the count of the observations in each bin, gets computed for you automatically. Furthermore, the geometric object layer is now a geom_histogram(). After running the following code, you’ll see the histogram in Figure 2.10 as well as warning messages. We’ll discuss the warning messages first. ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram() `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Warning: Removed 1 rows containing non-finite values (stat_bin). FIGURE 2.10: Histogram of hourly temperatures at three NYC airports. The first message is telling us that the histogram was constructed using bins = 30 for 30 equally spaced bins. This is known in computer programming as a default value; unless you override this default number of bins with a number you specify, R will choose 30 by default. We’ll see in the next section how to change the number of bins to another value than the default. The second message is telling us something similar to the warning message we received when we ran the code to create a scatterplot of departure and arrival delays for Alaska Airlines flights in Figure 2.2: that because one row has a missing NA value for temp, it was omitted from the histogram. R is just giving us a friendly heads up that this was the case. Now let’s unpack the resulting histogram in Figure 2.10. Observe that values less than 25°F as well as values above 80°F are rather rare. However, because of the large number of bins, it’s hard to get a sense for which range of temperatures is spanned by each bin; everything is one giant amorphous blob. So let’s add white vertical borders demarcating the bins by adding a color = &quot;white&quot; argument to geom_histogram() and ignore the warning about setting the number of bins to a better value: ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(color = &quot;white&quot;) FIGURE 2.11: Histogram of hourly temperatures at three NYC airports with white borders. We now have an easier time associating ranges of temperatures to each of the bins in Figure 2.11. We can also vary the color of the bars by setting the fill argument. For example, you can set the bin colors to be “blue steel” by setting fill = &quot;steelblue&quot;: ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(color = &quot;white&quot;, fill = &quot;steelblue&quot;) If you’re curious, run colors() to see all 657 possible choice of colors in R! 2.5.2 Adjusting the bins Observe in Figure 2.11 that in the 50-75°F range there appear to be roughly 8 bins. Thus each bin has width 25 divided by 8, or 3.125°F, which is not a very easily interpretable range to work with. Let’s improve this by adjusting the number of bins in our histogram in one of two ways: By adjusting the number of bins via the bins argument to geom_histogram(). By adjusting the width of the bins via the binwidth argument to geom_histogram(). Using the first method, we have the power to specify how many bins we would like to cut the x-axis up in. As mentioned in the previous section, the default number of bins is 30. We can override this default, to say 40 bins, as follows: ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bins = 40, color = &quot;white&quot;) Using the second method, instead of specifying the number of bins, we specify the width of the bins by using the binwidth argument in the geom_histogram() layer. For example, let’s set the width of each bin to be 10°F. ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(binwidth = 10, color = &quot;white&quot;) We compare both resulting histograms side-by-side in Figure 2.12. FIGURE 2.12: Setting histogram bins in two ways. Learning check (LC2.14) What does changing the number of bins from 30 to 40 tell us about the distribution of temperatures? (LC2.15) Would you classify the distribution of temperatures as symmetric or skewed in one direction or another? (LC2.16) What would you guess is the “center” value in this distribution? Why did you make that choice? (LC2.17) Is this data spread out greatly from the center or is it close? Why? 2.5.3 Summary Histograms, unlike scatterplots and linegraphs, present information on only a single numerical variable. Specifically, they are visualizations of the distribution of the numerical variable in question. 2.6 Facets Before continuing with the next of the 5NG, let’s briefly introduce a new concept called faceting. Faceting is used when we’d like to split a particular visualization by the values of another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ. For example, suppose we were interested in looking at how the histogram of hourly temperature recordings at the three NYC airports we saw in Figure 2.9 differed in each month. We could “split” this histogram by the 12 possible months in a given year. In other words, we would plot histograms of temp for each month separately. We do this by adding facet_wrap(~ month) layer. Note the ~ is a “tilde” and can generally be found on the key next to the “1” key on US keyboards. The tilde is required and you’ll receive the error Error in as.quoted(facets) : object 'month' not found if you don’t include it here. ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(binwidth = 5, color = &quot;white&quot;) + facet_wrap(~ month) FIGURE 2.13: Faceted histogram of hourly temperatures by month. We can also specify the number of rows and columns in the grid by using the nrow and ncol arguments inside of facet_wrap(). For example, say we would like our faceted histogram to have 4 rows instead of 3. We simply add an nrow = 4 argument to facet_wrap(~ month) ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(binwidth = 5, color = &quot;white&quot;) + facet_wrap(~ month, nrow = 4) FIGURE 2.14: Faceted histogram with 4 instead of 3 rows. Observe in both Figures 2.13 and 2.14 that as we might expect in the Northern Hemisphere, temperatures tend to be higher in the summer months, while they tend to be lower in the winter. Learning check (LC2.18) What other things do you notice about this faceted plot? How does a faceted plot help us see relationships between two variables? (LC2.19) What do the numbers 1-12 correspond to in the plot? What about 25, 50, 75, 100? (LC2.20) For which types of datasets would faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics. (LC2.21) Does the temp variable in the weather dataset have a lot of variability? Why do you say that? 2.7 5NG#4: Boxplots While faceted histograms are one type of visualization used to compare the distribution of a numerical variable split by the values of another variable, another type of visualization that achieves this same goal is a side-by-side boxplot. A boxplot is constructed from the information provided in the five-number summary of a numerical variable (see Appendix A.1). To keep things simple for now, let’s only consider the 2141 hourly temperature recordings for the month of November, each represented as a jittered point in Figure 2.15. FIGURE 2.15: November temperatures represented as jittered points. These 2141 observations have the following five-number summary: Minimum: 21°F First quartile (25th percentile): 36°F Median (second quartile, 50th percentile): 45°F Third quartile (75th percentile): 52°F Maximum: 71°F In the leftmost plot of Figure 2.16, let’s mark these 5 values with dashed horizontal lines on top of the 2141 points. In the middle plot of Figure 2.16 let’s add the boxplot. In the rightmost plot of Figure 2.16, let’s remove the points and the dashed horizontal lines for clarity’s sake. FIGURE 2.16: Building up a boxplot of November temperatures. What the boxplot does is visually summarize the 2141 points by cutting the 2141 temperature recordings into quartiles at the dashed lines, where each quartile contains roughly 2141 \\(\\div\\) 4 \\(\\approx\\) 535 observations. Thus 25% of points fall below the bottom edge of the box, which is the first quartile of 36°F. In other words, 25% of observations were below 36°F. 25% of points fall between the bottom edge of the box and the solid middle line, which is the median of 45°F. Thus, 25% of observations were between 36°F and 45°F and 50% of observations were below 45°F. 25% of points fall between the solid middle line and the top edge of the box, which is the third quartile of 52°F. It follows that 25% of observations were between 45°F and 52°F and 75% of observations were below 52°F. 25% of points fall above the top edge of the box. In other words, 25% of observations were above 52°F. The middle 50% of points lie within the interquartile range (IQR) between the first and third quartile. Thus, the IQR for this example is 52 - 36 = 16°F. The interquartile range is a measure of a numerical variable’s spread. Furthermore, in the rightmost plot of Figure 2.16, we see the whiskers of the boxplot. The whiskers stick out from either end of the box all the way to the minimum and maximum observed temperatures of 21°F and 71°F, respectively. However, the whiskers don’t always extend to the smallest and largest observed values as they do here. They in fact extend no more than 1.5 \\(\\times\\) the interquartile range from either end of the box. In this case of the November temperatures, no more than 1.5 \\(\\times\\) 16°F = 24°F from either end of the box. Any observed values outside this range get marked with points called outliers, which we’ll see in the next section. 2.7.1 Boxplots via geom_boxplot Let’s now create a side-by-side boxplot of hourly temperatures split by the 12 months as we did previously with the faceted histograms. We do this by mapping the month variable to the x-position aesthetic, the temp variable to the y-position aesthetic, and by adding a geom_boxplot() layer: ggplot(data = weather, mapping = aes(x = month, y = temp)) + geom_boxplot() FIGURE 2.17: Invalid boxplot specification. Warning messages: 1: Continuous x aesthetic -- did you forget aes(group=...)? 2: Removed 1 rows containing non-finite values (stat_boxplot). Observe in Figure 2.17 that this plot does not provide information about temperature separated by month. The first warning message clues us in as to why. It is telling us that we have a “continuous”, or numerical variable, on the x-position aesthetic. Boxplots, however, require a categorical variable to be mapped to the x-position aesthetic. The second warning message is identical to the warning message when plotting a histogram of hourly temperatures: that one of the values was recorded as NA missing. We can convert the numerical variable month into a factor categorical variable by using the factor() function. So after applying factor(month), month goes from having numerical values just the 1, 2, …, and 12 to having an associated ordering. With this ordering, ggplot() now knows how to work with this variable to produce the needed plot. ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) + geom_boxplot() FIGURE 2.18: Side-by-side boxplot of temperature split by month. The resulting Figure 2.18 shows 12 separate “box and whiskers” plots similar to the rightmost plot of Figure 2.16 of only November temperatures. Thus the different boxplots are shown “side-by-side.” The “box” portions of the visualization represent the 1st quartile, the median (the 2nd quartile), and the 3rd quartile. The height of each box (the value of the 3rd quartile minus the value of the 1st quartile) is the interquartile range (IQR). It is a measure of the spread of the middle 50% of values, with longer boxes indicating more variability. The “whisker” portions of these plots extend out from the bottoms and tops of the boxes and represent points less than the 25th percentile and greater than the 75th percentiles, respectively. They’re set to extend out no more than \\(1.5 \\times IQR\\) units away from either end of the boxes. We say “no more than” because the ends of the whiskers have to correspond to observed temperatures. The length of these whiskers show how the data outside the middle 50% of values vary, with longer whiskers indicating more variability. The dots representing values falling outside the whiskers are called outliers. These can be thought of as anomalous (“out-of-the-ordinary”) values. It is important to keep in mind that the definition of an outlier is somewhat arbitrary and not absolute. In this case, they are defined by the length of the whiskers, which are no more than \\(1.5 \\times IQR\\) units long for each boxplot. Looking at this side-by-side plot we can see, as expected, that summer months (6 through 8) have higher median temperatures as evidenced by the higher solid lines in the middle of the boxes. We can easily compare temperatures across months by drawing imaginary horizontal lines across the plot. Furthermore, the heights of the 12 boxes as quantified by the interquartile ranges are informative too; they tell us about variability, or spread, of temperatures recorded in a given month. Learning check (LC2.22) What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point. (LC2.23) Which months have the highest variability in temperature? What reasons can you give for this? (LC2.24) We looked at the distribution of the numerical variable temp split by the numerical variable month that we converted using the factor() function in order to make a side-by-side boxplot. Why would a boxplot of temp split by the numerical variable pressure similarly converted to a categorical variable using the factor() not be informative? (LC2.25) Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram? 2.7.2 Summary Side-by-side boxplots provide us with a way to compare the distribution of a numerical variable across multiple values of another variable. One can see where the median falls across the different groups by comparing the solid lines in the center of the boxes. To study the spread of a numerical variable within one of the boxes, look at both the length of the box and also how far the whiskers extend from either end of the box. Outliers are even more easily identified when looking at a boxplot than when looking at a histogram as they are marked with distinct points. 2.8 5NG#5: Barplots Both histograms and boxplots are tools to visualize the distribution of numerical variables. Another commonly desired task is to visualize the distribution of a categorical variable. This is a simpler task, as we are simply counting different categories within a categorical variable, also known as the levels of the categorical variable. Often the best way to visualize these different counts, also known as frequencies, is with barplots (also called barcharts). One complication, however, is how your data is represented. Is the categorical variable of interest “pre-counted” or not? For example, run the following code that manually creates two data frames representing a collection of fruit: 3 apples and 2 oranges. fruits &lt;- tibble( fruit = c(&quot;apple&quot;, &quot;apple&quot;, &quot;orange&quot;, &quot;apple&quot;, &quot;orange&quot;) ) fruits_counted &lt;- tibble( fruit = c(&quot;apple&quot;, &quot;orange&quot;), number = c(3, 2) ) We see both the fruits and fruits_counted data frames represent the same collection of fruit. Whereas fruits just lists the fruit individually… # A tibble: 5 x 1 fruit &lt;chr&gt; 1 apple 2 apple 3 orange 4 apple 5 orange … fruits_counted has a variable count which represent the “pre-counted” values of each fruit. # A tibble: 2 x 2 fruit number &lt;chr&gt; &lt;dbl&gt; 1 apple 3 2 orange 2 Depending on how your categorical data is represented, you’ll need to add a different geometric layer type to your ggplot() to create a barplot, as we now explore. 2.8.1 Barplots via geom_bar or geom_col Let’s generate barplots using these two different representations of the same basket of fruit: 3 apples and 2 oranges. Using the fruits data frame where all 5 fruits are listed individually in 5 rows, we map the fruit variable to the x-position aesthetic and add a geom_bar() layer: ggplot(data = fruits, mapping = aes(x = fruit)) + geom_bar() FIGURE 2.19: Barplot when counts are not pre-counted. However, using the fruits_counted data frame where the fruits have been “pre-counted”, we once again map the fruit variable to the x-position aesthetic, but here we also map the count variable to the y-position aesthetic, and add a geom_col() layer instead. ggplot(data = fruits_counted, mapping = aes(x = fruit, y = number)) + geom_col() FIGURE 2.20: Barplot when counts are pre-counted. Compare the barplots in Figures 2.19 and 2.20. They are identical because they reflect counts of the same five fruits. However, depending on how our categorical data is represented, either “pre-counted” or not, we must add a different geom layer. When the categorical variable whose distribution you want to visualize Is not pre-counted in your data frame, we use geom_bar(). Is pre-counted in your data frame, we use geom_col() with the y-position aesthetic mapped to the variable that has the counts. Let’s now go back to the flights data frame in the nycflights13 package and visualize the distribution of the categorical variable carrier. In other words, let’s visualize the number of domestic flights out of New York City each airline company flew in 2013. Recall from Subsection 1.4.3 when you first explored the flights data frame, you saw that each row corresponds to a flight. In other words, the flights data frame is more like the fruits data frame than the fruits_counted data frame because the flights have not been pre-counted by carrier. Thus we should use geom_bar() instead of geom_col() to create a barplot. Much like a geom_histogram(), there is only one variable in the aes() aesthetic mapping: the variable carrier gets mapped to the x-position. As a difference though, histograms have bars that touch whereas bar graphs have white space between the bars going from left to right. ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() FIGURE 2.21: Number of flights departing NYC in 2013 by airline using geom_bar(). Observe in Figure 2.21 that United Airlines (UA), JetBlue Airways (B6), and ExpressJet Airlines (EV) had the most flights depart NYC in 2013. If you don’t know which airlines correspond to which carrier codes, then run View(airlines) to see a directory of airlines. For example, B6 is JetBlue Airways. Alternatively, say you had a data frame where the number of flights for each carrier was pre-counted as in Table 2.3. TABLE 2.3: Number of flights pre-counted for each carrier carrier number 9E 18460 AA 32729 AS 714 B6 54635 DL 48110 EV 54173 F9 685 FL 3260 HA 342 MQ 26397 OO 32 UA 58665 US 20536 VX 5162 WN 12275 YV 601 In order to create a barplot visualizing the distribution of the categorical variable carrier in this case, we would now use geom_col() instead of geom_bar(), with an additional y = number in the aesthetic mapping on top of the x = carrier. The resulting barplot would be identical to Figure 2.21. Learning check (LC2.26) Why are histograms inappropriate for categorical variables? (LC2.27) What is the difference between histograms and barplots? (LC2.28) How many Envoy Air flights departed NYC in 2013? (LC2.29) What was the 7th highest airline for departed flights from NYC in 2013? How could we better present the table to get this answer quickly? 2.8.2 Must avoid pie charts! One of the most common plots used to visualize the distribution of categorical data is the pie chart. While they may seem harmless enough, pie charts actually present a problem in that humans are unable to judge angles well. As Naomi Robbins describes in her book, Creating More Effective Graphs (Robbins 2013), we overestimate angles greater than 90 degrees and we underestimate angles less than 90 degrees. In other words, it is difficult for us to determine the relative size of one piece of the pie compared to another. Let’s examine the same data used in our previous barplot of the number of flights departing NYC by airline in Figure 2.21, but this time we will use a pie chart in Figure 2.22. Try to answer the following questions: How much larger is the portion of the pie for ExpressJet Airlines (EV) compared to US Airways (US)? What is the third largest carrier in terms of departing flights? How many carriers have fewer flights than United Airlines (UA)? FIGURE 2.22: The dreaded pie chart. While it is quite difficult to answer these questions when looking at the pie chart in Figure 2.22, we can much more easily answer these questions using the barchart in Figure 2.21. This is true since barplots present the information in a way such that comparisons between categories can be made with single horizontal lines, whereas pie charts present the information in a way such that comparisons must be made by comparing angles. Learning check (LC2.30) Why should pie charts be avoided and replaced by barplots? (LC2.31) Why do you think people continue to use pie charts? 2.8.3 Two categorical variables Barplots are a very common way to visualize the frequency of different categories, or levels, of a single categorical variable. Another use of barplots is to visualize the joint distribution of two categorical variables at the same time. Let’s examine the joint distribution of outgoing domestic flights from NYC by carrier as well as origin. In other words, the number of flights for each carrier and origin combination. For example, the number of WestJet flights from JFK, the number of WestJet flights from LGA, the number of WestJet flights from EWR, the number of American Airlines flights from JFK, and so on. Recall the ggplot() code that created the barplot of carrier frequency in Figure 2.21: ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() We can now map the additional variable origin by adding a fill = origin inside the aes() aesthetic mapping. ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + geom_bar() FIGURE 2.23: Stacked barplot of flight amount by carrier and origin. Figure 2.23 is an example of a stacked barplot. While simple to make, in certain aspects it is not ideal. For example, it is difficult to compare the heights of the different colors between the bars, corresponding to comparing the number of flights from each origin airport between the carriers. Before we continue, let’s address some common points of confusion among new R users. First, the fill aesthetic corresponds to the color used to fill the bars, while the color aesthetic corresponds to the color of the outline of the bars. This is identical to how we added color to our histogram in Subsection 2.5.1: we set the outline of the bars to white by setting color = &quot;white&quot; and the colors of the bars to blue steel by setting fill = &quot;steelblue&quot;. Observe in Figure 2.24 that mapping origin to color and not fill yields grey bars with different colored outlines. ggplot(data = flights, mapping = aes(x = carrier, color = origin)) + geom_bar() FIGURE 2.24: Stacked barplot with color aesthetic used instead of fill. Second, note that fill is another aesthetic mapping much like x-position; thus we were careful to include it within the parentheses of the aes() mapping. The following code, where the fill aesthetic is specified outside the aes() mapping will yield an error. This is a fairly common error that new ggplot users make: ggplot(data = flights, mapping = aes(x = carrier), fill = origin) + geom_bar() An alternative to stacked barplots are side-by-side barplots, also known as dodged barplots, as seen in Figure 2.25. The code to create a side-by-side barplot is identical to the code to create a stacked barplot, but with a position = &quot;dodge&quot; argument added to geom_bar(). In other words, we are overriding the default barplot type, which is a stacked barplot, and specifying it to be a side-by-side barplot instead. ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + geom_bar(position = &quot;dodge&quot;) FIGURE 2.25: Side-by-side barplot comparing number of flights by carrier and origin. Note the width of the bars for AS, F9, FL, HA and YV is different than the others. We can make one tweak to the position argument to get them to be the same size in terms of width as the other bars by using the more robust position_dodge() function. ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + geom_bar(position = position_dodge(preserve = &quot;single&quot;)) FIGURE 2.26: Side-by-side barplot comparing number of flights by carrier and origin (with formatting tweak). Lastly, another type of barplot is a faceted barplot. Recall in Section 2.6 we visualized the distribution of hourly temperatures at the 3 NYC airports split by month using facets. We apply the same principle to our barplot visualizing the frequency of carrier split by origin: instead of mapping origin to fill we include it as the variable to create small multiples of the plot across the levels of origin. ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() + facet_wrap(~ origin, ncol = 1) FIGURE 2.27: Faceted barplot comparing the number of flights by carrier and origin. Learning check (LC2.32) What kinds of questions are not easily answered by looking at Figure 2.23? (LC2.33) What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights? (LC2.34) Why might the side-by-side barplot be preferable to a stacked barplot in this case? (LC2.35) What are the disadvantages of using a dodged barplot, in general? (LC2.36) Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case? (LC2.37) What information about the different carriers at different airports is more easily seen in the faceted barplot? 2.8.4 Summary Barplots are a common way of displaying the distribution of a categorical variable, or in other words the frequency with which the different categories (also called levels) occur. They are easy to understand and make it easy to make comparisons across levels. Furthermore, when trying to visualize the relationship of two categorical variables, you have many options: stacked barplots, side-by-side barplots, and faceted barplots. Depending on what aspect of the relationship you are trying to emphasize, you will need to make a choice between these three types of barplots and own that choice. 2.9 Conclusion 2.9.1 Summary table Let’s recap all five of the five named graphs (5NG) in Table 2.4 summarizing their differences. Using these 5NG, you’ll be able to visualize the distributions and relationships of variables contained in a wide array of datasets. This will be even more the case as we start to map more variables to more of each geometric object’s aesthetic attribute options, further unlocking the awesome power of the ggplot2 package. TABLE 2.4: Summary of Five Named Graphs Named graph Shows Geometric object Notes 1 Scatterplot Relationship between 2 numerical variables geom_point() 2 Linegraph Relationship between 2 numerical variables geom_line() Used when there is a sequential order to x-variable, e.g., time 3 Histogram Distribution of 1 numerical variable geom_histogram() Facetted histograms show the distribution of 1 numerical variable split by the values of another variable 4 Boxplot Distribution of 1 numerical variable split by the values of another variable geom_boxplot() 5 Barplot Distribution of 1 categorical variable geom_bar() when counts are not pre-counted, geom_col() when counts are pre-counted Stacked, side-by-side, and faceted barplots show the joint distribution of 2 categorical variables 2.9.2 Function argument specification Let’s go over some important points about specifying the arguments (i.e., inputs) to functions. Run the following two segments of code: # Segment 1: ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() # Segment 2: ggplot(flights, aes(x = carrier)) + geom_bar() You’ll notice that both code segments create the same barplot, even though in the second segment we omitted the data = and mapping = code argument names. This is because the ggplot() function by default assumes that the data argument comes first and the mapping argument comes second. As long as you specify the data frame in question first and the aes() mapping second, you can omit the explicit statement of the argument names data = and mapping =. Going forward for the rest of this book, all ggplot() code will be like the second segment: with the data = and mapping = explicit naming of the argument omitted with the default ordering of arguments respected. We’ll do this for brevity’s sake; it’s common to see this style when reviewing other R users’ code. 2.9.3 Additional resources An R script file of all R code used in this chapter is available here. If you want to further unlock the power of the ggplot2 package for data visualization, we suggest that you check out RStudio’s “Data Visualization with ggplot2” cheatsheet. This cheatsheet summarizes much more than what we’ve discussed in this chapter. In particular, it presents many more than the 5 geometric objects we covered in this chapter while providing quick and easy to read visual descriptions. For all the geometric objects, it also lists all the possible aesthetic attributes one can tweak. In the current version of RStudio in late 2019, you can access this cheatsheet by going to the RStudio Menu Bar -&gt; Help -&gt; Cheatsheets -&gt; “Data Visualization with ggplot2.” You can see a preview in the figure below. FIGURE 2.28: Data Visualization with ggplot2 cheatsheet. 2.9.4 What’s to come Recall in Figure 2.2 in Section 2.3 we visualized the relationship between departure delay and arrival delay for Alaska Airlines flights. This necessitated paring down the flights data frame to a new data frame alaska_flights consisting of only carrier == AS flights first: alaska_flights &lt;- flights %&gt;% filter(carrier == &quot;AS&quot;) ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point() Furthermore recall in Figure 2.7 in Section 2.4 we visualized hourly temperature recordings at Newark airport only for the first 15 days of January 2013. This necessitated paring down the weather data frame to a new data frame early_january_weather consisting of hourly temperature recordings only for origin == &quot;EWR&quot;, month == 1, and day less than or equal to 15 first: early_january_weather &lt;- weather %&gt;% filter(origin == &quot;EWR&quot; &amp; month == 1 &amp; day &lt;= 15) ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = temp)) + geom_line() These two code segments were a preview of Chapter 3 on data wrangling using the dplyr package. Data wrangling is the process of transforming and modifying existing data with the intent of making it more appropriate for analysis purposes. For example, these two code segments used the filter() function to create new data frames (alaska_flights and early_january_weather) by choosing only a subset of rows of existing data frames (flights and weather). In the next chapter, we’ll formally introduce the filter() and other data wrangling functions as well as the pipe operator %&gt;% which allows you to combine multiple data wrangling actions into a single sequential chain of actions. On to Chapter 3 on data wrangling! References "],
+["3-wrangling.html", "Chapter 3 Data Wrangling 3.1 The pipe operator: %&gt;% 3.2 filter rows 3.3 summarize variables 3.4 group_by rows 3.5 mutate existing variables 3.6 arrange and sort rows 3.7 join data frames 3.8 Other verbs 3.9 Conclusion", " Chapter 3 Data Wrangling So far in our journey, we’ve seen how to look at data saved in data frames using the glimpse() and View() functions in Chapter 1, and how to create data visualizations using the ggplot2 package in Chapter 2. In particular we studied what we term the “five named graphs” (5NG): scatterplots via geom_point() linegraphs via geom_line() boxplots via geom_boxplot() histograms via geom_histogram() barplots via geom_bar() or geom_col() We created these visualizations using the grammar of graphics, which maps variables in a data frame to the aesthetic attributes of one of the 5 geometric objects. We can also control other aesthetic attributes of the geometric objects such as the size and color as seen in the Gapminder data example in Figure 2.1. Recall however that for two of our visualizations, we first needed to transform/modify existing data frames a little. For example, recall the scatterplot in Figure 2.2 of departure and arrival delays only for Alaska Airlines flights. In order to create this visualization, we first needed to pare down the flights data frame to an alaska_flights data frame consisting of only carrier == &quot;AS&quot; flights. Thus, alaska_flights will have fewer rows than flights. We did this using the filter() function: alaska_flights &lt;- flights %&gt;% filter(carrier == &quot;AS&quot;) In this chapter, we’ll extend this example and we’ll introduce a series of functions from the dplyr package for data wrangling that will allow you to take a data frame and “wrangle” it (transform it) to suit your needs. Such functions include: filter() a data frame’s existing rows to only pick out a subset of them. For example, the alaska_flights data frame. summarize() one or more of its columns/variables with a summary statistic. Examples of summary statistics include the median and interquartile range of temperatures as we saw in Section 2.7 on boxplots. group_by() its rows. In other words, assign different rows to be part of the same group. We can then combine group_by() with summarize() to report summary statistics for each group separately. For example, say you don’t want a single overall average departure delay dep_delay for all three origin airports combined, but rather three separate average departure delays, one computed for each of the three origin airports. mutate() its existing columns/variables to create new ones. For example, convert hourly temperature recordings from degrees Fahrenheit to degrees Celsius. arrange() its rows. For example, sort the rows of weather in ascending or descending order of temp. join() it with another data frame by matching along a “key” variable. In other words, merge these two data frames together. Notice how we used computer_code font to describe the actions we want to take on our data frames. This is because the dplyr package for data wrangling has intuitively verb-named functions that are easy to remember. There is a further benefit to learning to use the dplyr package for data wrangling: its similarity to the database querying language SQL (pronounced “sequel” or spelled out as “S”, “Q”, “L”). SQL (which stands for “Structured Query Language”) is used to manage large databases quickly and efficiently and is widely used by many institutions with a lot of data. While SQL is a topic left for a book or a course on database management, keep in mind that once you learn dplyr, you can learn SQL easily. We’ll talk more about their similarities in Subsection 3.7.4. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section 1.3 for information on how to install and load R packages. library(dplyr) library(ggplot2) library(nycflights13) 3.1 The pipe operator: %&gt;% Before we start data wrangling, let’s first introduce a nifty tool that gets loaded with the dplyr package: the pipe operator %&gt;%. The pipe operator allows us to combine multiple operations in R into a single sequential chain of actions. Let’s start with a hypothetical example. Say you would like to perform a hypothetical sequence of operations on a hypothetical data frame x using hypothetical functions f(), g(), and h(): Take x then Use x as an input to a function f() then Use the output of f(x) as an input to a function g() then Use the output of g(f(x)) as an input to a function h() One way to achieve this sequence of operations is by using nesting parentheses as follows: h(g(f(x))) This code isn’t so hard to read since we are applying only three functions: f(), then g(), then h() and each of the functions is short in its name. Further, each of these functions also only has one argument. However, you can imagine that this will get progressively harder to read as the number of functions applied in your sequence increases and the arguments in each function increase as well. This is where the pipe operator %&gt;% comes in handy. %&gt;% takes the output of one function and then “pipes” it to be the input of the next function. Furthermore, a helpful trick is to read %&gt;% as “then” or “and then.” For example, you can obtain the same output as the hypothetical sequence of functions as follows: x %&gt;% f() %&gt;% g() %&gt;% h() You would read this sequence as: Take x then Use this output as the input to the next function f() then Use this output as the input to the next function g() then Use this output as the input to the next function h() So while both approaches achieve the same goal, the latter is much more human-readable because you can clearly read the sequence of operations line-by-line. But what are the hypothetical x, f(), g(), and h()? Throughout this chapter on data wrangling: The starting value x will be a data frame. For example, the flights data frame we explored in Section 1.4. The sequence of functions, here f(), g(), and h(), will mostly be a sequence of any number of the six data wrangling verb-named functions we listed in the introduction to this chapter. For example, the filter(carrier == &quot;AS&quot;) function and argument specified we previewed earlier. The result will be the transformed/modified data frame that you want. In our example, we’ll save the result in a new data frame by using the &lt;- assignment operator with the name alaska_flights via alaska_flights &lt;-. alaska_flights &lt;- flights %&gt;% filter(carrier == &quot;AS&quot;) Much like when adding layers to a ggplot() using the + sign, you form a single chain of data wrangling operations by combining verb-named functions into a single sequence using the pipe operator %&gt;%. Furthermore, much like how the + sign has to come at the end of lines when constructing plots, the pipe operator %&gt;% has to come at the end of lines as well. Keep in mind, there are many more advanced data wrangling functions than just the six listed in the introduction to this chapter; you’ll see some examples of these in Section 3.8. However, just with these six verb-named functions you’ll be able to perform a broad array of data wrangling tasks for the rest of this book. 3.2 filter rows FIGURE 3.1: Diagram of filter() rows operation. The filter() function here works much like the “Filter” option in Microsoft Excel; it allows you to specify criteria about the values of a variable in your dataset and then filters out only the rows that match that criteria. We begin by focusing only on flights from New York City to Portland, Oregon. The dest destination code (or airport code) for Portland, Oregon is &quot;PDX&quot;. Run the following and look at the results in RStudio’s spreadsheet viewer to ensure that only flights heading to Portland are chosen here: portland_flights &lt;- flights %&gt;% filter(dest == &quot;PDX&quot;) View(portland_flights) Note the order of the code. First, take the flights data frame flights then filter() the data frame so that only those where the dest equals &quot;PDX&quot; are included. We test for equality using the double equal sign == and not a single equal sign =. In other words filter(dest = &quot;PDX&quot;) will yield an error. This is a convention across many programming languages. If you are new to coding, you’ll probably forget to use the double equal sign == a few times before you get the hang of it. You can use other operators beyond just the == operator that tests for equality: &gt; corresponds to “greater than” &lt; corresponds to “less than” &gt;= corresponds to “greater than or equal to” &lt;= corresponds to “less than or equal to” != corresponds to “not equal to.” The ! is used in many programming languages to indicate “not.” Furthermore, you can combine multiple criteria using operators that make comparisons: | corresponds to “or” &amp; corresponds to “and” To see many of these in action, let’s filter flights for all rows that departed from JFK and were heading to Burlington, Vermont (&quot;BTV&quot;) or Seattle, Washington (&quot;SEA&quot;) and departed in the months of October, November, or December. Run the following: btv_sea_flights_fall &lt;- flights %&gt;% filter(origin == &quot;JFK&quot; &amp; (dest == &quot;BTV&quot; | dest == &quot;SEA&quot;) &amp; month &gt;= 10) View(btv_sea_flights_fall) Note that even though colloquially speaking one might say “all flights leaving Burlington, Vermont and Seattle, Washington,” in terms of computer operations, we really mean “all flights leaving Burlington, Vermont or leaving Seattle, Washington.” For a given row in the data, dest can be &quot;BTV&quot;, or &quot;SEA&quot;, or something else, but not both &quot;BTV&quot; and &quot;SEA&quot; at the same time. Furthermore, note the careful use of parentheses around dest == &quot;BTV&quot; | dest == &quot;SEA&quot;. We can often skip the use of &amp; and just separate our conditions with a comma. The previous code will return the identical output btv_sea_flights_fall as the following code: btv_sea_flights_fall &lt;- flights %&gt;% filter(origin == &quot;JFK&quot;, (dest == &quot;BTV&quot; | dest == &quot;SEA&quot;), month &gt;= 10) View(btv_sea_flights_fall) Let’s present another example that uses the ! “not” operator to pick rows that don’t match a criteria. As mentioned earlier, the ! can be read as “not.” Here we are filtering rows corresponding to flights that didn’t go to Burlington, VT or Seattle, WA. not_BTV_SEA &lt;- flights %&gt;% filter(!(dest == &quot;BTV&quot; | dest == &quot;SEA&quot;)) View(not_BTV_SEA) Again, note the careful use of parentheses around the (dest == &quot;BTV&quot; | dest == &quot;SEA&quot;). If we didn’t use parentheses as follows: flights %&gt;% filter(!dest == &quot;BTV&quot; | dest == &quot;SEA&quot;) We would be returning all flights not headed to &quot;BTV&quot; or those headed to &quot;SEA&quot;, which is an entirely different resulting data frame. Now say we have a larger number of airports we want to filter for, say &quot;SEA&quot;, &quot;SFO&quot;, &quot;PDX&quot;, &quot;BTV&quot;, and &quot;BDL&quot;. We could continue to use the | (or) operator: many_airports &lt;- flights %&gt;% filter(dest == &quot;SEA&quot; | dest == &quot;SFO&quot; | dest == &quot;PDX&quot; | dest == &quot;BTV&quot; | dest == &quot;BDL&quot;) but as we progressively include more airports, this will get unwieldy to write. A slightly shorter approach uses the %in% operator along with the c() function. Recall from Subsection 1.2.1 that the c() function “combines” or “concatenates” values into a single vector of values. many_airports &lt;- flights %&gt;% filter(dest %in% c(&quot;SEA&quot;, &quot;SFO&quot;, &quot;PDX&quot;, &quot;BTV&quot;, &quot;BDL&quot;)) View(many_airports) What this code is doing is filtering flights for all flights where dest is in the vector of airports c(&quot;BTV&quot;, &quot;SEA&quot;, &quot;PDX&quot;, &quot;SFO&quot;, &quot;BDL&quot;). Both outputs of many_airports are the same, but as you can see the latter takes much less energy to code. The %in% operator is useful for looking for matches commonly in one vector/variable compared to another. As a final note, we recommend that filter() should often be among the first verbs you consider applying to your data. This cleans your dataset to only those rows you care about, or put differently, it narrows down the scope of your data frame to just the observations you care about. Learning check (LC3.1) What’s another way of using the “not” operator ! to filter only the rows that are not going to Burlington, VT nor Seattle, WA in the flights data frame? Test this out using the previous code. 3.3 summarize variables The next common task when working with data frames is to compute summary statistics. Summary statistics are single numerical values that summarize a large number of values. Commonly known examples of summary statistics include the mean (also called the average) and the median (the middle value). Other examples of summary statistics that might not immediately come to mind include the sum, the smallest value also called the minimum, the largest value also called the maximum, and the standard deviation. See Appendix A.1 for a glossary of such summary statistics. Let’s calculate two summary statistics of the temp temperature variable in the weather data frame: the mean and standard deviation (recall from Section 1.4 that the weather data frame is included in the nycflights13 package). To compute these summary statistics, we need the mean() and sd() summary functions in R. Summary functions in R take in many values and return a single value, as illustrated in Figure 3.2. FIGURE 3.2: Diagram illustrating a summary function in R. More precisely, we’ll use the mean() and sd() summary functions within the summarize() function from the dplyr package. Note you can also use the British English spelling of summarise(). As shown in Figure 3.3, the summarize() function takes in a data frame and returns a data frame with only one row corresponding to the summary statistics. FIGURE 3.3: Diagram of summarize() rows. We’ll save the results in a new data frame called summary_temp that will have two columns/variables: the mean and the std_dev: summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp), std_dev = sd(temp)) summary_temp # A tibble: 1 x 2 mean std_dev &lt;dbl&gt; &lt;dbl&gt; 1 NA NA Why are the values returned NA? As we saw in Subsection 2.3.1 when creating the scatterplot of departure and arrival delays for alaska_flights, NA is how R encodes missing values where NA indicates “not available” or “not applicable.” If a value for a particular row and a particular column does not exist, NA is stored instead. Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it? Perhaps the data was not collected at all because it was too difficult to do so? Perhaps there was an erroneous value that someone entered that has been corrected to read as missing? You’ll often encounter issues with missing values when working with real data. Going back to our summary_temp output, by default any time you try to calculate a summary statistic of a variable that has one or more NA missing values in R, NA is returned. To work around this fact, you can set the na.rm argument to TRUE, where rm is short for “remove”; this will ignore any NA missing values and only return the summary value for all non-missing values. The code that follows computes the mean and standard deviation of all non-missing values of temp: summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE)) summary_temp # A tibble: 1 x 2 mean std_dev &lt;dbl&gt; &lt;dbl&gt; 1 55.3 17.8 Notice how the na.rm = TRUE are used as arguments to the mean() and sd() summary functions individually, and not to the summarize() function. However, one needs to be cautious whenever ignoring missing values as we’ve just done. In the upcoming Learning checks questions, we’ll consider the possible ramifications of blindly sweeping rows with missing values “under the rug.” This is in fact why the na.rm argument to any summary statistic function in R is set to FALSE by default. In other words, R does not ignore rows with missing values by default. R is alerting you to the presence of missing data and you should be mindful of this missingness and any potential causes of this missingness throughout your analysis. What are other summary functions we can use inside the summarize() verb to compute summary statistics? As seen in the diagram in Figure 3.2, you can use any function in R that takes many values and returns just one. Here are just a few: mean(): the average sd(): the standard deviation, which is a measure of spread min() and max(): the minimum and maximum values, respectively IQR(): interquartile range sum(): the total amount when adding multiple numbers n(): a count of the number of rows in each group. This particular summary function will make more sense when group_by() is covered in Section 3.4. Learning check (LC3.2) Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five-year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor’s approach? (LC3.3) Modify the summarize() function to create summary_temp to also use the n() summary function: summarize(count = n()). What does the returned value correspond to? (LC3.4) Why doesn’t the following code work? Run the code line-by-line instead of all at once, and then look at the data. In other words, run summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE)) first. summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE)) %&gt;% summarize(std_dev = sd(temp, na.rm = TRUE)) 3.4 group_by rows FIGURE 3.4: Diagram of group_by() and summarize(). Say instead of a single mean temperature for the whole year, you would like 12 mean temperatures, one for each of the 12 months separately. In other words, we would like to compute the mean temperature split by month. We can do this by “grouping” temperature observations by the values of another variable, in this case by the 12 values of the variable month. Run the following code: summary_monthly_temp &lt;- weather %&gt;% group_by(month) %&gt;% summarize(mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE)) summary_monthly_temp # A tibble: 12 x 3 month mean std_dev &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; 1 1 35.6 10.2 2 2 34.3 6.98 3 3 39.9 6.25 4 4 51.7 8.79 5 5 61.8 9.68 6 6 72.2 7.55 7 7 80.1 7.12 8 8 74.5 5.19 9 9 67.4 8.47 10 10 60.1 8.85 11 11 45.0 10.4 12 12 38.4 9.98 This code is identical to the previous code that created summary_temp, but with an extra group_by(month) added before the summarize(). Grouping the weather dataset by month and then applying the summarize() functions yields a data frame that displays the mean and standard deviation temperature split by the 12 months of the year. It is important to note that the group_by() function doesn’t change data frames by itself. Rather it changes the meta-data, or data about the data, specifically the grouping structure. It is only after we apply the summarize() function that the data frame changes. For example, let’s consider the diamonds data frame included in the ggplot2 package. Run this code: diamonds # A tibble: 53,940 x 10 carat cut color clarity depth table price x y z &lt;dbl&gt; &lt;ord&gt; &lt;ord&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 # … with 53,930 more rows Observe that the first line of the output reads # A tibble: 53,940 x 10. This is an example of meta-data, in this case the number of observations/rows and variables/columns in diamonds. The actual data itself are the subsequent table of values. Now let’s pipe the diamonds data frame into group_by(cut): diamonds %&gt;% group_by(cut) # A tibble: 53,940 x 10 # Groups: cut [5] carat cut color clarity depth table price x y z &lt;dbl&gt; &lt;ord&gt; &lt;ord&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 # … with 53,930 more rows Observe that now there is additional meta-data: # Groups: cut [5] indicating that the grouping structure meta-data has been set based on the 5 possible levels of the categorical variable cut: &quot;Fair&quot;, &quot;Good&quot;, &quot;Very Good&quot;, &quot;Premium&quot;, and &quot;Ideal&quot;. On the other hand, observe that the data has not changed: it is still a table of 53,940 \\(\\times\\) 10 values. Only by combining a group_by() with another data wrangling operation, in this case summarize(), will the data actually be transformed. diamonds %&gt;% group_by(cut) %&gt;% summarize(avg_price = mean(price)) # A tibble: 5 x 2 cut avg_price &lt;ord&gt; &lt;dbl&gt; 1 Fair 4359. 2 Good 3929. 3 Very Good 3982. 4 Premium 4584. 5 Ideal 3458. If you would like to remove this grouping structure meta-data, we can pipe the resulting data frame into the ungroup() function: diamonds %&gt;% group_by(cut) %&gt;% ungroup() # A tibble: 53,940 x 10 carat cut color clarity depth table price x y z &lt;dbl&gt; &lt;ord&gt; &lt;ord&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 # … with 53,930 more rows Observe how the # Groups: cut [5] meta-data is no longer present. Let’s now revisit the n() counting summary function we briefly introduced previously. Recall that the n() function counts rows. This is opposed to the sum() summary function that returns the sum of a numerical variable. For example, suppose we’d like to count how many flights departed each of the three airports in New York City: by_origin &lt;- flights %&gt;% group_by(origin) %&gt;% summarize(count = n()) by_origin # A tibble: 3 x 2 origin count &lt;chr&gt; &lt;int&gt; 1 EWR 120835 2 JFK 111279 3 LGA 104662 We see that Newark (&quot;EWR&quot;) had the most flights departing in 2013 followed by &quot;JFK&quot; and lastly by LaGuardia (&quot;LGA&quot;). Note there is a subtle but important difference between sum() and n(); while sum() returns the sum of a numerical variable, n() returns a count of the number of rows/observations. 3.4.1 Grouping by more than one variable You are not limited to grouping by one variable. Say you want to know the number of flights leaving each of the three New York City airports for each month. We can also group by a second variable month using group_by(origin, month): by_origin_monthly &lt;- flights %&gt;% group_by(origin, month) %&gt;% summarize(count = n()) by_origin_monthly # A tibble: 36 x 3 # Groups: origin [3] origin month count &lt;chr&gt; &lt;int&gt; &lt;int&gt; 1 EWR 1 9893 2 EWR 2 9107 3 EWR 3 10420 4 EWR 4 10531 5 EWR 5 10592 6 EWR 6 10175 7 EWR 7 10475 8 EWR 8 10359 9 EWR 9 9550 10 EWR 10 10104 # … with 26 more rows Observe that there are 36 rows to by_origin_monthly because there are 12 months for 3 airports (EWR, JFK, and LGA). Why do we group_by(origin, month) and not group_by(origin) and then group_by(month)? Let’s investigate: by_origin_monthly_incorrect &lt;- flights %&gt;% group_by(origin) %&gt;% group_by(month) %&gt;% summarize(count = n()) by_origin_monthly_incorrect # A tibble: 12 x 2 month count &lt;int&gt; &lt;int&gt; 1 1 27004 2 2 24951 3 3 28834 4 4 28330 5 5 28796 6 6 28243 7 7 29425 8 8 29327 9 9 27574 10 10 28889 11 11 27268 12 12 28135 What happened here is that the second group_by(month) overwrote the grouping structure meta-data of the earlier group_by(origin), so that in the end we are only grouping by month. The lesson here is if you want to group_by() two or more variables, you should include all the variables at the same time in the same group_by() adding a comma between the variable names. Learning check (LC3.5) Recall from Chapter 2 when we looked at temperatures by months in NYC. What does the standard deviation column in the summary_monthly_temp data frame tell us about temperatures in NYC throughout the year? (LC3.6) What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC? (LC3.7) Recreate by_monthly_origin, but instead of grouping via group_by(origin, month), group variables in a different order group_by(month, origin). What differs in the resulting dataset? (LC3.8) How could we identify how many flights left each of the three airports for each carrier? (LC3.9) How does the filter() operation differ from a group_by() followed by a summarize()? 3.5 mutate existing variables FIGURE 3.5: Diagram of mutate() columns. Another common transformation of data is to create/compute new variables based on existing ones. For example, say you are more comfortable thinking of temperature in degrees Celsius (°C) instead of degrees Fahrenheit (°F). The formula to convert temperatures from °F to °C is \\[ \\text{temp in C} = \\frac{\\text{temp in F} - 32}{1.8} \\] We can apply this formula to the temp variable using the mutate() function from the dplyr package, which takes existing variables and mutates them to create new ones. weather &lt;- weather %&gt;% mutate(temp_in_C = (temp - 32) / 1.8) In this code, we mutate() the weather data frame by creating a new variable temp_in_C = (temp - 32) / 1.8 and then overwrite the original weather data frame. Why did we overwrite the data frame weather, instead of assigning the result to a new data frame like weather_new? As a rough rule of thumb, as long as you are not losing original information that you might need later, it’s acceptable practice to overwrite existing data frames with updated ones, as we did here. On the other hand, why did we not overwrite the variable temp, but instead created a new variable called temp_in_C? Because if we did this, we would have erased the original information contained in temp of temperatures in Fahrenheit that may still be valuable to us. Let’s now compute monthly average temperatures in both °F and °C using the group_by() and summarize() code we saw in Section 3.4: summary_monthly_temp &lt;- weather %&gt;% group_by(month) %&gt;% summarize(mean_temp_in_F = mean(temp, na.rm = TRUE), mean_temp_in_C = mean(temp_in_C, na.rm = TRUE)) summary_monthly_temp # A tibble: 12 x 3 month mean_temp_in_F mean_temp_in_C &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; 1 1 35.6 2.02 2 2 34.3 1.26 3 3 39.9 4.38 4 4 51.7 11.0 5 5 61.8 16.6 6 6 72.2 22.3 7 7 80.1 26.7 8 8 74.5 23.6 9 9 67.4 19.7 10 10 60.1 15.6 11 11 45.0 7.22 12 12 38.4 3.58 Let’s consider another example. Passengers are often frustrated when their flight departs late, but aren’t as annoyed if, in the end, pilots can make up some time during the flight. This is known in the airline industry as gain, and we will create this variable using the mutate() function: flights &lt;- flights %&gt;% mutate(gain = dep_delay - arr_delay) Let’s take a look at only the dep_delay, arr_delay, and the resulting gain variables for the first 5 rows in our updated flights data frame in Table 3.1. TABLE 3.1: First five rows of departure/arrival delay and gain variables dep_delay arr_delay gain 2 11 -9 4 20 -16 2 33 -31 -1 -18 17 -6 -25 19 The flight in the first row departed 2 minutes late but arrived 11 minutes late, so its “gained time in the air” is a loss of 9 minutes, hence its gain is 2 - 11 = -9. On the other hand, the flight in the fourth row departed a minute early (dep_delay of -1) but arrived 18 minutes early (arr_delay of -18), so its “gained time in the air” is \\(-1 - (-18) = -1 + 18 = 17\\) minutes, hence its gain is +17. Let’s look at some summary statistics of the gain variable by considering multiple summary functions at once in the same summarize() code: gain_summary &lt;- flights %&gt;% summarize( min = min(gain, na.rm = TRUE), q1 = quantile(gain, 0.25, na.rm = TRUE), median = quantile(gain, 0.5, na.rm = TRUE), q3 = quantile(gain, 0.75, na.rm = TRUE), max = max(gain, na.rm = TRUE), mean = mean(gain, na.rm = TRUE), sd = sd(gain, na.rm = TRUE), missing = sum(is.na(gain)) ) gain_summary # A tibble: 1 x 8 min q1 median q3 max mean sd missing &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; 1 -196 -3 7 17 109 5.66 18.0 9430 We see for example that the average gain is +5 minutes, while the largest is +109 minutes! However, this code would take some time to type out in practice. We’ll see later on in Subsection 5.1.1 that there is a much more succinct way to compute a variety of common summary statistics: using the skim() function from the skimr package. Recall from Section 2.5 that since gain is a numerical variable, we can visualize its distribution using a histogram. ggplot(data = flights, mapping = aes(x = gain)) + geom_histogram(color = &quot;white&quot;, bins = 20) FIGURE 3.6: Histogram of gain variable. The resulting histogram in Figure 3.6 provides a different perspective on the gain variable than the summary statistics we computed earlier. For example, note that most values of gain are right around 0. To close out our discussion on the mutate() function to create new variables, note that we can create multiple new variables at once in the same mutate() code. Furthermore, within the same mutate() code we can refer to new variables we just created. As an example, consider the mutate() code Hadley Wickham and Garrett Grolemund show in Chapter 5 of R for Data Science (Grolemund and Wickham 2017): flights &lt;- flights %&gt;% mutate( gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours ) Learning check (LC3.10) What do positive values of the gain variable in flights correspond to? What about negative values? And what about a zero value? (LC3.11) Could we create the dep_delay and arr_delay columns by simply subtracting dep_time from sched_dep_time and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights. (LC3.12) What can we say about the distribution of gain? Describe it in a few sentences using the plot and the gain_summary data frame values. 3.6 arrange and sort rows One of the most commonly performed data wrangling tasks is to sort a data frame’s rows in the alphanumeric order of one of the variables. The dplyr package’s arrange() function allows us to sort/reorder a data frame’s rows according to the values of the specified variable. Suppose we are interested in determining the most frequent destination airports for all domestic flights departing from New York City in 2013: freq_dest &lt;- flights %&gt;% group_by(dest) %&gt;% summarize(num_flights = n()) freq_dest # A tibble: 105 x 2 dest num_flights &lt;chr&gt; &lt;int&gt; 1 ABQ 254 2 ACK 265 3 ALB 439 4 ANC 8 5 ATL 17215 6 AUS 2439 7 AVL 275 8 BDL 443 9 BGR 375 10 BHM 297 # … with 95 more rows Observe that by default the rows of the resulting freq_dest data frame are sorted in alphabetical order of destination. Say instead we would like to see the same data, but sorted from the most to the least number of flights (num_flights) instead: freq_dest %&gt;% arrange(num_flights) # A tibble: 105 x 2 dest num_flights &lt;chr&gt; &lt;int&gt; 1 LEX 1 2 LGA 1 3 ANC 8 4 SBN 10 5 HDN 15 6 MTJ 15 7 EYW 17 8 PSP 19 9 JAC 25 10 BZN 36 # … with 95 more rows This is, however, the opposite of what we want. The rows are sorted with the least frequent destination airports displayed first. This is because arrange() always returns rows sorted in ascending order by default. To switch the ordering to be in “descending” order instead, we use the desc() function as so: freq_dest %&gt;% arrange(desc(num_flights)) # A tibble: 105 x 2 dest num_flights &lt;chr&gt; &lt;int&gt; 1 ORD 17283 2 ATL 17215 3 LAX 16174 4 BOS 15508 5 MCO 14082 6 CLT 14064 7 SFO 13331 8 FLL 12055 9 MIA 11728 10 DCA 9705 # … with 95 more rows 3.7 join data frames Another common data transformation task is “joining” or “merging” two different datasets. For example, in the flights data frame, the variable carrier lists the carrier code for the different flights. While the corresponding airline names for &quot;UA&quot; and &quot;AA&quot; might be somewhat easy to guess (United and American Airlines), what airlines have codes &quot;VX&quot;, &quot;HA&quot;, and &quot;B6&quot;? This information is provided in a separate data frame airlines. View(airlines) We see that in airports, carrier is the carrier code, while name is the full name of the airline company. Using this table, we can see that &quot;VX&quot;, &quot;HA&quot;, and &quot;B6&quot; correspond to Virgin America, Hawaiian Airlines, and JetBlue, respectively. However, wouldn’t it be nice to have all this information in a single data frame instead of two separate data frames? We can do this by “joining” the flights and airlines data frames. Note that the values in the variable carrier in the flights data frame match the values in the variable carrier in the airlines data frame. In this case, we can use the variable carrier as a key variable to match the rows of the two data frames. Key variables are almost always identification variables that uniquely identify the observational units as we saw in Subsection 1.4.4. This ensures that rows in both data frames are appropriately matched during the join. Hadley and Garrett (Grolemund and Wickham 2017) created the diagram shown in Figure 3.7 to help us understand how the different data frames in the nycflights13 package are linked by various key variables: FIGURE 3.7: Data relationships in nycflights13 from R for Data Science. 3.7.1 Matching “key” variable names In both the flights and airlines data frames, the key variable we want to join/merge/match the rows by has the same name: carrier. Let’s use the inner_join() function to join the two data frames, where the rows will be matched by the variable carrier, and then compare the resulting data frames: flights_joined &lt;- flights %&gt;% inner_join(airlines, by = &quot;carrier&quot;) View(flights) View(flights_joined) Observe that the flights and flights_joined data frames are identical except that flights_joined has an additional variable name. The values of name correspond to the airline companies’ names as indicated in the airlines data frame. A visual representation of the inner_join() is shown in Figure 3.8 (Grolemund and Wickham 2017). There are other types of joins available (such as left_join(), right_join(), outer_join(), and anti_join()), but the inner_join() will solve nearly all of the problems you’ll encounter in this book. FIGURE 3.8: Diagram of inner join from R for Data Science. 3.7.2 Different “key” variable names Say instead you are interested in the destinations of all domestic flights departing NYC in 2013, and you ask yourself questions like: “What cities are these airports in?”, or “Is &quot;ORD&quot; Orlando?”, or “Where is &quot;FLL&quot;?”. The airports data frame contains the airport codes for each airport: View(airports) However, if you look at both the airports and flights data frames, you’ll find that the airport codes are in variables that have different names. In airports the airport code is in faa, whereas in flights the airport codes are in origin and dest. This fact is further highlighted in the visual representation of the relationships between these data frames in Figure 3.7. In order to join these two data frames by airport code, our inner_join() operation will use the by = c(&quot;dest&quot; = &quot;faa&quot;) argument with modified code syntax allowing us to join two data frames where the key variable has a different name: flights_with_airport_names &lt;- flights %&gt;% inner_join(airports, by = c(&quot;dest&quot; = &quot;faa&quot;)) View(flights_with_airport_names) Let’s construct the chain of pipe operators %&gt;% that computes the number of flights from NYC to each destination, but also includes information about each destination airport: named_dests &lt;- flights %&gt;% group_by(dest) %&gt;% summarize(num_flights = n()) %&gt;% arrange(desc(num_flights)) %&gt;% inner_join(airports, by = c(&quot;dest&quot; = &quot;faa&quot;)) %&gt;% rename(airport_name = name) named_dests # A tibble: 101 x 9 dest num_flights airport_name lat lon alt tz dst tzone &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; 1 ORD 17283 Chicago Ohare Intl 42.0 -87.9 668 -6 A America… 2 ATL 17215 Hartsfield Jackson… 33.6 -84.4 1026 -5 A America… 3 LAX 16174 Los Angeles Intl 33.9 -118. 126 -8 A America… 4 BOS 15508 General Edward Law… 42.4 -71.0 19 -5 A America… 5 MCO 14082 Orlando Intl 28.4 -81.3 96 -5 A America… 6 CLT 14064 Charlotte Douglas … 35.2 -80.9 748 -5 A America… 7 SFO 13331 San Francisco Intl 37.6 -122. 13 -8 A America… 8 FLL 12055 Fort Lauderdale Ho… 26.1 -80.2 9 -5 A America… 9 MIA 11728 Miami Intl 25.8 -80.3 8 -5 A America… 10 DCA 9705 Ronald Reagan Wash… 38.9 -77.0 15 -5 A America… # … with 91 more rows In case you didn’t know, &quot;ORD&quot; is the airport code of Chicago O’Hare airport and &quot;FLL&quot; is the main airport in Fort Lauderdale, Florida, which can be seen in the airport_name variable. 3.7.3 Multiple “key” variables Say instead we want to join two data frames by multiple key variables. For example, in Figure 3.7, we see that in order to join the flights and weather data frames, we need more than one key variable: year, month, day, hour, and origin. This is because the combination of these 5 variables act to uniquely identify each observational unit in the weather data frame: hourly weather recordings at each of the 3 NYC airports. We achieve this by specifying a vector of key variables to join by using the c() function. Recall from Subsection 1.2.1 that c() is short for “combine” or “concatenate.” flights_weather_joined &lt;- flights %&gt;% inner_join(weather, by = c(&quot;year&quot;, &quot;month&quot;, &quot;day&quot;, &quot;hour&quot;, &quot;origin&quot;)) View(flights_weather_joined) Learning check (LC3.13) Looking at Figure 3.7, when joining flights and weather (or, in other words, matching the hourly weather values with each flight), why do we need to join by all of year, month, day, hour, and origin, and not just hour? (LC3.14) What surprises you about the top 10 destinations from NYC in 2013? 3.7.4 Normal forms The data frames included in the nycflights13 package are in a form that minimizes redundancy of data. For example, the flights data frame only saves the carrier code of the airline company; it does not include the actual name of the airline. For example, the first row of flights has carrier equal to UA, but it does not include the airline name of “United Air Lines Inc.” The names of the airline companies are included in the name variable of the airlines data frame. In order to have the airline company name included in flights, we could join these two data frames as follows: joined_flights &lt;- flights %&gt;% inner_join(airlines, by = &quot;carrier&quot;) View(joined_flights) We are capable of performing this join because each of the data frames have keys in common to relate one to another: the carrier variable in both the flights and airlines data frames. The key variable(s) that we base our joins on are often identification variables as we mentioned previously. This is an important property of what’s known as normal forms of data. The process of decomposing data frames into less redundant tables without losing information is called normalization. More information is available on Wikipedia. Both dplyr and SQL we mentioned in the introduction of this chapter use such normal forms. Given that they share such commonalities, once you learn either of these two tools, you can learn the other very easily. Learning check (LC3.15) What are some advantages of data in normal forms? What are some disadvantages? 3.8 Other verbs Here are some other useful data wrangling verbs: select() only a subset of variables/columns. rename() variables/columns to have new names. Return only the top_n() values of a variable. 3.8.1 select variables FIGURE 3.9: Diagram of select() columns. We’ve seen that the flights data frame in the nycflights13 package contains 19 different variables. You can identify the names of these 19 variables by running the glimpse() function from the dplyr package: glimpse(flights) However, say you only need two of these 19 variables, say carrier and flight. You can select() these two variables: flights %&gt;% select(carrier, flight) This function makes it easier to explore large datasets since it allows us to limit the scope to only those variables we care most about. For example, if we select() only a smaller number of variables as is shown in Figure 3.9, it will make viewing the dataset in RStudio’s spreadsheet viewer more digestible. Let’s say instead you want to drop, or de-select, certain variables. For example, consider the variable year in the flights data frame. This variable isn’t quite a “variable” because it is always 2013 and hence doesn’t change. Say you want to remove this variable from the data frame. We can deselect year by using the - sign: flights_no_year &lt;- flights %&gt;% select(-year) Another way of selecting columns/variables is by specifying a range of columns: flight_arr_times &lt;- flights %&gt;% select(month:day, arr_time:sched_arr_time) flight_arr_times This will select() all columns between month and day, as well as between arr_time and sched_arr_time, and drop the rest. The select() function can also be used to reorder columns when used with the everything() helper function. For example, suppose we want the hour, minute, and time_hour variables to appear immediately after the year, month, and day variables, while not discarding the rest of the variables. In the following code, everything() will pick up all remaining variables: flights_reorder &lt;- flights %&gt;% select(year, month, day, hour, minute, time_hour, everything()) glimpse(flights_reorder) Lastly, the helper functions starts_with(), ends_with(), and contains() can be used to select variables/columns that match those conditions. As examples, flights %&gt;% select(starts_with(&quot;a&quot;)) flights %&gt;% select(ends_with(&quot;delay&quot;)) flights %&gt;% select(contains(&quot;time&quot;)) 3.8.2 rename variables Another useful function is rename(), which as you may have guessed changes the name of variables. Suppose we want to only focus on dep_time and arr_time and change dep_time and arr_time to be departure_time and arrival_time instead in the flights_time data frame: flights_time_new &lt;- flights %&gt;% select(dep_time, arr_time) %&gt;% rename(departure_time = dep_time, arrival_time = arr_time) glimpse(flights_time_new) Note that in this case we used a single = sign within the rename(). For example, departure_time = dep_time renames the dep_time variable to have the new name departure_time. This is because we are not testing for equality like we would using ==. Instead we want to assign a new variable departure_time to have the same values as dep_time and then delete the variable dep_time. Note that new dplyr users often forget that the new variable name comes before the equal sign. 3.8.3 top_n values of a variable We can also return the top n values of a variable using the top_n() function. For example, we can return a data frame of the top 10 destination airports using the example from Subsection 3.7.2. Observe that we set the number of values to return to n = 10 and wt = num_flights to indicate that we want the rows corresponding to the top 10 values of num_flights. See the help file for top_n() by running ?top_n for more information. named_dests %&gt;% top_n(n = 10, wt = num_flights) Let’s further arrange() these results in descending order of num_flights: named_dests %&gt;% top_n(n = 10, wt = num_flights) %&gt;% arrange(desc(num_flights)) Learning check (LC3.16) What are some ways to select all three of the dest, air_time, and distance variables from flights? Give the code showing how to do this in at least three different ways. (LC3.17) How could one use starts_with(), ends_with(), and contains() to select columns from the flights data frame? Provide three different examples in total: one for starts_with(), one for ends_with(), and one for contains(). (LC3.18) Why might we want to use the select function on a data frame? (LC3.19) Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013. 3.9 Conclusion 3.9.1 Summary table Let’s recap our data wrangling verbs in Table 3.2. Using these verbs and the pipe %&gt;% operator from Section 3.1, you’ll be able to write easily legible code to perform almost all the data wrangling and data transformation necessary for the rest of this book. TABLE 3.2: Summary of data wrangling verbs Verb Data wrangling operation filter() Pick out a subset of rows summarize() Summarize many values to one using a summary statistic function like mean(), median(), etc. group_by() Add grouping structure to rows in data frame. Note this does not change values in data frame, rather only the meta-data mutate() Create new variables by mutating existing ones arrange() Arrange rows of a data variable in ascending (default) or descending order inner_join() Join/merge two data frames, matching rows by a key variable Learning check (LC3.20) Let’s now put your newly acquired data wrangling skills to the test! An airline industry measure of a passenger airline’s capacity is the available seat miles, which is equal to the number of seats available multiplied by the number of miles or kilometers flown summed over all flights. For example, let’s consider the scenario in Figure 3.10. Since the airplane has 4 seats and it travels 200 miles, the available seat miles are \\(4 \\times 200 = 800\\). FIGURE 3.10: Example of available seat miles for one flight. Extending this idea, let’s say an airline had 2 flights using a plane with 10 seats that flew 500 miles and 3 flights using a plane with 20 seats that flew 1000 miles, the available seat miles would be \\(2 \\times 10 \\times 500 + 3 \\times 20 \\times 1000 = 70,000\\) seat miles. Using the datasets included in the nycflights13 package, compute the available seat miles for each airline sorted in descending order. After completing all the necessary data wrangling steps, the resulting data frame should have 16 rows (one for each airline) and 2 columns (airline name and available seat miles). Here are some hints: Crucial: Unless you are very confident in what you are doing, it is worthwhile not starting to code right away. Rather, first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level pseudocode that is informal yet detailed enough to articulate what you are doing. This way you won’t confuse what you are trying to do (the algorithm) with how you are going to do it (writing dplyr code). Take a close look at all the datasets using the View() function: flights, weather, planes, airports, and airlines to identify which variables are necessary to compute available seat miles. Figure 3.7 showing how the various datasets can be joined will also be useful. Consider the data wrangling verbs in Table 3.2 as your toolbox! 3.9.2 Additional resources An R script file of all R code used in this chapter is available here. If you want to further unlock the power of the dplyr package for data wrangling, we suggest that you check out RStudio’s “Data Transformation with dplyr” cheatsheet. This cheatsheet summarizes much more than what we’ve discussed in this chapter, in particular more intermediate level and advanced data wrangling functions, while providing quick and easy-to-read visual descriptions. In fact, many of the diagrams illustrating data wrangling operations in this chapter, such as Figure 3.1 on filter(), originate from this cheatsheet. In the current version of RStudio in late 2019, you can access this cheatsheet by going to the RStudio Menu Bar -&gt; Help -&gt; Cheatsheets -&gt; “Data Transformation with dplyr.” You can see a preview in the figure below. FIGURE 3.11: Data Transformation with dplyr cheatsheet. On top of the data wrangling verbs and examples we presented in this section, if you’d like to see more examples of using the dplyr package for data wrangling, check out Chapter 5 of R for Data Science (Grolemund and Wickham 2017). 3.9.3 What’s to come? So far in this book, we’ve explored, visualized, and wrangled data saved in data frames. These data frames were saved in a spreadsheet-like format: in a rectangular shape with a certain number of rows corresponding to observations and a certain number of columns corresponding to variables describing these observations. We’ll see in the upcoming Chapter 4 that there are actually two ways to represent data in spreadsheet-type rectangular format: (1) “wide” format and (2) “tall/narrow” format. The tall/narrow format is also known as “tidy” format in R user circles. While the distinction between “tidy” and non-“tidy” formatted data is subtle, it has immense implications for our data science work. This is because almost all the packages used in this book, including the ggplot2 package for data visualization and the dplyr package for data wrangling, all assume that all data frames are in “tidy” format. Furthermore, up until now we’ve only explored, visualized, and wrangled data saved within R packages. But what if you want to analyze data that you have saved in a Microsoft Excel, a Google Sheets, or a “Comma-Separated Values” (CSV) file? In Section 4.1, we’ll show you how to import this data into R using the readr package. References "],
+["4-tidy.html", "Chapter 4 Data Importing and “Tidy” Data 4.1 Importing data 4.2 “Tidy” data 4.3 Case study: Democracy in Guatemala 4.4 tidyverse package 4.5 Conclusion", " Chapter 4 Data Importing and “Tidy” Data In Subsection 1.2.1, we introduced the concept of a data frame in R: a rectangular spreadsheet-like representation of data where the rows correspond to observations and the columns correspond to variables describing each observation. In Section 1.4, we started exploring our first data frame: the flights data frame included in the nycflights13 package. In Chapter 2, we created visualizations based on the data included in flights and other data frames such as weather. In Chapter 3, we learned how to take existing data frames and transform/modify them to suit our ends. In this final chapter of the “Data Science with tidyverse” portion of the book, we extend some of these ideas by discussing a type of data formatting called “tidy” data. You will see that having data stored in “tidy” format is about more than just what the everyday definition of the term “tidy” might suggest: having your data “neatly organized.” Instead, we define the term “tidy” as it’s used by data scientists who use R, outlining a set of rules by which data is saved. Knowledge of this type of data formatting was not necessary for our treatment of data visualization in Chapter 2 and data wrangling in Chapter 3. This is because all the data used were already in “tidy” format. In this chapter, we’ll now see that this format is essential to using the tools we covered up until now. Furthermore, it will also be useful for all subsequent chapters in this book when we cover regression and statistical inference. First, however, we’ll show you how to import spreadsheet data in R. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section 1.3 for information on how to install and load R packages. library(dplyr) library(ggplot2) library(readr) library(tidyr) library(nycflights13) library(fivethirtyeight) 4.1 Importing data Up to this point, we’ve almost entirely used data stored inside of an R package. Say instead you have your own data saved on your computer or somewhere online. How can you analyze this data in R? Spreadsheet data is often saved in one of the following three formats: First, a Comma Separated Values .csv file. You can think of a .csv file as a bare-bones spreadsheet where: Each line in the file corresponds to one row of data/one observation. Values for each line are separated with commas. In other words, the values of different variables are separated by commas in each row. The first line is often, but not always, a header row indicating the names of the columns/variables. Second, an Excel .xlsx spreadsheet file. This format is based on Microsoft’s proprietary Excel software. As opposed to bare-bones .csv files, .xlsx Excel files contain a lot of meta-data (data about data). Recall we saw a previous example of meta-data in Section 3.4 when adding “group structure” meta-data to a data frame by using the group_by() verb. Some examples of Excel spreadsheet meta-data include the use of bold and italic fonts, colored cells, different column widths, and formula macros. Third, a Google Sheets file, which is a “cloud” or online-based way to work with a spreadsheet. Google Sheets allows you to download your data in both comma separated values .csv and Excel .xlsx formats. One way to import Google Sheets data in R is to go to the Google Sheets menu bar -&gt; File -&gt; Download as -&gt; Select “Microsoft Excel” or “Comma-separated values” and then load that data into R. A more advanced way to import Google Sheets data in R is by using the googlesheets package, a method we leave to a more advanced data science book. We’ll cover two methods for importing .csv and .xlsx spreadsheet data in R: one using the console and the other using RStudio’s graphical user interface, abbreviated as “GUI.” 4.1.1 Using the console First, let’s import a Comma Separated Values .csv file that exists on the internet. The .csv file dem_score.csv contains ratings of the level of democracy in different countries spanning 1952 to 1992 and is accessible at https://moderndive.com/data/dem_score.csv. Let’s use the read_csv() function from the readr (Wickham, Hester, and Francois 2018) package to read it off the web, import it into R, and save it in a data frame called dem_score. library(readr) dem_score &lt;- read_csv(&quot;https://moderndive.com/data/dem_score.csv&quot;) dem_score # A tibble: 96 x 10 country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 Albania -9 -9 -9 -9 -9 -9 -9 -9 5 2 Argentina -9 -1 -1 -9 -9 -9 -8 8 7 3 Armenia -9 -7 -7 -7 -7 -7 -7 -7 7 4 Australia 10 10 10 10 10 10 10 10 10 5 Austria 10 10 10 10 10 10 10 10 10 6 Azerbaijan -9 -7 -7 -7 -7 -7 -7 -7 1 7 Belarus -9 -7 -7 -7 -7 -7 -7 -7 7 8 Belgium 10 10 10 10 10 10 10 10 10 9 Bhutan -10 -10 -10 -10 -10 -10 -10 -10 -10 10 Bolivia -4 -3 -3 -4 -7 -7 8 9 9 # … with 86 more rows In this dem_score data frame, the minimum value of -10 corresponds to a highly autocratic nation, whereas a value of 10 corresponds to a highly democratic nation. Note also that backticks surround the different variable names. Variable names in R by default are not allowed to start with a number nor include spaces, but we can get around this fact by surrounding the column name with backticks. We’ll revisit the dem_score data frame in a case study in the upcoming Section 4.3. Note that the read_csv() function included in the readr package is different than the read.csv() function that comes installed with R. While the difference in the names might seem trivial (an _ instead of a .), the read_csv() function is, in our opinion, easier to use since it can more easily read data off the web and generally imports data at a much faster speed. Furthermore, the read_csv() function included in the readr saves data frames as tibbles by default. 4.1.2 Using RStudio’s interface Let’s read in the exact same data, but this time from an Excel file saved on your computer. Furthermore, we’ll do this using RStudio’s graphical interface instead of running read_csv() in the console. First, download the Excel file dem_score.xlsx by going to https://moderndive.com/data/dem_score.xlsx, then Go to the Files pane of RStudio. Navigate to the directory (i.e., folder on your computer) where the downloaded dem_score.xlsx Excel file is saved. For example, this might be in your Downloads folder. Click on dem_score.xlsx. Click “Import Dataset…” At this point, you should see a screen pop-up like in Figure 4.1. After clicking on the “Import” button on the bottom right of Figure 4.1, RStudio will save this spreadsheet’s data in a data frame called dem_score and display its contents in the spreadsheet viewer. FIGURE 4.1: Importing an Excel file to R. Furthermore, note the “Code Preview” block in the bottom right of Figure 4.1. You can copy and paste this code to reload your data again later programmatically, instead of repeating this manual point-and-click process. 4.2 “Tidy” data Let’s now switch gears and learn about the concept of “tidy” data format with a motivating example from the fivethirtyeight package. The fivethirtyeight package (Kim, Ismay, and Chunn 2019) provides access to the datasets used in many articles published by the data journalism website, FiveThirtyEight.com. For a complete list of all 127 datasets included in the fivethirtyeight package, check out the package webpage by going to: https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html. Let’s focus our attention on the drinks data frame and look at its first 5 rows: # A tibble: 5 x 5 country beer_servings spirit_servings wine_servings total_litres_of_pure_a… &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 Afghanist… 0 0 0 0 2 Albania 89 132 54 4.9 3 Algeria 25 0 14 0.7 4 Andorra 245 138 312 12.4 5 Angola 217 57 45 5.9 After reading the help file by running ?drinks, you’ll see that drinks is a data frame containing results from a survey of the average number of servings of beer, spirits, and wine consumed in 193 countries. This data was originally reported on FiveThirtyEight.com in Mona Chalabi’s article: “Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?”. Let’s apply some of the data wrangling verbs we learned in Chapter 3 on the drinks data frame: filter() the drinks data frame to only consider 4 countries: the United States, China, Italy, and Saudi Arabia, then select() all columns except total_litres_of_pure_alcohol by using the - sign, then rename() the variables beer_servings, spirit_servings, and wine_servings to beer, spirit, and wine, respectively. and save the resulting data frame in drinks_smaller: drinks_smaller &lt;- drinks %&gt;% filter(country %in% c(&quot;USA&quot;, &quot;China&quot;, &quot;Italy&quot;, &quot;Saudi Arabia&quot;)) %&gt;% select(-total_litres_of_pure_alcohol) %&gt;% rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings) drinks_smaller # A tibble: 4 x 4 country beer spirit wine &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; 1 China 79 192 8 2 Italy 85 42 237 3 Saudi Arabia 0 5 0 4 USA 249 158 84 Let’s now ask ourselves a question: “Using the drinks_smaller data frame, how would we create the side-by-side barplot in Figure 4.2?”. Recall we saw barplots displaying two categorical variables in Subsection 2.8.3. FIGURE 4.2: Comparing alcohol consumption in 4 countries. Let’s break down the grammar of graphics we introduced in Section 2.1: The categorical variable country with four levels (China, Italy, Saudi Arabia, USA) would have to be mapped to the x-position of the bars. The numerical variable servings would have to be mapped to the y-position of the bars (the height of the bars). The categorical variable type with three levels (beer, spirit, wine) would have to be mapped to the fill color of the bars. Observe, however, that drinks_smaller has three separate variables beer, spirit, and wine. In order to use the ggplot() function to recreate the barplot in Figure 4.2. However, we need a single variable type with three possible values: beer, spirit, and wine. We could then map this type variable to the fill aesthetic of our plot. In other words, to recreate the barplot in Figure 4.2, our data frame would have to look like this: drinks_smaller_tidy # A tibble: 12 x 3 country type servings &lt;chr&gt; &lt;chr&gt; &lt;int&gt; 1 China beer 79 2 Italy beer 85 3 Saudi Arabia beer 0 4 USA beer 249 5 China spirit 192 6 Italy spirit 42 7 Saudi Arabia spirit 5 8 USA spirit 158 9 China wine 8 10 Italy wine 237 11 Saudi Arabia wine 0 12 USA wine 84 Observe that while drinks_smaller and drinks_smaller_tidy are both rectangular in shape and contain the same 12 numerical values (3 alcohol types by 4 countries), they are formatted differently. drinks_smaller is formatted in what’s known as “wide” format, whereas drinks_smaller_tidy is formatted in what’s known as “long/narrow” format. In the context of doing data science in R, long/narrow format is also known as “tidy” format. In order to use the ggplot2 and dplyr packages for data visualization and data wrangling, your input data frames must be in “tidy” format. Thus, all non-“tidy” data must be converted to “tidy” format first. Before we convert non-“tidy” data frames like drinks_smaller to “tidy” data frames like drinks_smaller_tidy, let’s define “tidy” data. 4.2.1 Definition of “tidy” data You have surely heard the word “tidy” in your life: “Tidy up your room!” “Write your homework in a tidy way so it is easier to provide feedback.” Marie Kondo’s best-selling book, The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing, and Netflix TV series Tidying Up with Marie Kondo. “I am not by any stretch of the imagination a tidy person, and the piles of unread books on the coffee table and by my bed have a plaintive, pleading quality to me - ‘Read me, please!’” - Linda Grant What does it mean for your data to be “tidy”? While “tidy” has a clear English meaning of “organized,” the word “tidy” in data science using R means that your data follows a standardized format. We will follow Hadley Wickham’s definition of “tidy” data (Wickham 2014) shown also in Figure 4.3: A dataset is a collection of values, usually either numbers (if quantitative) or strings AKA text data (if qualitative/categorical). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a city) across attributes. “Tidy” data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. FIGURE 4.3: Tidy data graphic from R for Data Science. For example, say you have the following table of stock prices in Table 4.1: TABLE 4.1: Stock prices (non-tidy format) Date Boeing stock price Amazon stock price Google stock price 2009-01-01 $173.55 $174.90 $174.34 2009-01-02 $172.61 $171.42 $170.04 Although the data are neatly organized in a rectangular spreadsheet-type format, they do not follow the definition of data in “tidy” format. While there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns. In “tidy” data format, each variable should be its own column, as shown in Table 4.2. Notice that both tables present the same information, but in different formats. TABLE 4.2: Stock prices (tidy format) Date Stock Name Stock Price 2009-01-01 Boeing $173.55 2009-01-01 Amazon $174.90 2009-01-01 Google $174.34 2009-01-02 Boeing $172.61 2009-01-02 Amazon $171.42 2009-01-02 Google $170.04 Now we have the requisite three columns Date, Stock Name, and Stock Price. On the other hand, consider the data in Table 4.3. TABLE 4.3: Example of tidy data Date Boeing Price Weather 2009-01-01 $173.55 Sunny 2009-01-02 $172.61 Overcast In this case, even though the variable “Boeing Price” occurs just like in our non-“tidy” data in Table 4.1, the data is “tidy” since there are three variables corresponding to three unique pieces of information: Date, Boeing price, and the Weather that particular day. Learning check (LC4.1) What are common characteristics of “tidy” data frames? (LC4.2) What makes “tidy” data frames useful for organizing data? 4.2.2 Converting to “tidy” data In this book so far, you’ve only seen data frames that were already in “tidy” format. Furthermore, for the rest of this book, you’ll mostly only see data frames that are already in “tidy” format as well. This is not always the case however with all datasets in the world. If your original data frame is in wide (non-“tidy”) format and you would like to use the ggplot2 or dplyr packages, you will first have to convert it to “tidy” format. To do so, we recommend using the pivot_longer() function in the tidyr package (Wickham and Henry 2019). Going back to our drinks_smaller data frame from earlier: drinks_smaller # A tibble: 4 x 4 country beer spirit wine &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; 1 China 79 192 8 2 Italy 85 42 237 3 Saudi Arabia 0 5 0 4 USA 249 158 84 We convert it to “tidy” format by using the pivot_longer() function from the tidyr package as follows: drinks_smaller_tidy &lt;- drinks_smaller %&gt;% pivot_longer(names_to = &quot;type&quot;, values_to = &quot;servings&quot;, cols = -country) drinks_smaller_tidy # A tibble: 12 x 3 country type servings &lt;chr&gt; &lt;chr&gt; &lt;int&gt; 1 China beer 79 2 China spirit 192 3 China wine 8 4 Italy beer 85 5 Italy spirit 42 6 Italy wine 237 7 Saudi Arabia beer 0 8 Saudi Arabia spirit 5 9 Saudi Arabia wine 0 10 USA beer 249 11 USA spirit 158 12 USA wine 84 We set the arguments to pivot_longer() as follows: names_to here corresponds to the name of the variable in the new “tidy”/long data frame that will contain the column names of the original data. Observe how we set names_to = &quot;type&quot;. In the resulting drinks_smaller_tidy, the column type contains the three types of alcohol beer, spirit, and wine. Since type is a variable name that doesn’t appear in drinks_smaller, we use quotation marks around it. You’ll receive an error if you just use names_to = type here. values_to here is the name of the variable in the new “tidy” data frame that will contain the values of the original data. Observe how we set values_to = &quot;servings&quot; since each of the numeric values in each of the beer, wine, and spirit columns of the drinks_smaller data corresponds to a value of servings. In the resulting drinks_smaller_tidy, the column servings contains the 4 \\(\\times\\) 3 = 12 numerical values. Note again that servings doesn’t appear as a variable in drinks_smaller so it again needs quotation marks around it for the values_to argument. The third argument cols is the columns in the drinks_smaller data frame you either want to or don’t want to “tidy.” Observe how we set this to -country indicating that we don’t want to “tidy” the country variable in drinks_smaller and rather only beer, spirit, and wine. Since country is a column that appears in drinks_smaller we don’t put quotation marks around it. The third argument here of cols is a little nuanced, so let’s consider code that’s written slightly differently but that produces the same output: drinks_smaller %&gt;% pivot_longer(names_to = &quot;type&quot;, values_to = &quot;servings&quot;, cols = c(beer, spirit, wine)) Note that the third argument now specifies which columns we want to “tidy” with c(beer, spirit, wine), instead of the columns we don’t want to “tidy” using -country. We use the c() function to create a vector of the columns in drinks_smaller that we’d like to “tidy.” Note that since these three columns appear one after another in the drinks_smaller data frame, we could also do the following for the cols argument: drinks_smaller %&gt;% pivot_longer(names_to = &quot;type&quot;, values_to = &quot;servings&quot;, cols = beer:wine) With our drinks_smaller_tidy “tidy” formatted data frame, we can now produce the barplot you saw in Figure 4.2 using geom_col(). This is done in Figure 4.4. Recall from Section 2.8 on barplots that we use geom_col() and not geom_bar(), since we would like to map the “pre-counted” servings variable to the y-aesthetic of the bars. ggplot(drinks_smaller_tidy, aes(x = country, y = servings, fill = type)) + geom_col(position = &quot;dodge&quot;) FIGURE 4.4: Comparing alcohol consumption in 4 countries using geom_col(). Converting “wide” format data to “tidy” format often confuses new R users. The only way to learn to get comfortable with the pivot_longer() function is with practice, practice, and more practice using different datasets. For example, run ?pivot_longer and look at the examples in the bottom of the help file. We’ll show another example of using pivot_longer() to convert a “wide” formatted data frame to “tidy” format in Section 4.3. If however you want to convert a “tidy” data frame to “wide” format, you will need to use the pivot_wider() function instead. Run ?pivot_wider and look at the examples in the bottom of the help file for examples. You can also view examples of both pivot_longer() and pivot_wider() on the tidyverse.org webpage. There’s a nice example to check out the different functions available for data tidying and a case study using data from the World Health Organization on that webpage. Furthermore, each week the R4DS Online Learning Community posts a dataset in the weekly #TidyTuesday event that might serve as a nice place for you to find other data to explore and transform. Learning check (LC4.3) Take a look at the airline_safety data frame included in the fivethirtyeight data package. Run the following: airline_safety After reading the help file by running ?airline_safety, we see that airline_safety is a data frame containing information on different airline companies’ safety records. This data was originally reported on the data journalism website, FiveThirtyEight.com, in Nate Silver’s article, “Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?”. Let’s only consider the variables airlines and those relating to fatalities for simplicity: airline_safety_smaller &lt;- airline_safety %&gt;% select(airline, starts_with(&quot;fatalities&quot;)) airline_safety_smaller # A tibble: 56 x 3 airline fatalities_85_99 fatalities_00_14 &lt;chr&gt; &lt;int&gt; &lt;int&gt; 1 Aer Lingus 0 0 2 Aeroflot 128 88 3 Aerolineas Argentinas 0 0 4 Aeromexico 64 0 5 Air Canada 0 0 6 Air France 79 337 7 Air India 329 158 8 Air New Zealand 0 7 9 Alaska Airlines 0 88 10 Alitalia 50 0 # … with 46 more rows This data frame is not in “tidy” format. How would you convert this data frame to be in “tidy” format, in particular so that it has a variable fatalities_years indicating the incident year and a variable count of the fatality counts? 4.2.3 nycflights13 package Recall the nycflights13 package we introduced in Section 1.4 with data about all domestic flights departing from New York City in 2013. Let’s revisit the flights data frame by running View(flights). We saw that flights has a rectangular shape, with each of its 336,776 rows corresponding to a flight and each of its 22 columns corresponding to different characteristics/measurements of each flight. This satisfied the first two criteria of the definition of “tidy” data from Subsection 4.2.1: that “Each variable forms a column” and “Each observation forms a row.” But what about the third property of “tidy” data that “Each type of observational unit forms a table”? Recall that we saw in Subsection 1.4.3 that the observational unit for the flights data frame is an individual flight. In other words, the rows of the flights data frame refer to characteristics/measurements of individual flights. Also included in the nycflights13 package are other data frames with their rows representing different observational units (Wickham 2019a): airlines: translation between two letter IATA carrier codes and airline company names (16 in total). The observational unit is an airline company. planes: aircraft information about each of 3,322 planes used, i.e., the observational unit is an aircraft. weather: hourly meteorological data (about 8,705 observations) for each of the three NYC airports, i.e., the observational unit is an hourly measurement of weather at one of the three airports. airports: airport names and locations. The observational unit is an airport. The organization of the information into these five data frames follows the third “tidy” data property: observations corresponding to the same observational unit should be saved in the same table, i.e., data frame. You could think of this property as the old English expression: “birds of a feather flock together.” 4.3 Case study: Democracy in Guatemala In this section, we’ll show you another example of how to convert a data frame that isn’t in “tidy” format (“wide” format) to a data frame that is in “tidy” format (“long/narrow” format). We’ll do this using the pivot_longer() function from the tidyr package again. Furthermore, we’ll make use of functions from the ggplot2 and dplyr packages to produce a time-series plot showing how the democracy scores have changed over the 40 years from 1952 to 1992 for Guatemala. Recall that we saw time-series plots in Section 2.4 on creating linegraphs using geom_line(). Let’s use the dem_score data frame we imported in Section 4.1, but focus on only data corresponding to Guatemala. guat_dem &lt;- dem_score %&gt;% filter(country == &quot;Guatemala&quot;) guat_dem # A tibble: 1 x 10 country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 Guatemala 2 -6 -5 3 1 -3 -7 3 3 Let’s lay out the grammar of graphics we saw in Section 2.1. First we know we need to set data = guat_dem and use a geom_line() layer, but what is the aesthetic mapping of variables? We’d like to see how the democracy score has changed over the years, so we need to map: year to the x-position aesthetic and democracy_score to the y-position aesthetic Now we are stuck in a predicament, much like with our drinks_smaller example in Section 4.2. We see that we have a variable named country, but its only value is &quot;Guatemala&quot;. We have other variables denoted by different year values. Unfortunately, the guat_dem data frame is not “tidy” and hence is not in the appropriate format to apply the grammar of graphics, and thus we cannot use the ggplot2 package just yet. We need to take the values of the columns corresponding to years in guat_dem and convert them into a new “names” variable called year. Furthermore, we need to take the democracy score values in the inside of the data frame and turn them into a new “values” variable called democracy_score. Our resulting data frame will have three columns: country, year, and democracy_score. Recall that the pivot_longer() function in the tidyr package does this for us: guat_dem_tidy &lt;- guat_dem %&gt;% pivot_longer(names_to = &quot;year&quot;, values_to = &quot;democracy_score&quot;, cols = -country, names_ptypes = list(year = integer())) guat_dem_tidy # A tibble: 9 x 3 country year democracy_score &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; 1 Guatemala 1952 2 2 Guatemala 1957 -6 3 Guatemala 1962 -5 4 Guatemala 1967 3 5 Guatemala 1972 1 6 Guatemala 1977 -3 7 Guatemala 1982 -7 8 Guatemala 1987 3 9 Guatemala 1992 3 We set the arguments to pivot_longer() as follows: names_to is the name of the variable in the new “tidy” data frame that will contain the column names of the original data. Observe how we set names_to = &quot;year&quot;. In the resulting guat_dem_tidy, the column year contains the years where Guatemala’s democracy scores were measured. values_to is the name of the variable in the new “tidy” data frame that will contain the values of the original data. Observe how we set values_to = &quot;democracy_score&quot;. In the resulting guat_dem_tidy the column democracy_score contains the 1 \\(\\times\\) 9 = 9 democracy scores as numeric values. The third argument is the columns you either want to or don’t want to “tidy.” Observe how we set this to cols = -country indicating that we don’t want to “tidy” the country variable in guat_dem and rather only variables 1952 through 1992. The last argument of names_ptypes tells R what type of variable year should be set to. Without specifying that it is an integer as we’ve done here, pivot_longer() will set it to be a character value by default. We can now create the time-series plot in Figure 4.5 to visualize how democracy scores in Guatemala have changed from 1952 to 1992 using a geom_line(). Furthermore, we’ll use the labs() function in the ggplot2 package to add informative labels to all the aes()thetic attributes of our plot, in this case the x and y positions. ggplot(guat_dem_tidy, aes(x = year, y = democracy_score)) + geom_line() + labs(x = &quot;Year&quot;, y = &quot;Democracy Score&quot;) FIGURE 4.5: Democracy scores in Guatemala 1952-1992. Note that if we forgot to include the names_ptypes argument specifying that year was not of character format, we would have gotten an error here since geom_line() wouldn’t have known how to sort the character values in year in the right order. Learning check (LC4.4) Convert the dem_score data frame into a “tidy” data frame and assign the name of dem_score_tidy to the resulting long-formatted data frame. (LC4.5) Read in the life expectancy data stored at https://moderndive.com/data/le_mess.csv and convert it to a “tidy” data frame. 4.4 tidyverse package Notice at the beginning of the chapter we loaded the following four packages, which are among four of the most frequently used R packages for data science: library(ggplot2) library(dplyr) library(readr) library(tidyr) Recall that ggplot2 is for data visualization, dplyr is for data wrangling, readr is for importing spreadsheet data into R, and tidyr is for converting data to “tidy” format. There is a much quicker way to load these packages than by individually loading them: by installing and loading the tidyverse package. The tidyverse package acts as an “umbrella” package whereby installing/loading it will install/load multiple packages at once for you. After installing the tidyverse package as you would a normal package as seen in Section 1.3, running: library(tidyverse) would be the same as running: library(ggplot2) library(dplyr) library(readr) library(tidyr) library(purrr) library(tibble) library(stringr) library(forcats) The purrr, tibble, stringr, and forcats are left for a more advanced book; check out R for Data Science to learn about these packages. For the remainder of this book, we’ll start every chapter by running library(tidyverse), instead of loading the various component packages individually. The tidyverse “umbrella” package gets its name from the fact that all the functions in all its packages are designed to have common inputs and outputs: data frames are in “tidy” format. This standardization of input and output data frames makes transitions between different functions in the different packages as seamless as possible. For more information, check out the tidyverse.org webpage for the package. 4.5 Conclusion 4.5.1 Additional resources An R script file of all R code used in this chapter is available here. If you want to learn more about using the readr and tidyr package, we suggest that you check out RStudio’s “Data Import Cheat Sheet.” In the current version of RStudio in late 2019, you can access this cheatsheet by going to the RStudio Menu Bar -&gt; Help -&gt; Cheatsheets -&gt; “Browse Cheatsheets” -&gt; Scroll down the page to the “Data Import Cheat Sheet.” The first page of this cheatsheet has information on using the readr package to import data, while the second page has information on using the tidyr package to “tidy” data. You can see a preview of both cheatsheets in the figures below. FIGURE 4.6: Data Import cheatsheet (first page): readr package. FIGURE 4.7: Data Import cheatsheet (second page): tidyr package. 4.5.2 What’s to come? Congratulations! You’ve completed the “Data Science with tidyverse” portion of this book. We’ll now move to the “Data modeling with moderndive” portion of this book in Chapters 5 and 6, where you’ll leverage your data visualization and wrangling skills to model relationships between different variables in data frames. However, we’re going to leave Chapter 10 on “Inference for Regression” until after we’ve covered statistical inference in Chapters 7, 8, and 9. Onwards and upwards into Data Modeling as shown in Figure 4.8! FIGURE 4.8: ModernDive flowchart - on to Part II! References "],
+["5-regression.html", "Chapter 5 Basic Regression 5.1 One numerical explanatory variable 5.2 One categorical explanatory variable 5.3 Related topics 5.4 Conclusion", " Chapter 5 Basic Regression Now that we are equipped with data visualization skills from Chapter 2, data wrangling skills from Chapter 3, and an understanding of how to import data and the concept of a “tidy” data format from Chapter 4, let’s now proceed with data modeling. The fundamental premise of data modeling is to make explicit the relationship between: an outcome variable \\(y\\), also called a dependent variable or response variable, and an explanatory/predictor variable \\(x\\), also called an independent variable or covariate. Another way to state this is using mathematical terminology: we will model the outcome variable \\(y\\) “as a function” of the explanatory/predictor variable \\(x\\). When we say “function” here, we aren’t referring to functions in R like the ggplot() function, but rather as a mathematical function. But, why do we have two different labels, explanatory and predictor, for the variable \\(x\\)? That’s because even though the two terms are often used interchangeably, roughly speaking data modeling serves one of two purposes: Modeling for explanation: When you want to explicitly describe and quantify the relationship between the outcome variable \\(y\\) and a set of explanatory variables \\(x\\), determine the significance of any relationships, have measures summarizing these relationships, and possibly identify any causal relationships between the variables. Modeling for prediction: When you want to predict an outcome variable \\(y\\) based on the information contained in a set of predictor variables \\(x\\). Unlike modeling for explanation, however, you don’t care so much about understanding how all the variables relate and interact with one another, but rather only whether you can make good predictions about \\(y\\) using the information in \\(x\\). For example, say you are interested in an outcome variable \\(y\\) of whether patients develop lung cancer and information \\(x\\) on their risk factors, such as smoking habits, age, and socioeconomic status. If we are modeling for explanation, we would be interested in both describing and quantifying the effects of the different risk factors. One reason could be that you want to design an intervention to reduce lung cancer incidence in a population, such as targeting smokers of a specific age group with advertising for smoking cessation programs. If we are modeling for prediction, however, we wouldn’t care so much about understanding how all the individual risk factors contribute to lung cancer, but rather only whether we can make good predictions of which people will contract lung cancer. In this book, we’ll focus on modeling for explanation and hence refer to \\(x\\) as explanatory variables. If you are interested in learning about modeling for prediction, we suggest you check out books and courses on the field of machine learning such as An Introduction to Statistical Learning with Applications in R (ISLR) (James et al. 2017). Furthermore, while there exist many techniques for modeling, such as tree-based models and neural networks, in this book we’ll focus on one particular technique: linear regression. Linear regression is one of the most commonly-used and easy-to-understand approaches to modeling. Linear regression involves a numerical outcome variable \\(y\\) and explanatory variables \\(x\\) that are either numerical or categorical. Furthermore, the relationship between \\(y\\) and \\(x\\) is assumed to be linear, or in other words, a line. However, we’ll see that what constitutes a “line” will vary depending on the nature of your explanatory variables \\(x\\) . In Chapter 5 on basic regression, we’ll only consider models with a single explanatory variable \\(x\\). In Section 5.1, the explanatory variable will be numerical. This scenario is known as simple linear regression. In Section 5.2, the explanatory variable will be categorical. In Chapter 6 on multiple regression, we’ll extend the ideas behind basic regression and consider models with two explanatory variables \\(x_1\\) and \\(x_2\\). In Section 6.1, we’ll have two numerical explanatory variables. In Section 6.2, we’ll have one numerical and one categorical explanatory variable. In particular, we’ll consider two such models: interaction and parallel slopes models. In Chapter 10 on inference for regression, we’ll revisit our regression models and analyze the results using the tools for statistical inference you’ll develop in Chapters 7, 8, and 9 on sampling, bootstrapping and confidence intervals, and hypothesis testing and \\(p\\)-values, respectively. Let’s now begin with basic regression, which refers to linear regression models with a single explanatory variable \\(x\\). We’ll also discuss important statistical concepts like the correlation coefficient, that “correlation isn’t necessarily causation,” and what it means for a line to be “best-fitting.” Needed packages Let’s now load all the packages needed for this chapter (this assumes you’ve already installed them). In this chapter, we introduce some new packages: The tidyverse “umbrella” (Wickham 2019b) package. Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once: ggplot2 for data visualization dplyr for data wrangling tidyr for converting data to “tidy” format readr for importing spreadsheet data into R As well as the more advanced purrr, tibble, stringr, and forcats packages The moderndive package of datasets and functions for tidyverse-friendly introductory linear regression. The skimr (Quinn et al. 2019) package, which provides a simple-to-use function to quickly compute a wide array of commonly used summary statistics. If needed, read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(moderndive) library(skimr) library(gapminder) 5.1 One numerical explanatory variable Why do some professors and instructors at universities and colleges receive high teaching evaluations scores from students while others receive lower ones? Are there differences in teaching evaluations between instructors of different demographic groups? Could there be an impact due to student biases? These are all questions that are of interest to university/college administrators, as teaching evaluations are among the many criteria considered in determining which instructors and professors get promoted. Researchers at the University of Texas in Austin, Texas (UT Austin) tried to answer the following research question: what factors explain differences in instructor teaching evaluation scores? To this end, they collected instructor and course information on 463 courses. A full description of the study can be found at openintro.org. In this section, we’ll keep things simple for now and try to explain differences in instructor teaching scores as a function of one numerical variable: the instructor’s “beauty” score (we’ll describe how this score was determined shortly). Could it be that instructors with higher “beauty” scores also have higher teaching evaluations? Could it be instead that instructors with higher “beauty” scores tend to have lower teaching evaluations? Or could it be that there is no relationship between “beauty” score and teaching evaluations? We’ll answer these questions by modeling the relationship between teaching scores and “beauty” scores using simple linear regression where we have: A numerical outcome variable \\(y\\) (the instructor’s teaching score) and A single numerical explanatory variable \\(x\\) (the instructor’s “beauty” score). 5.1.1 Exploratory data analysis The data on the 463 courses at UT Austin can be found in the evals data frame included in the moderndive package. However, to keep things simple, let’s select() only the subset of the variables we’ll consider in this chapter, and save this data in a new data frame called evals_ch5: evals_ch5 &lt;- evals %&gt;% select(ID, score, bty_avg, age) A crucial step before doing any kind of analysis or modeling is performing an exploratory data analysis, or EDA for short. EDA gives you a sense of the distributions of the individual variables in your data, whether any potential relationships exist between variables, whether there are outliers and/or missing values, and (most importantly) how to build your model. Here are three common steps in an EDA: Most crucially, looking at the raw data values. Computing summary statistics, such as means, medians, and interquartile ranges. Creating data visualizations. Let’s perform the first common step in an exploratory data analysis: looking at the raw data values. Because this step seems so trivial, unfortunately many data analysts ignore it. However, getting an early sense of what your raw data looks like can often prevent many larger issues down the road. You can do this by using RStudio’s spreadsheet viewer or by using the glimpse() function as introduced in Subsection 1.4.3 on exploring data frames: glimpse(evals_ch5) Observations: 463 Variables: 4 $ ID &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18… $ score &lt;dbl&gt; 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4… $ bty_avg &lt;dbl&gt; 5.00, 5.00, 5.00, 5.00, 3.00, 3.00, 3.00, 3.33, 3.33, 3.17, 3… $ age &lt;int&gt; 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4… Observe that Observations: 463 indicates that there are 463 rows/observations in evals_ch5, where each row corresponds to one observed course at UT Austin. It is important to note that the observational unit is an individual course and not an individual instructor. Recall from Subsection 1.4.3 that the observational unit is the “type of thing” that is being measured by our variables. Since instructors teach more than one course in an academic year, the same instructor will appear more than once in the data. Hence there are fewer than 463 unique instructors being represented in evals_ch5. We’ll revisit this idea in Section 10.3, when we talk about the “independence assumption” for inference for regression. A full description of all the variables included in evals can be found at openintro.org or by reading the associated help file (run ?evals in the console). However, let’s fully describe only the 4 variables we selected in evals_ch5: ID: An identification variable used to distinguish between the 1 through 463 courses in the dataset. score: A numerical variable of the course instructor’s average teaching score, where the average is computed from the evaluation scores from all students in that course. Teaching scores of 1 are lowest and 5 are highest. This is the outcome variable \\(y\\) of interest. bty_avg: A numerical variable of the course instructor’s average “beauty” score, where the average is computed from a separate panel of six students. “Beauty” scores of 1 are lowest and 10 are highest. This is the explanatory variable \\(x\\) of interest. age: A numerical variable of the course instructor’s age. This will be another explanatory variable \\(x\\) that we’ll use in the Learning check at the end of this subsection. An alternative way to look at the raw data values is by choosing a random sample of the rows in evals_ch5 by piping it into the sample_n() function from the dplyr package. Here we set the size argument to be 5, indicating that we want a random sample of 5 rows. We display the results in Table 5.1. Note that due to the random nature of the sampling, you will likely end up with a different subset of 5 rows. evals_ch5 %&gt;% sample_n(size = 5) TABLE 5.1: A random sample of 5 out of the 463 courses at UT Austin ID score bty_avg age 129 3.7 3.00 62 109 4.7 4.33 46 28 4.8 5.50 62 434 2.8 2.00 62 330 4.0 2.33 64 Now that we’ve looked at the raw values in our evals_ch5 data frame and got a preliminary sense of the data, let’s move on to the next common step in an exploratory data analysis: computing summary statistics. Let’s start by computing the mean and median of our numerical outcome variable score and our numerical explanatory variable “beauty” score denoted as bty_avg. We’ll do this by using the summarize() function from dplyr along with the mean() and median() summary functions we saw in Section 3.3. evals_ch5 %&gt;% summarize(mean_bty_avg = mean(bty_avg), mean_score = mean(score), median_bty_avg = median(bty_avg), median_score = median(score)) # A tibble: 1 x 4 mean_bty_avg mean_score median_bty_avg median_score &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 4.42 4.17 4.33 4.3 However, what if we want other summary statistics as well, such as the standard deviation (a measure of spread), the minimum and maximum values, and various percentiles? Typing out all these summary statistic functions in summarize() would be long and tedious. Instead, let’s use the convenient skim() function from the skimr package. This function takes in a data frame, “skims” it, and returns commonly used summary statistics. Let’s take our evals_ch5 data frame, select() only the outcome and explanatory variables teaching score and bty_avg, and pipe them into the skim() function: evals_ch5 %&gt;% select(score, bty_avg) %&gt;% skim() Skim summary statistics n obs: 463 n variables: 2 ── Variable type:numeric variable missing complete n mean sd p0 p25 p50 p75 p100 bty_avg 0 463 463 4.42 1.53 1.67 3.17 4.33 5.5 8.17 score 0 463 463 4.17 0.54 2.3 3.8 4.3 4.6 5 (For formatting purposes in this book, the inline histogram that is usually printed with skim() has been removed. This can be done by using skim_with(numeric = list(hist = NULL)) prior to using the skim() function for version 1.0.6 of skimr.) For the numerical variables teaching score and bty_avg it returns: missing: the number of missing values complete: the number of non-missing or complete values n: the total number of values mean: the average sd: the standard deviation p0: the 0th percentile: the value at which 0% of observations are smaller than it (the minimum value) p25: the 25th percentile: the value at which 25% of observations are smaller than it (the 1st quartile) p50: the 50th percentile: the value at which 50% of observations are smaller than it (the 2nd quartile and more commonly called the median) p75: the 75th percentile: the value at which 75% of observations are smaller than it (the 3rd quartile) p100: the 100th percentile: the value at which 100% of observations are smaller than it (the maximum value) Looking at this output, we can see how the values of both variables distribute. For example, the mean teaching score was 4.17 out of 5, whereas the mean “beauty” score was 4.42 out of 10. Furthermore, the middle 50% of teaching scores was between 3.80 and 4.6 (the first and third quartiles), whereas the middle 50% of “beauty” scores falls within 3.17 to 5.5 out of 10. The skim() function only returns what are known as univariate summary statistics: functions that take a single variable and return some numerical summary of that variable. However, there also exist bivariate summary statistics: functions that take in two variables and return some summary of those two variables. In particular, when the two variables are numerical, we can compute the correlation coefficient. Generally speaking, coefficients are quantitative expressions of a specific phenomenon. A correlation coefficient is a quantitative expression of the strength of the linear relationship between two numerical variables. Its value ranges between -1 and 1 where: -1 indicates a perfect negative relationship: As one variable increases, the value of the other variable tends to go down, following a straight line. 0 indicates no relationship: The values of both variables go up/down independently of each other. +1 indicates a perfect positive relationship: As the value of one variable goes up, the value of the other variable tends to go up as well in a linear fashion. Figure 5.1 gives examples of 9 different correlation coefficient values for hypothetical numerical variables \\(x\\) and \\(y\\). For example, observe in the top right plot that for a correlation coefficient of -0.75 there is a negative linear relationship between \\(x\\) and \\(y\\), but it is not as strong as the negative linear relationship between \\(x\\) and \\(y\\) when the correlation coefficient is -0.9 or -1. FIGURE 5.1: Nine different correlation coefficients. The correlation coefficient can be computed using the get_correlation() function in the moderndive package. In this case, the inputs to the function are the two numerical variables for which we want to calculate the correlation coefficient. We put the name of the outcome variable on the left-hand side of the ~ “tilde” sign, while putting the name of the explanatory variable on the right-hand side. This is known as R’s formula notation. We will use this same “formula” syntax with regression later in this chapter. evals_ch5 %&gt;% get_correlation(formula = score ~ bty_avg) # A tibble: 1 x 1 cor &lt;dbl&gt; 1 0.187 An alternative way to compute correlation is to use the cor() summary function within a summarize(): evals_ch5 %&gt;% summarize(correlation = cor(score, bty_avg)) In our case, the correlation coefficient of 0.187 indicates that the relationship between teaching evaluation score and “beauty” average is “weakly positive.” There is a certain amount of subjectivity in interpreting correlation coefficients, especially those that aren’t close to the extreme values of -1, 0, and 1. To develop your intuition about correlation coefficients, play the “Guess the Correlation” 1980’s style video game mentioned in Subsection 5.4.1. Let’s now perform the last of the steps in an exploratory data analysis: creating data visualizations. Since both the score and bty_avg variables are numerical, a scatterplot is an appropriate graph to visualize this data. Let’s do this using geom_point() and display the result in Figure 5.2. Furthermore, let’s highlight the six points in the top right of the visualization in a box. ggplot(evals_ch5, aes(x = bty_avg, y = score)) + geom_point() + labs(x = &quot;Beauty Score&quot;, y = &quot;Teaching Score&quot;, title = &quot;Scatterplot of relationship of teaching and beauty scores&quot;) FIGURE 5.2: Instructor evaluation scores at UT Austin. Observe that most “beauty” scores lie between 2 and 8, while most teaching scores lie between 3 and 5. Furthermore, while opinions may vary, it is our opinion that the relationship between teaching score and “beauty” score is “weakly positive.” This is consistent with our earlier computed correlation coefficient of 0.187. Furthermore, there appear to be six points in the top-right of this plot highlighted in the box. However, this is not actually the case, as this plot suffers from overplotting. Recall from Subsection 2.3.2 that overplotting occurs when several points are stacked directly on top of each other, making it difficult to distinguish them. So while it may appear that there are only six points in the box, there are actually more. This fact is only apparent when using geom_jitter() in place of geom_point(). We display the resulting plot in Figure 5.3 along with the same small box as in Figure 5.2. ggplot(evals_ch5, aes(x = bty_avg, y = score)) + geom_jitter() + labs(x = &quot;Beauty Score&quot;, y = &quot;Teaching Score&quot;, title = &quot;Scatterplot of relationship of teaching and beauty scores&quot;) FIGURE 5.3: Instructor evaluation scores at UT Austin. It is now apparent that there are 12 points in the area highlighted in the box and not six as originally suggested in Figure 5.2. Recall from Subsection 2.3.2 on overplotting that jittering adds a little random “nudge” to each of the points to break up these ties. Furthermore, recall that jittering is strictly a visualization tool; it does not alter the original values in the data frame evals_ch5. To keep things simple going forward, however, we’ll only present regular scatterplots rather than their jittered counterparts. Let’s build on the unjittered scatterplot in Figure 5.2 by adding a “best-fitting” line: of all possible lines we can draw on this scatterplot, it is the line that “best” fits through the cloud of points. We do this by adding a new geom_smooth(method = &quot;lm&quot;, se = FALSE) layer to the ggplot() code that created the scatterplot in Figure 5.2. The method = &quot;lm&quot; argument sets the line to be a “linear model.” The se = FALSE argument suppresses standard error uncertainty bars. (We’ll define the concept of standard error later in Subsection 7.3.2.) ggplot(evals_ch5, aes(x = bty_avg, y = score)) + geom_point() + labs(x = &quot;Beauty Score&quot;, y = &quot;Teaching Score&quot;, title = &quot;Relationship between teaching and beauty scores&quot;) + geom_smooth(method = &quot;lm&quot;, se = FALSE) FIGURE 5.4: Regression line. The line in the resulting Figure 5.4 is called a “regression line.” The regression line is a visual summary of the relationship between two numerical variables, in our case the outcome variable score and the explanatory variable bty_avg. The positive slope of the blue line is consistent with our earlier observed correlation coefficient of 0.187 suggesting that there is a positive relationship between these two variables: as instructors have higher “beauty” scores, so also do they receive higher teaching evaluations. We’ll see later, however, that while the correlation coefficient and the slope of a regression line always have the same sign (positive or negative), they typically do not have the same value. Furthermore, a regression line is “best-fitting” in that it minimizes some mathematical criteria. We present these mathematical criteria in Subsection 5.3.2, but we suggest you read this subsection only after first reading the rest of this section on regression with one numerical explanatory variable. Learning check (LC5.1) Conduct a new exploratory data analysis with the same outcome variable \\(y\\) being score but with age as the new explanatory variable \\(x\\). Remember, this involves three things: Looking at the raw data values. Computing summary statistics. Creating data visualizations. What can you say about the relationship between age and teaching scores based on this exploration? 5.1.2 Simple linear regression You may recall from secondary/high school algebra that the equation of a line is \\(y = a + b\\cdot x\\). (Note that the \\(\\cdot\\) symbol is equivalent to the \\(\\times\\) “multiply by” mathematical symbol. We’ll use the \\(\\cdot\\) symbol in the rest of this book as it is more succinct.) It is defined by two coefficients \\(a\\) and \\(b\\). The intercept coefficient \\(a\\) is the value of \\(y\\) when \\(x = 0\\). The slope coefficient \\(b\\) for \\(x\\) is the increase in \\(y\\) for every increase of one in \\(x\\). This is also called the “rise over run.” However, when defining a regression line like the regression line in Figure 5.4, we use slightly different notation: the equation of the regression line is \\(\\widehat{y} = b_0 + b_1 \\cdot x\\) . The intercept coefficient is \\(b_0\\), so \\(b_0\\) is the value of \\(\\widehat{y}\\) when \\(x = 0\\). The slope coefficient for \\(x\\) is \\(b_1\\), i.e., the increase in \\(\\widehat{y}\\) for every increase of one in \\(x\\). Why do we put a “hat” on top of the \\(y\\)? It’s a form of notation commonly used in regression to indicate that we have a “fitted value,” or the value of \\(y\\) on the regression line for a given \\(x\\) value. We’ll discuss this more in the upcoming Subsection 5.1.3. We know that the regression line in Figure 5.4 has a positive slope \\(b_1\\) corresponding to our explanatory \\(x\\) variable bty_avg. Why? Because as instructors tend to have higher bty_avg scores, so also do they tend to have higher teaching evaluation scores. However, what is the numerical value of the slope \\(b_1\\)? What about the intercept \\(b_0\\)? Let’s not compute these two values by hand, but rather let’s use a computer! We can obtain the values of the intercept \\(b_0\\) and the slope for btg_avg \\(b_1\\) by outputting a linear regression table. This is done in two steps: We first “fit” the linear regression model using the lm() function and save it in score_model. We get the regression table by applying the get_regression_table() function from the moderndive package to score_model. # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals_ch5) # Get regression table: get_regression_table(score_model) TABLE 5.2: Linear regression table term estimate std_error statistic p_value lower_ci upper_ci intercept 3.880 0.076 50.96 0 3.731 4.030 bty_avg 0.067 0.016 4.09 0 0.035 0.099 Let’s first focus on interpreting the regression table output in Table 5.2, and then we’ll later revisit the code that produced it. In the estimate column of Table 5.2 are the intercept \\(b_0\\) = 3.88 and the slope \\(b_1\\) = 0.067 for bty_avg. Thus the equation of the regression line in Figure 5.4 follows: \\[ \\begin{aligned} \\widehat{y} &amp;= b_0 + b_1 \\cdot x\\\\ \\widehat{\\text{score}} &amp;= b_0 + b_{\\text{bty}\\_\\text{avg}} \\cdot\\text{bty}\\_\\text{avg}\\\\ &amp;= 3.880 + 0.067\\cdot\\text{bty}\\_\\text{avg} \\end{aligned} \\] The intercept \\(b_0\\) = 3.88 is the average teaching score \\(\\widehat{y}\\) = \\(\\widehat{\\text{score}}\\) for those courses where the instructor had a “beauty” score bty_avg of 0. Or in graphical terms, it’s where the line intersects the \\(y\\) axis when \\(x\\) = 0. Note, however, that while the intercept of the regression line has a mathematical interpretation, it has no practical interpretation here, since observing a bty_avg of 0 is impossible; it is the average of six panelists’ “beauty” scores ranging from 1 to 10. Furthermore, looking at the scatterplot with the regression line in Figure 5.4, no instructors had a “beauty” score anywhere near 0. Of greater interest is the slope \\(b_1\\) = \\(b_{\\text{bty\\_avg}}\\) for bty_avg of 0.067, as this summarizes the relationship between the teaching and “beauty” score variables. Note that the sign is positive, suggesting a positive relationship between these two variables, meaning teachers with higher “beauty” scores also tend to have higher teaching scores. Recall from earlier that the correlation coefficient is 0.187. They both have the same positive sign, but have a different value. Recall further that the correlation’s interpretation is the “strength of linear association”. The slope’s interpretation is a little different: For every increase of 1 unit in bty_avg, there is an associated increase of, on average, 0.067 units of score. We only state that there is an associated increase and not necessarily a causal increase. For example, perhaps it’s not that higher “beauty” scores directly cause higher teaching scores per se. Instead, the following could hold true: individuals from wealthier backgrounds tend to have stronger educational backgrounds and hence have higher teaching scores, while at the same time these wealthy individuals also tend to have higher “beauty” scores. In other words, just because two variables are strongly associated, it doesn’t necessarily mean that one causes the other. This is summed up in the often quoted phrase, “correlation is not necessarily causation.” We discuss this idea further in Subsection 5.3.1. Furthermore, we say that this associated increase is on average 0.067 units of teaching score, because you might have two instructors whose bty_avg scores differ by 1 unit, but their difference in teaching scores won’t necessarily be exactly 0.067. What the slope of 0.067 is saying is that across all possible courses, the average difference in teaching score between two instructors whose “beauty” scores differ by one is 0.067. Now that we’ve learned how to compute the equation for the regression line in Figure 5.4 using the values in the estimate column of Table 5.2, and how to interpret the resulting intercept and slope, let’s revisit the code that generated this table: # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals_ch5) # Get regression table: get_regression_table(score_model) First, we “fit” the linear regression model to the data using the lm() function and save this as score_model. When we say “fit”, we mean “find the best fitting line to this data.” lm() stands for “linear model” and is used as follows: lm(y ~ x, data = data_frame_name) where: y is the outcome variable, followed by a tilde ~. In our case, y is set to score. x is the explanatory variable. In our case, x is set to bty_avg. The combination of y ~ x is called a model formula. (Note the order of y and x.) In our case, the model formula is score ~ bty_avg. We saw such model formulas earlier when we computed the correlation coefficient using the get_correlation() function in Subsection 5.1.1. data_frame_name is the name of the data frame that contains the variables y and x. In our case, data_frame_name is the evals_ch5 data frame. Second, we take the saved model in score_model and apply the get_regression_table() function from the moderndive package to it to obtain the regression table in Table 5.2. This function is an example of what’s known in computer programming as a wrapper function. They take other pre-existing functions and “wrap” them into a single function that hides its inner workings. This concept is illustrated in Figure 5.5. FIGURE 5.5: The concept of a wrapper function. So all you need to worry about is what the inputs look like and what the outputs look like; you leave all the other details “under the hood of the car.” In our regression modeling example, the get_regression_table() function takes a saved lm() linear regression model as input and returns a data frame of the regression table as output. If you’re interested in learning more about the get_regression_table() function’s inner workings, check out Subsection 5.3.3. Lastly, you might be wondering what the remaining five columns in Table 5.2 are: std_error, statistic, p_value, lower_ci and upper_ci. They are the standard error, test statistic, p-value, lower 95% confidence interval bound, and upper 95% confidence interval bound. They tell us about both the statistical significance and practical significance of our results. This is loosely the “meaningfulness” of our results from a statistical perspective. Let’s put aside these ideas for now and revisit them in Chapter 10 on (statistical) inference for regression. We’ll do this after we’ve had a chance to cover standard errors in Chapter 7, confidence intervals in Chapter 8, and hypothesis testing and \\(p\\)-values in Chapter 9. Learning check (LC5.2) Fit a new simple linear regression using lm(score ~ age, data = evals_ch5) where age is the new explanatory variable \\(x\\). Get information about the “best-fitting” line from the regression table by applying the get_regression_table() function. How do the regression results match up with the results from your earlier exploratory data analysis? 5.1.3 Observed/fitted values and residuals We just saw how to get the value of the intercept and the slope of a regression line from the estimate column of a regression table generated by the get_regression_table() function. Now instead say we want information on individual observations. For example, let’s focus on the 21st of the 463 courses in the evals_ch5 data frame in Table 5.3: TABLE 5.3: Data for the 21st course out of 463 ID score bty_avg age 21 4.9 7.33 31 What is the value \\(\\widehat{y}\\) on the regression line corresponding to this instructor’s bty_avg “beauty” score of 7.333? In Figure 5.6 we mark three values corresponding to the instructor for this 21st course and give their statistical names: Circle: The observed value \\(y\\) = 4.9 is this course’s instructor’s actual teaching score. Square: The fitted value \\(\\widehat{y}\\) is the value on the regression line for \\(x\\) = bty_avg = 7.333. This value is computed using the intercept and slope in the previous regression table: \\[\\widehat{y} = b_0 + b_1 \\cdot x = 3.88 + 0.067 \\cdot 7.333 = 4.369\\] Arrow: The length of this arrow is the residual and is computed by subtracting the fitted value \\(\\widehat{y}\\) from the observed value \\(y\\). The residual can be thought of as a model’s error or “lack of fit” for a particular observation. In the case of this course’s instructor, it is \\(y - \\widehat{y}\\) = 4.9 - 4.369 = 0.531. FIGURE 5.6: Example of observed value, fitted value, and residual. Now say we want to compute both the fitted value \\(\\widehat{y} = b_0 + b_1 \\cdot x\\) and the residual \\(y - \\widehat{y}\\) for all 463 courses in the study. Recall that each course corresponds to one of the 463 rows in the evals_ch5 data frame and also one of the 463 points in the regression plot in Figure 5.6. We could repeat the previous calculations we performed by hand 463 times, but that would be tedious and time consuming. Instead, let’s do this using a computer with the get_regression_points() function. Just like the get_regression_table() function, the get_regression_points() function is a “wrapper” function. However, this function returns a different output. Let’s apply the get_regression_points() function to score_model, which is where we saved our lm() model in the previous section. In Table 5.4 we present the results of only the 21st through 24th courses for brevity’s sake. regression_points &lt;- get_regression_points(score_model) regression_points TABLE 5.4: Regression points (for only the 21st through 24th courses) ID score bty_avg score_hat residual 21 4.9 7.33 4.37 0.531 22 4.6 7.33 4.37 0.231 23 4.5 7.33 4.37 0.131 24 4.4 5.50 4.25 0.153 Let’s inspect the individual columns and match them with the elements of Figure 5.6: The score column represents the observed outcome variable \\(y\\). This is the y-position of the 463 black points. The bty_avg column represents the values of the explanatory variable \\(x\\). This is the x-position of the 463 black points. The score_hat column represents the fitted values \\(\\widehat{y}\\). This is the corresponding value on the regression line for the 463 \\(x\\) values. The residual column represents the residuals \\(y - \\widehat{y}\\). This is the 463 vertical distances between the 463 black points and the regression line. Just as we did for the instructor of the 21st course in the evals_ch5 dataset (in the first row of the table), let’s repeat the calculations for the instructor of the 24th course (in the fourth row of Table 5.4): score = 4.4 is the observed teaching score \\(y\\) for this course’s instructor. bty_avg = 5.50 is the value of the explanatory variable bty_avg \\(x\\) for this course’s instructor. score_hat = 4.25 = 3.88 + 0.067 \\(\\cdot\\) 5.50 is the fitted value \\(\\widehat{y}\\) on the regression line for this course’s instructor. residual = 0.153 = 4.4 - 4.25 is the value of the residual for this instructor. In other words, the model’s fitted value was off by 0.153 teaching score units for this course’s instructor. At this point, you can skip ahead if you like to Subsection 5.3.2 to learn about the processes behind what makes “best-fitting” regression lines. As a primer, a “best-fitting” line refers to the line that minimizes the sum of squared residuals out of all possible lines we can draw through the points. In Section 5.2, we’ll discuss another common scenario of having a categorical explanatory variable and a numerical outcome variable. Learning check (LC5.3) Generate a data frame of the residuals of the model where you used age as the explanatory \\(x\\) variable. 5.2 One categorical explanatory variable It’s an unfortunate truth that life expectancy is not the same across all countries in the world. International development agencies are interested in studying these differences in life expectancy in the hopes of identifying where governments should allocate resources to address this problem. In this section, we’ll explore differences in life expectancy in two ways: Differences between continents: Are there significant differences in average life expectancy between the five populated continents of the world: Africa, the Americas, Asia, Europe, and Oceania? Differences within continents: How does life expectancy vary within the world’s five continents? For example, is the spread of life expectancy among the countries of Africa larger than the spread of life expectancy among the countries of Asia? To answer such questions, we’ll use the gapminder data frame included in the gapminder package. This dataset has international development statistics such as life expectancy, GDP per capita, and population for 142 countries for 5-year intervals between 1952 and 2007. Recall we visualized some of this data in Figure 2.1 in Subsection 2.1.2 on the grammar of graphics. We’ll use this data for basic regression again, but now using an explanatory variable \\(x\\) that is categorical, as opposed to the numerical explanatory variable model we used in the previous Section 5.1. A numerical outcome variable \\(y\\) (a country’s life expectancy) and A single categorical explanatory variable \\(x\\) (the continent that the country is a part of). When the explanatory variable \\(x\\) is categorical, the concept of a “best-fitting” regression line is a little different than the one we saw previously in Section 5.1 where the explanatory variable \\(x\\) was numerical. We’ll study these differences shortly in Subsection 5.2.2, but first we conduct an exploratory data analysis. 5.2.1 Exploratory data analysis The data on the 142 countries can be found in the gapminder data frame included in the gapminder package. However, to keep things simple, let’s filter() for only those observations/rows corresponding to the year 2007. Additionally, let’s select() only the subset of the variables we’ll consider in this chapter. We’ll save this data in a new data frame called gapminder2007: library(gapminder) gapminder2007 &lt;- gapminder %&gt;% filter(year == 2007) %&gt;% select(country, lifeExp, continent, gdpPercap) Let’s perform the first common step in an exploratory data analysis: looking at the raw data values. You can do this by using RStudio’s spreadsheet viewer or by using the glimpse() command as introduced in Subsection 1.4.3 on exploring data frames: glimpse(gapminder2007) Observations: 142 Variables: 4 $ country &lt;fct&gt; Afghanistan, Albania, Algeria, Angola, Argentina, Australia… $ lifeExp &lt;dbl&gt; 43.8, 76.4, 72.3, 42.7, 75.3, 81.2, 79.8, 75.6, 64.1, 79.4,… $ continent &lt;fct&gt; Asia, Europe, Africa, Africa, Americas, Oceania, Europe, As… $ gdpPercap &lt;dbl&gt; 975, 5937, 6223, 4797, 12779, 34435, 36126, 29796, 1391, 33… Observe that Observations: 142 indicates that there are 142 rows/observations in gapminder2007, where each row corresponds to one country. In other words, the observational unit is an individual country. Furthermore, observe that the variable continent is of type &lt;fct&gt;, which stands for factor, which is R’s way of encoding categorical variables. A full description of all the variables included in gapminder can be found by reading the associated help file (run ?gapminder in the console). However, let’s fully describe only the 4 variables we selected in gapminder2007: country: An identification variable of type character/text used to distinguish the 142 countries in the dataset. lifeExp: A numerical variable of that country’s life expectancy at birth. This is the outcome variable \\(y\\) of interest. continent: A categorical variable with five levels. Here “levels” correspond to the possible categories: Africa, Asia, Americas, Europe, and Oceania. This is the explanatory variable \\(x\\) of interest. gdpPercap: A numerical variable of that country’s GDP per capita in US inflation-adjusted dollars that we’ll use as another outcome variable \\(y\\) in the Learning check at the end of this subsection. Let’s look at a random sample of five out of the 142 countries in Table 5.5. gapminder2007 %&gt;% sample_n(size = 5) TABLE 5.5: Random sample of 5 out of 142 countries country lifeExp continent gdpPercap Togo 58.4 Africa 883 Sao Tome and Principe 65.5 Africa 1598 Congo, Dem. Rep. 46.5 Africa 278 Lesotho 42.6 Africa 1569 Bulgaria 73.0 Europe 10681 Note that random sampling will likely produce a different subset of 5 rows for you than what’s shown. Now that we’ve looked at the raw values in our gapminder2007 data frame and got a sense of the data, let’s move on to computing summary statistics. Let’s once again apply the skim() function from the skimr package. Recall from our previous EDA that this function takes in a data frame, “skims” it, and returns commonly used summary statistics. Let’s take our gapminder2007 data frame, select() only the outcome and explanatory variables lifeExp and continent, and pipe them into the skim() function: gapminder2007 %&gt;% select(lifeExp, continent) %&gt;% skim() Skim summary statistics n obs: 142 n variables: 2 ── Variable type:factor variable missing complete n n_unique top_counts ordered continent 0 142 142 5 Afr: 52, Asi: 33, Eur: 30, Ame: 25 FALSE ── Variable type:numeric variable missing complete n mean sd p0 p25 p50 p75 p100 lifeExp 0 142 142 67.01 12.07 39.61 57.16 71.94 76.41 82.6 The skim() output now reports summaries for categorical variables (Variable type:factor) separately from the numerical variables (Variable type:numeric). For the categorical variable continent, it reports: missing, complete, and n, which are the number of missing, complete, and total number of values as before, respectively. n_unique: The number of unique levels to this variable, corresponding to Africa, Asia, Americas, Europe, and Oceania. This refers to how many countries are in the data for each continent. top_counts: In this case, the top four counts: Africa has 52 countries, Asia has 33, Europe has 30, and Americas has 25. Not displayed is Oceania with 2 countries. ordered: This tells us whether the categorical variable is “ordinal”: whether there is an encoded hierarchy (like low, medium, high). In this case, continent is not ordered. Turning our attention to the summary statistics of the numerical variable lifeExp, we observe that the global median life expectancy in 2007 was 71.94. Thus, half of the world’s countries (71 countries) had a life expectancy less than 71.94. The mean life expectancy of 67.01 is lower, however. Why is the mean life expectancy lower than the median? We can answer this question by performing the last of the three common steps in an exploratory data analysis: creating data visualizations. Let’s visualize the distribution of our outcome variable \\(y\\) = lifeExp in Figure 5.7. ggplot(gapminder2007, aes(x = lifeExp)) + geom_histogram(binwidth = 5, color = &quot;white&quot;) + labs(x = &quot;Life expectancy&quot;, y = &quot;Number of countries&quot;, title = &quot;Histogram of distribution of worldwide life expectancies&quot;) FIGURE 5.7: Histogram of life expectancy in 2007. We see that this data is left-skewed, also known as negatively skewed: there are a few countries with low life expectancy that are bringing down the mean life expectancy. However, the median is less sensitive to the effects of such outliers; hence, the median is greater than the mean in this case. Remember, however, that we want to compare life expectancies both between continents and within continents. In other words, our visualizations need to incorporate some notion of the variable continent. We can do this easily with a faceted histogram. Recall from Section 2.6 that facets allow us to split a visualization by the different values of another variable. We display the resulting visualization in Figure 5.8 by adding a facet_wrap(~ continent, nrow = 2) layer. ggplot(gapminder2007, aes(x = lifeExp)) + geom_histogram(binwidth = 5, color = &quot;white&quot;) + labs(x = &quot;Life expectancy&quot;, y = &quot;Number of countries&quot;, title = &quot;Histogram of distribution of worldwide life expectancies&quot;) + facet_wrap(~ continent, nrow = 2) FIGURE 5.8: Life expectancy in 2007. Observe that unfortunately the distribution of African life expectancies is much lower than the other continents, while in Europe life expectancies tend to be higher and furthermore do not vary as much. On the other hand, both Asia and Africa have the most variation in life expectancies. There is the least variation in Oceania, but keep in mind that there are only two countries in Oceania: Australia and New Zealand. Recall that an alternative method to visualize the distribution of a numerical variable split by a categorical variable is by using a side-by-side boxplot. We map the categorical variable continent to the \\(x\\)-axis and the different life expectancies within each continent on the \\(y\\)-axis in Figure 5.9. ggplot(gapminder2007, aes(x = continent, y = lifeExp)) + geom_boxplot() + labs(x = &quot;Continent&quot;, y = &quot;Life expectancy&quot;, title = &quot;Life expectancy by continent&quot;) FIGURE 5.9: Life expectancy in 2007. Some people prefer comparing the distributions of a numerical variable between different levels of a categorical variable using a boxplot instead of a faceted histogram. This is because we can make quick comparisons between the categorical variable’s levels with imaginary horizontal lines. For example, observe in Figure 5.9 that we can quickly convince ourselves that Oceania has the highest median life expectancies by drawing an imaginary horizontal line at \\(y\\) = 80. Furthermore, as we observed in the faceted histogram in Figure 5.8, Africa and Asia have the largest variation in life expectancy as evidenced by their large interquartile ranges (the heights of the boxes). It’s important to remember, however, that the solid lines in the middle of the boxes correspond to the medians (the middle value) rather than the mean (the average). So, for example, if you look at Asia, the solid line denotes the median life expectancy of around 72 years. This tells us that half of all countries in Asia have a life expectancy below 72 years, whereas half have a life expectancy above 72 years. Let’s compute the median and mean life expectancy for each continent with a little more data wrangling and display the results in Table 5.6. lifeExp_by_continent &lt;- gapminder2007 %&gt;% group_by(continent) %&gt;% summarize(median = median(lifeExp), mean = mean(lifeExp)) TABLE 5.6: Life expectancy by continent continent median mean Africa 52.9 54.8 Americas 72.9 73.6 Asia 72.4 70.7 Europe 78.6 77.6 Oceania 80.7 80.7 Observe the order of the second column median life expectancy: Africa is lowest, the Americas and Asia are next with similar medians, then Europe, then Oceania. This ordering corresponds to the ordering of the solid black lines inside the boxes in our side-by-side boxplot in Figure 5.9. Let’s now turn our attention to the values in the third column mean. Using Africa’s mean life expectancy of 54.8 as a baseline for comparison, let’s start making comparisons to the mean life expectancies of the other four continents and put these values in Table 5.7, which we’ll revisit later on in this section. For the Americas, it is 73.6 - 54.8 = 18.8 years higher. For Asia, it is 70.7 - 54.8 = 15.9 years higher. For Europe, it is 77.6 - 54.8 = 22.8 years higher. For Oceania, it is 80.7 - 54.8 = 25.9 years higher. TABLE 5.7: Mean life expectancy by continent and relative differences from mean for Africa continent mean Difference versus Africa Africa 54.8 0.0 Americas 73.6 18.8 Asia 70.7 15.9 Europe 77.6 22.8 Oceania 80.7 25.9 Learning check (LC5.4) Conduct a new exploratory data analysis with the same explanatory variable \\(x\\) being continent but with gdpPercap as the new outcome variable \\(y\\). What can you say about the differences in GDP per capita between continents based on this exploration? 5.2.2 Linear regression In Subsection 5.1.2 we introduced simple linear regression, which involves modeling the relationship between a numerical outcome variable \\(y\\) and a numerical explanatory variable \\(x\\). In our life expectancy example, we now instead have a categorical explanatory variable continent. Our model will not yield a “best-fitting” regression line like in Figure 5.4, but rather offsets relative to a baseline for comparison. As we did in Subsection 5.1.2 when studying the relationship between teaching scores and “beauty” scores, let’s output the regression table for this model. Recall that this is done in two steps: We first “fit” the linear regression model using the lm(y ~ x, data) function and save it in lifeExp_model. We get the regression table by applying the get_regression_table() function from the moderndive package to lifeExp_model. lifeExp_model &lt;- lm(lifeExp ~ continent, data = gapminder2007) get_regression_table(lifeExp_model) TABLE 5.8: Linear regression table term estimate std_error statistic p_value lower_ci upper_ci intercept 54.8 1.02 53.45 0 52.8 56.8 continentAmericas 18.8 1.80 10.45 0 15.2 22.4 continentAsia 15.9 1.65 9.68 0 12.7 19.2 continentEurope 22.8 1.70 13.47 0 19.5 26.2 continentOceania 25.9 5.33 4.86 0 15.4 36.5 Let’s once again focus on the values in the term and estimate columns of Table 5.8. Why are there now 5 rows? Let’s break them down one-by-one: intercept corresponds to the mean life expectancy of countries in Africa of 54.8 years. continentAmericas corresponds to countries in the Americas and the value +18.8 is the same difference in mean life expectancy relative to Africa we displayed in Table 5.7. In other words, the mean life expectancy of countries in the Americas is \\(54.8 + 18.8 = 73.6\\). continentAsia corresponds to countries in Asia and the value +15.9 is the same difference in mean life expectancy relative to Africa we displayed in Table 5.7. In other words, the mean life expectancy of countries in Asia is \\(54.8 + 15.9 = 70.7\\). continentEurope corresponds to countries in Europe and the value +22.8 is the same difference in mean life expectancy relative to Africa we displayed in Table 5.7. In other words, the mean life expectancy of countries in Europe is \\(54.8 + 22.8 = 77.6\\). continentOceania corresponds to countries in Oceania and the value +25.9 is the same difference in mean life expectancy relative to Africa we displayed in Table 5.7. In other words, the mean life expectancy of countries in Oceania is \\(54.8 + 25.9 = 80.7\\). To summarize, the 5 values in the estimate column in Table 5.8 correspond to the “baseline for comparison” continent Africa (the intercept) as well as four “offsets” from this baseline for the remaining 4 continents: the Americas, Asia, Europe, and Oceania. You might be asking at this point why was Africa chosen as the “baseline for comparison” group. This is the case for no other reason than it comes first alphabetically of the five continents; by default R arranges factors/categorical variables in alphanumeric order. You can change this baseline group to be another continent if you manipulate the variable continent’s factor “levels” using the forcats package. See Chapter 15 of R for Data Science (Grolemund and Wickham 2017) for examples. Let’s now write the equation for our fitted values \\(\\widehat{y} = \\widehat{\\text{life exp}}\\). \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{life exp}} &amp;= b_0 + b_{\\text{Amer}}\\cdot\\mathbb{1}_{\\text{Amer}}(x) + b_{\\text{Asia}}\\cdot\\mathbb{1}_{\\text{Asia}}(x) + \\\\ &amp; \\qquad b_{\\text{Euro}}\\cdot\\mathbb{1}_{\\text{Euro}}(x) + b_{\\text{Ocean}}\\cdot\\mathbb{1}_{\\text{Ocean}}(x)\\\\ &amp;= 54.8 + 18.8\\cdot\\mathbb{1}_{\\text{Amer}}(x) + 15.9\\cdot\\mathbb{1}_{\\text{Asia}}(x) + \\\\ &amp; \\qquad 22.8\\cdot\\mathbb{1}_{\\text{Euro}}(x) + 25.9\\cdot\\mathbb{1}_{\\text{Ocean}}(x) \\end{aligned} \\] Whoa! That looks daunting! Don’t fret, however, as once you understand what all the elements mean, things simplify greatly. First, \\(\\mathbb{1}_{A}(x)\\) is what’s known in mathematics as an “indicator function.” It returns only one of two possible values, 0 and 1, where \\[ \\mathbb{1}_{A}(x) = \\left\\{ \\begin{array}{ll} 1 &amp; \\text{if } x \\text{ is in } A \\\\ 0 &amp; \\text{if } \\text{otherwise} \\end{array} \\right. \\] In a statistical modeling context, this is also known as a dummy variable. In our case, let’s consider the first such indicator variable \\(\\mathbb{1}_{\\text{Amer}}(x)\\). This indicator function returns 1 if a country is in the Americas, 0 otherwise: \\[ \\mathbb{1}_{\\text{Amer}}(x) = \\left\\{ \\begin{array}{ll} 1 &amp; \\text{if } \\text{country } x \\text{ is in the Americas} \\\\ 0 &amp; \\text{otherwise}\\end{array} \\right. \\] Second, \\(b_0\\) corresponds to the intercept as before; in this case, it’s the mean life expectancy of all countries in Africa. Third, the \\(b_{\\text{Amer}}\\), \\(b_{\\text{Asia}}\\), \\(b_{\\text{Euro}}\\), and \\(b_{\\text{Ocean}}\\) represent the 4 “offsets relative to the baseline for comparison” in the regression table output in Table 5.8: continentAmericas, continentAsia, continentEurope, and continentOceania. Let’s put this all together and compute the fitted value \\(\\widehat{y} = \\widehat{\\text{life exp}}\\) for a country in Africa. Since the country is in Africa, all four indicator functions \\(\\mathbb{1}_{\\text{Amer}}(x)\\), \\(\\mathbb{1}_{\\text{Asia}}(x)\\), \\(\\mathbb{1}_{\\text{Euro}}(x)\\), and \\(\\mathbb{1}_{\\text{Ocean}}(x)\\) will equal 0, and thus: \\[ \\begin{aligned} \\widehat{\\text{life exp}} &amp;= b_0 + b_{\\text{Amer}}\\cdot\\mathbb{1}_{\\text{Amer}}(x) + b_{\\text{Asia}}\\cdot\\mathbb{1}_{\\text{Asia}}(x) + \\\\ &amp; \\qquad b_{\\text{Euro}}\\cdot\\mathbb{1}_{\\text{Euro}}(x) + b_{\\text{Ocean}}\\cdot\\mathbb{1}_{\\text{Ocean}}(x)\\\\ &amp;= 54.8 + 18.8\\cdot\\mathbb{1}_{\\text{Amer}}(x) + 15.9\\cdot\\mathbb{1}_{\\text{Asia}}(x) + \\\\ &amp; \\qquad 22.8\\cdot\\mathbb{1}_{\\text{Euro}}(x) + 25.9\\cdot\\mathbb{1}_{\\text{Ocean}}(x)\\\\ &amp;= 54.8 + 18.8\\cdot 0 + 15.9\\cdot 0 + 22.8\\cdot 0 + 25.9\\cdot 0\\\\ &amp;= 54.8 \\end{aligned} \\] In other words, all that’s left is the intercept \\(b_0\\), corresponding to the average life expectancy of African countries of 54.8 years. Next, say we are considering a country in the Americas. In this case, only the indicator function \\(\\mathbb{1}_{\\text{Amer}}(x)\\) for the Americas will equal 1, while all the others will equal 0, and thus: \\[ \\begin{aligned} \\widehat{\\text{life exp}} &amp;= 54.8 + 18.8\\cdot\\mathbb{1}_{\\text{Amer}}(x) + 15.9\\cdot\\mathbb{1}_{\\text{Asia}}(x) + 22.8\\cdot\\mathbb{1}_{\\text{Euro}}(x) + \\\\ &amp; \\qquad 25.9\\cdot\\mathbb{1}_{\\text{Ocean}}(x)\\\\ &amp;= 54.8 + 18.8\\cdot 1 + 15.9\\cdot 0 + 22.8\\cdot 0 + 25.9\\cdot 0\\\\ &amp;= 54.8 + 18.8 \\\\ &amp; = 73.6 \\end{aligned} \\] which is the mean life expectancy for countries in the Americas of 73.6 years in Table 5.7. Note the “offset from the baseline for comparison” is +18.8 years. Let’s do one more. Say we are considering a country in Asia. In this case, only the indicator function \\(\\mathbb{1}_{\\text{Asia}}(x)\\) for Asia will equal 1, while all the others will equal 0, and thus: \\[ \\begin{aligned} \\widehat{\\text{life exp}} &amp;= 54.8 + 18.8\\cdot\\mathbb{1}_{\\text{Amer}}(x) + 15.9\\cdot\\mathbb{1}_{\\text{Asia}}(x) + 22.8\\cdot\\mathbb{1}_{\\text{Euro}}(x) + \\\\ &amp; \\qquad 25.9\\cdot\\mathbb{1}_{\\text{Ocean}}(x)\\\\ &amp;= 54.8 + 18.8\\cdot 0 + 15.9\\cdot 1 + 22.8\\cdot 0 + 25.9\\cdot 0\\\\ &amp;= 54.8 + 15.9 \\\\ &amp; = 70.7 \\end{aligned} \\] which is the mean life expectancy for Asian countries of 70.7 years in Table 5.7. The “offset from the baseline for comparison” here is +15.9 years. Let’s generalize this idea a bit. If we fit a linear regression model using a categorical explanatory variable \\(x\\) that has \\(k\\) possible categories, the regression table will return an intercept and \\(k - 1\\) “offsets.” In our case, since there are \\(k = 5\\) continents, the regression model returns an intercept corresponding to the baseline for comparison group of Africa and \\(k - 1 = 4\\) offsets corresponding to the Americas, Asia, Europe, and Oceania. Understanding a regression table output when you’re using a categorical explanatory variable is a topic those new to regression often struggle with. The only real remedy for these struggles is practice, practice, practice. However, once you equip yourselves with an understanding of how to create regression models using categorical explanatory variables, you’ll be able to incorporate many new variables into your models, given the large amount of the world’s data that is categorical. If you feel like you’re still struggling at this point, however, we suggest you closely compare Tables 5.7 and 5.8 and note how you can compute all the values from one table using the values in the other. Learning check (LC5.5) Fit a new linear regression using lm(gdpPercap ~ continent, data = gapminder2007) where gdpPercap is the new outcome variable \\(y\\). Get information about the “best-fitting” line from the regression table by applying the get_regression_table() function. How do the regression results match up with the results from your previous exploratory data analysis? 5.2.3 Observed/fitted values and residuals Recall in Subsection 5.1.3, we defined the following three concepts: Observed values \\(y\\), or the observed value of the outcome variable Fitted values \\(\\widehat{y}\\), or the value on the regression line for a given \\(x\\) value Residuals \\(y - \\widehat{y}\\), or the error between the observed value and the fitted value We obtained these values and other values using the get_regression_points() function from the moderndive package. This time, however, let’s add an argument setting ID = &quot;country&quot;: this is telling the function to use the variable country in gapminder2007 as an identification variable in the output. This will help contextualize our analysis by matching values to countries. regression_points &lt;- get_regression_points(lifeExp_model, ID = &quot;country&quot;) regression_points TABLE 5.9: Regression points (First 10 out of 142 countries) country lifeExp continent lifeExp_hat residual Afghanistan 43.8 Asia 70.7 -26.900 Albania 76.4 Europe 77.6 -1.226 Algeria 72.3 Africa 54.8 17.495 Angola 42.7 Africa 54.8 -12.075 Argentina 75.3 Americas 73.6 1.712 Australia 81.2 Oceania 80.7 0.515 Austria 79.8 Europe 77.6 2.180 Bahrain 75.6 Asia 70.7 4.907 Bangladesh 64.1 Asia 70.7 -6.666 Belgium 79.4 Europe 77.6 1.792 Observe in Table 5.9 that lifeExp_hat contains the fitted values \\(\\widehat{y}\\) = \\(\\widehat{\\text{lifeExp}}\\). If you look closely, there are only 5 possible values for lifeExp_hat. These correspond to the five mean life expectancies for the 5 continents that we displayed in Table 5.7 and computed using the values in the estimate column of the regression table in Table 5.8. The residual column is simply \\(y - \\widehat{y}\\) = lifeExp - lifeExp_hat. These values can be interpreted as the deviation of a country’s life expectancy from its continent’s average life expectancy. For example, look at the first row of Table 5.9 corresponding to Afghanistan. The residual of \\(y - \\widehat{y} = 43.8 - 70.7 = -26.9\\) is telling us that Afghanistan’s life expectancy is a whopping 26.9 years lower than the mean life expectancy of all Asian countries. This can in part be explained by the many years of war that country has suffered. Learning check (LC5.6) Using either the sorting functionality of RStudio’s spreadsheet viewer or using the data wrangling tools you learned in Chapter 3, identify the five countries with the five smallest (most negative) residuals? What do these negative residuals say about their life expectancy relative to their continents’ life expectancy? (LC5.7) Repeat this process, but identify the five countries with the five largest (most positive) residuals. What do these positive residuals say about their life expectancy relative to their continents’ life expectancy? 5.3 Related topics 5.3.1 Correlation is not necessarily causation Throughout this chapter we’ve been cautious when interpreting regression slope coefficients. We always discussed the “associated” effect of an explanatory variable \\(x\\) on an outcome variable \\(y\\). For example, our statement from Subsection 5.1.2 that “for every increase of 1 unit in bty_avg, there is an associated increase of on average 0.067 units of score.” We include the term “associated” to be extra careful not to suggest we are making a causal statement. So while “beauty” score of bty_avg is positively correlated with teaching score, we can’t necessarily make any statements about “beauty” scores’ direct causal effect on teaching score without more information on how this study was conducted. Here is another example: a not-so-great medical doctor goes through medical records and finds that patients who slept with their shoes on tended to wake up more with headaches. So this doctor declares, “Sleeping with shoes on causes headaches!” FIGURE 5.10: Does sleeping with shoes on cause headaches? However, there is a good chance that if someone is sleeping with their shoes on, it’s potentially because they are intoxicated from alcohol. Furthermore, higher levels of drinking leads to more hangovers, and hence more headaches. The amount of alcohol consumption here is what’s known as a confounding/lurking variable. It “lurks” behind the scenes, confounding the causal relationship (if any) of “sleeping with shoes on” with “waking up with a headache.” We can summarize this in Figure 5.11 with a causal graph where: Y is a response variable; here it is “waking up with a headache.” X is a treatment variable whose causal effect we are interested in; here it is “sleeping with shoes on.” FIGURE 5.11: Causal graph. To study the relationship between Y and X, we could use a regression model where the outcome variable is set to Y and the explanatory variable is set to be X, as you’ve been doing throughout this chapter. However, Figure 5.11 also includes a third variable with arrows pointing at both X and Y: Z is a confounding variable that affects both X and Y, thereby “confounding” their relationship. Here the confounding variable is alcohol. Alcohol will cause people to be both more likely to sleep with their shoes on as well as be more likely to wake up with a headache. Thus any regression model of the relationship between X and Y should also use Z as an explanatory variable. In other words, our doctor needs to take into account who had been drinking the night before. In the next chapter, we’ll start covering multiple regression models that allow us to incorporate more than one variable in our regression models. Establishing causation is a tricky problem and frequently takes either carefully designed experiments or methods to control for the effects of confounding variables. Both these approaches attempt, as best they can, either to take all possible confounding variables into account or negate their impact. This allows researchers to focus only on the relationship of interest: the relationship between the outcome variable Y and the treatment variable X. As you read news stories, be careful not to fall into the trap of thinking that correlation necessarily implies causation. Check out the Spurious Correlations website for some rather comical examples of variables that are correlated, but are definitely not causally related. 5.3.2 Best-fitting line Regression lines are also known as “best-fitting” lines. But what do we mean by “best”? Let’s unpack the criteria that is used in regression to determine “best.” Recall Figure 5.6, where for an instructor with a beauty score of \\(x = 7.333\\) we mark the observed value \\(y\\) with a circle, the fitted value \\(\\widehat{y}\\) with a square, and the residual \\(y - \\widehat{y}\\) with an arrow. We re-display Figure 5.6 in the top-left plot of Figure 5.12 in addition to three more arbitrarily chosen course instructors: FIGURE 5.12: Example of observed value, fitted value, and residual. The three other plots refer to: A course whose instructor had a “beauty” score \\(x\\) = 2.333 and teaching score \\(y\\) = 2.7. The residual in this case is \\(2.7 - 4.036 = -1.336\\), which we mark with a new blue arrow in the top-right plot. A course whose instructor had a “beauty” score \\(x = 3.667\\) and teaching score \\(y = 4.4\\). The residual in this case is \\(4.4 - 4.125 = 0.2753\\), which we mark with a new blue arrow in the bottom-left plot. A course whose instructor had a “beauty” score \\(x = 6\\) and teaching score \\(y = 3.8\\). The residual in this case is \\(3.8 - 4.28 = -0.4802\\), which we mark with a new blue arrow in the bottom-right plot. Now say we repeated this process of computing residuals for all 463 courses’ instructors, then we squared all the residuals, and then we summed them. We call this quantity the sum of squared residuals; it is a measure of the lack of fit of a model. Larger values of the sum of squared residuals indicate a bigger lack of fit. This corresponds to a worse fitting model. If the regression line fits all the points perfectly, then the sum of squared residuals is 0. This is because if the regression line fits all the points perfectly, then the fitted value \\(\\widehat{y}\\) equals the observed value \\(y\\) in all cases, and hence the residual \\(y-\\widehat{y}\\) = 0 in all cases, and the sum of even a large number of 0’s is still 0. Furthermore, of all possible lines we can draw through the cloud of 463 points, the regression line minimizes this value. In other words, the regression and its corresponding fitted values \\(\\widehat{y}\\) minimizes the sum of the squared residuals: \\[ \\sum_{i=1}^{n}(y_i - \\widehat{y}_i)^2 \\] Let’s use our data wrangling tools from Chapter 3 to compute the sum of squared residuals exactly: # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals_ch5) # Get regression points: regression_points &lt;- get_regression_points(score_model) regression_points # A tibble: 463 x 5 ID score bty_avg score_hat residual &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 1 4.7 5 4.21 0.486 2 2 4.1 5 4.21 -0.114 3 3 3.9 5 4.21 -0.314 4 4 4.8 5 4.21 0.586 5 5 4.6 3 4.08 0.52 6 6 4.3 3 4.08 0.22 7 7 2.8 3 4.08 -1.28 8 8 4.1 3.33 4.10 -0.002 9 9 3.4 3.33 4.10 -0.702 10 10 4.5 3.17 4.09 0.409 # … with 453 more rows # Compute sum of squared residuals regression_points %&gt;% mutate(squared_residuals = residual^2) %&gt;% summarize(sum_of_squared_residuals = sum(squared_residuals)) # A tibble: 1 x 1 sum_of_squared_residuals &lt;dbl&gt; 1 132. Any other straight line drawn in the figure would yield a sum of squared residuals greater than 132. This is a mathematically guaranteed fact that you can prove using calculus and linear algebra. That’s why alternative names for the linear regression line are the best-fitting line and the least-squares line. Why do we square the residuals (i.e., the arrow lengths)? So that both positive and negative deviations of the same amount are treated equally. Learning check (LC5.8) Note in Figure 5.13 there are 3 points marked with dots and: The “best” fitting solid regression line in blue An arbitrarily chosen dotted red line Another arbitrarily chosen dashed green line FIGURE 5.13: Regression line and two others. Compute the sum of squared residuals by hand for each line and show that of these three lines, the regression line in blue has the smallest value. 5.3.3 get_regression_x() functions Recall in this chapter we introduced two functions from the moderndive package: get_regression_table() that returns a regression table in Subsection 5.1.2 and get_regression_points() that returns point-by-point information from a regression model in Subsection 5.1.3. What is going on behind the scenes with the get_regression_table() and get_regression_points() functions? We mentioned in Subsection 5.1.2 that these were examples of wrapper functions. Such functions take other pre-existing functions and “wrap” them into single functions that hide the user from their inner workings. This way all the user needs to worry about is what the inputs look like and what the outputs look like. In this subsection, we’ll “get under the hood” of these functions and see how the “engine” of these wrapper functions works. Recall our two-step process to generate a regression table from Subsection 5.1.2: # Fit regression model: score_model &lt;- lm(formula = score ~ bty_avg, data = evals_ch5) # Get regression table: get_regression_table(score_model) TABLE 5.10: Regression table term estimate std_error statistic p_value lower_ci upper_ci intercept 3.880 0.076 50.96 0 3.731 4.030 bty_avg 0.067 0.016 4.09 0 0.035 0.099 The get_regression_table() wrapper function takes two pre-existing functions in other R packages: tidy() from the broom package (Robinson and Hayes 2019) and clean_names() from the janitor package (Firke 2019) and “wraps” them into a single function that takes in a saved lm() linear model model, here score_model, and returns a regression table saved as a “tidy” data frame. Here is how we used the tidy() and clean_names() functions to produce Table 5.11: library(broom) library(janitor) score_model %&gt;% tidy(conf.int = TRUE) %&gt;% mutate_if(is.numeric, round, digits = 3) %&gt;% clean_names() %&gt;% rename(lower_ci = conf_low, upper_ci = conf_high) TABLE 5.11: Regression table using tidy() from broom package term estimate std_error statistic p_value lower_ci upper_ci (Intercept) 3.880 0.076 50.96 0 3.731 4.030 bty_avg 0.067 0.016 4.09 0 0.035 0.099 Yikes! That’s a lot of code! So, in order to simplify your lives, we made the editorial decision to “wrap” all the code into get_regression_table(), freeing you from the need to understand the inner workings of the function. Note that the mutate_if() function is from the dplyr package and applies the round() function to three significant digits precision only to those variables that are numerical. Similarly, the get_regression_points() function is another wrapper function, but this time returning information about the individual points involved in a regression model like the fitted values, observed values, and the residuals. get_regression_points() uses the augment() function in the broom package instead of the tidy() function as with get_regression_table() to produce the data shown in Table 5.12: library(broom) library(janitor) score_model %&gt;% augment() %&gt;% mutate_if(is.numeric, round, digits = 3) %&gt;% clean_names() %&gt;% select(-c(&quot;se_fit&quot;, &quot;hat&quot;, &quot;sigma&quot;, &quot;cooksd&quot;, &quot;std_resid&quot;)) TABLE 5.12: Regression points using augment() from broom package score bty_avg fitted resid 4.7 5.00 4.21 0.486 4.1 5.00 4.21 -0.114 3.9 5.00 4.21 -0.314 4.8 5.00 4.21 0.586 4.6 3.00 4.08 0.520 4.3 3.00 4.08 0.220 2.8 3.00 4.08 -1.280 4.1 3.33 4.10 -0.002 3.4 3.33 4.10 -0.702 4.5 3.17 4.09 0.409 In this case, it outputs only the variables of interest to students learning regression: the outcome variable \\(y\\) (score), all explanatory/predictor variables (bty_avg), all resulting fitted values \\(\\hat{y}\\) used by applying the equation of the regression line to bty_avg, and the residual \\(y - \\hat{y}\\). If you’re even more curious about how these and other wrapper functions work, take a look at the source code for these functions on GitHub. 5.4 Conclusion 5.4.1 Additional resources An R script file of all R code used in this chapter is available here. As we suggested in Subsection 5.1.1, interpreting coefficients that are not close to the extreme values of -1, 0, and 1 can be somewhat subjective. To help develop your sense of correlation coefficients, we suggest you play the 80s-style video game called, “Guess the Correlation”, at http://guessthecorrelation.com/. FIGURE 5.14: Preview of “Guess the Correlation” game. 5.4.2 What’s to come? In this chapter, you’ve studied the term basic regression, where you fit models that only have one explanatory variable. In Chapter 6, we’ll study multiple regression, where our regression models can now have more than one explanatory variable! In particular, we’ll consider two scenarios: regression models with one numerical and one categorical explanatory variable and regression models with two numerical explanatory variables. This will allow you to construct more sophisticated and more powerful models, all in the hopes of better explaining your outcome variable \\(y\\). References "],
+["6-multiple-regression.html", "Chapter 6 Multiple Regression 6.1 One numerical and one categorical explanatory variable 6.2 Two numerical explanatory variables 6.3 Related topics 6.4 Conclusion", " Chapter 6 Multiple Regression In Chapter 5 we introduced ideas related to modeling for explanation, in particular that the goal of modeling is to make explicit the relationship between some outcome variable \\(y\\) and some explanatory variable \\(x\\). While there are many approaches to modeling, we focused on one particular technique: linear regression, one of the most commonly used and easy-to-understand approaches to modeling. Furthermore to keep things simple, we only considered models with one explanatory \\(x\\) variable that was either numerical in Section 5.1 or categorical in Section 5.2. In this chapter on multiple regression, we’ll start considering models that include more than one explanatory variable \\(x\\). You can imagine when trying to model a particular outcome variable, like teaching evaluation scores as in Section 5.1 or life expectancy as in Section 5.2, that it would be useful to include more than just one explanatory variable’s worth of information. Since our regression models will now consider more than one explanatory variable, the interpretation of the associated effect of any one explanatory variable must be made in conjunction with the other explanatory variables included in your model. Let’s begin! Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once: ggplot2 for data visualization dplyr for data wrangling tidyr for converting data to “tidy” format readr for importing spreadsheet data into R As well as the more advanced purrr, tibble, stringr, and forcats packages If needed, read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(moderndive) library(skimr) library(ISLR) 6.1 One numerical and one categorical explanatory variable Let’s revisit the instructor evaluation data from UT Austin we introduced in Section 5.1. We studied the relationship between teaching evaluation scores as given by students and “beauty” scores. The variable teaching score was the numerical outcome variable \\(y\\), and the variable “beauty” score (bty_avg) was the numerical explanatory \\(x\\) variable. In this section, we are going to consider a different model. Our outcome variable will still be teaching score, but we’ll now include two different explanatory variables: age and (binary) gender. Could it be that instructors who are older receive better teaching evaluations from students? Or could it instead be that younger instructors receive better evaluations? Are there differences in evaluations given by students for instructors of different genders? We’ll answer these questions by modeling the relationship between these variables using multiple regression, where we have: A numerical outcome variable \\(y\\), the instructor’s teaching score, and Two explanatory variables: A numerical explanatory variable \\(x_1\\), the instructor’s age. A categorical explanatory variable \\(x_2\\), the instructor’s (binary) gender. It is important to note that at the time of this study due to then commonly held beliefs about gender, this variable was often recorded as a binary variable. While the results of a model that oversimplifies gender this way may be imperfect, we still found the results to be pertinent and relevant today. 6.1.1 Exploratory data analysis Recall that data on the 463 courses at UT Austin can be found in the evals data frame included in the moderndive package. However, to keep things simple, let’s select() only the subset of the variables we’ll consider in this chapter, and save this data in a new data frame called evals_ch6. Note that these are different than the variables chosen in Chapter 5. evals_ch6 &lt;- evals %&gt;% select(ID, score, age, gender) Recall the three common steps in an exploratory data analysis we saw in Subsection 5.1.1: Looking at the raw data values. Computing summary statistics. Creating data visualizations. Let’s first look at the raw data values by either looking at evals_ch6 using RStudio’s spreadsheet viewer or by using the glimpse() function from the dplyr package: glimpse(evals_ch6) Observations: 463 Variables: 4 $ ID &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,… $ score &lt;dbl&gt; 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4.… $ age &lt;int&gt; 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 40… $ gender &lt;fct&gt; female, female, female, female, male, male, male, male, male, … Let’s also display a random sample of 5 rows of the 463 rows corresponding to different courses in Table 6.1. Remember due to the random nature of the sampling, you will likely end up with a different subset of 5 rows. evals_ch6 %&gt;% sample_n(size = 5) TABLE 6.1: A random sample of 5 out of the 463 courses at UT Austin ID score age gender 129 3.7 62 male 109 4.7 46 female 28 4.8 62 male 434 2.8 62 male 330 4.0 64 male Now that we’ve looked at the raw values in our evals_ch6 data frame and got a sense of the data, let’s compute summary statistics. As we did in our exploratory data analyses in Sections 5.1.1 and 5.2.1 from the previous chapter, let’s use the skim() function from the skimr package, being sure to only select() the variables of interest in our model: evals_ch6 %&gt;% select(score, age, gender) %&gt;% skim() Skim summary statistics n obs: 463 n variables: 3 ── Variable type:factor variable missing complete n n_unique top_counts ordered gender 0 463 463 2 mal: 268, fem: 195, NA: 0 FALSE ── Variable type:integer variable missing complete n mean sd p0 p25 p50 p75 p100 age 0 463 463 48.37 9.8 29 42 48 57 73 ── Variable type:numeric variable missing complete n mean sd p0 p25 p50 p75 p100 score 0 463 463 4.17 0.54 2.3 3.8 4.3 4.6 5 Observe that we have no missing data, that there are 268 courses taught by male instructors and 195 courses taught by female instructors, and that the average instructor age is 48.37. Recall that each row represents a particular course and that the same instructor often teaches more than one course. Therefore, the average age of the unique instructors may differ. Furthermore, let’s compute the correlation coefficient between our two numerical variables: score and age. Recall from Subsection 5.1.1 that correlation coefficients only exist between numerical variables. We observe that they are “weakly negatively” correlated. evals_ch6 %&gt;% get_correlation(formula = score ~ age) # A tibble: 1 x 1 cor &lt;dbl&gt; 1 -0.107 Let’s now perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Given that the outcome variable score and explanatory variable age are both numerical, we’ll use a scatterplot to display their relationship. How can we incorporate the categorical variable gender, however? By mapping the variable gender to the color aesthetic, thereby creating a colored scatterplot. The following code is similar to the code that created the scatterplot of teaching score over “beauty” score in Figure 5.2, but with color = gender added to the aes()thetic mapping. ggplot(evals_ch6, aes(x = age, y = score, color = gender)) + geom_point() + labs(x = &quot;Age&quot;, y = &quot;Teaching Score&quot;, color = &quot;Gender&quot;) + geom_smooth(method = &quot;lm&quot;, se = FALSE) FIGURE 6.1: Colored scatterplot of relationship of teaching and beauty scores. In the resulting Figure 6.1, observe that ggplot() assigns a default in red/blue color scheme to the points and to the lines associated with the two levels of gender: female and male. Furthermore, the geom_smooth(method = &quot;lm&quot;, se = FALSE) layer automatically fits a different regression line for each group. We notice some interesting trends. First, there are almost no women faculty over the age of 60 as evidenced by lack of red dots above \\(x\\) = 60. Second, while both regression lines are negatively sloped with age (i.e., older instructors tend to have lower scores), the slope for age for the female instructors is more negative. In other words, female instructors are paying a harsher penalty for advanced age than the male instructors. 6.1.2 Interaction model Let’s now quantify the relationship of our outcome variable \\(y\\) and the two explanatory variables using one type of multiple regression model known as an interaction model. We’ll explain where the term “interaction” comes from at the end of this section. In particular, we’ll write out the equation of the two regression lines in Figure 6.1 using the values from a regression table. Before we do this, however, let’s go over a brief refresher of regression when you have a categorical explanatory variable \\(x\\). Recall in Subsection 5.2.2 we fit a regression model for countries’ life expectancies as a function of which continent the country was in. In other words, we had a numerical outcome variable \\(y\\) = lifeExp and a categorical explanatory variable \\(x\\) = continent which had 5 levels: Africa, Americas, Asia, Europe, and Oceania. Let’s re-display the regression table you saw in Table 5.8: TABLE 6.2: Regression table for life expectancy as a function of continent term estimate std_error statistic p_value lower_ci upper_ci intercept 54.8 1.02 53.45 0 52.8 56.8 continentAmericas 18.8 1.80 10.45 0 15.2 22.4 continentAsia 15.9 1.65 9.68 0 12.7 19.2 continentEurope 22.8 1.70 13.47 0 19.5 26.2 continentOceania 25.9 5.33 4.86 0 15.4 36.5 Recall our interpretation of the estimate column. Since Africa was the “baseline for comparison” group, the intercept term corresponds to the mean life expectancy for all countries in Africa of 54.8 years. The other four values of estimate correspond to “offsets” relative to the baseline group. So, for example, the “offset” corresponding to the Americas is +18.8 as compared to the baseline for comparison group Africa. In other words, the average life expectancy for countries in the Americas is 18.8 years higher. Thus the mean life expectancy for all countries in the Americas is 54.8 + 18.8 = 73.6. The same interpretation holds for Asia, Europe, and Oceania. Going back to our multiple regression model for teaching score using age and gender in Figure 6.1, we generate the regression table using the same two-step approach from Chapter 5: we first “fit” the model using the lm() “linear model” function and then we apply the get_regression_table() function. This time, however, our model formula won’t be of the form y ~ x, but rather of the form y ~ x1 * x2. In other words, our two explanatory variables x1 and x2 are separated by a * sign: # Fit regression model: score_model_interaction &lt;- lm(score ~ age * gender, data = evals_ch6) # Get regression table: get_regression_table(score_model_interaction) TABLE 6.3: Regression table for interaction model term estimate std_error statistic p_value lower_ci upper_ci intercept 4.883 0.205 23.80 0.000 4.480 5.286 age -0.018 0.004 -3.92 0.000 -0.026 -0.009 gendermale -0.446 0.265 -1.68 0.094 -0.968 0.076 age:gendermale 0.014 0.006 2.45 0.015 0.003 0.024 Looking at the regression table output in Table 6.3, there are four rows of values in the estimate column. While it is not immediately apparent, using these four values we can write out the equations of both lines in Figure 6.1. First, since the word female comes alphabetically before male, female instructors are the “baseline for comparison” group. Thus, intercept is the intercept for only the female instructors. This holds similarly for age. It is the slope for age for only the female instructors. Thus, the red regression line in Figure 6.1 has an intercept of 4.883 and slope for age of -0.018. Remember that for this data, while the intercept has a mathematical interpretation, it has no practical interpretation since instructors can’t have zero age. What about the intercept and slope for age of the male instructors in the blue line in Figure 6.1? This is where our notion of “offsets” comes into play once again. The value for gendermale of -0.446 is not the intercept for the male instructors, but rather the offset in intercept for male instructors relative to female instructors. The intercept for the male instructors is intercept + gendermale = 4.883 + (-0.446) = 4.883 - 0.446 = 4.437. Similarly, age:gendermale = 0.014 is not the slope for age for the male instructors, but rather the offset in slope for the male instructors. Therefore, the slope for age for the male instructors is age + age:gendermale \\(= -0.018 + 0.014 = -0.004\\). Thus, the blue regression line in Figure 6.1 has intercept 4.437 and slope for age of -0.004. Let’s summarize these values in Table 6.4 and focus on the two slopes for age: TABLE 6.4: Comparison of intercepts and slopes for interaction model Gender Intercept Slope for age Female instructors 4.883 -0.018 Male instructors 4.437 -0.004 Since the slope for age for the female instructors was -0.018, it means that on average, a female instructor who is a year older would have a teaching score that is 0.018 units lower. For the male instructors, however, the corresponding associated decrease was on average only 0.004 units. While both slopes for age were negative, the slope for age for the female instructors is more negative. This is consistent with our observation from Figure 6.1, that this model is suggesting that age impacts teaching scores for female instructors more than for male instructors. Let’s now write the equation for our regression lines, which we can use to compute our fitted values \\(\\widehat{y} = \\widehat{\\text{score}}\\). \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{score}} &amp;= b_0 + b_{\\text{age}} \\cdot \\text{age} + b_{\\text{male}} \\cdot \\mathbb{1}_{\\text{is male}}(x) + b_{\\text{age,male}} \\cdot \\text{age} \\cdot \\mathbb{1}_{\\text{is male}}\\\\ &amp;= 4.883 -0.018 \\cdot \\text{age} - 0.446 \\cdot \\mathbb{1}_{\\text{is male}}(x) + 0.014 \\cdot \\text{age} \\cdot \\mathbb{1}_{\\text{is male}} \\end{aligned} \\] Whoa! That’s even more daunting than the equation you saw for the life expectancy as a function of continent in Subsection 5.2.2! However, if you recall what an “indicator function” does, the equation simplifies greatly. In the previous equation, we have one indicator function of interest: \\[ \\mathbb{1}_{\\text{is male}}(x) = \\left\\{ \\begin{array}{ll} 1 &amp; \\text{if } \\text{instructor } x \\text{ is male} \\\\ 0 &amp; \\text{otherwise}\\end{array} \\right. \\] Second, let’s match coefficients in the previous equation with values in the estimate column in our regression table in Table 6.3: \\(b_0\\) is the intercept = 4.883 for the female instructors \\(b_{\\text{age}}\\) is the slope for age = -0.018 for the female instructors \\(b_{\\text{male}}\\) is the offset in intercept = -0.446 for the male instructors \\(b_{\\text{age,male}}\\) is the offset in slope for age = 0.014 for the male instructors Let’s put this all together and compute the fitted value \\(\\widehat{y} = \\widehat{\\text{score}}\\) for female instructors. Since for female instructors \\(\\mathbb{1}_{\\text{is male}}(x)\\) = 0, the previous equation becomes \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{score}} &amp;= 4.883 - 0.018 \\cdot \\text{age} - 0.446 \\cdot 0 + 0.014 \\cdot \\text{age} \\cdot 0\\\\ &amp;= 4.883 - 0.018 \\cdot \\text{age} - 0 + 0\\\\ &amp;= 4.883 - 0.018 \\cdot \\text{age}\\\\ \\end{aligned} \\] which is the equation of the red regression line in Figure 6.1 corresponding to the female instructors in Table 6.4. Correspondingly, since for male instructors \\(\\mathbb{1}_{\\text{is male}}(x)\\) = 1, the previous equation becomes \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{score}} &amp;= 4.883 - 0.018 \\cdot \\text{age} - 0.446 + 0.014 \\cdot \\text{age}\\\\ &amp;= (4.883 - 0.446) + (- 0.018 + 0.014) * \\text{age}\\\\ &amp;= 4.437 - 0.004 \\cdot \\text{age}\\\\ \\end{aligned} \\] which is the equation of the blue regression line in Figure 6.1 corresponding to the male instructors in Table 6.4. Phew! That was a lot of arithmetic! Don’t fret, however, this is as hard as modeling will get in this book. If you’re still a little unsure about using indicator functions and using categorical explanatory variables in a regression model, we highly suggest you re-read Subsection 5.2.2. This involves only a single categorical explanatory variable and thus is much simpler. Before we end this section, we explain why we refer to this type of model as an “interaction model.” The \\(b_{\\text{age,male}}\\) term in the equation for the fitted value \\(\\widehat{y}\\) = \\(\\widehat{\\text{score}}\\) is what’s known in statistical modeling as an “interaction effect.” The interaction term corresponds to the age:gendermale = 0.014 in the final row of the regression table in Table 6.3. We say there is an interaction effect if the associated effect of one variable depends on the value of another variable. That is to say, the two variables are “interacting” with each other. Here, the associated effect of the variable age depends on the value of the other variable gender. The difference in slopes for age of +0.014 of male instructors relative to female instructors shows this. Another way of thinking about interaction effects on teaching scores is as follows. For a given instructor at UT Austin, there might be an associated effect of their age by itself, there might be an associated effect of their gender by itself, but when age and gender are considered together there might be an additional effect above and beyond the two individual effects. 6.1.3 Parallel slopes model When creating regression models with one numerical and one categorical explanatory variable, we are not just limited to interaction models as we just saw. Another type of model we can use is known as a parallel slopes model. Unlike interaction models where the regression lines can have different intercepts and different slopes, parallel slopes models still allow for different intercepts but force all lines to have the same slope. The resulting regression lines are thus parallel. Let’s visualize the best-fitting parallel slopes model to evals_ch6. Unfortunately, the geom_smooth() function in the ggplot2 package does not have a convenient way to plot parallel slopes models. Evgeni Chasnovski thus created a special purpose function called geom_parallel_slopes() that is included in the moderndive package. You won’t find geom_parallel_slopes() in the ggplot2 package, but rather the moderndive package. Thus, if you want to be able to use it, you will need to load both the ggplot2 and moderndive packages. Using this function, let’s now plot the parallel slopes model for teaching score. Notice how the code is identical to the code that produced the visualization of the interaction model in Figure 6.1, but now the geom_smooth(method = &quot;lm&quot;, se = FALSE) layer is replaced with geom_parallel_slopes(se = FALSE). ggplot(evals_ch6, aes(x = age, y = score, color = gender)) + geom_point() + labs(x = &quot;Age&quot;, y = &quot;Teaching Score&quot;, color = &quot;Gender&quot;) + geom_parallel_slopes(se = FALSE) FIGURE 6.2: Parallel slopes model of score with age and gender. Observe in Figure 6.2 that we now have parallel lines corresponding to the female and male instructors, respectively: here they have the same negative slope. This is telling us that instructors who are older will tend to receive lower teaching scores than instructors who are younger. Furthermore, since the lines are parallel, the associated penalty for being older is assumed to be the same for both female and male instructors. However, observe also in Figure 6.2 that these two lines have different intercepts as evidenced by the fact that the blue line corresponding to the male instructors is higher than the red line corresponding to the female instructors. This is telling us that irrespective of age, female instructors tended to receive lower teaching scores than male instructors. In order to obtain the precise numerical values of the two intercepts and the single common slope, we once again “fit” the model using the lm() “linear model” function and then apply the get_regression_table() function. However, unlike the interaction model which had a model formula of the form y ~ x1 * x2, our model formula is now of the form y ~ x1 + x2. In other words, our two explanatory variables x1 and x2 are separated by a + sign: # Fit regression model: score_model_parallel_slopes &lt;- lm(score ~ age + gender, data = evals_ch6) # Get regression table: get_regression_table(score_model_parallel_slopes) TABLE 6.5: Regression table for parallel slopes model term estimate std_error statistic p_value lower_ci upper_ci intercept 4.484 0.125 35.79 0.000 4.238 4.730 age -0.009 0.003 -3.28 0.001 -0.014 -0.003 gendermale 0.191 0.052 3.63 0.000 0.087 0.294 Similarly to the regression table for the interaction model from Table 6.3, we have an intercept term corresponding to the intercept for the “baseline for comparison” female instructor group and a gendermale term corresponding to the offset in intercept for the male instructors relative to female instructors. In other words, in Figure 6.2 the red regression line corresponding to the female instructors has an intercept of 4.484 while the blue regression line corresponding to the male instructors has an intercept of 4.484 + 0.191 = 4.675. Once again, since there aren’t any instructors of age 0, the intercepts only have a mathematical interpretation but no practical one. Unlike in Table 6.3, however, we now only have a single slope for age of -0.009. This is because the model dictates that both the female and male instructors have a common slope for age. This is telling us that an instructor who is a year older than another instructor received a teaching score that is on average 0.009 units lower. This penalty for being of advanced age applies equally to both female and male instructors. Let’s summarize these values in Table 6.6, noting the different intercepts but common slopes: TABLE 6.6: Comparison of intercepts and slope for parallel slopes model Gender Intercept Slope for age Female instructors 4.484 -0.009 Male instructors 4.675 -0.009 Let’s now write the equation for our regression lines, which we can use to compute our fitted values \\(\\widehat{y} = \\widehat{\\text{score}}\\). \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{score}} &amp;= b_0 + b_{\\text{age}} \\cdot \\text{age} + b_{\\text{male}} \\cdot \\mathbb{1}_{\\text{is male}}(x)\\\\ &amp;= 4.484 -0.009 \\cdot \\text{age} + 0.191 \\cdot \\mathbb{1}_{\\text{is male}}(x) \\end{aligned} \\] Let’s put this all together and compute the fitted value \\(\\widehat{y} = \\widehat{\\text{score}}\\) for female instructors. Since for female instructors the indicator function \\(\\mathbb{1}_{\\text{is male}}(x)\\) = 0, the previous equation becomes \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{score}} &amp;= 4.484 -0.009 \\cdot \\text{age} + 0.191 \\cdot 0\\\\ &amp;= 4.484 -0.009 \\cdot \\text{age} \\end{aligned} \\] which is the equation of the red regression line in Figure 6.2 corresponding to the female instructors. Correspondingly, since for male instructors the indicator function \\(\\mathbb{1}_{\\text{is male}}(x)\\) = 1, the previous equation becomes \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{score}} &amp;= 4.484 -0.009 \\cdot \\text{age} + 0.191 \\cdot 1\\\\ &amp;= (4.484 + 0.191) - 0.009 \\cdot \\text{age}\\\\ &amp;= 4.675 -0.009 \\cdot \\text{age} \\end{aligned} \\] which is the equation of the blue regression line in Figure 6.2 corresponding to the male instructors. Great! We’ve considered both an interaction model and a parallel slopes model for our data. Let’s compare the visualizations for both models side-by-side in Figure 6.3. FIGURE 6.3: Comparison of interaction and parallel slopes models. At this point, you might be asking yourself: “Why would we ever use a parallel slopes model?”. Looking at the left-hand plot in Figure 6.3, the two lines definitely do not appear to be parallel, so why would we force them to be parallel? For this data, we agree! It can easily be argued that the interaction model on the left is more appropriate. However, in the upcoming Subsection 6.3.1 on model selection, we’ll present an example where it can be argued that the case for a parallel slopes model might be stronger. 6.1.4 Observed/fitted values and residuals For brevity’s sake, in this section we’ll only compute the observed values, fitted values, and residuals for the interaction model which we saved in score_model_interaction. You’ll have an opportunity to study the corresponding values for the parallel slopes model in the upcoming Learning check. Say, you have an instructor who identifies as female and is 36 years old. What fitted value \\(\\widehat{y}\\) = \\(\\widehat{\\text{score}}\\) would our model yield? Say, you have another instructor who identifies as male and is 59 years old. What would their fitted value \\(\\widehat{y}\\) be? We answer this question visually first for the female instructor by finding the intersection of the red regression line and the vertical line at \\(x\\) = age = 36. We mark this value with a large red dot in Figure 6.4. Similarly, we can identify the fitted value \\(\\widehat{y}\\) = \\(\\widehat{\\text{score}}\\) for the male instructor by finding the intersection of the blue regression line and the vertical line at \\(x\\) = age = 59. We mark this value with a large blue dot in Figure 6.4. FIGURE 6.4: Fitted values for two new professors. What are these two values of \\(\\widehat{y}\\) = \\(\\widehat{\\text{score}}\\) precisely? We can use the equations of the two regression lines we computed in Subsection 6.1.2, which in turn were based on values from the regression table in Table 6.3: For all female instructors: \\(\\widehat{y} = \\widehat{\\text{score}} = 4.883 - 0.018 \\cdot \\text{age}\\) For all male instructors: \\(\\widehat{y} = \\widehat{\\text{score}} = 4.437 - 0.004 \\cdot \\text{age}\\) So our fitted values would be: \\(4.883 - 0.018 \\cdot 36 = 4.25\\) and \\(4.437 - 0.004 \\cdot 59 = 4.20\\), respectively. Now what if we want the fitted values not just for these two instructors, but for the instructors of all 463 courses included in the evals_ch6 data frame? Doing this by hand would be long and tedious! This is where the get_regression_points() function from the moderndive package can help: it will quickly automate the above calculations for all 463 courses. We present a preview of just the first 10 rows out of 463 in Table 6.7. regression_points &lt;- get_regression_points(score_model_interaction) regression_points TABLE 6.7: Regression points (First 10 out of 463 courses) ID score age gender score_hat residual 1 4.7 36 female 4.25 0.448 2 4.1 36 female 4.25 -0.152 3 3.9 36 female 4.25 -0.352 4 4.8 36 female 4.25 0.548 5 4.6 59 male 4.20 0.399 6 4.3 59 male 4.20 0.099 7 2.8 59 male 4.20 -1.401 8 4.1 51 male 4.23 -0.133 9 3.4 51 male 4.23 -0.833 10 4.5 40 female 4.18 0.318 It turns out that the female instructor of age 36 taught the first four courses, while the male instructor taught the next 3. The resulting \\(\\widehat{y}\\) = \\(\\widehat{\\text{score}}\\) fitted values are in the score_hat column. Furthermore, the get_regression_points() function also returns the residuals \\(y-\\widehat{y}\\). Notice, for example, the first and fourth courses the female instructor of age 36 taught had positive residuals, indicating that the actual teaching scores they received from students were greater than their fitted score of 4.25. On the other hand, the second and third courses this instructor taught had negative residuals, indicating that the actual teaching scores they received from students were less than 4.25. Learning check (LC6.1) Compute the observed values, fitted values, and residuals not for the interaction model as we just did, but rather for the parallel slopes model we saved in score_model_interaction. 6.2 Two numerical explanatory variables Let’s now switch gears and consider multiple regression models where instead of one numerical and one categorical explanatory variable, we now have two numerical explanatory variables. The dataset we’ll use is from An Introduction to Statistical Learning with Applications in R (ISLR), an intermediate-level textbook on statistical and machine learning (James et al. 2017). Its accompanying ISLR R package contains the datasets to which the authors apply various machine learning methods. One frequently used dataset in this book is the Credit dataset, where the outcome variable of interest is the credit card debt of 400 individuals. Other variables like income, credit limit, credit rating, and age are included as well. Note that the Credit data is not based on real individuals’ financial information, but rather is a simulated dataset used for educational purposes. In this section, we’ll fit a regression model where we have A numerical outcome variable \\(y\\), the cardholder’s credit card debt Two explanatory variables: One numerical explanatory variable \\(x_1\\), the cardholder’s credit limit Another numerical explanatory variable \\(x_2\\), the cardholder’s income (in thousands of dollars). 6.2.1 Exploratory data analysis Let’s load the Credit dataset. To keep things simple let’s select() the subset of the variables we’ll consider in this chapter, and save this data in the new data frame credit_ch6. Notice our slightly different use of the select() verb here than we introduced in Subsection 3.8.1. For example, we’ll select the Balance variable from Credit but then save it with a new variable name debt. We do this because here the term “debt” is easier to interpret than “balance.” library(ISLR) credit_ch6 &lt;- Credit %&gt;% as_tibble() %&gt;% select(ID, debt = Balance, credit_limit = Limit, income = Income, credit_rating = Rating, age = Age) You can observe the effect of our use of select() in the first common step of an exploratory data analysis: looking at the raw values either in RStudio’s spreadsheet viewer or by using glimpse(). glimpse(credit_ch6) Observations: 400 Variables: 6 $ ID &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … $ debt &lt;int&gt; 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 140… $ credit_limit &lt;int&gt; 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6… $ income &lt;dbl&gt; 14.9, 106.0, 104.6, 148.9, 55.9, 80.2, 21.0, 71.4, 15.1… $ credit_rating &lt;int&gt; 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, … $ age &lt;int&gt; 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49,… Furthermore, let’s look at a random sample of five out of the 400 credit card holders in Table 6.8. Once again, note that due to the random nature of the sampling, you will likely end up with a different subset of five rows. credit_ch6 %&gt;% sample_n(size = 5) TABLE 6.8: Random sample of 5 credit card holders ID debt credit_limit income credit_rating age 272 436 4866 45.0 347 30 239 52 2910 26.5 236 58 87 815 6340 55.4 448 33 108 0 3189 39.1 263 72 149 0 2420 15.2 192 69 Now that we’ve looked at the raw values in our credit_ch6 data frame and got a sense of the data, let’s move on to the next common step in an exploratory data analysis: computing summary statistics. Let’s use the skim() function from the skimr package, being sure to only select() the columns of interest for our model: credit_ch6 %&gt;% select(debt, credit_limit, income) %&gt;% skim() Skim summary statistics n obs: 400 n variables: 3 ── Variable type:integer variable missing complete n mean sd p0 p25 p50 p75 p100 credit_limit 0 400 400 4735.6 2308.2 855 3088 4622.5 5872.75 13913 debt 0 400 400 520.01 459.76 0 68.75 459.5 863 1999 ── Variable type:numeric variable missing complete n mean sd p0 p25 p50 p75 p100 income 0 400 400 45.22 35.24 10.35 21.01 33.12 57.47 186.63 Observe the summary statistics for the outcome variable debt: the mean and median credit card debt are $520.01 and $459.50, respectively, and that 25% of card holders had debts of $68.75 or less. Let’s now look at one of the explanatory variables credit_limit: the mean and median credit card limit are $4735.6 and $4622.50, respectively, while 75% of card holders had incomes of $57,470 or less. Since our outcome variable debt and the explanatory variables credit_limit and income are numerical, we can compute the correlation coefficient between the different possible pairs of these variables. First, we can run the get_correlation() command as seen in Subsection 5.1.1 twice, once for each explanatory variable: credit_ch6 %&gt;% get_correlation(debt ~ credit_limit) credit_ch6 %&gt;% get_correlation(debt ~ income) Or we can simultaneously compute them by returning a correlation matrix which we display in Table 6.9. We can see the correlation coefficient for any pair of variables by looking them up in the appropriate row/column combination. credit_ch6 %&gt;% select(debt, credit_limit, income) %&gt;% cor() TABLE 6.9: Correlation coefficients between credit card debt, credit limit, and income debt credit_limit income debt 1.000 0.862 0.464 credit_limit 0.862 1.000 0.792 income 0.464 0.792 1.000 For example, the correlation coefficient of: debt with itself is 1 as we would expect based on the definition of the correlation coefficient. debt with credit_limit is 0.862. This indicates a strong positive linear relationship, which makes sense as only individuals with large credit limits can accrue large credit card debts. debt with income is 0.464. This is suggestive of another positive linear relationship, although not as strong as the relationship between debt and credit_limit. As an added bonus, we can read off the correlation coefficient between the two explanatory variables of credit_limit and income as 0.792. We say there is a high degree of collinearity between the credit_limit and income explanatory variables. Collinearity (or multicollinearity) is a phenomenon where one explanatory variable in a multiple regression model is highly correlated with another. So in our case since credit_limit and income are highly correlated, if we knew someone’s credit_limit, we could make pretty good guesses about their income as well. Thus, these two variables provide somewhat redundant information. However, we’ll leave discussion on how to work with collinear explanatory variables to a more intermediate-level book on regression modeling. Let’s visualize the relationship of the outcome variable with each of the two explanatory variables in two separate plots in Figure 6.5. ggplot(credit_ch6, aes(x = credit_limit, y = debt)) + geom_point() + labs(x = &quot;Credit limit (in $)&quot;, y = &quot;Credit card debt (in $)&quot;, title = &quot;Debt and credit limit&quot;) + geom_smooth(method = &quot;lm&quot;, se = FALSE) ggplot(credit_ch6, aes(x = income, y = debt)) + geom_point() + labs(x = &quot;Income (in $1000)&quot;, y = &quot;Credit card debt (in $)&quot;, title = &quot;Debt and income&quot;) + geom_smooth(method = &quot;lm&quot;, se = FALSE) FIGURE 6.5: Relationship between credit card debt and credit limit/income. Observe there is a positive relationship between credit limit and credit card debt: as credit limit increases so also does credit card debt. This is consistent with the strongly positive correlation coefficient of 0.862 we computed earlier. In the case of income, the positive relationship doesn’t appear as strong, given the weakly positive correlation coefficient of 0.464. However, the two plots in Figure 6.5 only focus on the relationship of the outcome variable with each of the two explanatory variables separately. To visualize the joint relationship of all three variables simultaneously, we need a 3-dimensional (3D) scatterplot as seen in Figure 6.6. Each of the 400 observations in the credit_ch6 data frame are marked with a blue point where The numerical outcome variable \\(y\\) debt is on the vertical axis. The two numerical explanatory variables, \\(x_1\\) income and \\(x_2\\) credit_limit, are on the two axes that form the bottom plane. FIGURE 6.6: 3D scatterplot and regression plane. Furthermore, we also include the regression plane. Recall from Subsection 5.3.2 that regression lines are “best-fitting” in that of all possible lines we can draw through a cloud of points, the regression line minimizes the sum of squared residuals. This concept also extends to models with two numerical explanatory variables. The difference is instead of a “best-fitting” line, we now have a “best-fitting” plane that similarly minimizes the sum of squared residuals. Head to this website to open an interactive version of this plot in your browser. Learning check (LC6.2) Conduct a new exploratory data analysis with the same outcome variable \\(y\\) debt but with credit_rating and age as the new explanatory variables \\(x_1\\) and \\(x_2\\). What can you say about the relationship between a credit card holder’s debt and their credit rating and age? 6.2.2 Regression plane Let’s now fit a regression model and get the regression table corresponding to the regression plane in Figure 6.6. To keep things brief in this subsection, we won’t consider an interaction model for the two numerical explanatory variables income and credit_limit like we did in Subsection 6.1.2 using the model formula score ~ age * gender. Rather we’ll only consider a model fit with a formula of the form y ~ x1 + x2. Confusingly, however, since we now have a regression plane instead of multiple lines, the label “parallel slopes” doesn’t apply when you have two numerical explanatory variables. Just as we have done multiple times throughout Chapters 5 and this chapter, the regression table for this model using our two-step process is in Table 6.10. # Fit regression model: debt_model &lt;- lm(debt ~ credit_limit + income, data = credit_ch6) # Get regression table: get_regression_table(debt_model) TABLE 6.10: Multiple regression table term estimate std_error statistic p_value lower_ci upper_ci intercept -385.179 19.465 -19.8 0 -423.446 -346.912 credit_limit 0.264 0.006 45.0 0 0.253 0.276 income -7.663 0.385 -19.9 0 -8.420 -6.906 We first “fit” the linear regression model using the lm(y ~ x1 + x2, data) function and save it in debt_model. We get the regression table by applying the get_regression_table() function from the moderndive package to debt_model. Let’s interpret the three values in the estimate column. First, the intercept value is -$385.179. This intercept represents the credit card debt for an individual who has credit_limit of $0 and income of $0. In our data, however, the intercept has no practical interpretation since no individuals had credit_limit or income values of $0. Rather, the intercept is used to situate the regression plane in 3D space. Second, the credit_limit value is $0.264. Taking into account all the other explanatory variables in our model, for every increase of one dollar in credit_limit, there is an associated increase of on average $0.26 in credit card debt. Just as we did in Subsection 5.1.2, we are cautious not to imply causality as we saw in Subsection 5.3.1 that “correlation is not necessarily causation.” We do this merely stating there was an associated increase. Furthermore, we preface our interpretation with the statement, “taking into account all the other explanatory variables in our model.” Here, by all other explanatory variables we mean income. We do this to emphasize that we are now jointly interpreting the associated effect of multiple explanatory variables in the same model at the same time. Third, income = -$7.66. Taking into account all other explanatory variables in our model, for every increase of one unit of income ($1000 in actual income), there is an associated decrease of, on average, $7.66 in credit card debt. Putting these results together, the equation of the regression plane that gives us fitted values \\(\\widehat{y}\\) = \\(\\widehat{\\text{debt}}\\) is: \\[ \\begin{aligned} \\widehat{y} &amp;= b_0 + b_1 \\cdot x_1 + b_2 \\cdot x_2\\\\ \\widehat{\\text{debt}} &amp;= b_0 + b_{\\text{limit}} \\cdot \\text{limit} + b_{\\text{income}} \\cdot \\text{income}\\\\ &amp;= -385.179 + 0.263 \\cdot\\text{limit} - 7.663 \\cdot\\text{income} \\end{aligned} \\] Recall however in the right-hand plot of Figure 6.5 that when plotting the relationship between debt and income in isolation, there appeared to be a positive relationship. In the last discussed multiple regression, however, when jointly modeling the relationship between debt, credit_limit, and income, there appears to be a negative relationship of debt and income as evidenced by the negative slope for income of -$7.663. What explains these contradictory results? A phenomenon known as Simpson’s Paradox, whereby overall trends that exist in aggregate either disappear or reverse when the data are broken down into groups. In Subsection 6.3.3 we elaborate on this idea by looking at the relationship between credit_limit and credit card debt, but split along different income brackets. Learning check (LC6.3) Fit a new simple linear regression using lm(debt ~ credit_rating + age, data = credit_ch6) where credit_rating and age are the new numerical explanatory variables \\(x_1\\) and \\(x_2\\). Get information about the “best-fitting” regression plane from the regression table by applying the get_regression_table() function. How do the regression results match up with the results from your previous exploratory data analysis? 6.2.3 Observed/fitted values and residuals Let’s also compute all fitted values and residuals for our regression model using the get_regression_points() function and present only the first 10 rows of output in Table 6.11. Remember that the coordinates of each of the blue points in our 3D scatterplot in Figure 6.6 can be found in the income, credit_limit, and debt columns. The fitted values on the regression plane are found in the debt_hat column and are computed using our equation for the regression plane in the previous section: \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{debt}} &amp;= -385.179 + 0.263 \\cdot \\text{limit} - 7.663 \\cdot \\text{income} \\end{aligned} \\] get_regression_points(debt_model) TABLE 6.11: Regression points (First 10 credit card holders out of 400) ID debt credit_limit income debt_hat residual 1 333 3606 14.9 454 -120.8 2 903 6645 106.0 559 344.3 3 580 7075 104.6 683 -103.4 4 964 9504 148.9 986 -21.7 5 331 4897 55.9 481 -150.0 6 1151 8047 80.2 1127 23.6 7 203 3388 21.0 349 -146.4 8 872 7114 71.4 948 -76.0 9 279 3300 15.1 371 -92.2 10 1350 6819 71.1 873 477.3 6.3 Related topics 6.3.1 Model selection When should we use an interaction model versus a parallel slopes model? Recall in Sections 6.1.2 and 6.1.3 we fit both interaction and parallel slopes models for the outcome variable \\(y\\) (teaching score) using a numerical explanatory variable \\(x_1\\) (age) and a categorical explanatory variable \\(x_2\\) (gender recorded as a binary variable). We compared these models in Figure 6.3, which we display again now. FIGURE 6.7: Previously seen comparison of interaction and parallel slopes models. A lot of you might have asked yourselves: “Why would I force the lines to have parallel slopes (as seen in the right-hand plot) when they clearly have different slopes (as seen in the left-hand plot)?”. The answer lies in a philosophical principle known as “Occam’s Razor.” It states that, “all other things being equal, simpler solutions are more likely to be correct than complex ones.” When viewed in a modeling framework, Occam’s Razor can be restated as, “all other things being equal, simpler models are to be preferred over complex ones.” In other words, we should only favor the more complex model if the additional complexity is warranted. Let’s revisit the equations for the regression line for both the interaction and parallel slopes model: \\[ \\begin{aligned} \\text{Interaction} &amp;: \\widehat{y} = \\widehat{\\text{score}} = b_0 + b_{\\text{age}} \\cdot \\text{age} + b_{\\text{male}} \\cdot \\mathbb{1}_{\\text{is male}}(x) + \\\\ &amp; \\qquad b_{\\text{age,male}} \\cdot \\text{age} \\cdot \\mathbb{1}_{\\text{is male}}\\\\ \\text{Parallel slopes} &amp;: \\widehat{y} = \\widehat{\\text{score}} = b_0 + b_{\\text{age}} \\cdot \\text{age} + b_{\\text{male}} \\cdot \\mathbb{1}_{\\text{is male}}(x) \\end{aligned} \\] The interaction model is “more complex” in that there is an additional \\(b_{\\text{age,male}} \\cdot \\text{age} \\cdot \\mathbb{1}_{\\text{is male}}\\) interaction term in the equation not present for the parallel slopes model. Or viewed alternatively, the regression table for the interaction model in Table 6.3 has four rows, whereas the regression table for the parallel slopes model in Table 6.5 has three rows. The question becomes: “Is this additional complexity warranted?”. In this case, it can be argued that this additional complexity is warranted, as evidenced by the clear x-shaped pattern of the two regression lines in the left-hand plot of Figure 6.7. However, let’s consider an example where the additional complexity might not be warranted. Let’s consider the MA_schools data included in the moderndive package which contains 2017 data on Massachusetts public high schools provided by the Massachusetts Department of Education. For more details, read the help file for this data by running ?MA_schools in the console. Let’s model the numerical outcome variable \\(y\\), average SAT math score for a given high school, as a function of two explanatory variables: A numerical explanatory variable \\(x_1\\), the percentage of that high school’s student body that are economically disadvantaged and A categorical explanatory variable \\(x_2\\), the school size as measured by enrollment: small (13-341 students), medium (342-541 students), and large (542-4264 students). Let’s create visualizations of both the interaction and parallel slopes model once again and display the output in Figure 6.8. Recall from Subsection 6.1.3 that the geom_parallel_slopes() function is a special purpose function included in the moderndive package, since the geom_smooth() method in the ggplot2 package does not have a convenient way to plot parallel slopes models. # Interaction model ggplot(MA_schools, aes(x = perc_disadvan, y = average_sat_math, color = size)) + geom_point(alpha = 0.25) + geom_smooth(method = &quot;lm&quot;, se = FALSE) + labs(x = &quot;Percent economically disadvantaged&quot;, y = &quot;Math SAT Score&quot;, color = &quot;School size&quot;, title = &quot;Interaction model&quot;) # Parallel slopes model ggplot(MA_schools, aes(x = perc_disadvan, y = average_sat_math, color = size)) + geom_point(alpha = 0.25) + geom_parallel_slopes(se = FALSE) + labs(x = &quot;Percent economically disadvantaged&quot;, y = &quot;Math SAT Score&quot;, color = &quot;School size&quot;, title = &quot;Parallel slopes model&quot;) FIGURE 6.8: Comparison of interaction and parallel slopes models for Massachusetts schools. Look closely at the left-hand plot of Figure 6.8 corresponding to an interaction model. While the slopes are indeed different, they do not differ by much and are nearly identical. Now compare the left-hand plot with the right-hand plot corresponding to a parallel slopes model. The two models don’t appear all that different. So in this case, it can be argued that the additional complexity of the interaction model is not warranted. Thus following Occam’s Razor, we should prefer the “simpler” parallel slopes model. Let’s explicitly define what “simpler” means in this case. Let’s compare the regression tables for the interaction and parallel slopes models in Tables 6.12 and 6.13. model_2_interaction &lt;- lm(average_sat_math ~ perc_disadvan * size, data = MA_schools) get_regression_table(model_2_interaction) TABLE 6.12: Interaction model regression table term estimate std_error statistic p_value lower_ci upper_ci intercept 594.327 13.288 44.726 0.000 568.186 620.469 perc_disadvan -2.932 0.294 -9.961 0.000 -3.511 -2.353 sizemedium -17.764 15.827 -1.122 0.263 -48.899 13.371 sizelarge -13.293 13.813 -0.962 0.337 -40.466 13.880 perc_disadvan:sizemedium 0.146 0.371 0.393 0.694 -0.585 0.877 perc_disadvan:sizelarge 0.189 0.323 0.586 0.559 -0.446 0.824 model_2_parallel_slopes &lt;- lm(average_sat_math ~ perc_disadvan + size, data = MA_schools) get_regression_table(model_2_parallel_slopes) TABLE 6.13: Parallel slopes regression table term estimate std_error statistic p_value lower_ci upper_ci intercept 588.19 7.607 77.325 0.000 573.23 603.15 perc_disadvan -2.78 0.106 -26.120 0.000 -2.99 -2.57 sizemedium -11.91 7.535 -1.581 0.115 -26.74 2.91 sizelarge -6.36 6.923 -0.919 0.359 -19.98 7.26 Observe how the regression table for the interaction model has 2 more rows (6 versus 4). This reflects the additional “complexity” of the interaction model over the parallel slopes model. Furthermore, note in Table 6.12 how the offsets for the slopes perc_disadvan:sizemedium being 0.146 and perc_disadvan:sizelarge being 0.189 are small relative to the slope for the baseline group of small schools of \\(-2.932\\). In other words, all three slopes are similarly negative: \\(-2.932\\) for small schools, \\(-2.786\\) \\((=-2.932 + 0.146)\\) for medium schools, and \\(-2.743\\) \\((=-2.932 + 0.189)\\) for large schools. These results are suggesting that irrespective of school size, the relationship between average math SAT scores and the percent of the student body that is economically disadvantaged is similar and, alas, quite negative. What you have just performed is a rudimentary model selection: choosing which model fits data best among a set of candidate models. While the model selection approach we just took was visual in nature and hence somewhat qualitative, more statistically rigorous methods for model selection exist in the fields of multiple regression and statistical/machine learning. 6.3.2 Correlation coefficient Recall from Table 6.9 that the correlation coefficient between income in thousands of dollars and credit card debt was 0.464. What if instead we looked at the correlation coefficient between income and credit card debt, but where income was in dollars and not thousands of dollars? This can be done by multiplying income by 1000. credit_ch6 %&gt;% select(debt, income) %&gt;% mutate(income = income * 1000) %&gt;% cor() TABLE 6.14: Correlation between income (in dollars) and credit card debt debt income debt 1.000 0.464 income 0.464 1.000 We see it is the same! We say that the correlation coefficient is invariant to linear transformations. The correlation between \\(x\\) and \\(y\\) will be the same as the correlation between \\(a\\cdot x + b\\) and \\(y\\) for any numerical values \\(a\\) and \\(b\\). 6.3.3 Simpson’s Paradox Recall in Section 6.2, we saw the two seemingly contradictory results when studying the relationship between credit card debt and income. On the one hand, the right hand plot of Figure 6.5 suggested that the relationship between credit card debt and income was positive. We re-display this in Figure 6.9. FIGURE 6.9: Relationship between credit card debt and income. On the other hand, the multiple regression results in Table 6.10 suggested that the relationship between debt and income was negative. We re-display this information in Table 6.15. TABLE 6.15: Multiple regression results term estimate std_error statistic p_value lower_ci upper_ci intercept -385.179 19.465 -19.8 0 -423.446 -346.912 credit_limit 0.264 0.006 45.0 0 0.253 0.276 income -7.663 0.385 -19.9 0 -8.420 -6.906 Observe how the slope for income is \\(-7.663\\) and, most importantly for now, it is negative. This contradicts our observation in Figure 6.9 that the relationship is positive. How can this be? Recall the interpretation of the slope for income in the context of a multiple regression model: taking into account all the other explanatory variables in our model, for every increase of one unit in income (i.e., $1000), there is an associated decrease of on average $7.663 in debt. In other words, while in isolation, the relationship between debt and income may be positive, when taking into account credit_limit as well, this relationship becomes negative. These seemingly paradoxical results are due to a phenomenon aptly named Simpson’s Paradox. Simpson’s Paradox occurs when trends that exist for the data in aggregate either disappear or reverse when the data are broken down into groups. Let’s show how Simpson’s Paradox manifests itself in the credit_ch6 data. Let’s first visualize the distribution of the numerical explanatory variable credit_limit with a histogram in Figure 6.10. FIGURE 6.10: Histogram of credit limits and brackets. The vertical dashed lines are the quartiles that cut up the variable credit_limit into four equally sized groups. Let’s think of these quartiles as converting our numerical variable credit_limit into a categorical variable “credit_limit bracket” with four levels. This means that 25% of credit limits were between $0 and $3088. Let’s assign these 100 people to the “low” credit_limit bracket. 25% of credit limits were between $3088 and $4622. Let’s assign these 100 people to the “medium-low” credit_limit bracket. 25% of credit limits were between $4622 and $5873. Let’s assign these 100 people to the “medium-high” credit_limit bracket. 25% of credit limits were over $5873. Let’s assign these 100 people to the “high” credit_limit bracket. Now in Figure 6.11 let’s re-display two versions of the scatterplot of debt and income from Figure 6.9, but with a slight twist: The left-hand plot shows the regular scatterplot and the single regression line, just as you saw in Figure 6.9. The right-hand plot shows the colored scatterplot, where the color aesthetic is mapped to “credit_limit bracket.” Furthermore, there are now four separate regression lines. In other words, the location of the 400 points are the same in both scatterplots, but the right-hand plot shows an additional variable of information: credit_limit bracket. FIGURE 6.11: Relationship between credit card debt and income by credit limit bracket. The left-hand plot of Figure 6.11 focuses on the relationship between debt and income in aggregate. It is suggesting that overall there exists a positive relationship between debt and income. However, the right-hand plot of Figure 6.11 focuses on the relationship between debt and income broken down by credit_limit bracket. In other words, we focus on four separate relationships between debt and income: one for the “low” credit_limit bracket, one for the “medium-low” credit_limit bracket, and so on. Observe in the right-hand plot that the relationship between debt and income is clearly negative for the “medium-low” and “medium-high” credit_limit brackets, while the relationship is somewhat flat for the “low” credit_limit bracket. The only credit_limit bracket where the relationship remains positive is for the “high” credit_limit bracket. However, this relationship is less positive than in the relationship in aggregate, since the slope is shallower than the slope of the regression line in the left-hand plot. In this example of Simpson’s Paradox, the credit_limit is a confounding variable of the relationship between credit card debt and income as we defined in Subsection 5.3.1. Thus, credit_limit needs to be accounted for in any appropriate model for the relationship between debt and income. 6.4 Conclusion 6.4.1 Additional resources An R script file of all R code used in this chapter is available here. 6.4.2 What’s to come? Congratulations! We’ve completed the “Data Modeling with moderndive” portion of this book. We’re ready to proceed to Part III of this book: “Statistical Inference with infer.” Statistical inference is the science of inferring about some unknown quantity using sampling. For example, among the most well-known examples of sampling involves polls. Because asking an entire population about their opinions would be a long and arduous task, pollsters often take a smaller sample that is hopefully representative of the population. Based on the results of this sample, pollsters hope to make claims about the entire population. Once we’ve covered Chapters 7 on sampling, 8 on confidence intervals, and 9 on hypothesis testing, we’ll revisit the regression models we studied in Chapters 5 and 6 in Chapter 10 on inference for regression. So far, we’ve only studied the estimate column of all our regression tables. The next four chapters focus on what the remaining columns mean: the standard error (std_error), the test statistic, the p_value, and the lower and upper bounds of confidence intervals (lower_ci and upper_ci). Furthermore in Chapter 10, we’ll revisit the concept of residuals \\(y - \\widehat{y}\\) and discuss their importance when interpreting the results of a regression model. We’ll perform what is known as a residual analysis of the residual variable of all get_regression_points() outputs. Residual analyses allow you to verify what are known as the conditions for inference for regression. On to Chapter 7 on sampling in Part III as shown in Figure 6.12! FIGURE 6.12: ModernDive flowchart - on to Part III! References "],
+["7-sampling.html", "Chapter 7 Sampling 7.1 Sampling bowl activity 7.2 Virtual sampling 7.3 Sampling framework 7.4 Case study: Polls 7.5 Conclusion", " Chapter 7 Sampling In this chapter, we kick off the third portion of this book on statistical inference by learning about sampling. The concepts behind sampling form the basis of confidence intervals and hypothesis testing, which we’ll cover in Chapters 8 and 9. We will see that the tools that you learned in the data science portion of this book, in particular data visualization and data wrangling, will also play an important role in the development of your understanding. As mentioned before, the concepts throughout this text all build into a culmination allowing you to “tell your story with data.” Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once: ggplot2 for data visualization dplyr for data wrangling tidyr for converting data to “tidy” format readr for importing spreadsheet data into R As well as the more advanced purrr, tibble, stringr, and forcats packages If needed, read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(moderndive) 7.1 Sampling bowl activity Let’s start with a hands-on activity. 7.1.1 What proportion of this bowl’s balls are red? Take a look at the bowl in Figure 7.1. It has a certain number of red and a certain number of white balls all of equal size. Furthermore, it appears the bowl has been mixed beforehand, as there does not seem to be any coherent pattern to the spatial distribution of the red and white balls. Let’s now ask ourselves, what proportion of this bowl’s balls are red? FIGURE 7.1: A bowl with red and white balls. One way to answer this question would be to perform an exhaustive count: remove each ball individually, count the number of red balls and the number of white balls, and divide the number of red balls by the total number of balls. However, this would be a long and tedious process. 7.1.2 Using the shovel once Instead of performing an exhaustive count, let’s insert a shovel into the bowl as seen in Figure 7.2. Using the shovel, let’s remove \\(5 \\cdot 10 = 50\\) balls, as seen in Figure 7.3. FIGURE 7.2: Inserting a shovel into the bowl. FIGURE 7.3: Removing 50 balls from the bowl. Observe that 17 of the balls are red and thus 0.34 = 34% of the shovel’s balls are red. We can view the proportion of balls that are red in this shovel as a guess of the proportion of balls that are red in the entire bowl. While not as exact as doing an exhaustive count of all the balls in the bowl, our guess of 34% took much less time and energy to make. However, say, we started this activity over from the beginning. In other words, we replace the 50 balls back into the bowl and start over. Would we remove exactly 17 red balls again? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% again? Maybe? What if we repeated this activity several times following the process shown in Figure 7.4? Would we obtain exactly 17 red balls each time? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% every time? Surely not. Let’s repeat this exercise several times with the help of 33 groups of friends to understand how the value differs with repetition. 7.1.3 Using the shovel 33 times Each of our 33 groups of friends will do the following: Use the shovel to remove 50 balls each. Count the number of red balls and thus compute the proportion of the 50 balls that are red. Return the balls into the bowl. Mix the contents of the bowl a little to not let a previous group’s results influence the next group’s. FIGURE 7.4: Repeating sampling activity 33 times. Each of our 33 groups of friends make note of their proportion of red balls from their sample collected. Each group then marks their proportion of their 50 balls that were red in the appropriate bin in a hand-drawn histogram as seen in Figure 7.5. FIGURE 7.5: Constructing a histogram of proportions. Recall from Section 2.5 that histograms allow us to visualize the distribution of a numerical variable. In particular, where the center of the values falls and how the values vary. A partially completed histogram of the first 10 out of 33 groups of friends’ results can be seen in Figure 7.6. FIGURE 7.6: Hand-drawn histogram of first 10 out of 33 proportions. Observe the following in the histogram in Figure 7.6: At the low end, one group removed 50 balls from the bowl with proportion red between 0.20 and 0.25. At the high end, another group removed 50 balls from the bowl with proportion between 0.45 and 0.5 red. However, the most frequently occurring proportions were between 0.30 and 0.35 red, right in the middle of the distribution. The shape of this distribution is somewhat bell-shaped. Let’s construct this same hand-drawn histogram in R using your data visualization skills that you honed in Chapter 2. We saved our 33 groups of friends’ results in the tactile_prop_red data frame included in the moderndive package. Run the following to display the first 10 of 33 rows: tactile_prop_red # A tibble: 33 x 4 group replicate red_balls prop_red &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 Ilyas, Yohan 1 21 0.42 2 Morgan, Terrance 2 17 0.34 3 Martin, Thomas 3 21 0.42 4 Clark, Frank 4 21 0.42 5 Riddhi, Karina 5 18 0.36 6 Andrew, Tyler 6 19 0.38 7 Julia 7 19 0.38 8 Rachel, Lauren 8 11 0.22 9 Daniel, Caroline 9 15 0.3 10 Josh, Maeve 10 17 0.34 # … with 23 more rows Observe for each group that we have their names, the number of red_balls they obtained, and the corresponding proportion out of 50 balls that were red named prop_red. We also have a replicate variable enumerating each of the 33 groups. We chose this name because each row can be viewed as one instance of a replicated (in other words repeated) activity: using the shovel to remove 50 balls and computing the proportion of those balls that are red. Let’s visualize the distribution of these 33 proportions using geom_histogram() with binwidth = 0.05 in Figure 7.7. This is a computerized and complete version of the partially completed hand-drawn histogram you saw in Figure 7.6. Note that setting boundary = 0.4 indicates that we want a binning scheme such that one of the bins’ boundary is at 0.4. This helps us to more closely align this histogram with the hand-drawn histogram in Figure 7.6. ggplot(tactile_prop_red, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 50 balls that were red&quot;, title = &quot;Distribution of 33 proportions red&quot;) FIGURE 7.7: Distribution of 33 proportions based on 33 samples of size 50. 7.1.4 What did we just do? What we just demonstrated in this activity is the statistical concept of sampling. We would like to know the proportion of the bowl’s balls that are red. Because the bowl has a large number of balls, performing an exhaustive count of the red and white balls would be time-consuming. We thus extracted a sample of 50 balls using the shovel to make an estimate. Using this sample of 50 balls, we estimated the proportion of the bowl’s balls that are red to be 34%. Moreover, because we mixed the balls before each use of the shovel, the samples were randomly drawn. Because each sample was drawn at random, the samples were different from each other. Because the samples were different from each other, we obtained the different proportions red observed in Figure 7.7. This is known as the concept of sampling variation. The purpose of this sampling activity was to develop an understanding of two key concepts relating to sampling: Understanding the effect of sampling variation. Understanding the effect of sample size on sampling variation. In Section 7.2, we’ll mimic the hands-on sampling activity we just performed on a computer. This will allow us not only to repeat the sampling exercise much more than 33 times, but it will also allow us to use shovels with different numbers of slots than just 50. Afterwards, we’ll present you with definitions, terminology, and notation related to sampling in Section 7.3. As in many disciplines, such necessary background knowledge may seem inaccessible and even confusing at first. However, as with many difficult topics, if you truly understand the underlying concepts and practice, practice, practice, you’ll be able to master them. To tie the contents of this chapter to the real world, we’ll present an example of one of the most recognizable uses of sampling: polls. In Section 7.4 we’ll look at a particular case study: a 2013 poll on then U.S. President Barack Obama’s popularity among young Americans, conducted by Kennedy School’s Institute of Politics at Harvard University. To close this chapter, we’ll generalize the “sampling from a bowl” exercise to other sampling scenarios and present a theoretical result known as the Central Limit Theorem. Learning check (LC7.1) Why was it important to mix the bowl before we sampled the balls? (LC7.2) Why is it that our 33 groups of friends did not all have the same numbers of balls that were red out of 50, and hence different proportions red? 7.2 Virtual sampling In the previous Section 7.1, we performed a tactile sampling activity by hand. In other words, we used a physical bowl of balls and a physical shovel. We performed this sampling activity by hand first so that we could develop a firm understanding of the root ideas behind sampling. In this section, we’ll mimic this tactile sampling activity with a virtual sampling activity using a computer. In other words, we’ll use a virtual analog to the bowl of balls and a virtual analog to the shovel. 7.2.1 Using the virtual shovel once Let’s start by performing the virtual analog of the tactile sampling exercise we performed in Section 7.1. We first need a virtual analog of the bowl seen in Figure 7.1. To this end, we included a data frame named bowl in the moderndive package. The rows of bowl correspond exactly with the contents of the actual bowl. bowl # A tibble: 2,400 x 2 ball_ID color &lt;int&gt; &lt;chr&gt; 1 1 white 2 2 white 3 3 white 4 4 red 5 5 white 6 6 white 7 7 red 8 8 white 9 9 red 10 10 white # … with 2,390 more rows Observe that bowl has 2400 rows, telling us that the bowl contains 2400 equally sized balls. The first variable ball_ID is used as an identification variable as discussed in Subsection 1.4.4; none of the balls in the actual bowl are marked with numbers. The second variable color indicates whether a particular virtual ball is red or white. View the contents of the bowl in RStudio’s data viewer and scroll through the contents to convince yourself that bowl is indeed a virtual analog of the actual bowl in Figure 7.1. Now that we have a virtual analog of our bowl, we now need a virtual analog to the shovel seen in Figure 7.2 to generate virtual samples of 50 balls. We’re going to use the rep_sample_n() function included in the moderndive package. This function allows us to take repeated, or replicated, samples of size n. virtual_shovel &lt;- bowl %&gt;% rep_sample_n(size = 50) virtual_shovel # A tibble: 50 x 3 # Groups: replicate [1] replicate ball_ID color &lt;int&gt; &lt;int&gt; &lt;chr&gt; 1 1 1970 white 2 1 842 red 3 1 2287 white 4 1 599 white 5 1 108 white 6 1 846 red 7 1 390 red 8 1 344 white 9 1 910 white 10 1 1485 white # … with 40 more rows Observe that virtual_shovel has 50 rows corresponding to our virtual sample of size 50. The ball_ID variable identifies which of the 2400 balls from bowl are included in our sample of 50 balls while color denotes its color. However, what does the replicate variable indicate? In virtual_shovel’s case, replicate is equal to 1 for all 50 rows. This is telling us that these 50 rows correspond to the first repeated/replicated use of the shovel, in our case our first sample. We’ll see shortly that when we “virtually” take 33 samples, replicate will take values between 1 and 33. Let’s compute the proportion of balls in our virtual sample that are red using the dplyr data wrangling verbs you learned in Chapter 3. First, for each of our 50 sampled balls, let’s identify if it is red or not using a test for equality with ==. Let’s create a new Boolean variable is_red using the mutate() function from Section 3.5: virtual_shovel %&gt;% mutate(is_red = (color == &quot;red&quot;)) # A tibble: 50 x 4 # Groups: replicate [1] replicate ball_ID color is_red &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;lgl&gt; 1 1 1970 white FALSE 2 1 842 red TRUE 3 1 2287 white FALSE 4 1 599 white FALSE 5 1 108 white FALSE 6 1 846 red TRUE 7 1 390 red TRUE 8 1 344 white FALSE 9 1 910 white FALSE 10 1 1485 white FALSE # … with 40 more rows Observe that for every row where color == &quot;red&quot;, the Boolean (logical) value TRUE is returned and for every row where color is not equal to &quot;red&quot;, the Boolean FALSE is returned. Second, let’s compute the number of balls out of 50 that are red using the summarize() function. Recall from Section 3.3 that summarize() takes a data frame with many rows and returns a data frame with a single row containing summary statistics, like the mean() or median(). In this case, we use the sum(): virtual_shovel %&gt;% mutate(is_red = (color == &quot;red&quot;)) %&gt;% summarize(num_red = sum(is_red)) # A tibble: 1 x 2 replicate num_red &lt;int&gt; &lt;int&gt; 1 1 12 Why does this work? Because R treats TRUE like the number 1 and FALSE like the number 0. So summing the number of TRUEs and FALSEs is equivalent to summing 1’s and 0’s. In the end, this operation counts the number of balls where color is red. In our case, 12 of the 50 balls were red. However, you might have gotten a different number red because of the randomness of the virtual sampling. Third and lastly, let’s compute the proportion of the 50 sampled balls that are red by dividing num_red by 50: virtual_shovel %&gt;% mutate(is_red = color == &quot;red&quot;) %&gt;% summarize(num_red = sum(is_red)) %&gt;% mutate(prop_red = num_red / 50) # A tibble: 1 x 3 replicate num_red prop_red &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 1 12 0.24 In other words, 24% of this virtual sample’s balls were red. Let’s make this code a little more compact and succinct by combining the first mutate() and the summarize() as follows: virtual_shovel %&gt;% summarize(num_red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = num_red / 50) # A tibble: 1 x 3 replicate num_red prop_red &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 1 12 0.24 Great! 24% of virtual_shovel’s 50 balls were red! So based on this particular sample of 50 balls, our guess at the proportion of the bowl’s balls that are red is 24%. But remember from our earlier tactile sampling activity that if we repeat this sampling, we will not necessarily obtain the same value of 24% again. There will likely be some variation. In fact, our 33 groups of friends computed 33 such proportions whose distribution we visualized in Figure 7.6. We saw that these estimates varied. Let’s now perform the virtual analog of having 33 groups of students use the sampling shovel! 7.2.2 Using the virtual shovel 33 times Recall that in our tactile sampling exercise in Section 7.1, we had 33 groups of students each use the shovel, yielding 33 samples of size 50 balls. We then used these 33 samples to compute 33 proportions. In other words, we repeated/replicated using the shovel 33 times. We can perform this repeated/replicated sampling virtually by once again using our virtual shovel function rep_sample_n(), but by adding the reps = 33 argument. This is telling R that we want to repeat the sampling 33 times. We’ll save these results in a data frame called virtual_samples. While we provide a preview of the first 10 rows of virtual_samples in what follows, we highly suggest you scroll through its contents using RStudio’s spreadsheet viewer by running View(virtual_samples). virtual_samples &lt;- bowl %&gt;% rep_sample_n(size = 50, reps = 33) virtual_samples # A tibble: 1,650 x 3 # Groups: replicate [33] replicate ball_ID color &lt;int&gt; &lt;int&gt; &lt;chr&gt; 1 1 875 white 2 1 1851 red 3 1 1548 red 4 1 1975 white 5 1 835 white 6 1 16 white 7 1 327 white 8 1 1803 red 9 1 740 red 10 1 179 red # … with 1,640 more rows Observe in the spreadsheet viewer that the first 50 rows of replicate are equal to 1 while the next 50 rows of replicate are equal to 2. This is telling us that the first 50 rows correspond to the first sample of 50 balls while the next 50 rows correspond to the second sample of 50 balls. This pattern continues for all reps = 33 replicates and thus virtual_samples has 33 \\(\\cdot\\) 50 = 1650 rows. Let’s now take virtual_samples and compute the resulting 33 proportions red. We’ll use the same dplyr verbs as before, but this time with an additional group_by() of the replicate variable. Recall from Section 3.4 that by assigning the grouping variable “meta-data” before we summarize(), we’ll obtain 33 different proportions red. We display a preview of the first 10 out of 33 rows: virtual_prop_red &lt;- virtual_samples %&gt;% group_by(replicate) %&gt;% summarize(red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = red / 50) virtual_prop_red # A tibble: 33 x 3 replicate red prop_red &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 1 23 0.46 2 2 19 0.38 3 3 18 0.36 4 4 19 0.38 5 5 15 0.3 6 6 21 0.42 7 7 21 0.42 8 8 16 0.32 9 9 24 0.48 10 10 14 0.28 # … with 23 more rows As with our 33 groups of friends’ tactile samples, there is variation in the resulting 33 virtual proportions red. Let’s visualize this variation in a histogram in Figure 7.8. Note that we add binwidth = 0.05 and boundary = 0.4 arguments as well. Recall that setting boundary = 0.4 ensures a binning scheme with one of the bins’ boundaries at 0.4. Since the binwidth = 0.05 is also set, this will create bins with boundaries at 0.30, 0.35, 0.45, 0.5, etc. as well. ggplot(virtual_prop_red, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 50 balls that were red&quot;, title = &quot;Distribution of 33 proportions red&quot;) FIGURE 7.8: Distribution of 33 proportions based on 33 samples of size 50. Observe that we occasionally obtained proportions red that are less than 30%. On the other hand, we occasionally obtained proportions that are greater than 45%. However, the most frequently occurring proportions were between 35% and 40% (for 11 out of 33 samples). Why do we have these differences in proportions red? Because of sampling variation. Let’s now compare our virtual results with our tactile results from the previous section in Figure 7.9. Observe that both histograms are somewhat similar in their center and variation, although not identical. These slight differences are again due to random sampling variation. Furthermore, observe that both distributions are somewhat bell-shaped. FIGURE 7.9: Comparing 33 virtual and 33 tactile proportions red. Learning check (LC7.3) Why couldn’t we study the effects of sampling variation when we used the virtual shovel only once? Why did we need to take more than one virtual sample (in our case 33 virtual samples)? 7.2.3 Using the virtual shovel 1000 times Now say we want to study the effects of sampling variation not for 33 samples, but rather for a larger number of samples, say 1000. We have two choices at this point. We could have our groups of friends manually take 1000 samples of 50 balls and compute the corresponding 1000 proportions. However, this would be a tedious and time-consuming task. This is where computers excel: automating long and repetitive tasks while performing them quite quickly. Thus, at this point we will abandon tactile sampling in favor of only virtual sampling. Let’s once again use the rep_sample_n() function with sample size set to be 50 once again, but this time with the number of replicates reps set to 1000. Be sure to scroll through the contents of virtual_samples in RStudio’s viewer. virtual_samples &lt;- bowl %&gt;% rep_sample_n(size = 50, reps = 1000) virtual_samples # A tibble: 50,000 x 3 # Groups: replicate [1,000] replicate ball_ID color &lt;int&gt; &lt;int&gt; &lt;chr&gt; 1 1 1236 red 2 1 1944 red 3 1 1939 white 4 1 780 white 5 1 1956 white 6 1 1003 white 7 1 2113 white 8 1 2213 white 9 1 782 white 10 1 898 white # … with 49,990 more rows Observe that now virtual_samples has 1000 \\(\\cdot\\) 50 = 50,000 rows, instead of the 33 \\(\\cdot\\) 50 = 1650 rows from earlier. Using the same data wrangling code as earlier, let’s take the data frame virtual_samples with 1000 \\(\\cdot\\) 50 = 50,000 rows and compute the resulting 1000 proportions of red balls. virtual_prop_red &lt;- virtual_samples %&gt;% group_by(replicate) %&gt;% summarize(red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = red / 50) virtual_prop_red # A tibble: 1,000 x 3 replicate red prop_red &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 1 18 0.36 2 2 19 0.38 3 3 20 0.4 4 4 15 0.3 5 5 17 0.34 6 6 16 0.32 7 7 23 0.46 8 8 23 0.46 9 9 15 0.3 10 10 18 0.36 # … with 990 more rows Observe that we now have 1000 replicates of prop_red, the proportion of 50 balls that are red. Using the same code as earlier, let’s now visualize the distribution of these 1000 replicates of prop_red in a histogram in Figure 7.10. ggplot(virtual_prop_red, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 50 balls that were red&quot;, title = &quot;Distribution of 1000 proportions red&quot;) FIGURE 7.10: Distribution of 1000 proportions based on 1000 samples of size 50. Once again, the most frequently occurring proportions of red balls occur between 35% and 40%. Every now and then, we obtain proportions as low as between 20% and 25%, and others as high as between 55% and 60%. These are rare, however. Furthermore, observe that we now have a much more symmetric and smoother bell-shaped distribution. This distribution is, in fact, approximated well by a normal distribution. At this point we recommend you read the “Normal distribution” section (Appendix A.2) for a brief discussion on the properties of the normal distribution. Learning check (LC7.4) Why did we not take 1000 “tactile” samples of 50 balls by hand? (LC7.5) Looking at Figure 7.10, would you say that sampling 50 balls where 30% of them were red is likely or not? What about sampling 50 balls where 10% of them were red? 7.2.4 Using different shovels Now say instead of just one shovel, you have three choices of shovels to extract a sample of balls with: shovels of size 25, 50, and 100. FIGURE 7.11: Three shovels to extract three different sample sizes. If your goal is still to estimate the proportion of the bowl’s balls that are red, which shovel would you choose? In our experience, most people would choose the largest shovel with 100 slots because it would yield the “best” guess of the proportion of the bowl’s balls that are red. Let’s define some criteria for “best” in this subsection. Using our newly developed tools for virtual sampling, let’s unpack the effect of having different sample sizes! In other words, let’s use rep_sample_n() with size set to 25, 50, and 100, respectively, while keeping the number of repeated/replicated samples at 1000: Virtually use the appropriate shovel to generate 1000 samples with size balls. Compute the resulting 1000 replicates of the proportion of the shovel’s balls that are red. Visualize the distribution of these 1000 proportions red using a histogram. Run each of the following code segments individually and then compare the three resulting histograms. # Segment 1: sample size = 25 ------------------------------ # 1.a) Virtually use shovel 1000 times virtual_samples_25 &lt;- bowl %&gt;% rep_sample_n(size = 25, reps = 1000) # 1.b) Compute resulting 1000 replicates of proportion red virtual_prop_red_25 &lt;- virtual_samples_25 %&gt;% group_by(replicate) %&gt;% summarize(red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = red / 25) # 1.c) Plot distribution via a histogram ggplot(virtual_prop_red_25, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 25 balls that were red&quot;, title = &quot;25&quot;) # Segment 2: sample size = 50 ------------------------------ # 2.a) Virtually use shovel 1000 times virtual_samples_50 &lt;- bowl %&gt;% rep_sample_n(size = 50, reps = 1000) # 2.b) Compute resulting 1000 replicates of proportion red virtual_prop_red_50 &lt;- virtual_samples_50 %&gt;% group_by(replicate) %&gt;% summarize(red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = red / 50) # 2.c) Plot distribution via a histogram ggplot(virtual_prop_red_50, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 50 balls that were red&quot;, title = &quot;50&quot;) # Segment 3: sample size = 100 ------------------------------ # 3.a) Virtually using shovel with 100 slots 1000 times virtual_samples_100 &lt;- bowl %&gt;% rep_sample_n(size = 100, reps = 1000) # 3.b) Compute resulting 1000 replicates of proportion red virtual_prop_red_100 &lt;- virtual_samples_100 %&gt;% group_by(replicate) %&gt;% summarize(red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = red / 100) # 3.c) Plot distribution via a histogram ggplot(virtual_prop_red_100, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 100 balls that were red&quot;, title = &quot;100&quot;) For easy comparison, we present the three resulting histograms in a single row with matching x and y axes in Figure 7.12. FIGURE 7.12: Comparing the distributions of proportion red for different sample sizes. Observe that as the sample size increases, the variation of the 1000 replicates of the proportion of red decreases. In other words, as the sample size increases, there are fewer differences due to sampling variation and the distribution centers more tightly around the same value. Eyeballing Figure 7.12, all three histograms appear to center around roughly 40%. We can be numerically explicit about the amount of variation in our three sets of 1000 values of prop_red using the standard deviation. A standard deviation is a summary statistic that measures the amount of variation within a numerical variable (see Appendix A.1 for a brief discussion on the properties of the standard deviation). For all three sample sizes, let’s compute the standard deviation of the 1000 proportions red by running the following data wrangling code that uses the sd() summary function. # n = 25 virtual_prop_red_25 %&gt;% summarize(sd = sd(prop_red)) # n = 50 virtual_prop_red_50 %&gt;% summarize(sd = sd(prop_red)) # n = 100 virtual_prop_red_100 %&gt;% summarize(sd = sd(prop_red)) Let’s compare these three measures of distributional variation in Table 7.1. TABLE 7.1: Comparing standard deviations of proportions red for three different shovels Number of slots in shovel Standard deviation of proportions red 25 0.094 50 0.069 100 0.045 As we observed in Figure 7.12, as the sample size increases, the variation decreases. In other words, there is less variation in the 1000 values of the proportion red. So as the sample size increases, our guesses at the true proportion of the bowl’s balls that are red get more precise. Learning check (LC7.6) In Figure 7.12, we used shovels to take 1000 samples each, computed the resulting 1000 proportions of the shovel’s balls that were red, and then visualized the distribution of these 1000 proportions in a histogram. We did this for shovels with 25, 50, and 100 slots in them. As the size of the shovels increased, the histograms got narrower. In other words, as the size of the shovels increased from 25 to 50 to 100, did the 1000 proportions A. vary less, B. vary by the same amount, or C. vary more? (LC7.7) What summary statistic did we use to quantify how much the 1000 proportions red varied? A. The interquartile range B. The standard deviation C. The range: the largest value minus the smallest. 7.3 Sampling framework In both our tactile and our virtual sampling activities, we used sampling for the purpose of estimation. We extracted samples in order to estimate the proportion of the bowl’s balls that are red. We used sampling as a less time-consuming approach than performing an exhaustive count of all the balls. Our virtual sampling activity built up to the results shown in Figure 7.12 and Table 7.1: comparing 1000 proportions red based on samples of size 25, 50, and 100. This was our first attempt at understanding two key concepts relating to sampling for estimation: The effect of sampling variation on our estimates. The effect of sample size on sampling variation. Let’s now introduce some terminology and notation as well as statistical definitions related to sampling. Given the number of new words you’ll need to learn, you will likely have to read this section a few times. Keep in mind, however, that all of the concepts underlying these terminology, notation, and definitions tie directly to the concepts underlying our tactile and virtual sampling activities. It will simply take time and practice to master them. 7.3.1 Terminology and notation Here is a list of terminology and mathematical notation relating to sampling. First, a population is a collection of individuals or observations we are interested in. This is also commonly denoted as a study population. We mathematically denote the population’s size using upper-case \\(N\\). In our sampling activities, the (study) population is the collection of \\(N\\) = 2400 identically sized red and white balls contained in the bowl. Second, a population parameter is a numerical summary quantity about the population that is unknown, but you wish you knew. For example, when this quantity is a mean, the population parameter of interest is the population mean. This is mathematically denoted with the Greek letter \\(\\mu\\) pronounced “mu” (we’ll see a sampling activity involving means in the upcoming Section 8.1). In our earlier sampling from the bowl activity, however, since we were interested in the proportion of the bowl’s balls that were red, the population parameter is the population proportion. This is mathematically denoted with the letter \\(p\\). Third, a census is an exhaustive enumeration or counting of all \\(N\\) individuals or observations in the population in order to compute the population parameter’s value exactly. In our sampling activity, this would correspond to counting the number of balls out of \\(N\\) = 2400 that are red and computing the population proportion \\(p\\) that are red exactly. When the number \\(N\\) of individuals or observations in our population is large as was the case with our bowl, a census can be quite expensive in terms of time, energy, and money. Fourth, sampling is the act of collecting a sample from the population when we don’t have the means to perform a census. We mathematically denote the sample’s size using lower case \\(n\\), as opposed to upper case \\(N\\) which denotes the population’s size. Typically the sample size \\(n\\) is much smaller than the population size \\(N\\). Thus sampling is a much cheaper alternative than performing a census. In our sampling activities, we used shovels with 25, 50, and 100 slots to extract samples of size \\(n\\) = 25, \\(n\\) = 50, and \\(n\\) = 100. Fifth, a point estimate (AKA sample statistic) is a summary statistic computed from a sample that estimates an unknown population parameter. In our sampling activities, recall that the unknown population parameter was the population proportion and that this is mathematically denoted with \\(p\\). Our point estimate is the sample proportion: the proportion of the shovel’s balls that are red. In other words, it is our guess of the proportion of the bowl’s balls balls that are red. We mathematically denote the sample proportion using \\(\\widehat{p}\\). The “hat” on top of the \\(p\\) indicates that it is an estimate of the unknown population proportion \\(p\\). Sixth is the idea of representative sampling. A sample is said to be a representative sample if it roughly looks like the population. In other words, are the sample’s characteristics a good representation of the population’s characteristics? In our sampling activity, are the samples of \\(n\\) balls extracted using our shovels representative of the bowl’s \\(N\\) = 2400 balls? Seventh is the idea of generalizability. We say a sample is generalizable if any results based on the sample can generalize to the population. In other words, does the value of the point estimate generalize to the population? In our sampling activity, can we generalize the sample proportion from our shovels to the entire bowl? Using our mathematical notation, this is akin to asking if \\(\\widehat{p}\\) is a “good guess” of \\(p\\)? Eighth, we say biased sampling occurs if certain individuals or observations in a population have a higher chance of being included in a sample than others. We say a sampling procedure is unbiased if every observation in a population had an equal chance of being sampled. In our sampling activities, since we mixed all \\(N = 2400\\) balls prior to each group’s sampling and since each of the equally sized balls had an equal chance of being sampled, our samples were unbiased. Ninth and lastly, the idea of random sampling. We say a sampling procedure is random if we sample randomly from the population in an unbiased fashion. In our sampling activities, this would correspond to sufficiently mixing the bowl before each use of the shovel. Phew, that’s a lot of new terminology and notation to learn! Let’s put them all together to describe the paradigm of sampling. In general: If the sampling of a sample of size \\(n\\) is done at random, then the sample is unbiased and representative of the population of size \\(N\\), thus any result based on the sample can generalize to the population, thus the point estimate is a “good guess” of the unknown population parameter, thus instead of performing a census, we can infer about the population using sampling. Specific to our sampling activity: If we extract a sample of \\(n=50\\) balls at random, in other words, we mix all of the equally sized balls before using the shovel, then the contents of the shovel are an unbiased representation of the contents of the bowl’s 2400 balls, thus any result based on the shovel’s balls can generalize to the bowl, thus the sample proportion \\(\\widehat{p}\\) of the \\(n=50\\) balls in the shovel that are red is a “good guess” of the population proportion \\(p\\) of the \\(N=2400\\) balls that are red, thus instead of manually going over all 2400 balls in the bowl, we can infer about the bowl using the shovel. Note that last word we wrote in bold: infer. The act of “inferring” means to deduce or conclude information from evidence and reasoning. In our sampling activities, we wanted to infer about the proportion of the bowl’s balls that are red. Statistical inference is the “theory, methods, and practice of forming judgments about the parameters of a population and the reliability of statistical relationships, typically on the basis of random sampling.” In other words, statistical inference is the act of inference via sampling. In the upcoming Chapter 8 on confidence intervals, we’ll introduce the infer package, which makes statistical inference “tidy” and transparent. It is why this third portion of the book is called “Statistical inference via infer.” Learning check (LC7.8) In the case of our bowl activity, what is the population parameter? Do we know its value? (LC7.9) What would performing a census in our bowl activity correspond to? Why did we not perform a census? (LC7.10) What purpose do point estimates serve in general? What is the name of the point estimate specific to our bowl activity? What is its mathematical notation? (LC7.11) How did we ensure that our tactile samples using the shovel were random? (LC7.12) Why is it important that sampling be done at random? (LC7.13) What are we inferring about the bowl based on the samples using the shovel? 7.3.2 Statistical definitions Now, for some important statistical definitions related to sampling. As a refresher of our 1000 repeated/replicated virtual samples of size \\(n\\) = 25, \\(n\\) = 50, and \\(n\\) = 100 in Section 7.2, let’s display Figure 7.12 again as Figure 7.13. FIGURE 7.13: Previously seen three distributions of the sample proportion \\(\\widehat{p}\\). These types of distributions have a special name: sampling distributions; their visualization displays the effect of sampling variation on the distribution of any point estimate, in this case, the sample proportion \\(\\widehat{p}\\). Using these sampling distributions, for a given sample size \\(n\\), we can make statements about what values we can typically expect. For example, observe the centers of all three sampling distributions: they are all roughly centered around \\(0.4 = 40\\%\\). Furthermore, observe that while we are somewhat likely to observe sample proportions of red balls of \\(0.2 = 20\\%\\) when using the shovel with 25 slots, we will almost never observe a proportion of 20% when using the shovel with 100 slots. Observe also the effect of sample size on the sampling variation. As the sample size \\(n\\) increases from 25 to 50 to 100, the variation of the sampling distribution decreases and thus the values cluster more and more tightly around the same center of around 40%. We quantified this variation using the standard deviation of our sample proportions in Table 7.1, which we display again as Table 7.2: TABLE 7.2: Previously seen comparing standard deviations of proportions red for three different shovels Number of slots in shovel Standard deviation of proportions red 25 0.094 50 0.069 100 0.045 So as the sample size increases, the standard deviation of the proportion of red balls decreases. This type of standard deviation has another special name: standard error. Standard errors quantify the effect of sampling variation induced on our estimates. In other words, they quantify how much we can expect different proportions of a shovel’s balls that are red to vary from one sample to another sample to another sample, and so on. As a general rule, as sample size increases, the standard error decreases. Unfortunately, these names confuse many people who are new to statistical inference. For example, it’s common for people who are new to statistical inference to call the “sampling distribution” the “sample distribution.” Another additional source of confusion is the name “standard deviation” and “standard error.” Remember that a standard error is merely a kind of standard deviation: the standard deviation of any point estimate from sampling. In other words, all standard errors are standard deviations, but not every standard deviation is necessarily a standard error. To help reinforce these concepts, let’s re-display Figure 7.12 but using our new terminology, notation, and definitions relating to sampling in Figure 7.14. FIGURE 7.14: Three sampling distributions of the sample proportion \\(\\widehat{p}\\). Furthermore, let’s re-display Table 7.1 but using our new terminology, notation, and definitions relating to sampling in Table 7.3. TABLE 7.3: Standard errors of the sample proportion based on sample sizes of 25, 50, and 100 Sample size (n) Standard error of \\(\\widehat{p}\\) n = 25 0.094 n = 50 0.069 n = 100 0.045 Remember the key message of this last table: that as the sample size \\(n\\) goes up, the “typical” error of your point estimate will go down, as quantified by the standard error. Learning check (LC7.14) What purpose did the sampling distributions serve? (LC7.15) What does the standard error of the sample proportion \\(\\widehat{p}\\) quantify? 7.3.3 The moral of the story Let’s recap this section so far. We’ve seen that if a sample is generated at random, then the resulting point estimate is a “good guess” of the true unknown population parameter. In our sampling activities, since we made sure to mix the balls first before extracting a sample with the shovel, the resulting sample proportion \\(\\widehat{p}\\) of the shovel’s balls that were red was a “good guess” of the population proportion \\(p\\) of the bowl’s balls that were red. However, what do we mean by our point estimate being a “good guess”? Sometimes, we’ll get an estimate that is less than the true value of the population parameter, while at other times we’ll get an estimate that is greater. This is due to sampling variation. However, despite this sampling variation, our estimates will “on average” be correct and thus will be centered at the true value. This is because our sampling was done at random and thus in an unbiased fashion. In our sampling activities, sometimes our sample proportion \\(\\widehat{p}\\) was less than the true population proportion \\(p\\), while at other times it was greater. This was due to the sampling variability. However, despite this sampling variation, our sample proportions \\(\\widehat{p}\\) were “on average” correct and thus were centered at the true value of the population proportion \\(p\\). This is because we mixed our bowl before taking samples and thus the sampling was done at random and thus in an unbiased fashion. This is also known as having an accurate estimate. What was the value of the population proportion \\(p\\) of the \\(N\\) = 2400 balls in the actual bowl that were red? There were 900 red balls, for a proportion red of 900/2400 = 0.375 = 37.5%! How do we know this? Did the authors do an exhaustive count of all the balls? No! They were listed in the contents of the box that the bowl came in! Hence we were able to make the contents of the virtual bowl match the tactile bowl: bowl %&gt;% summarize(sum_red = sum(color == &quot;red&quot;), sum_not_red = sum(color != &quot;red&quot;)) # A tibble: 1 x 2 sum_red sum_not_red &lt;int&gt; &lt;int&gt; 1 900 1500 Let’s re-display our sampling distributions from Figures 7.12 and 7.14, but now with a vertical red line marking the true population proportion \\(p\\) of balls that are red = 37.5% in Figure 7.15. We see that while there is a certain amount of error in the sample proportions \\(\\widehat{p}\\) for all three sampling distributions, on average the \\(\\widehat{p}\\) are centered at the true population proportion red \\(p\\). FIGURE 7.15: Three sampling distributions with population proportion \\(p\\) marked by vertical line. We also saw in this section that as your sample size \\(n\\) increases, your point estimates will vary less and less and be more and more concentrated around the true population parameter. This variation is quantified by the decreasing standard error. In other words, the typical error of your point estimates will decrease. In our sampling exercise, as the sample size increased, the variation of our sample proportions \\(\\widehat{p}\\) decreased. You can observe this behavior in Figure 7.15. This is also known as having a precise estimate. So random sampling ensures our point estimates are accurate, while on the other hand having a large sample size ensures our point estimates are precise. While the terms “accuracy” and “precision” may sound like they mean the same thing, there is a subtle difference. Accuracy describes how “on target” our estimates are, whereas precision describes how “consistent” our estimates are. Figure 7.16 illustrates the difference. FIGURE 7.16: Comparing accuracy and precision. At this point, you might be asking yourself: “If we already knew the true proportion of the bowl’s balls that are red was 37.5%, then why did we do any sampling?”. You might also be asking: “Why did we take 1000 repeated samples of size n = 25, 50, and 100? Shouldn’t we be taking only one sample that’s as large as possible?”. If you did ask yourself these questions, your suspicion is merited! The sampling activity involving the bowl is merely an idealized version of how sampling is done in real life. We performed this exercise only to study and understand: The effect of sampling variation. The effect of sample size on sampling variation. This is not how sampling is done in real life. In a real-life scenario, we won’t know what the true value of the population parameter is. Furthermore, we wouldn’t take 1000 repeated/replicated samples, but rather a single sample that’s as large as we can afford. In the next section, let’s now study a real-life example of sampling: polls. Learning check (LC7.16) The table that follows is a version of Table 7.3 matching sample sizes \\(n\\) to different standard errors of the sample proportion \\(\\widehat{p}\\), but with the rows randomly re-ordered and the sample sizes removed. Fill in the table by matching the correct sample sizes to the correct standard errors. TABLE 7.4: Standard errors of \\(\\widehat{p}\\) based on n = 25, 50, 100 Sample size Standard error of \\(\\widehat{p}\\) n = 0.094 n = 0.045 n = 0.069 For the following four Learning checks, let the estimate be the sample proportion \\(\\widehat{p}\\): the proportion of a shovel’s balls that were red. It estimates the population proportion \\(p\\): the proportion of the bowl’s balls that were red. (LC7.17) What is the difference between an accurate and a precise estimate? (LC7.18) How do we ensure that an estimate is accurate? How do we ensure that an estimate is precise? (LC7.19) In a real-life situation, we would not take 1000 different samples to infer about a population, but rather only one. Then, what was the purpose of our exercises where we took 1000 different samples? (LC7.20) Figure 7.16 with the targets shows four combinations of “accurate versus precise” estimates. Draw four corresponding sampling distributions of the sample proportion \\(\\widehat{p}\\), like the one in the leftmost plot in Figure 7.15. 7.4 Case study: Polls Let’s now switch gears to a more realistic sampling scenario than our bowl activity: a poll. In practice, pollsters do not take 1000 repeated samples as we did in our previous sampling activities, but rather take only a single sample that’s as large as possible. On December 4, 2013, National Public Radio in the US reported on a poll of President Obama’s approval rating among young Americans aged 18-29 in an article, “Poll: Support For Obama Among Young Americans Eroding.” The poll was conducted by the Kennedy School’s Institute of Politics at Harvard University. A quote from the article: After voting for him in large numbers in 2008 and 2012, young Americans are souring on President Obama. According to a new Harvard University Institute of Politics poll, just 41 percent of millennials — adults ages 18-29 — approve of Obama’s job performance, his lowest-ever standing among the group and an 11-point drop from April. Let’s tie elements of the real-life poll in this new article with our “tactile” and “virtual” bowl activity from Sections 7.1 and 7.2 using the terminology, notations, and definitions we learned in Section 7.3. You’ll see that our sampling activity with the bowl is an idealized version of what pollsters are trying to do in real life. First, who is the (Study) Population of \\(N\\) individuals or observations of interest? Bowl: \\(N\\) = 2400 identically sized red and white balls Obama poll: \\(N\\) = ? young Americans aged 18-29 Second, what is the population parameter? Bowl: The population proportion \\(p\\) of all the balls in the bowl that are red. Obama poll: The population proportion \\(p\\) of all young Americans who approve of Obama’s job performance. Third, what would a census look like? Bowl: Manually going over all \\(N\\) = 2400 balls and exactly computing the population proportion \\(p\\) of the balls that are red. Obama poll: Locating all \\(N\\) young Americans and asking them all if they approve of Obama’s job performance. In this case, we don’t even know what the population size \\(N\\) is! Fourth, how do you perform sampling to obtain a sample of size \\(n\\)? Bowl: Using a shovel with \\(n\\) slots. Obama poll: One method is to get a list of phone numbers of all young Americans and pick out \\(n\\) phone numbers. In this poll’s case, the sample size of this poll was \\(n = 2089\\) young Americans. Fifth, what is your point estimate (AKA sample statistic) of the unknown population parameter? Bowl: The sample proportion \\(\\widehat{p}\\) of the balls in the shovel that were red. Obama poll: The sample proportion \\(\\widehat{p}\\) of young Americans in the sample that approve of Obama’s job performance. In this poll’s case, \\(\\widehat{p} = 0.41 = 41\\%\\), the quoted percentage in the second paragraph of the article. Sixth, is the sampling procedure representative? Bowl: Are the contents of the shovel representative of the contents of the bowl? Because we mixed the bowl before sampling, we can feel confident that they are. Obama poll: Is the sample of \\(n = 2089\\) young Americans representative of all young Americans aged 18-29? This depends on whether the sampling was random. Seventh, are the samples generalizable to the greater population? Bowl: Is the sample proportion \\(\\widehat{p}\\) of the shovel’s balls that are red a “good guess” of the population proportion \\(p\\) of the bowl’s balls that are red? Given that the sample was representative, the answer is yes. Obama poll: Is the sample proportion \\(\\widehat{p} = 0.41\\) of the sample of young Americans who supported Obama a “good guess” of the population proportion \\(p\\) of all young Americans who supported Obama at this time in 2013? In other words, can we confidently say that roughly 41% of all young Americans approved of Obama at the time of the poll? Again, this depends on whether the sampling was random. Eighth, is the sampling procedure unbiased? In other words, do all observations have an equal chance of being included in the sample? Bowl: Since each ball was equally sized and we mixed the bowl before using the shovel, each ball had an equal chance of being included in a sample and hence the sampling was unbiased. Obama poll: Did all young Americans have an equal chance at being represented in this poll? Again, this depends on whether the sampling was random. Ninth and lastly, was the sampling done at random? Bowl: As long as you mixed the bowl sufficiently before sampling, your samples would be random. Obama poll: Was the sample conducted at random? We can’t answer this question without knowing about the sampling methodology used by Kennedy School’s Institute of Politics at Harvard University. We’ll discuss this more at the end of this section. In other words, the poll by Kennedy School’s Institute of Politics at Harvard University can be thought of as an instance of using the shovel to sample balls from the bowl. Furthermore, if another polling company conducted a similar poll of young Americans at roughly the same time, they would likely get a different estimate than 41%. This is due to sampling variation. Let’s now revisit the sampling paradigm from Subsection 7.3.1: In general: If the sampling of a sample of size \\(n\\) is done at random, then the sample is unbiased and representative of the population of size \\(N\\), thus any result based on the sample can generalize to the population, thus the point estimate is a “good guess” of the unknown population parameter, thus instead of performing a census, we can infer about the population using sampling. Specific to the bowl: If we extract a sample of \\(n = 50\\) balls at random, in other words, we mix all of the equally sized balls before using the shovel, then the contents of the shovel are an unbiased representation of the contents of the bowl’s 2400 balls, thus any result based on the shovel’s balls can generalize to the bowl, thus the sample proportion \\(\\widehat{p}\\) of the \\(n = 50\\) balls in the shovel that are red is a “good guess” of the population proportion \\(p\\) of the \\(N = 2400\\) balls that are red, thus instead of manually going over all 2400 balls in the bowl, we can infer about the bowl using the shovel. Specific to the Obama poll: If we had a way of contacting a randomly chosen sample of 2089 young Americans and polling their approval of President Obama in 2013, then these 2089 young Americans would be an unbiased and representative sample of all young Americans in 2013, thus any results based on this sample of 2089 young Americans can generalize to the entire population of all young Americans in 2013, thus the reported sample approval rating of 41% of these 2089 young Americans is a good guess of the true approval rating among all young Americans in 2013, thus instead of performing an expensive census of all young Americans in 2013, we can infer about all young Americans in 2013 using polling. So as you can see, it was critical for the sample obtained by Kennedy School’s Institute of Politics at Harvard University to be truly random in order to infer about all young Americans’ opinions about Obama. Was their sample truly random? It’s hard to answer such questions without knowing about the sampling methodology they used. For example, if this poll was conducted using only mobile phone numbers, people without mobile phones would be left out and therefore not represented in the sample. What about if Kennedy School’s Institute of Politics at Harvard University conducted this poll on an internet news site? Then people who don’t read this particular internet news site would be left out. Ensuring that our samples were random was easy to do in our sampling bowl exercises; however, in a real-life situation like the Obama poll, this is much harder to do. Learning check Comment on the representativeness of the following sampling methodologies: (LC7.21) The Royal Air Force wants to study how resistant all their airplanes are to bullets. They study the bullet holes on all the airplanes on the tarmac after an air battle against the Luftwaffe (German Air Force). (LC7.22) Imagine it is 1993, a time when almost all households had landlines. You want to know the average number of people in each household in your city. You randomly pick out 500 phone numbers from the phone book and conduct a phone survey. (LC7.23) You want to know the prevalence of illegal downloading of TV shows among students at a local college. You get the emails of 100 randomly chosen students and ask them, “How many times did you download a pirated TV show last week?”. (LC7.24) A local college administrator wants to know the average income of all graduates in the last 10 years. So they get the records of five randomly chosen graduates, contact them, and obtain their answers. 7.5 Conclusion 7.5.1 Sampling scenarios In this chapter, we performed both tactile and virtual sampling exercises to infer about an unknown proportion. We also presented a case study of sampling in real life with polls. In each case, we used the sample proportion \\(\\widehat{p}\\) to estimate the population proportion \\(p\\). However, we are not just limited to scenarios related to proportions. In other words, we can use sampling to estimate other population parameters using other point estimates as well. We present four more such scenarios in Table 7.5. TABLE 7.5: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Symbol(s) 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) 4 Difference in population means \\(\\mu_1 - \\mu_2\\) Difference in sample means \\(\\overline{x}_1 - \\overline{x}_2\\) 5 Population regression slope \\(\\beta_1\\) Fitted regression slope \\(b_1\\) or \\(\\widehat{\\beta}_1\\) In the rest of this book, we’ll cover all the remaining scenarios as follows: In Chapter 8, we’ll cover examples of statistical inference for Scenario 2: The mean age \\(\\mu\\) of all pennies in circulation in the US. Scenario 3: The difference \\(p_1 - p_2\\) in the proportion of people who yawn when seeing someone else yawn first minus the proportion of people who yawn without seeing someone else yawn first. This is an example of two-sample inference. In Chapter 9, we’ll cover an example of statistical inference for Scenario 4: The difference \\(\\mu_1 - \\mu_2\\) in mean IMDb ratings for action and romance movies. This is another example of two-sample inference. In Chapter 10, we’ll cover an example of statistical inference for regression by revisiting the regression models for teaching score as a function of various instructor demographic variables you saw in Chapters 5 and 6. Scenario 5: The slope \\(\\beta_1\\) of the population regression line. 7.5.2 Central Limit Theorem What you visualized in Figures 7.12 and 7.14 and summarized in Tables 7.1 and 7.3 was a demonstration of a famous theorem, or mathematically proven truth, called the Central Limit Theorem. It loosely states that when sample means are based on larger and larger sample sizes, the sampling distribution of these sample means becomes both more and more normally shaped and more and more narrow. In other words, their sampling distribution increasingly follows a normal distribution and the variation of these sampling distributions gets smaller, as quantified by their standard errors. Shuyi Chiou, Casey Dunn, and Pathikrit Bhattacharyya created a 3-minute and 38-second video at https://youtu.be/jvoxEYmQHNM explaining this crucial statistical theorem using the average weight of wild bunny rabbits and the average wingspan of dragons as examples. Figure 7.17 shows a preview of this video. FIGURE 7.17: Preview of Central Limit Theorem video. 7.5.3 Additional resources An R script file of all R code used in this chapter is available here. 7.5.4 What’s to come? Recall in our Obama poll case study in Section 7.4 that based on this particular sample, the best guess by Kennedy School’s Institute of Politics at Harvard University of the U.S. President Obama’s approval rating among all young Americans was 41%. However, this isn’t the end of the story. If you read the article further, it states: The online survey of 2,089 adults was conducted from Oct. 30 to Nov. 11, just weeks after the federal government shutdown ended and the problems surrounding the implementation of the Affordable Care Act began to take center stage. The poll’s margin of error was plus or minus 2.1 percentage points. Note the term margin of error, which here is “plus or minus 2.1 percentage points.” Most polls won’t produce an estimate that’s perfectly right; there will always be a certain amount of error caused by sampling variation. The margin of error of plus or minus 2.1 percentage points is saying that a typical range of errors for polls of this type is about \\(\\pm\\) 2.1%, in words from about 2.1% too small to about 2.1% too big. We can restate this as the interval of \\([41\\% - 2.1\\%, 41\\% + 2.1\\%] = [37.9\\%, 43.1\\%]\\) (this notation indicates the interval contains all values between 37.9% and 43.1%, including the end points of 37.9% and 43.1%). We’ll see in the next chapter that such intervals are known as confidence intervals. "],
+["8-confidence-intervals.html", "Chapter 8 Bootstrapping and Confidence Intervals 8.1 Pennies activity 8.2 Computer simulation of resampling 8.3 Understanding confidence intervals 8.4 Constructing confidence intervals 8.5 Interpreting confidence intervals 8.6 Case study: Is yawning contagious? 8.7 Conclusion", " Chapter 8 Bootstrapping and Confidence Intervals In Chapter 7, we studied sampling. We started with a “tactile” exercise where we wanted to know the proportion of balls in the sampling bowl in Figure 7.1 that are red. While we could have performed an exhaustive count, this would have been a tedious process. So instead, we used a shovel to extract a sample of 50 balls and used the resulting proportion that were red as an estimate. Furthermore, we made sure to mix the bowl’s contents before every use of the shovel. Because of the randomness created by the mixing, different uses of the shovel yielded different proportions red and hence different estimates of the proportion of the bowl’s balls that are red. We then mimicked this “tactile” sampling exercise with an equivalent “virtual” sampling exercise performed on the computer. Using our computer’s random number generator, we quickly mimicked the above sampling procedure a large number of times. In Subsection 7.2.4, we quickly repeated this sampling procedure 1000 times, using three different “virtual” shovels with 25, 50, and 100 slots. We visualized these three sets of 1000 estimates in Figure 7.15 and saw that as the sample size increased, the variation in the estimates decreased. In doing so, what we did was construct sampling distributions. The motivation for taking 1000 repeated samples and visualizing the resulting estimates was to study how these estimates varied from one sample to another; in other words, we wanted to study the effect of sampling variation. We quantified the variation of these estimates using their standard deviation, which has a special name: the standard error. In particular, we saw that as the sample size increased from 25 to 50 to 100, the standard error decreased and thus the sampling distributions narrowed. Larger sample sizes led to more precise estimates that varied less around the center. We then tied these sampling exercises to terminology and mathematical notation related to sampling in Subsection 7.3.1. Our study population was the large bowl with \\(N\\) = 2400 balls, while the population parameter, the unknown quantity of interest, was the population proportion \\(p\\) of the bowl’s balls that were red. Since performing a census would be expensive in terms of time and energy, we instead extracted a sample of size \\(n\\) = 50. The point estimate, also known as a sample statistic, used to estimate \\(p\\) was the sample proportion \\(\\widehat{p}\\) of these 50 sampled balls that were red. Furthermore, since the sample was obtained at random, it can be considered as unbiased and representative of the population. Thus any results based on the sample could be generalized to the population. Therefore, the proportion of the shovel’s balls that were red was a “good guess” of the proportion of the bowl’s balls that are red. In other words, we used the sample to infer about the population. However, as described in Section 7.2, both the tactile and virtual sampling exercises are not what one would do in real life; this was merely an activity used to study the effects of sampling variation. In a real-life situation, we would not take 1000 samples of size \\(n\\), but rather take a single representative sample that’s as large as possible. Additionally, we knew that the true proportion of the bowl’s balls that were red was 37.5%. In a real-life situation, we will not know what this value is. Because if we did, then why would we take a sample to estimate it? An example of a realistic sampling situation would be a poll, like the Obama poll you saw in Section 7.4. Pollsters did not know the true proportion of all young Americans who supported President Obama in 2013, and thus they took a single sample of size \\(n\\) = 2089 young Americans to estimate this value. So how does one quantify the effects of sampling variation when you only have a single sample to work with? You cannot directly study the effects of sampling variation when you only have one sample. One common method to study this is bootstrapping resampling, which will be the focus of the earlier sections of this chapter. Furthermore, what if we would like not only a single estimate of the unknown population parameter, but also a range of highly plausible values? Going back to the Obama poll article, it stated that the pollsters’ estimate of the proportion of all young Americans who supported President Obama was 41%. But in addition it stated that the poll’s “margin of error was plus or minus 2.1 percentage points.” This “plausible range” was [41% - 2.1%, 41% + 2.1%] = [38.9%, 43.1%]. This range of plausible values is what’s known as a confidence interval, which will be the focus of the later sections of this chapter. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once: ggplot2 for data visualization dplyr for data wrangling tidyr for converting data to tidy format readr for importing spreadsheet data into R As well as the more advanced purrr, tibble, stringr, and forcats packages If needed, read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(moderndive) library(infer) 8.1 Pennies activity As we did in Chapter 7, we’ll begin with a hands-on tactile activity. 8.1.1 What is the average year on US pennies in 2019? Try to imagine all the pennies being used in the United States in 2019. That’s a lot of pennies! Now say we’re interested in the average year of minting of all these pennies. One way to compute this value would be to gather up all pennies being used in the US, record the year, and compute the average. However, this would be near impossible! So instead, let’s collect a sample of 50 pennies from a local bank in downtown Northampton, Massachusetts, USA as seen in Figure 8.1. FIGURE 8.1: Collecting a sample of 50 US pennies from a local bank. An image of these 50 pennies can be seen in Figure 8.2. For each of the 50 pennies starting in the top left, progressing row-by-row, and ending in the bottom right, we assigned an “ID” identification variable and marked the year of minting. FIGURE 8.2: 50 US pennies labelled. The moderndive package contains this data on our 50 sampled pennies in the pennies_sample data frame: pennies_sample # A tibble: 50 x 2 ID year &lt;int&gt; &lt;dbl&gt; 1 1 2002 2 2 1986 3 3 2017 4 4 1988 5 5 2008 6 6 1983 7 7 2008 8 8 1996 9 9 2004 10 10 2000 # … with 40 more rows The pennies_sample data frame has 50 rows corresponding to each penny with two variables. The first variable ID corresponds to the ID labels in Figure 8.2, whereas the second variable year corresponds to the year of minting saved as a numeric variable, also known as a double (dbl). Based on these 50 sampled pennies, what can we say about all US pennies in 2019? Let’s study some properties of our sample by performing an exploratory data analysis. Let’s first visualize the distribution of the year of these 50 pennies using our data visualization tools from Chapter 2. Since year is a numerical variable, we use a histogram in Figure 8.3 to visualize its distribution. ggplot(pennies_sample, aes(x = year)) + geom_histogram(binwidth = 10, color = &quot;white&quot;) FIGURE 8.3: Distribution of year on 50 US pennies. Observe a slightly left-skewed distribution, since most pennies fall between 1980 and 2010 with only a few pennies older than 1970. What is the average year for the 50 sampled pennies? Eyeballing the histogram it appears to be around 1990. Let’s now compute this value exactly using our data wrangling tools from Chapter 3. pennies_sample %&gt;% summarize(mean_year = mean(year)) # A tibble: 1 x 1 mean_year &lt;dbl&gt; 1 1995.44 Thus, if we’re willing to assume that pennies_sample is a representative sample from all US pennies, a “good guess” of the average year of minting of all US pennies would be 1995.44. In other words, around 1995. This should all start sounding similar to what we did previously in Chapter 7! In Chapter 7, our study population was the bowl of \\(N\\) = 2400 balls. Our population parameter was the population proportion of these balls that were red, denoted by \\(p\\). In order to estimate \\(p\\), we extracted a sample of 50 balls using the shovel. We then computed the relevant point estimate: the sample proportion of these 50 balls that were red, denoted mathematically by \\(\\widehat{p}\\). Here our population is \\(N\\) = whatever the number of pennies are being used in the US, a value which we don’t know and probably never will. The population parameter of interest is now the population mean year of all these pennies, a value denoted mathematically by the Greek letter \\(\\mu\\) (pronounced “mu”). In order to estimate \\(\\mu\\), we went to the bank and obtained a sample of 50 pennies and computed the relevant point estimate: the sample mean year of these 50 pennies, denoted mathematically by \\(\\overline{x}\\) (pronounced “x-bar”). An alternative and more intuitive notation for the sample mean is \\(\\widehat{\\mu}\\). However, this is unfortunately not as commonly used, so in this book we’ll stick with convention and always denote the sample mean as \\(\\overline{x}\\). We summarize the correspondence between the sampling bowl exercise in Chapter 7 and our pennies exercise in Table 8.1, which are the first two rows of the previously seen Table 7.5. TABLE 8.1: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Symbol(s) 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) Going back to our 50 sampled pennies in Figure 8.2, the point estimate of interest is the sample mean \\(\\overline{x}\\) of 1995.44. This quantity is an estimate of the population mean year of all US pennies \\(\\mu\\). Recall that we also saw in Chapter 7 that such estimates are prone to sampling variation. For example, in this particular sample in Figure 8.2, we observed three pennies with the year 1999. If we sampled another 50 pennies, would we observe exactly three pennies with the year 1999 again? More than likely not. We might observe none, one, two, or maybe even all 50! The same can be said for the other 26 unique years that are represented in our sample of 50 pennies. To study the effects of sampling variation in Chapter 7, we took many samples, something we could easily do with our shovel. In our case with pennies, however, how would we obtain another sample? By going to the bank and getting another roll of 50 pennies. Say we’re feeling lazy, however, and don’t want to go back to the bank. How can we study the effects of sampling variation using our single sample? We will do so using a technique known as bootstrap resampling with replacement, which we now illustrate. 8.1.2 Resampling once Step 1: Let’s print out identically sized slips of paper representing our 50 pennies as seen in Figure 8.4. FIGURE 8.4: Step 1: 50 slips of paper representing 50 US pennies. Step 2: Put the 50 slips of paper into a hat or tuque as seen in Figure 8.5. FIGURE 8.5: Step 2: Putting 50 slips of paper in a hat. Step 3: Mix the hat’s contents and draw one slip of paper at random as seen in Figure 8.6. Record the year. FIGURE 8.6: Step 3: Drawing one slip of paper at random. Step 4: Put the slip of paper back in the hat! In other words, replace it as seen in Figure 8.7. FIGURE 8.7: Step 4: Replacing slip of paper. Step 5: Repeat Steps 3 and 4 a total of 49 more times, resulting in 50 recorded years. What we just performed was a resampling of the original sample of 50 pennies. We are not sampling 50 pennies from the population of all US pennies as we did in our trip to the bank. Instead, we are mimicking this act by resampling 50 pennies from our original sample of 50 pennies. Now ask yourselves, why did we replace our resampled slip of paper back into the hat in Step 4? Because if we left the slip of paper out of the hat each time we performed Step 4, we would end up with the same 50 original pennies! In other words, replacing the slips of paper induces sampling variation. Being more precise with our terminology, we just performed a resampling with replacement from the original sample of 50 pennies. Had we left the slip of paper out of the hat each time we performed Step 4, this would be resampling without replacement. Let’s study our 50 resampled pennies via an exploratory data analysis. First, let’s load the data into R by manually creating a data frame pennies_resample of our 50 resampled values. We’ll do this using the tibble() command from the dplyr package. Note that the 50 values you resample will almost certainly not be the same as ours given the inherent randomness. pennies_resample &lt;- tibble( year = c(1976, 1962, 1976, 1983, 2017, 2015, 2015, 1962, 2016, 1976, 2006, 1997, 1988, 2015, 2015, 1988, 2016, 1978, 1979, 1997, 1974, 2013, 1978, 2015, 2008, 1982, 1986, 1979, 1981, 2004, 2000, 1995, 1999, 2006, 1979, 2015, 1979, 1998, 1981, 2015, 2000, 1999, 1988, 2017, 1992, 1997, 1990, 1988, 2006, 2000) ) The 50 values of year in pennies_resample represent a resample of size 50 from the original sample of 50 pennies. We display the 50 resampled pennies in Figure 8.8. FIGURE 8.8: 50 resampled US pennies labelled. Let’s compare the distribution of the numerical variable year of our 50 resampled pennies with the distribution of the numerical variable year of our original sample of 50 pennies in Figure 8.9. ggplot(pennies_resample, aes(x = year)) + geom_histogram(binwidth = 10, color = &quot;white&quot;) + labs(title = &quot;Resample of 50 pennies&quot;) ggplot(pennies_sample, aes(x = year)) + geom_histogram(binwidth = 10, color = &quot;white&quot;) + labs(title = &quot;Original sample of 50 pennies&quot;) FIGURE 8.9: Comparing year in the resampled pennies_resample with the original sample pennies_sample. Observe in Figure 8.9 that while the general shapes of both distributions of year are roughly similar, they are not identical. Recall from the previous section that the sample mean of the original sample of 50 pennies from the bank was 1995.44. What about for our resample? Any guesses? Let’s have dplyr help us out as before: pennies_resample %&gt;% summarize(mean_year = mean(year)) # A tibble: 1 x 1 mean_year &lt;dbl&gt; 1 1996 We obtained a different mean year of 1996. This variation is induced by the resampling with replacement we performed earlier. What if we repeated this resampling exercise many times? Would we obtain the same mean year each time? In other words, would our guess at the mean year of all pennies in the US in 2019 be exactly 1996 every time? Just as we did in Chapter 7, let’s perform this resampling activity with the help of some of our friends: 35 friends in total. 8.1.3 Resampling 35 times Each of our 35 friends will repeat the same five steps: Start with 50 identically sized slips of paper representing the 50 pennies. Put the 50 small pieces of paper into a hat or beanie cap. Mix the hat’s contents and draw one slip of paper at random. Record the year in a spreadsheet. Replace the slip of paper back in the hat! Repeat Steps 3 and 4 a total of 49 more times, resulting in 50 recorded years. Since we had 35 of our friends perform this task, we ended up with \\(35 \\cdot 50 = 1750\\) values. We recorded these values in a shared spreadsheet with 50 rows (plus a header row) and 35 columns. We display a snapshot of the first 10 rows and five columns of this shared spreadsheet in Figure 8.10. FIGURE 8.10: Snapshot of shared spreadsheet of resampled pennies. For your convenience, we’ve taken these 35 \\(\\cdot\\) 50 = 1750 values and saved them in pennies_resamples, a “tidy” data frame included in the moderndive package. We saw what it means for a data frame to be “tidy” in Subsection 4.2.1. pennies_resamples # A tibble: 1,750 x 3 # Groups: name [35] replicate name year &lt;int&gt; &lt;chr&gt; &lt;dbl&gt; 1 1 Arianna 1988 2 1 Arianna 2002 3 1 Arianna 2015 4 1 Arianna 1998 5 1 Arianna 1979 6 1 Arianna 1971 7 1 Arianna 1971 8 1 Arianna 2015 9 1 Arianna 1988 10 1 Arianna 1979 # … with 1,740 more rows What did each of our 35 friends obtain as the mean year? Once again, dplyr to the rescue! After grouping the rows by name, we summarize each group of 50 rows by their mean year: resampled_means &lt;- pennies_resamples %&gt;% group_by(name) %&gt;% summarize(mean_year = mean(year)) resampled_means # A tibble: 35 x 2 name mean_year &lt;chr&gt; &lt;dbl&gt; 1 Arianna 1992.5 2 Artemis 1996.42 3 Bea 1996.32 4 Camryn 1996.9 5 Cassandra 1991.22 6 Cindy 1995.48 7 Claire 1995.52 8 Dahlia 1998.48 9 Dan 1993.86 10 Eindra 1993.56 # … with 25 more rows Observe that resampled_means has 35 rows corresponding to the 35 means based on the 35 resamples. Furthermore, observe the variation in the 35 values in the variable mean_year. Let’s visualize this variation using a histogram in Figure 8.11. Recall that adding the argument boundary = 1990 to the geom_histogram() sets the binning structure so that one of the bin boundaries is at 1990 exactly. ggplot(resampled_means, aes(x = mean_year)) + geom_histogram(binwidth = 1, color = &quot;white&quot;, boundary = 1990) + labs(x = &quot;Sampled mean year&quot;) FIGURE 8.11: Distribution of 35 sample means from 35 resamples. Observe in Figure 8.11 that the distribution looks roughly normal and that we rarely observe sample mean years less than 1992 or greater than 2000. Also observe how the distribution is roughly centered at 1995, which is close to the sample mean of 1995.44 of the original sample of 50 pennies from the bank. 8.1.4 What did we just do? What we just demonstrated in this activity is the statistical procedure known as bootstrap resampling with replacement. We used resampling to mimic the sampling variation we studied in Chapter 7 on sampling. However, in this case, we did so using only a single sample from the population. In fact, the histogram of sample means from 35 resamples in Figure 8.11 is called the bootstrap distribution. It is an approximation to the sampling distribution of the sample mean, in the sense that both distributions will have a similar shape and similar spread. In fact in the upcoming Section 8.7, we’ll show you that this is the case. Using this bootstrap distribution, we can study the effect of sampling variation on our estimates. In particular, we’ll study the typical “error” of our estimates, known as the standard error. In Section 8.2 we’ll mimic our tactile resampling activity virtually on the computer, allowing us to quickly perform the resampling many more than 35 times. In Section 8.3 we’ll define the statistical concept of a confidence interval, which builds off the concept of bootstrap distributions. In Section 8.4, we’ll construct confidence intervals using the dplyr package, as well as a new package: the infer package for “tidy” and transparent statistical inference. We’ll introduce the “tidy” statistical inference framework that was the motivation for the infer package pipeline. The infer package will be the driving package throughout the rest of this book. As we did in Chapter 7, we’ll tie all these ideas together with a real-life case study in Section 8.6. This time we’ll look at data from an experiment about yawning from the US television show Mythbusters. 8.2 Computer simulation of resampling Let’s now mimic our tactile resampling activity virtually with a computer. 8.2.1 Virtually resampling once First, let’s perform the virtual analog of resampling once. Recall that the pennies_sample data frame included in the moderndive package contains the years of our original sample of 50 pennies from the bank. Furthermore, recall in Chapter 7 on sampling that we used the rep_sample_n() function as a virtual shovel to sample balls from our virtual bowl of 2400 balls as follows: virtual_shovel &lt;- bowl %&gt;% rep_sample_n(size = 50) Let’s modify this code to perform the resampling with replacement of the 50 slips of paper representing our original sample 50 pennies: virtual_resample &lt;- pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE) Observe how we explicitly set the replace argument to TRUE in order to tell rep_sample_n() that we would like to sample pennies with replacement. Had we not set replace = TRUE, the function would’ve assumed the default value of FALSE and hence done resampling without replacement. Additionally, since we didn’t specify the number of replicates via the reps argument, the function assumes the default of one replicate reps = 1. Lastly, observe also that the size argument is set to match the original sample size of 50 pennies. Let’s look at only the first 10 out of 50 rows of virtual_resample: virtual_resample # A tibble: 50 x 3 # Groups: replicate [1] replicate ID year &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 1 37 1962 2 1 1 2002 3 1 45 1997 4 1 28 2006 5 1 50 2017 6 1 10 2000 7 1 16 2015 8 1 47 1982 9 1 23 1998 10 1 44 2015 # … with 40 more rows The replicate variable only takes on the value of 1 corresponding to us only having reps = 1, the ID variable indicates which of the 50 pennies from pennies_sample was resampled, and year denotes the year of minting. Let’s now compute the mean year in our virtual resample of size 50 using data wrangling functions included in the dplyr package: virtual_resample %&gt;% summarize(resample_mean = mean(year)) # A tibble: 1 x 2 replicate resample_mean &lt;int&gt; &lt;dbl&gt; 1 1 1996 As we saw when we did our tactile resampling exercise, the resulting mean year is different than the mean year of our 50 originally sampled pennies of 1995.44. 8.2.2 Virtually resampling 35 times Let’s now perform the virtual analog of our 35 friends’ resampling. Using these results, we’ll be able to study the variability in the sample means from 35 resamples of size 50. Let’s first add a reps = 35 argument to rep_sample_n() to indicate we would like 35 replicates. Thus, we want to repeat the resampling with the replacement of 50 pennies 35 times. virtual_resamples &lt;- pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE, reps = 35) virtual_resamples # A tibble: 1,750 x 3 # Groups: replicate [35] replicate ID year &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 1 21 1981 2 1 34 1985 3 1 4 1988 4 1 11 1994 5 1 26 1979 6 1 8 1996 7 1 19 1983 8 1 21 1981 9 1 49 2006 10 1 2 1986 # … with 1,740 more rows The resulting virtual_resamples data frame has 35 \\(\\cdot\\) 50 = 1750 rows corresponding to 35 resamples of 50 pennies. Let’s now compute the resulting 35 sample means using the same dplyr code as we did in the previous section, but this time adding a group_by(replicate): virtual_resampled_means &lt;- virtual_resamples %&gt;% group_by(replicate) %&gt;% summarize(mean_year = mean(year)) virtual_resampled_means # A tibble: 35 x 2 replicate mean_year &lt;int&gt; &lt;dbl&gt; 1 1 1995.58 2 2 1999.74 3 3 1993.7 4 4 1997.1 5 5 1999.42 6 6 1995.12 7 7 1994.94 8 8 1997.78 9 9 1991.26 10 10 1996.88 # … with 25 more rows Observe that virtual_resampled_means has 35 rows, corresponding to the 35 resampled means. Furthermore, observe that the values of mean_year vary. Let’s visualize this variation using a histogram in Figure 8.12. ggplot(virtual_resampled_means, aes(x = mean_year)) + geom_histogram(binwidth = 1, color = &quot;white&quot;, boundary = 1990) + labs(x = &quot;Resample mean year&quot;) FIGURE 8.12: Distribution of 35 sample means from 35 resamples. Let’s compare our virtually constructed bootstrap distribution with the one our 35 friends constructed via our tactile resampling exercise in Figure 8.13. Observe how they are somewhat similar, but not identical. FIGURE 8.13: Comparing distributions of means from resamples. Recall that in the “resampling with replacement” scenario we are illustrating here, both of these histograms have a special name: the bootstrap distribution of the sample mean. Furthermore, recall they are an approximation to the sampling distribution of the sample mean, a concept you saw in Chapter 7 on sampling. These distributions allow us to study the effect of sampling variation on our estimates of the true population mean, in this case the true mean year for all US pennies. However, unlike in Chapter 7 where we took multiple samples (something one would never do in practice), bootstrap distributions are constructed by taking multiple resamples from a single sample: in this case, the 50 original pennies from the bank. 8.2.3 Virtually resampling 1000 times Remember that one of the goals of resampling with replacement is to construct the bootstrap distribution, which is an approximation of the sampling distribution. However, the bootstrap distribution in Figure 8.12 is based only on 35 resamples and hence looks a little coarse. Let’s increase the number of resamples to 1000, so that we can hopefully better see the shape and the variability between different resamples. # Repeat resampling 1000 times virtual_resamples &lt;- pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE, reps = 1000) # Compute 1000 sample means virtual_resampled_means &lt;- virtual_resamples %&gt;% group_by(replicate) %&gt;% summarize(mean_year = mean(year)) However, in the interest of brevity, going forward let’s combine these two operations into a single chain of pipe (%&gt;%) operators: virtual_resampled_means &lt;- pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE, reps = 1000) %&gt;% group_by(replicate) %&gt;% summarize(mean_year = mean(year)) virtual_resampled_means # A tibble: 1,000 x 2 replicate mean_year &lt;int&gt; &lt;dbl&gt; 1 1 1992.6 2 2 1994.78 3 3 1994.74 4 4 1997.88 5 5 1990 6 6 1999.48 7 7 1990.26 8 8 1993.2 9 9 1994.88 10 10 1996.3 # … with 990 more rows In Figure 8.14 let’s visualize the bootstrap distribution of these 1000 means based on 1000 virtual resamples: ggplot(virtual_resampled_means, aes(x = mean_year)) + geom_histogram(binwidth = 1, color = &quot;white&quot;, boundary = 1990) + labs(x = &quot;sample mean&quot;) FIGURE 8.14: Bootstrap resampling distribution based on 1000 resamples. Note here that the bell shape is starting to become much more apparent. We now have a general sense for the range of values that the sample mean may take on. But where is this histogram centered? Let’s compute the mean of the 1000 resample means: virtual_resampled_means %&gt;% summarize(mean_of_means = mean(mean_year)) # A tibble: 1 x 1 mean_of_means &lt;dbl&gt; 1 1995.36 The mean of these 1000 means is 1995.36, which is quite close to the mean of our original sample of 50 pennies of 1995.44. This is the case since each of the 1000 resamples is based on the original sample of 50 pennies. Congratulations! You’ve just constructed your first bootstrap distribution! In the next section, you’ll see how to use this bootstrap distribution to construct confidence intervals. Learning check (LC8.1) What is the chief difference between a bootstrap distribution and a sampling distribution? (LC8.2) Looking at the bootstrap distribution for the sample mean in Figure 8.14, between what two values would you say most values lie? 8.3 Understanding confidence intervals Let’s start this section with an analogy involving fishing. Say you are trying to catch a fish. On the one hand, you could use a spear, while on the other you could use a net. Using the net will probably allow you to catch more fish! Now think back to our pennies exercise where you are trying to estimate the true population mean year \\(\\mu\\) of all US pennies. Think of the value of \\(\\mu\\) as a fish. On the one hand, we could use the appropriate point estimate/sample statistic to estimate \\(\\mu\\), which we saw in Table 8.1 is the sample mean \\(\\overline{x}\\). Based on our sample of 50 pennies from the bank, the sample mean was 1995.44. Think of using this value as “fishing with a spear.” What would “fishing with a net” correspond to? Look at the bootstrap distribution in Figure 8.14 once more. Between which two years would you say that “most” sample means lie? While this question is somewhat subjective, saying that most sample means lie between 1992 and 2000 would not be unreasonable. Think of this interval as the “net.” What we’ve just illustrated is the concept of a confidence interval, which we’ll abbreviate with “CI” throughout this book. As opposed to a point estimate/sample statistic that estimates the value of an unknown population parameter with a single value, a confidence interval gives what can be interpreted as a range of plausible values. Going back to our analogy, point estimates/sample statistics can be thought of as spears, whereas confidence intervals can be thought of as nets. FIGURE 8.15: Analogy of difference between point estimates and confidence intervals. Our proposed interval of 1992 to 2000 was constructed by eye and was thus somewhat subjective. We now introduce two methods for constructing such intervals in a more exact fashion: the percentile method and the standard error method. Both methods for confidence interval construction share some commonalities. First, they are both constructed from a bootstrap distribution, as you constructed in Subsection 8.2.3 and visualized in Figure 8.14. Second, they both require you to specify the confidence level. Commonly used confidence levels include 90%, 95%, and 99%. All other things being equal, higher confidence levels correspond to wider confidence intervals, and lower confidence levels correspond to narrower confidence intervals. In this book, we’ll be mostly using 95% and hence constructing “95% confidence intervals for \\(\\mu\\)” for our pennies activity. 8.3.1 Percentile method One method to construct a confidence interval is to use the middle 95% of values of the bootstrap distribution. We can do this by computing the 2.5th and 97.5th percentiles, which are 1991.059 and 1999.283, respectively. This is known as the percentile method for constructing confidence intervals. For now, let’s focus only on the concepts behind a percentile method constructed confidence interval; we’ll show you the code that computes these values in the next section. Let’s mark these percentiles on the bootstrap distribution with vertical lines in Figure 8.16. About 95% of the mean_year variable values in virtual_resampled_means fall between 1991.059 and 1999.283, with 2.5% to the left of the leftmost line and 2.5% to the right of the rightmost line. FIGURE 8.16: Percentile method 95% confidence interval. Interval endpoints marked by vertical lines. 8.3.2 Standard error method Recall in Appendix A.2, we saw that if a numerical variable follows a normal distribution, or, in other words, the histogram of this variable is bell-shaped, then roughly 95% of values fall between \\(\\pm\\) 1.96 standard deviations of the mean. Given that our bootstrap distribution based on 1000 resamples with replacement in Figure 8.14 is normally shaped, let’s use this fact about normal distributions to construct a confidence interval in a different way. First, recall the bootstrap distribution has a mean equal to 1995.36. This value almost coincides exactly with the value of the sample mean \\(\\overline{x}\\) of our original 50 pennies of 1995.44. Second, let’s compute the standard deviation of the bootstrap distribution using the values of mean_year in the virtual_resampled_means data frame: virtual_resampled_means %&gt;% summarize(SE = sd(mean_year)) # A tibble: 1 x 1 SE &lt;dbl&gt; 1 2.15466 What is this value? Recall that the bootstrap distribution is an approximation to the sampling distribution. Recall also that the standard deviation of a sampling distribution has a special name: the standard error. Putting these two facts together, we can say that 2.155 is an approximation of the standard error of \\(\\overline{x}\\). Thus, using our 95% rule of thumb about normal distributions from Appendix A.2, we can use the following formula to determine the lower and upper endpoints of a 95% confidence interval for \\(\\mu\\): \\[ \\begin{aligned} \\overline{x} \\pm 1.96 \\cdot SE &amp;= (\\overline{x} - 1.96 \\cdot SE, \\overline{x} + 1.96 \\cdot SE)\\\\ &amp;= (1995.44 - 1.96 \\cdot 2.15, 1995.44 + 1.96 \\cdot 2.15)\\\\ &amp;= (1991.15, 1999.73) \\end{aligned} \\] Let’s now add the SE method confidence interval with dashed lines in Figure 8.17. FIGURE 8.17: Comparing two 95% confidence interval methods. We see that both methods produce nearly identical 95% confidence intervals for \\(\\mu\\) with the percentile method yielding \\((1991.06, 1999.28)\\) while the standard error method produces \\((1991.22, 1999.66)\\). However, recall that we can only use the standard error rule when the bootstrap distribution is roughly normally shaped. Now that we’ve introduced the concept of confidence intervals and laid out the intuition behind two methods for constructing them, let’s explore the code that allows us to construct them. Learning check (LC8.3) What condition about the bootstrap distribution must be met for us to be able to construct confidence intervals using the standard error method? (LC8.4) Say we wanted to construct a 68% confidence interval instead of a 95% confidence interval for \\(\\mu\\). Describe what changes are needed to make this happen. Hint: we suggest you look at Appendix A.2 on the normal distribution. 8.4 Constructing confidence intervals Recall that the process of resampling with replacement we performed by hand in Section 8.1 and virtually in Section 8.2 is known as bootstrapping. The term bootstrapping originates in the expression of “pulling oneself up by their bootstraps,” meaning to “succeed only by one’s own efforts or abilities.” From a statistical perspective, bootstrapping alludes to succeeding in being able to study the effects of sampling variation on estimates from the “effort” of a single sample. Or more precisely, it refers to constructing an approximation to the sampling distribution using only one sample. To perform this resampling with replacement virtually in Section 8.2, we used the rep_sample_n() function, making sure that the size of the resamples matched the original sample size of 50. In this section, we’ll build off these ideas to construct confidence intervals using a new package: the infer package for “tidy” and transparent statistical inference. 8.4.1 Original workflow Recall that in Section 8.2, we virtually performed bootstrap resampling with replacement to construct bootstrap distributions. Such distributions are approximations to the sampling distributions we saw in Chapter 7, but are constructed using only a single sample. Let’s revisit the original workflow using the %&gt;% pipe operator. First, we used the rep_sample_n() function to resample size = 50 pennies with replacement from the original sample of 50 pennies in pennies_sample by setting replace = TRUE. Furthermore, we repeated this resampling 1000 times by setting reps = 1000: pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE, reps = 1000) Second, since for each of our 1000 resamples of size 50, we wanted to compute a separate sample mean, we used the dplyr verb group_by() to group observations/rows together by the replicate variable… pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE, reps = 1000) %&gt;% group_by(replicate) … followed by using summarize() to compute the sample mean() year for each replicate group: pennies_sample %&gt;% rep_sample_n(size = 50, replace = TRUE, reps = 1000) %&gt;% group_by(replicate) %&gt;% summarize(mean_year = mean(year)) For this simple case, we can get by with using the rep_sample_n() function and a couple of dplyr verbs to construct the bootstrap distribution. However, using only dplyr verbs only provides us with a limited set of tools. For more complicated situations, we’ll need a little more firepower. Let’s repeat this using the infer package. 8.4.2 infer package workflow The infer package is an R package for statistical inference. It makes efficient use of the %&gt;% pipe operator we introduced in Section 3.1 to spell out the sequence of steps necessary to perform statistical inference in a “tidy” and transparent fashion. Furthermore, just as the dplyr package provides functions with verb-like names to perform data wrangling, the infer package provides functions with intuitive verb-like names to perform statistical inference. Let’s go back to our pennies. Previously, we computed the value of the sample mean \\(\\overline{x}\\) using the dplyr function summarize(): pennies_sample %&gt;% summarize(stat = mean(year)) We’ll see that we can also do this using infer functions specify() and calculate(): pennies_sample %&gt;% specify(response = year) %&gt;% calculate(stat = &quot;mean&quot;) You might be asking yourself: “Isn’t the infer code longer? Why would I use that code?”. While not immediately apparent, you’ll see that there are three chief benefits to the infer workflow as opposed to the dplyr workflow. First, the infer verb names better align with the overall resampling framework you need to understand to construct confidence intervals and to conduct hypothesis tests (in Chapter 9). We’ll see flowchart diagrams of this framework in the upcoming Figure 8.23 and in Chapter 9 with Figure 9.14. Second, you can jump back and forth seamlessly between confidence intervals and hypothesis testing with minimal changes to your code. This will become apparent in Subsection 9.3.2 when we’ll compare the infer code for both of these inferential methods. Third, the infer workflow is much simpler for conducting inference when you have more than one variable. We’ll see two such situations. We’ll first see situations of two-sample inference where the sample data is collected from two groups, such as in Section 8.6 where we study the contagiousness of yawning and in Section 9.1 where we compare promotion rates of two groups at banks in the 1970s. Then in Section 10.4, we’ll see situations of inference for regression using the regression models you fit in Chapter 5. Let’s now illustrate the sequence of verbs necessary to construct a confidence interval for \\(\\mu\\), the population mean year of minting of all US pennies in 2019. 1. specify variables FIGURE 8.18: Diagram of the specify() verb. As shown in Figure 8.18, the specify() function is used to choose which variables in a data frame will be the focus of our statistical inference. We do this by specifying the response argument. For example, in our pennies_sample data frame of the 50 pennies sampled from the bank, the variable of interest is year: pennies_sample %&gt;% specify(response = year) Response: year (numeric) # A tibble: 50 x 1 year &lt;dbl&gt; 1 2002 2 1986 3 2017 4 1988 5 2008 6 1983 7 2008 8 1996 9 2004 10 2000 # … with 40 more rows Notice how the data itself doesn’t change, but the Response: year (numeric) meta-data does. This is similar to how the group_by() verb from dplyr doesn’t change the data, but only adds “grouping” meta-data, as we saw in Section 3.4. We can also specify which variables will be the focus of our statistical inference using a formula = y ~ x. This is the same formula notation you saw in Chapters 5 and 6 on regression models: the response variable y is separated from the explanatory variable x by a ~ (“tilde”). The following use of specify() with the formula argument yields the same result seen previously: pennies_sample %&gt;% specify(formula = year ~ NULL) Since in the case of pennies we only have a response variable and no explanatory variable of interest, we set the x on the right-hand side of the ~ to be NULL. While in the case of the pennies either specification works just fine, we’ll see examples later on where the formula specification is simpler. In particular, this comes up in the upcoming Section 8.6 on comparing two proportions and Section 10.4 on inference for regression. 2. generate replicates FIGURE 8.19: Diagram of generate() replicates. After we specify() the variables of interest, we pipe the results into the generate() function to generate replicates. Figure 8.19 shows how this is combined with specify() to start the pipeline. In other words, repeat the resampling process a large number of times. Recall in Sections 8.2.2 and 8.2.3 we did this 35 and 1000 times. The generate() function’s first argument is reps, which sets the number of replicates we would like to generate. Since we want to resample the 50 pennies in pennies_sample with replacement 1000 times, we set reps = 1000. The second argument type determines the type of computer simulation we’d like to perform. We set this to type = &quot;bootstrap&quot; indicating that we want to perform bootstrap resampling. You’ll see different options for type in Chapter 9. pennies_sample %&gt;% specify(response = year) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) Response: year (numeric) # A tibble: 50,000 x 2 # Groups: replicate [1,000] replicate year &lt;int&gt; &lt;dbl&gt; 1 1 1981 2 1 1988 3 1 2006 4 1 2016 5 1 2002 6 1 1985 7 1 1979 8 1 2000 9 1 2006 10 1 2016 # … with 49,990 more rows Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 pennies with replacement 1000 times and 50,000 = 50 \\(\\cdot\\) 1000. The variable replicate indicates which resample each row belongs to. So it has the value 1 50 times, the value 2 50 times, all the way through to the value 1000 50 times. The default value of the type argument is &quot;bootstrap&quot; in this scenario, so if the last line was written as generate(reps = 1000), we’d obtain the same results. Comparing with original workflow: Note that the steps of the infer workflow so far produce the same results as the original workflow using the rep_sample_n() function we saw earlier. In other words, the following two code chunks produce similar results: # infer workflow: # Original workflow: pennies_sample %&gt;% pennies_sample %&gt;% specify(response = year) %&gt;% rep_sample_n(size = 50, replace = TRUE, generate(reps = 1000) reps = 1000) 3. calculate summary statistics FIGURE 8.20: Diagram of calculate() summary statistics. After we generate() many replicates of bootstrap resampling with replacement, we next want to summarize each of the 1000 resamples of size 50 to a single sample statistic value. As seen in the diagram, the calculate() function does this. In our case, we want to calculate the mean year for each bootstrap resample of size 50. To do so, we set the stat argument to &quot;mean&quot;. You can also set the stat argument to a variety of other common summary statistics, like &quot;median&quot;, &quot;sum&quot;, &quot;sd&quot; (standard deviation), and &quot;prop&quot; (proportion). To see a list of all possible summary statistics you can use, type ?calculate and read the help file. Let’s save the result in a data frame called bootstrap_distribution and explore its contents: bootstrap_distribution &lt;- pennies_sample %&gt;% specify(response = year) %&gt;% generate(reps = 1000) %&gt;% calculate(stat = &quot;mean&quot;) bootstrap_distribution # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 1995.7 2 2 1994.04 3 3 1993.62 4 4 1994.5 5 5 1994.08 6 6 1993.6 7 7 1995.26 8 8 1996.64 9 9 1994.3 10 10 1995.94 # … with 990 more rows Observe that the resulting data frame has 1000 rows and 2 columns corresponding to the 1000 replicate values. It also has the mean year for each bootstrap resample saved in the variable stat. Comparing with original workflow: You may have recognized at this point that the calculate() step in the infer workflow produces the same output as the group_by() %&gt;% summarize() steps in the original workflow. # infer workflow: # Original workflow: pennies_sample %&gt;% pennies_sample %&gt;% specify(response = year) %&gt;% rep_sample_n(size = 50, replace = TRUE, generate(reps = 1000) %&gt;% reps = 1000) %&gt;% calculate(stat = &quot;mean&quot;) group_by(replicate) %&gt;% summarize(stat = mean(year)) 4. visualize the results FIGURE 8.21: Diagram of visualize() results. The visualize() verb provides a quick way to visualize the bootstrap distribution as a histogram of the numerical stat variable’s values. The pipeline of the main infer verbs used for exploring bootstrap distribution results is shown in Figure 8.21. visualize(bootstrap_distribution) FIGURE 8.22: Bootstrap distribution. Comparing with original workflow: In fact, visualize() is a wrapper function for the ggplot() function that uses a geom_histogram() layer. Recall that we illustrated the concept of a wrapper function in Figure 5.5 in Subsection 5.1.2. # infer workflow: # Original workflow: visualize(bootstrap_distribution) ggplot(bootstrap_distribution, aes(x = stat)) + geom_histogram() The visualize() function can take many other arguments which we’ll see momentarily to customize the plot further. It also works with helper functions to do the shading of the histogram values corresponding to the confidence interval values. Let’s recap the steps of the infer workflow for constructing a bootstrap distribution and then visualizing it in Figure 8.23. FIGURE 8.23: infer package workflow for confidence intervals. Recall how we introduced two different methods for constructing 95% confidence intervals for an unknown population parameter in Section 8.3: the percentile method and the standard error method. Let’s now check out the infer package code that explicitly constructs these. There are also some additional neat functions to visualize the resulting confidence intervals built-in to the infer package! 8.4.3 Percentile method with infer Recall the percentile method for constructing 95% confidence intervals we introduced in Subsection 8.3.1. This method sets the lower endpoint of the confidence interval at the 2.5th percentile of the bootstrap distribution and similarly sets the upper endpoint at the 97.5th percentile. The resulting interval captures the middle 95% of the values of the sample mean in the bootstrap distribution. We can compute the 95% confidence interval by piping bootstrap_distribution into the get_confidence_interval() function from the infer package, with the confidence level set to 0.95 and the confidence interval type to be &quot;percentile&quot;. Let’s save the results in percentile_ci. percentile_ci &lt;- bootstrap_distribution %&gt;% get_confidence_interval(level = 0.95, type = &quot;percentile&quot;) percentile_ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 1991.24 1999.42 Alternatively, we can visualize the interval (1991.24, 1999.42) by piping the bootstrap_distribution data frame into the visualize() function and adding a shade_confidence_interval() layer. We set the endpoints argument to be percentile_ci. visualize(bootstrap_distribution) + shade_confidence_interval(endpoints = percentile_ci) FIGURE 8.24: Percentile method 95% confidence interval shaded corresponding to potential values. Observe in Figure 8.24 that 95% of the sample means stored in the stat variable in bootstrap_distribution fall between the two endpoints marked with the darker lines, with 2.5% of the sample means to the left of the shaded area and 2.5% of the sample means to the right. You also have the option to change the colors of the shading using the color and fill arguments. You can also use the shorter named function shade_ci() and the results will be the same. This is for folks who don’t want to type out all of confidence_interval and prefer to type out ci instead. Try out the following code! visualize(bootstrap_distribution) + shade_ci(endpoints = percentile_ci, color = &quot;hotpink&quot;, fill = &quot;khaki&quot;) 8.4.4 Standard error method with infer Recall the standard error method for constructing 95% confidence intervals we introduced in Subsection 8.3.2. For any distribution that is normally shaped, roughly 95% of the values lie within two standard deviations of the mean. In the case of the bootstrap distribution, the standard deviation has a special name: the standard error. So in our case, 95% of values of the bootstrap distribution will lie within \\(\\pm 1.96\\) standard errors of \\(\\overline{x}\\). Thus, a 95% confidence interval is \\[\\overline{x} \\pm 1.96 \\cdot SE = (\\overline{x} - 1.96 \\cdot SE, \\, \\overline{x} + 1.96 \\cdot SE).\\] Computation of the 95% confidence interval can once again be done by piping the bootstrap_distribution data frame we created into the get_confidence_interval() function. However, this time we set the first type argument to be &quot;se&quot;. Second, we must specify the point_estimate argument in order to set the center of the confidence interval. We set this to be the sample mean of the original sample of 50 pennies of 1995.44. x_bar # A tibble: 1 x 1 mean_year &lt;dbl&gt; 1 1995.44 standard_error_ci &lt;- bootstrap_distribution %&gt;% get_confidence_interval(type = &quot;se&quot;, point_estimate = x_bar) standard_error_ci # A tibble: 1 x 2 lower upper &lt;dbl&gt; &lt;dbl&gt; 1 1991.35 1999.53 If we would like to visualize the interval (1991.35, 1999.53), we can once again pipe the bootstrap_distribution data frame into the visualize() function and add a shade_confidence_interval() layer to our plot. We set the endpoints argument to be standard_error_ci. The resulting standard-error method based on a 95% confidence interval for \\(\\mu\\) can be seen in Figure 8.25. visualize(bootstrap_distribution) + shade_confidence_interval(endpoints = standard_error_ci) FIGURE 8.25: Standard-error-method 95% confidence interval. As noted in Section 8.3, both methods produce similar confidence intervals: Percentile method: (1991.24, 1999.42) Standard error method: (1991.35, 1999.53) Learning check (LC8.5) Construct a 95% confidence interval for the median year of minting of all US pennies? Use the percentile method and, if appropriate, then use the standard-error method. 8.5 Interpreting confidence intervals Now that we’ve shown you how to construct confidence intervals using a sample drawn from a population, let’s now focus on how to interpret their effectiveness. The effectiveness of a confidence interval is judged by whether or not it contains the true value of the population parameter. Going back to our fishing analogy in Section 8.3, this is like asking, “Did our net capture the fish?”. So, for example, does our percentile-based confidence interval of (1991.24, 1999.42) “capture” the true mean year \\(\\mu\\) of all US pennies? Alas, we’ll never know, because we don’t know what the true value of \\(\\mu\\) is. After all, we’re sampling to estimate it! In order to interpret a confidence interval’s effectiveness, we need to know what the value of the population parameter is. That way we can say whether or not a confidence interval “captured” this value. Let’s revisit our sampling bowl from Chapter 7. What proportion of the bowl’s 2400 balls are red? Let’s compute this: bowl %&gt;% summarize(p_red = mean(color == &quot;red&quot;)) # A tibble: 1 x 1 p_red &lt;dbl&gt; 1 0.375 In this case, we know what the value of the population parameter is: we know that the population proportion \\(p\\) is 0.375. In other words, we know that 37.5% of the bowl’s balls are red. As we stated in Subsection 7.3.3, the sampling bowl exercise doesn’t really reflect how sampling is done in real life, but rather was an idealized activity. In real life, we won’t know what the true value of the population parameter is, hence the need for estimation. Let’s now construct confidence intervals for \\(p\\) using our 33 groups of friends’ samples from the bowl in Chapter 7. We’ll then see if the confidence intervals “captured” the true value of \\(p\\), which we know to be 37.5%. That is to say, “Did the net capture the fish?”. 8.5.1 Did the net capture the fish? Recall that we had 33 groups of friends each take samples of size 50 from the bowl and then compute the sample proportion of red balls \\(\\widehat{p}\\). This resulted in 33 such estimates of \\(p\\). Let’s focus on Ilyas and Yohan’s sample, which is saved in the bowl_sample_1 data frame in the moderndive package: bowl_sample_1 # A tibble: 50 x 1 color &lt;chr&gt; 1 white 2 white 3 red 4 red 5 white 6 white 7 red 8 white 9 white 10 white # … with 40 more rows They observed 21 red balls out of 50 and thus their sample proportion \\(\\widehat{p}\\) was 21/50 = 0.42 = 42%. Think of this as the “spear” from our fishing analogy. Let’s now follow the infer package workflow from Subsection 8.4.2 to create a percentile-method-based 95% confidence interval for \\(p\\) using Ilyas and Yohan’s sample. Think of this as the “net.” 1. specify variables First, we specify() the response variable of interest color: bowl_sample_1 %&gt;% specify(response = color) Error: A level of the response variable `color` needs to be specified for the `success` argument in `specify()`. Whoops! We need to define which event is of interest! red or white balls? Since we are interested in the proportion red, let’s set success to be &quot;red&quot;: bowl_sample_1 %&gt;% specify(response = color, success = &quot;red&quot;) Response: color (factor) # A tibble: 50 x 1 color &lt;fct&gt; 1 white 2 white 3 red 4 red 5 white 6 white 7 red 8 white 9 white 10 white # … with 40 more rows 2. generate replicates Second, we generate() 1000 replicates of bootstrap resampling with replacement from bowl_sample_1 by setting reps = 1000 and type = &quot;bootstrap&quot;. bowl_sample_1 %&gt;% specify(response = color, success = &quot;red&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) Response: color (factor) # A tibble: 50,000 x 2 # Groups: replicate [1,000] replicate color &lt;int&gt; &lt;fct&gt; 1 1 white 2 1 white 3 1 white 4 1 white 5 1 red 6 1 white 7 1 white 8 1 white 9 1 white 10 1 red # … with 49,990 more rows Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 balls with replacement 1000 times and thus 50,000 = 50 \\(\\cdot\\) 1000. The variable replicate indicates which resample each row belongs to. So it has the value 1 50 times, the value 2 50 times, all the way through to the value 1000 50 times. 3. calculate summary statistics Third, we summarize each of the 1000 resamples of size 50 with the proportion of successes. In other words, the proportion of the balls that are &quot;red&quot;. We can set the summary statistic to be calculated as the proportion by setting the stat argument to be &quot;prop&quot;. Let’s save the result as sample_1_bootstrap: sample_1_bootstrap &lt;- bowl_sample_1 %&gt;% specify(response = color, success = &quot;red&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;prop&quot;) sample_1_bootstrap # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 0.32 2 2 0.42 3 3 0.44 4 4 0.4 5 5 0.44 6 6 0.52 7 7 0.38 8 8 0.44 9 9 0.34 10 10 0.42 # … with 990 more rows Observe there are 1000 rows in this data frame and thus 1000 values of the variable stat. These 1000 values of stat represent our 1000 replicated values of the proportion, each based on a different resample. 4. visualize the results Fourth and lastly, let’s compute the resulting 95% confidence interval. percentile_ci_1 &lt;- sample_1_bootstrap %&gt;% get_confidence_interval(level = 0.95, type = &quot;percentile&quot;) percentile_ci_1 # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 0.3 0.56 Let’s visualize the bootstrap distribution along with the percentile_ci_1 percentile-based 95% confidence interval for \\(p\\) in Figure 8.26. We’ll adjust the number of bins to better see the resulting shape. Furthermore, we’ll add a dashed vertical line at Ilyas and Yohan’s observed \\(\\widehat{p}\\) = 21/50 = 0.42 = 42% using geom_vline(). sample_1_bootstrap %&gt;% visualize(bins = 15) + shade_confidence_interval(endpoints = percentile_ci_1) + geom_vline(xintercept = 0.375, linetype = &quot;dashed&quot;) FIGURE 8.26: Bootstrap distribution. Did Ilyas and Yohan’s net capture the fish? Did their 95% confidence interval for \\(p\\) based on their sample contain the true value of \\(p\\) of 0.375? Yes! 0.375 is between the endpoints of their confidence interval (0.3, 0.56). However, will every 95% confidence interval for \\(p\\) capture this value? In other words, if we had a different sample of 50 balls and constructed a different confidence interval, would it necessarily contain \\(p\\) = 0.375 as well? Let’s see! Let’s first take a different sample from the bowl, this time using the computer as we did in Chapter 7: bowl_sample_2 &lt;- bowl %&gt;% rep_sample_n(size = 50) bowl_sample_2 # A tibble: 50 x 3 # Groups: replicate [1] replicate ball_ID color &lt;int&gt; &lt;int&gt; &lt;chr&gt; 1 1 1665 red 2 1 1312 red 3 1 2105 red 4 1 810 white 5 1 189 white 6 1 1429 white 7 1 2294 red 8 1 1233 white 9 1 1951 white 10 1 2061 white # … with 40 more rows Let’s reapply the same infer functions on bowl_sample_2 to generate a different 95% confidence interval for \\(p\\). First, we create the new bootstrap distribution and save the results in sample_2_bootstrap: sample_2_bootstrap &lt;- bowl_sample_2 %&gt;% specify(response = color, success = &quot;red&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;prop&quot;) sample_2_bootstrap # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 0.48 2 2 0.38 3 3 0.32 4 4 0.32 5 5 0.34 6 6 0.26 7 7 0.3 8 8 0.36 9 9 0.44 10 10 0.36 # … with 990 more rows We once again compute a percentile-based 95% confidence interval for \\(p\\): percentile_ci_2 &lt;- sample_2_bootstrap %&gt;% get_confidence_interval(level = 0.95, type = &quot;percentile&quot;) percentile_ci_2 # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 0.2 0.48 Does this new net capture the fish? In other words, does the 95% confidence interval for \\(p\\) based on the new sample contain the true value of \\(p\\) of 0.375? Yes again! 0.375 is between the endpoints of our confidence interval (0.2, 0.48). Let’s now repeat this process 100 more times: we take 100 virtual samples from the bowl and construct 100 95% confidence intervals. Let’s visualize the results in Figure 8.27 where: We mark the true value of \\(p = 0.375\\) with a vertical line. We mark each of the 100 95% confidence intervals with horizontal lines. These are the “nets.” The horizontal line is colored grey if the confidence interval “captures” the true value of \\(p\\) marked with the vertical line. The horizontal line is colored black otherwise. FIGURE 8.27: 100 percentile-based 95% confidence intervals for \\(p\\). Of the 100 95% confidence intervals, 95 of them captured the true value \\(p = 0.375\\), whereas 5 of them didn’t. In other words, 95 of our nets caught the fish, whereas 5 of our nets didn’t. This is where the “95% confidence level” we defined in Section 8.3 comes into play: for every 100 95% confidence intervals, we expect that 95 of them will capture \\(p\\) and that five of them won’t. Note that “expect” is a probabilistic statement referring to a long-run average. In other words, for every 100 confidence intervals, we will observe about 95 confidence intervals that capture \\(p\\), but not necessarily exactly 95. In Figure 8.27 for example, 95 of the confidence intervals capture \\(p\\). To further accentuate our point about confidence levels, let’s generate a figure similar to Figure 8.27, but this time constructing 80% standard-error method based confidence intervals instead. Let’s visualize the results in Figure 8.28 with the scale on the x-axis being the same as in Figure 8.27 to make comparison easy. Furthermore, since all standard-error method 95% confidence intervals for \\(p\\) are centered at their respective point estimates \\(\\widehat{p}\\), we mark this value on each line with dots. FIGURE 8.28: 100 SE-based 80% confidence intervals for \\(p\\) with point estimate center marked with dots. Observe how the 80% confidence intervals are narrower than the 95% confidence intervals, reflecting our lower degree of confidence. Think of this as using a smaller “net.” We’ll explore other determinants of confidence interval width in the upcoming Subsection 8.5.3. Furthermore, observe that of the 100 80% confidence intervals, 82 of them captured the population proportion \\(p\\) = 0.375, whereas 18 of them did not. Since we lowered the confidence level from 95% to 80%, we now have a much larger number of confidence intervals that failed to “catch the fish.” 8.5.2 Precise and shorthand interpretation Let’s return our attention to 95% confidence intervals. The precise and mathematically correct interpretation of a 95% confidence interval is a little long-winded: Precise interpretation: If we repeated our sampling procedure a large number of times, we expect about 95% of the resulting confidence intervals to capture the value of the population parameter. This is what we observed in Figure 8.27. Our confidence interval construction procedure is 95% reliable. That is to say, we can expect our confidence intervals to include the true population parameter about 95% of the time. A common but incorrect interpretation is: “There is a 95% probability that the confidence interval contains \\(p\\).” Looking at Figure 8.27, each of the confidence intervals either does or doesn’t contain \\(p\\). In other words, the probability is either a 1 or a 0. So if the 95% confidence level only relates to the reliability of the confidence interval construction procedure and not to a given confidence interval itself, what insight can be derived from a given confidence interval? For example, going back to the pennies example, we found that the percentile method 95% confidence interval for \\(\\mu\\) was (1991.24, 1999.42), whereas the standard error method 95% confidence interval was (1991.35, 1999.53). What can be said about these two intervals? Loosely speaking, we can think of these intervals as our “best guess” of a plausible range of values for the mean year \\(\\mu\\) of all US pennies. For the rest of this book, we’ll use the following shorthand summary of the precise interpretation. Short-hand interpretation: We are 95% “confident” that a 95% confidence interval captures the value of the population parameter. We use quotation marks around “confident” to emphasize that while 95% relates to the reliability of our confidence interval construction procedure, ultimately a constructed confidence interval is our best guess of an interval that contains the population parameter. In other words, it’s our best net. So returning to our pennies example and focusing on the percentile method, we are 95% “confident” that the true mean year of pennies in circulation in 2019 is somewhere between 1991.24 and 1999.42. 8.5.3 Width of confidence intervals Now that we know how to interpret confidence intervals, let’s go over some factors that determine their width. Impact of confidence level One factor that determines confidence interval widths is the pre-specified confidence level. For example, in Figures 8.27 and 8.28, we compared the widths of 95% and 80% confidence intervals and observed that the 95% confidence intervals were wider. The quantification of the confidence level should match what many expect of the word “confident.” In order to be more confident in our best guess of a range of values, we need to widen the range of values. To elaborate on this, imagine we want to guess the forecasted high temperature in Seoul, South Korea on August 15th. Given Seoul’s temperate climate with four distinct seasons, we could say somewhat confidently that the high temperature would be between 50°F - 95°F (10°C - 35°C). However, if we wanted a temperature range we were absolutely confident about, we would need to widen it. We need this wider range to allow for the possibility of anomalous weather, like a freak cold spell or an extreme heat wave. So a range of temperatures we could be near certain about would be between 32°F - 110°F (0°C - 43°C). On the other hand, if we could tolerate being a little less confident, we could narrow this range to between 70°F - 85°F (21°C - 30°C). Let’s revisit our sampling bowl from Chapter 7. Let’s compare \\(10 \\cdot 3 = 30\\) confidence intervals for \\(p\\) based on three different confidence levels: 80%, 95%, and 99%. Specifically, we’ll first take 30 different random samples of size \\(n\\) = 50 balls from the bowl. Then we’ll construct 10 percentile-based confidence intervals using each of the three different confidence levels. Finally, we’ll compare the widths of these intervals. We visualize the resulting confidence intervals in Figure 8.29 along with a vertical line marking the true value of \\(p\\) = 0.375. FIGURE 8.29: Ten 80, 95, and 99% confidence intervals for \\(p\\) based on \\(n = 50\\). Observe that as the confidence level increases from 80% to 95% to 99%, the confidence intervals tend to get wider as seen in Table 8.2 where we compare their average widths. TABLE 8.2: Average width of 80, 95, and 99% confidence intervals Confidence level Mean width 80% 0.162 95% 0.262 99% 0.338 So in order to have a higher confidence level, our confidence intervals must be wider. Ideally, we would have both a high confidence level and narrow confidence intervals. However, we cannot have it both ways. If we want to be more confident, we need to allow for wider intervals. Conversely, if we would like a narrow interval, we must tolerate a lower confidence level. The moral of the story is: Higher confidence levels tend to produce wider confidence intervals. When looking at Figure 8.29 it is important to keep in mind that we kept the sample size fixed at \\(n\\) = 50. Thus, all \\(10 \\cdot 3 = 30\\) random samples from the bowl had the same sample size. What happens if instead we took samples of different sizes? Recall that we did this in Subsection 7.2.4 using virtual shovels with 25, 50, and 100 slots. Impact of sample size This time, let’s fix the confidence level at 95%, but consider three different sample sizes for \\(n\\): 25, 50, and 100. Specifically, we’ll first take 10 different random samples of size 25, 10 different random samples of size 50, and 10 different random samples of size 100. We’ll then construct 95% percentile-based confidence intervals for each sample. Finally, we’ll compare the widths of these intervals. We visualize the resulting 30 confidence intervals in Figure 8.30. Note also the vertical line marking the true value of \\(p\\) = 0.375. FIGURE 8.30: Ten 95% confidence intervals for \\(p\\) with \\(n = 25, 50,\\) and \\(100\\). Observe that as the confidence intervals are constructed from larger and larger sample sizes, they tend to get narrower. Let’s compare the average widths in Table 8.3. TABLE 8.3: Average width of 95% confidence intervals based on \\(n = 25\\), \\(50\\), and \\(100\\) Sample size Mean width n = 25 0.380 n = 50 0.268 n = 100 0.189 The moral of the story is: Larger sample sizes tend to produce narrower confidence intervals. Recall that this was a key message in Subsection 7.3.3. As we used larger and larger shovels for our samples, the sample proportions red \\(\\widehat{p}\\) tended to vary less. In other words, our estimates got more and more precise. Recall that we visualized these results in Figure 7.15, where we compared the sampling distributions for \\(\\widehat{p}\\) based on samples of size \\(n\\) equal 25, 50, and 100. We also quantified the sampling variation of these sampling distributions using their standard deviation, which has that special name: the standard error. So as the sample size increases, the standard error decreases. In fact, the standard error is another related factor in determining confidence interval width. We’ll explore this fact in Subsection 8.7.2 when we discuss theory-based methods for constructing confidence intervals using mathematical formulas. Such methods are an alternative to the computer-based methods we’ve been using so far. 8.6 Case study: Is yawning contagious? Let’s apply our knowledge of confidence intervals to answer the question: “Is yawning contagious?”. If you see someone else yawn, are you more likely to yawn? In an episode of the US show Mythbusters, the hosts conducted an experiment to answer this question. The episode is available to view in the United States on the Discovery Network website here and more information about the episode is also available on IMDb. 8.6.1 Mythbusters study data Fifty adult participants who thought they were being considered for an appearance on the show were interviewed by a show recruiter. In the interview, the recruiter either yawned or did not. Participants then sat by themselves in a large van and were asked to wait. While in the van, the Mythbusters team watched the participants using a hidden camera to see if they yawned. The data frame containing the results of their experiment is available in the mythbusters_yawn data frame included in the moderndive package: mythbusters_yawn # A tibble: 50 x 3 subj group yawn &lt;int&gt; &lt;chr&gt; &lt;chr&gt; 1 1 seed yes 2 2 control yes 3 3 seed no 4 4 seed yes 5 5 seed no 6 6 control no 7 7 seed yes 8 8 control no 9 9 control no 10 10 seed no # … with 40 more rows The variables are: subj: The participant ID with values 1 through 50. group: A binary treatment variable indicating whether the participant was exposed to yawning. &quot;seed&quot; indicates the participant was exposed to yawning while &quot;control&quot; indicates the participant was not. yawn: A binary response variable indicating whether the participant ultimately yawned. Recall that you learned about treatment and response variables in Subsection 5.3.1 in our discussion on confounding variables. Let’s use some data wrangling to obtain counts of the four possible outcomes: mythbusters_yawn %&gt;% group_by(group, yawn) %&gt;% summarize(count = n()) # A tibble: 4 x 3 # Groups: group [2] group yawn count &lt;chr&gt; &lt;chr&gt; &lt;int&gt; 1 control no 12 2 control yes 4 3 seed no 24 4 seed yes 10 Let’s first focus on the &quot;control&quot; group participants who were not exposed to yawning. 12 such participants did not yawn, while 4 such participants did. So out of the 16 people who were not exposed to yawning, 4/16 = 0.25 = 25% did yawn. Let’s now focus on the &quot;seed&quot; group participants who were exposed to yawning where 24 such participants did not yawn, while 10 such participants did yawn. So out of the 34 people who were exposed to yawning, 10/34 = 0.294 = 29.4% did yawn. Comparing these two percentages, the participants who were exposed to yawning yawned 29.4% - 25% = 4.4% more often than those who were not. 8.6.2 Sampling scenario Let’s review the terminology and notation related to sampling we studied in Subsection 7.3.1. In Chapter 7 our study population was the bowl of \\(N\\) = 2400 balls. Our population parameter of interest was the population proportion of these balls that were red, denoted mathematically by \\(p\\). In order to estimate \\(p\\), we extracted a sample of 50 balls using the shovel and computed the relevant point estimate: the sample proportion that were red, denoted mathematically by \\(\\widehat{p}\\). Who is the study population here? All humans? All the people who watch the show Mythbusters? It’s hard to say! This question can only be answered if we know how the show’s hosts recruited participants! In other words, what was the sampling methodology used by the Mythbusters to recruit participants? We alas are not provided with this information. Only for the purposes of this case study, however, we’ll assume that the 50 participants are a representative sample of all Americans given the popularity of this show. Thus, we’ll be assuming that any results of this experiment will generalize to all \\(N\\) = 327 million Americans (2018 population). Just like with our sampling bowl, the population parameter here will involve proportions. However, in this case it will be the difference in population proportions \\(p_{seed} - p_{control}\\), where \\(p_{seed}\\) is the proportion of all Americans who if exposed to yawning will yawn themselves, and \\(p_{control}\\) is the proportion of all Americans who if not exposed to yawning still yawn themselves. Correspondingly, the point estimate/sample statistic based the Mythbusters’ sample of participants will be the difference in sample proportions \\(\\widehat{p}_{seed} - \\widehat{p}_{control}\\). Let’s extend Table 7.5 of scenarios of sampling for inference to include our latest scenario. TABLE 8.4: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Symbol(s) 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) This is known as a two-sample inference situation since we have two separate samples. Based on their two-samples of size \\(n_{seed}\\) = 34 and \\(n_{control}\\) = 16, the point estimate is \\[ \\widehat{p}_{seed} - \\widehat{p}_{control} = \\frac{24}{34} - \\frac{12}{16} = 0.04411765 \\approx 4.4\\% \\] However, say the Mythbusters repeated this experiment. In other words, say they recruited 50 new participants and exposed 34 of them to yawning and 16 not. Would they obtain the exact same estimated difference of 4.4%? Probably not, again, because of sampling variation. How does this sampling variation affect their estimate of 4.4%? In other words, what would be a plausible range of values for this difference that accounts for this sampling variation? We can answer this question with confidence intervals! Furthermore, since the Mythbusters only have a single two-sample of 50 participants, they would have to construct a 95% confidence interval for \\(p_{seed} - p_{control}\\) using bootstrap resampling with replacement. We make a couple of important notes. First, for the comparison between the &quot;seed&quot; and &quot;control&quot; groups to make sense, however, both groups need to be independent from each other. Otherwise, they could influence each other’s results. This means that a participant being selected for the &quot;seed&quot; or &quot;control&quot; group has no influence on another participant being assigned to one of the two groups. As an example, if there were a mother and her child as participants in the study, they wouldn’t necessarily be in the same group. They would each be assigned randomly to one of the two groups of the explanatory variable. Second, the order of the subtraction in the difference doesn’t matter so long as you are consistent and tailor your interpretations accordingly. In other words, using a point estimate of \\(\\widehat{p}_{seed} - \\widehat{p}_{control}\\) or \\(\\widehat{p}_{control} - \\widehat{p}_{seed}\\) does not make a material difference, you just need to stay consistent and interpret your results accordingly. 8.6.3 Constructing the confidence interval As we did in Subsection 8.4.2, let’s first construct the bootstrap distribution for \\(\\widehat{p}_{seed} - \\widehat{p}_{control}\\) and then use this to construct 95% confidence intervals for \\(p_{seed} - p_{control}\\). We’ll do this using the infer workflow again. However, since the difference in proportions is a new scenario for inference, we’ll need to use some new arguments in the infer functions along the way. 1. specify variables Let’s take our mythbusters_yawn data frame and specify() which variables are of interest using the y ~ x formula interface where: Our response variable is yawn: whether or not a participant yawned. It has levels &quot;yes&quot; and &quot;no&quot;. The explanatory variable is group: whether or not a participant was exposed to yawning. It has levels &quot;seed&quot; (exposed to yawning) and &quot;control&quot; (not exposed to yawning). mythbusters_yawn %&gt;% specify(formula = yawn ~ group) Error: A level of the response variable `yawn` needs to be specified for the `success` argument in `specify()`. Alas, we got an error message similar to the one from Subsection 8.5.1: infer is telling us that one of the levels of the categorical variable yawn needs to be defined as the success. Recall that we define success to be the event of interest we are trying to count and compute proportions of. Are we interested in those participants who &quot;yes&quot; yawned or those who &quot;no&quot; didn’t yawn? This isn’t clear to R or someone just picking up the code and results for the first time, so we need to set the success argument to &quot;yes&quot; as follows to improve the transparency of the code: mythbusters_yawn %&gt;% specify(formula = yawn ~ group, success = &quot;yes&quot;) Response: yawn (factor) Explanatory: group (factor) # A tibble: 50 x 2 yawn group &lt;fct&gt; &lt;fct&gt; 1 yes seed 2 yes control 3 no seed 4 yes seed 5 no seed 6 no control 7 yes seed 8 no control 9 no control 10 no seed # … with 40 more rows 2. generate replicates Our next step is to perform bootstrap resampling with replacement like we did with the slips of paper in our pennies activity in Section 8.1. We saw how it works with both a single variable in computing bootstrap means in Section 8.4 and in computing bootstrap proportions in Section 8.5, but we haven’t yet worked with bootstrapping involving multiple variables. In the infer package, bootstrapping with multiple variables means that each row is potentially resampled. Let’s investigate this by focusing only on the first six rows of mythbusters_yawn: first_six_rows &lt;- head(mythbusters_yawn) first_six_rows # A tibble: 6 x 3 subj group yawn &lt;int&gt; &lt;chr&gt; &lt;chr&gt; 1 1 seed yes 2 2 control yes 3 3 seed no 4 4 seed yes 5 5 seed no 6 6 control no When we bootstrap this data, we are potentially pulling the subject’s readings multiple times. Thus, we could see the entries of &quot;seed&quot; for group and &quot;no&quot; for yawn together in a new row in a bootstrap sample. This is further seen by exploring the sample_n() function in dplyr on this smaller 6-row data frame comprised of head(mythbusters_yawn). The sample_n() function can perform this bootstrapping procedure and is similar to the rep_sample_n() function in infer, except that it is not repeated, but rather only performs one sample with or without replacement. first_six_rows %&gt;% sample_n(size = 6, replace = TRUE) # A tibble: 6 x 3 subj group yawn &lt;int&gt; &lt;chr&gt; &lt;chr&gt; 1 1 seed yes 2 6 control no 3 1 seed yes 4 5 seed no 5 4 seed yes 6 4 seed yes We can see that in this bootstrap sample generated from the first six rows of mythbusters_yawn, we have some rows repeated. The same is true when we perform the generate() step in infer as done in what follows. Using this fact, we generate 1000 replicates, or, in other words, we bootstrap resample the 50 participants with replacement 1000 times. mythbusters_yawn %&gt;% specify(formula = yawn ~ group, success = &quot;yes&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) Response: yawn (factor) Explanatory: group (factor) # A tibble: 50,000 x 3 # Groups: replicate [1,000] replicate yawn group &lt;int&gt; &lt;fct&gt; &lt;fct&gt; 1 1 yes seed 2 1 yes control 3 1 no control 4 1 no control 5 1 yes seed 6 1 yes seed 7 1 yes seed 8 1 yes seed 9 1 no seed 10 1 yes seed # … with 49,990 more rows Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 participants with replacement 1000 times and 50,000 = 1000 \\(\\cdot\\) 50. The variable replicate indicates which resample each row belongs to. So it has the value 1 50 times, the value 2 50 times, all the way through to the value 1000 50 times. 3. calculate summary statistics After we generate() many replicates of bootstrap resampling with replacement, we next want to summarize the bootstrap resamples of size 50 with a single summary statistic, the difference in proportions. We do this by setting the stat argument to &quot;diff in props&quot;: mythbusters_yawn %&gt;% specify(formula = yawn ~ group, success = &quot;yes&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;) Error: Statistic is based on a difference; specify the `order` in which to subtract the levels of the explanatory variable. We see another error here. We need to specify the order of the subtraction. Is it \\(\\widehat{p}_{seed} - \\widehat{p}_{control}\\) or \\(\\widehat{p}_{control} - \\widehat{p}_{seed}\\). We specify it to be \\(\\widehat{p}_{seed} - \\widehat{p}_{control}\\) by setting order = c(&quot;seed&quot;, &quot;control&quot;). Note that you could’ve also set order = c(&quot;control&quot;, &quot;seed&quot;). As we stated earlier, the order of the subtraction does not matter, so long as you stay consistent throughout your analysis and tailor your interpretations accordingly. Let’s save the output in a data frame bootstrap_distribution_yawning: bootstrap_distribution_yawning &lt;- mythbusters_yawn %&gt;% specify(formula = yawn ~ group, success = &quot;yes&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;seed&quot;, &quot;control&quot;)) bootstrap_distribution_yawning # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 0.0357143 2 2 0.229167 3 3 0.00952381 4 4 0.0106952 5 5 0.00483092 6 6 0.00793651 7 7 -0.0845588 8 8 -0.00466200 9 9 0.164686 10 10 0.124777 # … with 990 more rows Observe that the resulting data frame has 1000 rows and 2 columns corresponding to the 1000 replicate ID’s and the 1000 differences in proportions for each bootstrap resample in stat. 4. visualize the results In Figure 8.31 we visualize() the resulting bootstrap resampling distribution. Let’s also add a vertical line at 0 by adding a geom_vline() layer. visualize(bootstrap_distribution_yawning) + geom_vline(xintercept = 0) FIGURE 8.31: Bootstrap distribution. First, let’s compute the 95% confidence interval for \\(p_{seed} - p_{control}\\) using the percentile method, in other words, by identifying the 2.5th and 97.5th percentiles which include the middle 95% of values. Recall that this method does not require the bootstrap distribution to be normally shaped. bootstrap_distribution_yawning %&gt;% get_confidence_interval(type = &quot;percentile&quot;, level = 0.95) # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 -0.238276 0.302464 Second, since the bootstrap distribution is roughly bell-shaped, we can construct a confidence interval using the standard error method as well. Recall that to construct a confidence interval using the standard error method, we need to specify the center of the interval using the point_estimate argument. In our case, we need to set it to be the difference in sample proportions of 4.4% that the Mythbusters observed. We can also use the infer workflow to compute this value by excluding the generate() 1000 bootstrap replicates step. In other words, do not generate replicates, but rather use only the original sample data. We can achieve this by commenting out the generate() line, telling R to ignore it: obs_diff_in_props &lt;- mythbusters_yawn %&gt;% specify(formula = yawn ~ group, success = &quot;yes&quot;) %&gt;% # generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;seed&quot;, &quot;control&quot;)) obs_diff_in_props # A tibble: 1 x 1 stat &lt;dbl&gt; 1 0.0441176 We thus plug this value in as the point_estimate argument. myth_ci_se &lt;- bootstrap_distribution_yawning %&gt;% get_confidence_interval(type = &quot;se&quot;, point_estimate = obs_diff_in_props) myth_ci_se # A tibble: 1 x 2 lower upper &lt;dbl&gt; &lt;dbl&gt; 1 -0.227291 0.315526 Let’s visualize both confidence intervals in Figure 8.32, with the percentile-method interval marked with black lines and the standard-error-method marked with grey lines. Observe that they are both similar to each other. FIGURE 8.32: Two 95% confidence intervals: percentile method (black) and standard error method (grey). 8.6.4 Interpreting the confidence interval Given that both confidence intervals are quite similar, let’s focus our interpretation to only the percentile-method confidence interval of (-0.238, 0.302). Recall from Subsection 8.5.2 that the precise statistical interpretation of a 95% confidence interval is: if this construction procedure is repeated 100 times, then we expect about 95 of the confidence intervals to capture the true value of \\(p_{seed} - p_{control}\\). In other words, if we gathered 100 samples of \\(n\\) = 50 participants from a similar pool of people and constructed 100 confidence intervals each based on each of the 100 samples, about 95 of them will contain the true value of \\(p_{seed} - p_{control}\\) while about five won’t. Given that this is a little long winded, we use the shorthand interpretation: we’re 95% “confident” that the true difference in proportions \\(p_{seed} - p_{control}\\) is between (-0.238, 0.302). There is one value of particular interest that this 95% confidence interval contains: zero. If \\(p_{seed} - p_{control}\\) were equal to 0, then there would be no difference in proportion yawning between the two groups. This would suggest that there is no associated effect of being exposed to a yawning recruiter on whether you yawn yourself. In our case, since the 95% confidence interval includes 0, we cannot conclusively say if either proportion is larger. Of our 1000 bootstrap resamples with replacement, sometimes \\(\\widehat{p}_{seed}\\) was higher and thus those exposed to yawning yawned themselves more often. At other times, the reverse happened. Say, on the other hand, the 95% confidence interval was entirely above zero. This would suggest that \\(p_{seed} - p_{control} &gt; 0\\), or, in other words \\(p_{seed} &gt; p_{control}\\), and thus we’d have evidence suggesting those exposed to yawning do yawn more often. 8.7 Conclusion 8.7.1 Comparing bootstrap and sampling distributions Let’s talk more about the relationship between sampling distributions and bootstrap distributions. Recall back in Subsection 7.2.3, we took 1000 virtual samples from the bowl using a virtual shovel, computed 1000 values of the sample proportion red \\(\\widehat{p}\\), then visualized their distribution in a histogram. Recall that this distribution is called the sampling distribution of \\(\\widehat{p}\\) . Furthermore, the standard deviation of the sampling distribution has a special name: the standard error. We also mentioned that this sampling activity does not reflect how sampling is done in real life. Rather, it was an idealized version of sampling so that we could study the effects of sampling variation on estimates, like the proportion of the shovel’s balls that are red. In real life, however, one would take a single sample that’s as large as possible, much like in the Obama poll we saw in Section 7.4. But how can we get a sense of the effect of sampling variation on estimates if we only have one sample and thus only one estimate? Don’t we need many samples and hence many estimates? The workaround to having a single sample was to perform bootstrap resampling with replacement from the single sample. We did this in the resampling activity in Section 8.1 where we focused on the mean year of minting of pennies. We used pieces of paper representing the original sample of 50 pennies from the bank and resampled them with replacement from a hat. We had 35 of our friends perform this activity and visualized the resulting 35 sample means \\(\\overline{x}\\) in a histogram in Figure 8.11. This distribution was called the bootstrap distribution of \\(\\overline{x}\\). We stated at the time that the bootstrap distribution is an approximation to the sampling distribution of \\(\\overline{x}\\) in the sense that both distributions will have a similar shape and similar spread. Thus the standard error of the bootstrap distribution can be used as an approximation to the standard error of the sampling distribution. Let’s show you that this is the case by now comparing these two types of distributions. Specifically, we’ll compare the sampling distribution of \\(\\widehat{p}\\) based on 1000 virtual samples from the bowl from Subsection 7.2.3 to the bootstrap distribution of \\(\\widehat{p}\\) based on 1000 virtual resamples with replacement from Ilyas and Yohan’s single sample bowl_sample_1 from Subsection 8.5.1. Sampling distribution Here is the code you saw in Subsection 7.2.3 to construct the sampling distribution of \\(\\widehat{p}\\) shown again in Figure 8.33, with some changes to incorporate the statistical terminology relating to sampling from Subsection 7.3.1. # Take 1000 virtual samples of size 50 from the bowl: virtual_samples &lt;- bowl %&gt;% rep_sample_n(size = 50, reps = 1000) # Compute the sampling distribution of 1000 values of p-hat sampling_distribution &lt;- virtual_samples %&gt;% group_by(replicate) %&gt;% summarize(red = sum(color == &quot;red&quot;)) %&gt;% mutate(prop_red = red / 50) # Visualize sampling distribution of p-hat ggplot(sampling_distribution, aes(x = prop_red)) + geom_histogram(binwidth = 0.05, boundary = 0.4, color = &quot;white&quot;) + labs(x = &quot;Proportion of 50 balls that were red&quot;, title = &quot;Sampling distribution&quot;) FIGURE 8.33: Previously seen sampling distribution of sample proportion red for \\(n = 1000\\). An important thing to keep in mind is the default value for replace is FALSE when using rep_sample_n(). This is because when sampling 50 balls with a shovel, we are extracting 50 balls one-by-one without replacing them. This is in contrast to bootstrap resampling with replacement, where we resample a ball and put it back, and repeat this process 50 times. Let’s quantify the variability in this sampling distribution by calculating the standard deviation of the prop_red variable representing 1000 values of the sample proportion \\(\\widehat{p}\\). Remember that the standard deviation of the sampling distribution is the standard error, frequently denoted as se. sampling_distribution %&gt;% summarize(se = sd(prop_red)) # A tibble: 1 x 1 se &lt;dbl&gt; 1 0.0673987 Bootstrap distribution Here is the code you previously saw in Subsection 8.5.1 to construct the bootstrap distribution of \\(\\widehat{p}\\) based on Ilyas and Yohan’s original sample of 50 balls saved in bowl_sample_1. bootstrap_distribution &lt;- bowl_sample_1 %&gt;% specify(response = color, success = &quot;red&quot;) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;prop&quot;) FIGURE 8.34: Bootstrap distribution of proportion red for \\(n = 1000\\). bootstrap_distribution %&gt;% summarize(se = sd(stat)) # A tibble: 1 x 1 se &lt;dbl&gt; 1 0.0712212 Comparison Now that we have computed both the sampling distribution and the bootstrap distributions, let’s compare them side-by-side in Figure 8.35. We’ll make both histograms have matching scales on the x- and y-axes to make them more comparable. Furthermore, we’ll add: To the sampling distribution on the top: a solid line denoting the proportion of the bowl’s balls that are red \\(p\\) = 0.375. To the bootstrap distribution on the bottom: a dashed line at the sample proportion \\(\\widehat{p}\\) = 21/50 = 0.42 = 42% that Ilyas and Yohan observed. FIGURE 8.35: Comparing the sampling and bootstrap distributions of \\(\\widehat{p}\\). There is a lot going on in Figure 8.35, so let’s break down all the comparisons slowly. First, observe how the sampling distribution on top is centered at \\(p\\) = 0.375. This is because the sampling is done at random and in an unbiased fashion. So the estimates \\(\\widehat{p}\\) are centered at the true value of \\(p\\). However, this is not the case with the following bootstrap distribution. The bootstrap distribution is centered at 0.42, which is the proportion red of Ilyas and Yohan’s 50 sampled balls. This is because we are resampling from the same sample over and over again. Since the bootstrap distribution is centered at the original sample’s proportion, it doesn’t necessarily provide a better estimate of \\(p\\) = 0.375. This leads us to our first lesson about bootstrapping: The bootstrap distribution will likely not have the same center as the sampling distribution. In other words, bootstrapping cannot improve the quality of an estimate. Second, let’s now compare the spread of the two distributions: they are somewhat similar. In the previous code, we computed the standard deviations of both distributions as well. Recall that such standard deviations have a special name: standard errors. Let’s compare them in Table 8.5. TABLE 8.5: Comparing standard errors Distribution type Standard error Sampling distribution 0.067 Bootstrap distribution 0.071 Notice that the bootstrap distribution’s standard error is a rather good approximation to the sampling distribution’s standard error. This leads us to our second lesson about bootstrapping: Even if the bootstrap distribution might not have the same center as the sampling distribution, it will likely have very similar shape and spread. In other words, bootstrapping will give you a good estimate of the standard error. Thus, using the fact that the bootstrap distribution and sampling distributions have similar spreads, we can build confidence intervals using bootstrapping as we’ve done all throughout this chapter! 8.7.2 Theory-based confidence intervals So far in this chapter, we’ve constructed confidence intervals using two methods: the percentile method and the standard error method. Recall also from Subsection 8.3.2 that we can only use the standard-error method if the bootstrap distribution is bell-shaped (i.e., normally distributed). In a similar vein, if the sampling distribution is normally shaped, there is another method for constructing confidence intervals that does not involve using your computer. You can use a theory-based method involving a mathematical formulas! The formula uses the rule of thumb we saw in Appendix A.2 that 95% of values in a normal distribution are within \\(\\pm 1.96\\) standard deviations of the mean. In the case of sampling and bootstrap distributions, recall that the standard deviation has a special name: the standard error. Theory-based method for computing standard errors There exists in many cases a formula that approximates the standard error! In the case of our bowl where we used the sample proportion red \\(\\widehat{p}\\) to estimate the proportion of the bowl’s balls that are red, the formula that approximates the standard error is: \\[\\text{SE}_{\\widehat{p}} \\approx \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] For example, recall from bowl_sample_1 that Yohan and Ilyas sampled \\(n = 50\\) balls and observed a sample proportion \\(\\widehat{p}\\) of 21/50 = 0.42. So, using the formula, an approximation of the standard error of \\(\\widehat{p}\\) is \\[\\text{SE}_{\\widehat{p}} \\approx \\sqrt{\\frac{0.42(1-0.42)}{50}} = \\sqrt{0.004872} = 0.0698 \\approx 0.070\\] The key observation to make here is that there is an \\(n\\) in the denominator. So as the sample size \\(n\\) increases, the standard error decreases. We’ve demonstrated this fact using our virtual shovels in Subsection 7.3.3. If you don’t recall this demonstration, we highly recommend you go back and read that subsection. Let’s compare this theory-based standard error to the standard error of the sampling and bootstrap distributions you computed previously in Subsection 8.7.1 in Table 8.6. Notice how they are all similar! TABLE 8.6: Comparing standard errors Distribution type Standard error Sampling distribution 0.067 Bootstrap distribution 0.071 Formula approximation 0.070 Going back to Yohan and Ilyas’ sample proportion of \\(\\widehat{p}\\) of 21/50 = 0.42, say this were based on a sample of size \\(n\\) = 100 instead of 50. Then the standard error would be: \\[\\text{SE}_{\\widehat{p}} \\approx \\sqrt{\\frac{0.42(1-0.42)}{100}} = \\sqrt{0.002436} = 0.0494\\] Observe that the standard error has gone down from 0.0698 to 0.0494. In other words, the “typical” error of our estimates using \\(n\\) = 100 will go down and hence be more precise. Recall that we illustrated the difference between accuracy and precision of estimates in Figure 7.16. Why is this formula true? Unfortunately, we don’t have the tools at this point to prove this; you’ll need to take a more advanced course in probability and statistics. (It is related to the concepts of Bernoulli and Binomial Distributions. You can read more about its derivation here if you like.) Theory-based method for constructing confidence intervals Using these theory-based standard errors, let’s present a theory-based method for constructing 95% confidence intervals that does not involve using a computer, but rather mathematical formulas. Note that this theory-based method only holds if the sampling distribution is normally shaped, so that we can use the 95% rule of thumb about normal distributions discussed in Appendix A.2. Collect a single representative sample of size \\(n\\) that’s as large as possible. Compute the point estimate: the sample proportion \\(\\widehat{p}\\). Think of this as the center of your “net.” Compute the approximation to the standard error \\[\\text{SE}_{\\widehat{p}} \\approx \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] Compute a quantity known as the margin of error (more on this later after we list the five steps): \\[\\text{MoE}_{\\widehat{p}} = 1.96 \\cdot \\text{SE}_{\\widehat{p}} = 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] Compute both endpoints of the confidence interval. The lower end-point. Think of this as the left end-point of the net: \\[\\widehat{p} - \\text{MoE}_{\\widehat{p}} = \\widehat{p} - 1.96 \\cdot \\text{SE}_{\\widehat{p}} = \\widehat{p} - 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] The upper endpoint. Think of this as the right end-point of the net: \\[\\widehat{p} + \\text{MoE}_{\\widehat{p}} = \\widehat{p} + 1.96 \\cdot \\text{SE}_{\\widehat{p}} = \\widehat{p} + 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] Alternatively, you can succinctly summarize a 95% confidence interval for \\(p\\) using the \\(\\pm\\) symbol: \\[\\widehat{p} \\pm \\text{MoE}_{\\widehat{p}} = \\widehat{p} \\pm (1.96 \\cdot \\text{SE}_{\\widehat{p}}) = \\widehat{p} \\pm \\left( 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}} \\right)\\] So going back to Yohan and Ilyas’ sample of \\(n = 50\\) balls that had 21 red balls, the 95% confidence interval for \\(p\\) is \\[ \\begin{aligned} 0.41 \\pm 1.96 \\cdot 0.0698 &amp;= 0.41 \\, \\pm \\, 0.137 \\\\ &amp;= (0.41 - 0.137, \\, 0.41 + 0.137) \\\\ &amp;= (0.273, \\, 0.547). \\end{aligned} \\] Yohan and Ilyas are 95% “confident” that the true proportion red of the bowl’s balls is between 28.3% and 55.7%. Given that the true population proportion \\(p\\) was 0.375, in this case they successfully captured the fish. In Step 4, we defined a statistical quantity known as the margin of error. You can think of this quantity as how much the net extends to the left and to the right of the center of our net. The 1.96 multiplier is rooted in the 95% rule of thumb we introduced earlier and the fact that we want the confidence level to be 95%. The value of the margin of error entirely determines the width of the confidence interval. Recall from Subsection 8.5.3 that confidence interval widths are determined by an interplay of the confidence level, the sample size \\(n\\), and the standard error. Let’s revisit the poll of President Obama’s approval rating among young Americans aged 18-29 which we introduced in Section 7.4. Pollsters found that based on a representative sample of \\(n\\) = 2089 young Americans, \\(\\widehat{p}\\) = 0.41 = 41% supported President Obama. If you look towards the end of the article, it also states: “The poll’s margin of error was plus or minus 2.1 percentage points.” This is precisely the \\(\\text{MoE}\\): \\[ \\begin{aligned} \\text{MoE} &amp;= 1.96 \\cdot \\text{SE} = 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}} = 1.96 \\cdot \\sqrt{\\frac{0.41(1-0.41)}{2089}} \\\\ &amp;= 1.96 \\cdot 0.0108 = 0.021 = 2.1\\% \\end{aligned} \\] Their poll results are based on a confidence level of 95% and the resulting 95% confidence interval for the proportion of all young Americans who support Obama is: \\[\\widehat{p} \\pm \\text{MoE} = 0.41 \\pm 0.021 = (0.389, \\, 0.431) = (38.9\\%, \\, 43.1\\%).\\] Confidence intervals based on 33 tactile samples Let’s revisit our 33 friends’ samples from the bowl from Subsection 7.1.3. We’ll use their 33 samples to construct 33 theory-based 95% confidence intervals for \\(p\\). Recall this data was saved in the tactile_prop_red data frame included in the moderndive package: rename() the variable prop_red to p_hat, the statistical name of the sample proportion \\(\\widehat{p}\\). mutate() a new variable n making explicit the sample size of 50. mutate() other new variables computing: The standard error SE for \\(\\widehat{p}\\) using the previous formula. The margin of error MoE by multiplying the SE by 1.96 The left endpoint of the confidence interval lower_ci The right endpoint of the confidence interval upper_ci conf_ints &lt;- tactile_prop_red %&gt;% rename(p_hat = prop_red) %&gt;% mutate( n = 50, SE = sqrt(p_hat * (1 - p_hat) / n), MoE = 1.96 * SE, lower_ci = p_hat - MoE, upper_ci = p_hat + MoE ) # A tibble: 33 x 9 group replicate red_balls p_hat n SE MoE lower_ci upper_ci &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 Ilyas, … 1 21 0.42 50 0.0697997 0.136807 0.283193 0.556807 2 Morgan,… 2 17 0.34 50 0.0669925 0.131305 0.208695 0.471305 3 Martin,… 3 21 0.42 50 0.0697997 0.136807 0.283193 0.556807 4 Clark, … 4 21 0.42 50 0.0697997 0.136807 0.283193 0.556807 5 Riddhi,… 5 18 0.36 50 0.0678823 0.133049 0.226951 0.493049 6 Andrew,… 6 19 0.38 50 0.0686440 0.134542 0.245458 0.514542 7 Julia 7 19 0.38 50 0.0686440 0.134542 0.245458 0.514542 8 Rachel,… 8 11 0.22 50 0.0585833 0.114823 0.105177 0.334823 9 Daniel,… 9 15 0.3 50 0.0648074 0.127023 0.172977 0.427023 10 Josh, M… 10 17 0.34 50 0.0669925 0.131305 0.208695 0.471305 # … with 23 more rows In Figure 8.36, let’s plot the 33 confidence intervals for \\(p\\) saved in conf_ints along with a vertical line at \\(p\\) = 0.375 indicating the true proportion of the bowl’s balls that are red. Furthermore, let’s mark the sample proportions \\(\\widehat{p}\\) with dots since they represent the centers of these confidence intervals. FIGURE 8.36: 33 confidence intervals at the 95% level based on 33 tactile samples of size \\(n = 50\\). Observe that 31 of the 33 confidence intervals “captured” the true value of \\(p\\), for a success rate of 31 / 33 = 93.94%. While this is not quite 95%, recall that we expect about 95% of such confidence intervals to capture \\(p\\). The actual observed success rate will vary slightly. Theory-based methods like this have largely been used in the past because we didn’t have the computing power to perform simulation-based methods such as bootstrapping. They are still commonly used, however, and if the sampling distribution is normally distributed, we have access to an alternative method for constructing confidence intervals as well as performing hypothesis tests as we will see in Chapter 9. The kind of computer-based statistical inference we’ve seen so far has a particular name in the field of statistics: simulation-based inference. This is because we are performing statistical inference using computer simulations. In our opinion, two large benefits of simulation-based methods over theory-based methods are that (1) they are easier for people new to statistical inference to understand and (2) they also work in situations where theory-based methods and mathematical formulas don’t exist. 8.7.3 Additional resources An R script file of all R code used in this chapter is available here. If you want more examples of the infer workflow to construct confidence intervals, we suggest you check out the infer package homepage, in particular, a series of example analyses available at https://infer.netlify.com/articles/. 8.7.4 What’s to come? Now that we’ve equipped ourselves with confidence intervals, in Chapter 9 we’ll cover the other common tool for statistical inference: hypothesis testing. Just like confidence intervals, hypothesis tests are used to infer about a population using a sample. However, we’ll see that the framework for making such inferences is slightly different. "],
+["9-hypothesis-testing.html", "Chapter 9 Hypothesis Testing 9.1 Promotions activity 9.2 Understanding hypothesis tests 9.3 Conducting hypothesis tests 9.4 Interpreting hypothesis tests 9.5 Case study: Are action or romance movies rated higher? 9.6 Conclusion", " Chapter 9 Hypothesis Testing Now that we’ve studied confidence intervals in Chapter 8, let’s study another commonly used method for statistical inference: hypothesis testing. Hypothesis tests allow us to take a sample of data from a population and infer about the plausibility of competing hypotheses. For example, in the upcoming “promotions” activity in Section 9.1, you’ll study the data collected from a psychology study in the 1970s to investigate whether gender-based discrimination in promotion rates existed in the banking industry at the time of the study. The good news is we’ve already covered many of the necessary concepts to understand hypothesis testing in Chapters 7 and 8. We will expand further on these ideas here and also provide a general framework for understanding hypothesis tests. By understanding this general framework, you’ll be able to adapt it to many different scenarios. The same can be said for confidence intervals. There was one general framework that applies to all confidence intervals and the infer package was designed around this framework. While the specifics may change slightly for different types of confidence intervals, the general framework stays the same. We believe that this approach is much better for long-term learning than focusing on specific details for specific confidence intervals using theory-based approaches. As you’ll now see, we prefer this general framework for hypothesis tests as well. If you’d like more practice or you’re curious to see how this framework applies to different scenarios, you can find fully-worked out examples for many common hypothesis tests and their corresponding confidence intervals in Appendix B. We recommend that you carefully review these examples as they also cover how the general frameworks apply to traditional theory-based methods like the \\(t\\)-test and normal-theory confidence intervals. You’ll see there that these traditional methods are just approximations for the computer-based methods we’ve been focusing on. However, they also require conditions to be met for their results to be valid. Computer-based methods using randomization, simulation, and bootstrapping have much fewer restrictions. Furthermore, they help develop your computational thinking, which is one big reason they are emphasized throughout this book. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once: ggplot2 for data visualization dplyr for data wrangling tidyr for converting data to “tidy” format readr for importing spreadsheet data into R As well as the more advanced purrr, tibble, stringr, and forcats packages If needed, read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(infer) library(moderndive) library(nycflights13) library(ggplot2movies) 9.1 Promotions activity Let’s start with an activity studying the effect of gender on promotions at a bank. 9.1.1 Does gender affect promotions at a bank? Say you are working at a bank in the 1970s and you are submitting your résumé to apply for a promotion. Will your gender affect your chances of getting promoted? To answer this question, we’ll focus on data from a study published in the Journal of Applied Psychology in 1974. This data is also used in the OpenIntro series of statistics textbooks. To begin the study, 48 bank supervisors were asked to assume the role of a hypothetical director of a bank with multiple branches. Every one of the bank supervisors was given a résumé and asked whether or not the candidate on the résumé was fit to be promoted to a new position in one of their branches. However, each of these 48 résumés were identical in all respects except one: the name of the applicant at the top of the résumé. Of the supervisors, 24 were randomly given résumés with stereotypically “male” names, while 24 of the supervisors were randomly given résumés with stereotypically “female” names. Since only (binary) gender varied from résumé to résumé, researchers could isolate the effect of this variable in promotion rates. While many people today (including us, the authors) disagree with such binary views of gender, it is important to remember that this study was conducted at a time where more nuanced views of gender were not as prevalent. Despite this imperfection, we decided to still use this example as we feel it presents ideas still relevant today about how we could study discrimination in the workplace. The moderndive package contains the data on the 48 applicants in the promotions data frame. Let’s explore this data by looking at six randomly selected rows: promotions %&gt;% sample_n(size = 6) %&gt;% arrange(id) # A tibble: 6 x 3 id decision gender &lt;int&gt; &lt;fct&gt; &lt;fct&gt; 1 11 promoted male 2 26 promoted female 3 28 promoted female 4 36 not male 5 37 not male 6 46 not female The variable id acts as an identification variable for all 48 rows, the decision variable indicates whether the applicant was selected for promotion or not, while the gender variable indicates the gender of the name used on the résumé. Recall that this data does not pertain to 24 actual men and 24 actual women, but rather 48 identical résumés of which 24 were assigned stereotypically “male” names and 24 were assigned stereotypically “female” names. Let’s perform an exploratory data analysis of the relationship between the two categorical variables decision and gender. Recall that we saw in Subsection 2.8.3 that one way we can visualize such a relationship is by using a stacked barplot. ggplot(promotions, aes(x = gender, fill = decision)) + geom_bar() + labs(x = &quot;Gender of name on résumé&quot;) FIGURE 9.1: Barplot relating gender to promotion decision. Observe in Figure 9.1 that it appears that résumés with female names were much less likely to be accepted for promotion. Let’s quantify these promotion rates by computing the proportion of résumés accepted for promotion for each group using the dplyr package for data wrangling. Note the use of the tally() function here which is a shortcut for summarize(n = n()) to get counts. promotions %&gt;% group_by(gender, decision) %&gt;% tally() # A tibble: 4 x 3 # Groups: gender [2] gender decision n &lt;fct&gt; &lt;fct&gt; &lt;int&gt; 1 male not 3 2 male promoted 21 3 female not 10 4 female promoted 14 So of the 24 résumés with male names, 21 were selected for promotion, for a proportion of 21/24 = 0.875 = 87.5%. On the other hand, of the 24 résumés with female names, 14 were selected for promotion, for a proportion of 14/24 = 0.583 = 58.3%. Comparing these two rates of promotion, it appears that résumés with male names were selected for promotion at a rate 0.875 - 0.583 = 0.292 = 29.2% higher than résumés with female names. This is suggestive of an advantage for résumés with a male name on it. The question is, however, does this provide conclusive evidence that there is gender discrimination in promotions at banks? Could a difference in promotion rates of 29.2% still occur by chance, even in a hypothetical world where no gender-based discrimination existed? In other words, what is the role of sampling variation in this hypothesized world? To answer this question, we’ll again rely on a computer to run simulations. 9.1.2 Shuffling once First, try to imagine a hypothetical universe where no gender discrimination in promotions existed. In such a hypothetical universe, the gender of an applicant would have no bearing on their chances of promotion. Bringing things back to our promotions data frame, the gender variable would thus be an irrelevant label. If these gender labels were irrelevant, then we could randomly reassign them by “shuffling” them to no consequence! To illustrate this idea, let’s narrow our focus to 6 arbitrarily chosen résumés of the 48 in Table 9.1. The decision column shows that 3 résumés resulted in promotion while 3 didn’t. The gender column shows what the original gender of the résumé name was. However, in our hypothesized universe of no gender discrimination, gender is irrelevant and thus it is of no consequence to randomly “shuffle” the values of gender. The shuffled_gender column shows one such possible random shuffling. Observe in the fourth column how the number of male and female names remains the same at 3 each, but they are now listed in a different order. TABLE 9.1: One example of shuffling gender variable résumé number decision gender shuffled gender 1 not male male 2 not female male 3 not female female 4 promoted male female 5 promoted male female 6 promoted female male Again, such random shuffling of the gender label only makes sense in our hypothesized universe of no gender discrimination. How could we extend this shuffling of the gender variable to all 48 résumés by hand? One way would be by using standard deck of 52 playing cards, which we display in Figure 9.2. FIGURE 9.2: Standard deck of 52 playing cards. Since half the cards are red (diamonds and hearts) and the other half are black (spades and clubs), by removing two red cards and two black cards, we would end up with 24 red cards and 24 black cards. After shuffling these 48 cards as seen in Figure 9.3, we can flip the cards over one-by-one, assigning “male” for each red card and “female” for each black card. FIGURE 9.3: Shuffling a deck of cards. We’ve saved one such shuffling in the promotions_shuffled data frame of the moderndive package. If you compare the original promotions and the shuffled promotions_shuffled data frames, you’ll see that while the decision variable is identical, the gender variable has changed. Let’s repeat the same exploratory data analysis we did for the original promotions data on our promotions_shuffled data frame. Let’s create a barplot visualizing the relationship between decision and the new shuffled gender variable and compare this to the original unshuffled version in Figure 9.4. ggplot(promotions_shuffled, aes(x = gender, fill = decision)) + geom_bar() + labs(x = &quot;Gender of résumé name&quot;) FIGURE 9.4: Barplots of relationship of promotion with gender (left) and shuffled gender (right). It appears the difference in “male names” versus “female names” promotion rates is now different. Compared to the original data in the left barplot, the new “shuffled” data in the right barplot has promotion rates that are much more similar. Let’s also compute the proportion of résumés accepted for promotion for each group: promotions_shuffled %&gt;% group_by(gender, decision) %&gt;% tally() # Same as summarize(n = n()) # A tibble: 4 x 3 # Groups: gender [2] gender decision n &lt;fct&gt; &lt;fct&gt; &lt;int&gt; 1 male not 6 2 male promoted 18 3 female not 7 4 female promoted 17 So in this hypothetical universe of no discrimination, \\(18/24 = 0.75 = 75\\%\\) of “male” résumés were selected for promotion. On the other hand, \\(17/24 = 0.708 = 70.8\\%\\) of “female” résumés were selected for promotion. Let’s next compare these two values. It appears that résumés with stereotypically male names were selected for promotion at a rate that was \\(0.75 - 0.708 = 0.042 = 4.2\\%\\) different than résumés with stereotypically female names. Observe how this difference in rates is not the same as the difference in rates of 0.292 = 29.2% we originally observed. This is once again due to sampling variation. How can we better understand the effect of this sampling variation? By repeating this shuffling several times! 9.1.3 Shuffling 16 times We recruited 16 groups of our friends to repeat this shuffling exercise. They recorded these values in a shared spreadsheet; we display a snapshot of the first 10 rows and 5 columns in Figure 9.5. FIGURE 9.5: Snapshot of shared spreadsheet of shuffling results (m for male, f for female). For each of these 16 columns of shuffles, we computed the difference in promotion rates, and in Figure 9.6 we display their distribution in a histogram. We also mark the observed difference in promotion rate that occurred in real life of 0.292 = 29.2% with a dark line. FIGURE 9.6: Distribution of shuffled differences in promotions. Before we discuss the distribution of the histogram, we emphasize the key thing to remember: this histogram represents differences in promotion rates that one would observe in our hypothesized universe of no gender discrimination. Observe first that the histogram is roughly centered at 0. Saying that the difference in promotion rates is 0 is equivalent to saying that both genders had the same promotion rate. In other words, the center of these 16 values is consistent with what we would expect in our hypothesized universe of no gender discrimination. However, while the values are centered at 0, there is variation about 0. This is because even in a hypothesized universe of no gender discrimination, you will still likely observe small differences in promotion rates because of chance sampling variation. Looking at the histogram in Figure 9.6, such differences could even be as extreme as -0.292 or 0.208. Turning our attention to what we observed in real life: the difference of 0.292 = 29.2% is marked with a vertical dark line. Ask yourself: in a hypothesized world of no gender discrimination, how likely would it be that we observe this difference? While opinions here may differ, in our opinion not often! Now ask yourself: what do these results say about our hypothesized universe of no gender discrimination? 9.1.4 What did we just do? What we just demonstrated in this activity is the statistical procedure known as hypothesis testing using a permutation test. The term “permutation” is the mathematical term for “shuffling”: taking a series of values and reordering them randomly, as you did with the playing cards. In fact, permutations are another form of resampling, like the bootstrap method you performed in Chapter 8. While the bootstrap method involves resampling with replacement, permutation methods involve resampling without replacement. Think of our exercise involving the slips of paper representing pennies and the hat in Section 8.1: after sampling a penny, you put it back in the hat. Now think of our deck of cards. After drawing a card, you laid it out in front of you, recorded the color, and then you did not put it back in the deck. In our previous example, we tested the validity of the hypothesized universe of no gender discrimination. The evidence contained in our observed sample of 48 résumés was somewhat inconsistent with our hypothesized universe. Thus, we would be inclined to reject this hypothesized universe and declare that the evidence suggests there is gender discrimination. Recall our case study on whether yawning is contagious from Section 8.6. The previous example involves inference about an unknown difference of population proportions as well. This time, it will be \\(p_{m} - p_{f}\\), where \\(p_{m}\\) is the population proportion of résumés with male names being recommended for promotion and \\(p_{f}\\) is the equivalent for résumés with female names. Recall that this is one of the scenarios for inference we’ve seen so far in Table 9.2. TABLE 9.2: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Symbol(s) 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) So, based on our sample of \\(n_m\\) = 24 “male” applicants and \\(n_w\\) = 24 “female” applicants, the point estimate for \\(p_{m} - p_{f}\\) is the difference in sample proportions \\(\\widehat{p}_{m} -\\widehat{p}_{f}\\) = 0.875 - 0.583 = 0.292 = 29.2%. This difference in favor of “male” résumés of 0.292 is greater than 0, suggesting discrimination in favor of men. However, the question we asked ourselves was “is this difference meaningfully greater than 0?”. In other words, is that difference indicative of true discrimination, or can we just attribute it to sampling variation? Hypothesis testing allows us to make such distinctions. 9.2 Understanding hypothesis tests Much like the terminology, notation, and definitions relating to sampling you saw in Section 7.3, there are a lot of terminology, notation, and definitions related to hypothesis testing as well. Learning these may seem like a very daunting task at first. However, with practice, practice, and more practice, anyone can master them. First, a hypothesis is a statement about the value of an unknown population parameter. In our résumé activity, our population parameter of interest is the difference in population proportions \\(p_{m} - p_{f}\\). Hypothesis tests can involve any of the population parameters in Table 7.5 of the five inference scenarios we’ll cover in this book and also more advanced types we won’t cover here. Second, a hypothesis test consists of a test between two competing hypotheses: (1) a null hypothesis \\(H_0\\) (pronounced “H-naught”) versus (2) an alternative hypothesis \\(H_A\\) (also denoted \\(H_1\\)). Generally the null hypothesis is a claim that there is “no effect” or “no difference of interest.” In many cases, the null hypothesis represents the status quo or a situation that nothing interesting is happening. Furthermore, generally the alternative hypothesis is the claim the experimenter or researcher wants to establish or find evidence to support. It is viewed as a “challenger” hypothesis to the null hypothesis \\(H_0\\). In our résumé activity, an appropriate hypothesis test would be: \\[ \\begin{aligned} H_0 &amp;: \\text{men and women are promoted at the same rate}\\\\ \\text{vs } H_A &amp;: \\text{men are promoted at a higher rate than women} \\end{aligned} \\] Note some of the choices we have made. First, we set the null hypothesis \\(H_0\\) to be that there is no difference in promotion rate and the “challenger” alternative hypothesis \\(H_A\\) to be that there is a difference. While it would not be wrong in principle to reverse the two, it is a convention in statistical inference that the null hypothesis is set to reflect a “null” situation where “nothing is going on.” As we discussed earlier, in this case, \\(H_0\\) corresponds to there being no difference in promotion rates. Furthermore, we set \\(H_A\\) to be that men are promoted at a higher rate, a subjective choice reflecting a prior suspicion we have that this is the case. We call such alternative hypotheses one-sided alternatives. If someone else however does not share such suspicions and only wants to investigate that there is a difference, whether higher or lower, they would set what is known as a two-sided alternative. We can re-express the formulation of our hypothesis test using the mathematical notation for our population parameter of interest, the difference in population proportions \\(p_{m} - p_{f}\\): \\[ \\begin{aligned} H_0 &amp;: p_{m} - p_{f} = 0\\\\ \\text{vs } H_A&amp;: p_{m} - p_{f} &gt; 0 \\end{aligned} \\] Observe how the alternative hypothesis \\(H_A\\) is one-sided with \\(p_{m} - p_{f} &gt; 0\\). Had we opted for a two-sided alternative, we would have set \\(p_{m} - p_{f} \\neq 0\\). To keep things simple for now, we’ll stick with the simpler one-sided alternative. We’ll present an example of a two-sided alternative in Section 9.5. Third, a test statistic is a point estimate/sample statistic formula used for hypothesis testing. Note that a sample statistic is merely a summary statistic based on a sample of observations. Recall we saw in Section 3.3 that a summary statistic takes in many values and returns only one. Here, the samples would be the \\(n_m\\) = 24 résumés with male names and the \\(n_f\\) = 24 résumés with female names. Hence, the point estimate of interest is the difference in sample proportions \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\). Fourth, the observed test statistic is the value of the test statistic that we observed in real life. In our case, we computed this value using the data saved in the promotions data frame. It was the observed difference of \\(\\widehat{p}_{m} -\\widehat{p}_{f} = 0.875 - 0.583 = 0.292 = 29.2\\%\\) in favor of résumés with male names. Fifth, the null distribution is the sampling distribution of the test statistic assuming the null hypothesis \\(H_0\\) is true. Ooof! That’s a long one! Let’s unpack it slowly. The key to understanding the null distribution is that the null hypothesis \\(H_0\\) is assumed to be true. We’re not saying that \\(H_0\\) is true at this point, we’re only assuming it to be true for hypothesis testing purposes. In our case, this corresponds to our hypothesized universe of no gender discrimination in promotion rates. Assuming the null hypothesis \\(H_0\\), also stated as “Under \\(H_0\\),” how does the test statistic vary due to sampling variation? In our case, how will the difference in sample proportions \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\) vary due to sampling under \\(H_0\\)? Recall from Subsection 7.3.2 that distributions displaying how point estimates vary due to sampling variation are called sampling distributions. The only additional thing to keep in mind about null distributions is that they are sampling distributions assuming the null hypothesis \\(H_0\\) is true. In our case, we previously visualized a null distribution in Figure 9.6, which we re-display in Figure 9.7 using our new notation and terminology. It is the distribution of the 16 differences in sample proportions our friends computed assuming a hypothetical universe of no gender discrimination. We also mark the value of the observed test statistic of 0.292 with a vertical line. FIGURE 9.7: Null distribution and observed test statistic. Sixth, the \\(p\\)-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis \\(H_0\\) is true. Double ooof! Let’s unpack this slowly as well. You can think of the \\(p\\)-value as a quantification of “surprise”: assuming \\(H_0\\) is true, how surprised are we with what we observed? Or in our case, in our hypothesized universe of no gender discrimination, how surprised are we that we observed a difference in promotion rates of 0.292 from our collected samples assuming \\(H_0\\) is true? Very surprised? Somewhat surprised? The \\(p\\)-value quantifies this probability, or in the case of our 16 differences in sample proportions in Figure 9.7, what proportion had a more “extreme” result? Here, extreme is defined in terms of the alternative hypothesis \\(H_A\\) that “male” applicants are promoted at a higher rate than “female” applicants. In other words, how often was the discrimination in favor of men even more pronounced than \\(0.875 - 0.583 = 0.292 = 29.2\\%\\)? In this case, 0 times out of 16, we obtained a difference in proportion greater than or equal to the observed difference of 0.292 = 29.2%. A very rare (in fact, not occurring) outcome! Given the rarity of such a pronounced difference in promotion rates in our hypothesized universe of no gender discrimination, we’re inclined to reject our hypothesized universe. Instead, we favor the hypothesis stating there is discrimination in favor of the “male” applicants. In other words, we reject \\(H_0\\) in favor of \\(H_A\\). Seventh and lastly, in many hypothesis testing procedures, it is commonly recommended to set the significance level of the test beforehand. It is denoted by the Greek letter \\(\\alpha\\) (pronounced “alpha”). This value acts as a cutoff on the \\(p\\)-value, where if the \\(p\\)-value falls below \\(\\alpha\\), we would “reject the null hypothesis \\(H_0\\).” Alternatively, if the \\(p\\)-value does not fall below \\(\\alpha\\), we would “fail to reject \\(H_0\\).” Note the latter statement is not quite the same as saying we “accept \\(H_0\\).” This distinction is rather subtle and not immediately obvious. So we’ll revisit it later in Section 9.4. While different fields tend to use different values of \\(\\alpha\\), some commonly used values for \\(\\alpha\\) are 0.1, 0.01, and 0.05; with 0.05 being the choice people often make without putting much thought into it. We’ll talk more about \\(\\alpha\\) significance levels in Section 9.4, but first let’s fully conduct the hypothesis test corresponding to our promotions activity using the infer package. 9.3 Conducting hypothesis tests In Section 8.4, we showed you how to construct confidence intervals. We first illustrated how to do this using dplyr data wrangling verbs and the rep_sample_n() function from Subsection 7.2.3 which we used as a virtual shovel. In particular, we constructed confidence intervals by resampling with replacement by setting the replace = TRUE argument to the rep_sample_n() function. We then showed you how to perform the same task using the infer package workflow. While both workflows resulted in the same bootstrap distribution from which we can construct confidence intervals, the infer package workflow emphasizes each of the steps in the overall process in Figure 9.8. It does so using function names that are intuitively named with verbs: specify() the variables of interest in your data frame. generate() replicates of bootstrap resamples with replacement. calculate() the summary statistic of interest. visualize() the resulting bootstrap distribution and confidence interval. FIGURE 9.8: Confidence intervals with the infer package. In this section, we’ll now show you how to seamlessly modify the previously seen infer code for constructing confidence intervals to conduct hypothesis tests. You’ll notice that the basic outline of the workflow is almost identical, except for an additional hypothesize() step between the specify() and generate() steps, as can be seen in Figure 9.9. FIGURE 9.9: Hypothesis testing with the infer package. Furthermore, we’ll use a pre-specified significance level \\(\\alpha\\) = 0.05 for this hypothesis test. Let’s leave discussion on the choice of this \\(\\alpha\\) value until later on in Section 9.4. 9.3.1 infer package workflow 1. specify variables Recall that we use the specify() verb to specify the response variable and, if needed, any explanatory variables for our study. In this case, since we are interested in any potential effects of gender on promotion decisions, we set decision as the response variable and gender as the explanatory variable. We do so using formula = response ~ explanatory where response is the name of the response variable in the data frame and explanatory is the name of the explanatory variable. So in our case it is decision ~ gender. Furthermore, since we are interested in the proportion of résumés &quot;promoted&quot;, and not the proportion of résumés not promoted, we set the argument success to &quot;promoted&quot;. promotions %&gt;% specify(formula = decision ~ gender, success = &quot;promoted&quot;) Response: decision (factor) Explanatory: gender (factor) # A tibble: 48 x 2 decision gender &lt;fct&gt; &lt;fct&gt; 1 promoted male 2 promoted male 3 promoted male 4 promoted male 5 promoted male 6 promoted male 7 promoted male 8 promoted male 9 promoted male 10 promoted male # … with 38 more rows Again, notice how the promotions data itself doesn’t change, but the Response: decision (factor) and Explanatory: gender (factor) meta-data do. This is similar to how the group_by() verb from dplyr doesn’t change the data, but only adds “grouping” meta-data, as we saw in Section 3.4. 2. hypothesize the null In order to conduct hypothesis tests using the infer workflow, we need a new step not present for confidence intervals: hypothesize(). Recall from Section 9.2 that our hypothesis test was \\[ \\begin{aligned} H_0 &amp;: p_{m} - p_{f} = 0\\\\ \\text{vs. } H_A&amp;: p_{m} - p_{f} &gt; 0 \\end{aligned} \\] In other words, the null hypothesis \\(H_0\\) corresponding to our “hypothesized universe” stated that there was no difference in gender-based discrimination rates. We set this null hypothesis \\(H_0\\) in our infer workflow using the null argument of the hypothesize() function to either: &quot;point&quot; for hypotheses involving a single sample or &quot;independence&quot; for hypotheses involving two samples. In our case, since we have two samples (the résumés with “male” and “female” names), we set null = &quot;independence&quot;. promotions %&gt;% specify(formula = decision ~ gender, success = &quot;promoted&quot;) %&gt;% hypothesize(null = &quot;independence&quot;) Response: decision (factor) Explanatory: gender (factor) Null Hypothesis: independence # A tibble: 48 x 2 decision gender &lt;fct&gt; &lt;fct&gt; 1 promoted male 2 promoted male 3 promoted male 4 promoted male 5 promoted male 6 promoted male 7 promoted male 8 promoted male 9 promoted male 10 promoted male # … with 38 more rows Again, the data has not changed yet. This will occur at the upcoming generate() step; we’re merely setting meta-data for now. Where do the terms &quot;point&quot; and &quot;independence&quot; come from? These are two technical statistical terms. The term “point” relates from the fact that for a single group of observations, you will test the value of a single point. Going back to the pennies example from Chapter 8, say we wanted to test if the mean year of all US pennies was equal to 1993 or not. We would be testing the value of a “point” \\(\\mu\\), the mean year of all US pennies, as follows \\[ \\begin{aligned} H_0 &amp;: \\mu = 1993\\\\ \\text{vs } H_A&amp;: \\mu \\neq 1993 \\end{aligned} \\] The term “independence” relates to the fact that for two groups of observations, you are testing whether or not the response variable is independent of the explanatory variable that assigns the groups. In our case, we are testing whether the decision response variable is “independent” of the explanatory variable gender that assigns each résumé to either of the two groups. 3. generate replicates After we hypothesize() the null hypothesis, we generate() replicates of “shuffled” datasets assuming the null hypothesis is true. We do this by repeating the shuffling exercise you performed in Section 9.1 several times. Instead of merely doing it 16 times as our groups of friends did, let’s use the computer to repeat this 1000 times by setting reps = 1000 in the generate() function. However, unlike for confidence intervals where we generated replicates using type = &quot;bootstrap&quot; resampling with replacement, we’ll now perform shuffles/permutations by setting type = &quot;permute&quot;. Recall that shuffles/permutations are a kind of resampling, but unlike the bootstrap method, they involve resampling without replacement. promotions_generate &lt;- promotions %&gt;% specify(formula = decision ~ gender, success = &quot;promoted&quot;) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) nrow(promotions_generate) [1] 48000 Observe that the resulting data frame has 48,000 rows. This is because we performed shuffles/permutations for each of the 48 rows 1000 times and \\(48,000 = 1000 \\cdot 48\\). If you explore the promotions_generate data frame with View(), you’ll notice that the variable replicate indicates which resample each row belongs to. So it has the value 1 48 times, the value 2 48 times, all the way through to the value 1000 48 times. 4. calculate summary statistics Now that we have generated 1000 replicates of “shuffles” assuming the null hypothesis is true, let’s calculate() the appropriate summary statistic for each of our 1000 shuffles. From Section 9.2, point estimates related to hypothesis testing have a specific name: test statistics. Since the unknown population parameter of interest is the difference in population proportions \\(p_{m} - p_{f}\\), the test statistic here is the difference in sample proportions \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\). For each of our 1000 shuffles, we can calculate this test statistic by setting stat = &quot;diff in props&quot;. Furthermore, since we are interested in \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\) we set order = c(&quot;male&quot;, &quot;female&quot;). As we stated earlier, the order of the subtraction does not matter, so long as you stay consistent throughout your analysis and tailor your interpretations accordingly. Let’s save the result in a data frame called null_distribution: null_distribution &lt;- promotions %&gt;% specify(formula = decision ~ gender, success = &quot;promoted&quot;) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;male&quot;, &quot;female&quot;)) null_distribution # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 -0.0416667 2 2 -0.125 3 3 -0.125 4 4 -0.0416667 5 5 -0.0416667 6 6 -0.125 7 7 -0.125 8 8 -0.125 9 9 -0.0416667 10 10 -0.0416667 # … with 990 more rows Observe that we have 1000 values of stat, each representing one instance of \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\) in a hypothesized world of no gender discrimination. Observe as well that we chose the name of this data frame carefully: null_distribution. Recall once again from Section 9.2 that sampling distributions when the null hypothesis \\(H_0\\) is assumed to be true have a special name: the null distribution. What was the observed difference in promotion rates? In other words, what was the observed test statistic \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\)? Recall from Section 9.1 that we computed this observed difference by hand to be 0.875 - 0.583 = 0.292 = 29.2%. We can also compute this value using the previous infer code but with the hypothesize() and generate() steps removed. Let’s save this in obs_diff_prop: obs_diff_prop &lt;- promotions %&gt;% specify(decision ~ gender, success = &quot;promoted&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;male&quot;, &quot;female&quot;)) obs_diff_prop # A tibble: 1 x 1 stat &lt;dbl&gt; 1 0.291667 5. visualize the p-value The final step is to measure how surprised we are by a promotion difference of 29.2% in a hypothesized universe of no gender discrimination. If the observed difference of 0.292 is highly unlikely, then we would be inclined to reject the validity of our hypothesized universe. We start by visualizing the null distribution of our 1000 values of \\(\\widehat{p}_{m} - \\widehat{p}_{f}\\) using visualize() in Figure 9.10. Recall that these are values of the difference in promotion rates assuming \\(H_0\\) is true. This corresponds to being in our hypothesized universe of no gender discrimination. visualize(null_distribution, bins = 10) FIGURE 9.10: Null distribution. Let’s now add what happened in real life to Figure 9.10, the observed difference in promotion rates of 0.875 - 0.583 = 0.292 = 29.2%. However, instead of merely adding a vertical line using geom_vline(), let’s use the shade_p_value() function with obs_stat set to the observed test statistic value we saved in obs_diff_prop. Furthermore, we’ll set the direction = &quot;right&quot; reflecting our alternative hypothesis \\(H_A: p_{m} - p_{f} &gt; 0\\). Recall our alternative hypothesis \\(H_A\\) is that \\(p_{m} - p_{f} &gt; 0\\), stating that there is a difference in promotion rates in favor of résumés with male names. “More extreme” here corresponds to differences that are “bigger” or “more positive” or “more to the right.” Hence we set the direction argument of shade_p_value() to be &quot;right&quot;. On the other hand, had our alternative hypothesis \\(H_A\\) been the other possible one-sided alternative \\(p_{m} - p_{f} &lt; 0\\), suggesting discrimination in favor of résumés with female names, we would’ve set direction = &quot;left&quot;. Had our alternative hypothesis \\(H_A\\) been two-sided \\(p_{m} - p_{f} \\neq 0\\), suggesting discrimination in either direction, we would’ve set direction = &quot;both&quot;. visualize(null_distribution, bins = 10) + shade_p_value(obs_stat = obs_diff_prop, direction = &quot;right&quot;) FIGURE 9.11: Shaded histogram to show \\(p\\)-value. In the resulting Figure 9.11, the solid dark line marks 0.292 = 29.2%. However, what does the shaded-region correspond to? This is the \\(p\\)-value. Recall the definition of the \\(p\\)-value from Section 9.2: A \\(p\\)-value is the probability of obtaining a test statistic just as or more extreme than the observed test statistic assuming the null hypothesis \\(H_0\\) is true. So judging by the shaded region in Figure 9.11, it seems we would somewhat rarely observe differences in promotion rates of 0.292 = 29.2% or more in a hypothesized universe of no gender discrimination. In other words, the \\(p\\)-value is somewhat small. Hence, we would be inclined to reject this hypothesized universe, or using statistical language we would “reject \\(H_0\\).” What fraction of the null distribution is shaded? In other words, what is the exact value of the \\(p\\)-value? We can compute it using the get_p_value() function with the same arguments as the previous shade_p_value() code: null_distribution %&gt;% get_p_value(obs_stat = obs_diff_prop, direction = &quot;right&quot;) # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0.027 Keeping the definition of a \\(p\\)-value in mind, the probability of observing a difference in promotion rates as large as 0.292 = 29.2% due to sampling variation alone in the null distribution is 0.027 = 2.7%. Since this \\(p\\)-value is smaller than our pre-specified significance level \\(\\alpha\\) = 0.05, we reject the null hypothesis \\(H_0: p_{m} - p_{f} = 0\\). In other words, this \\(p\\)-value is sufficiently small to reject our hypothesized universe of no gender discrimination. We instead have enough evidence to change our mind in favor of gender discrimination being a likely culprit here. Observe that whether we reject the null hypothesis \\(H_0\\) or not depends in large part on our choice of significance level \\(\\alpha\\). We’ll discuss this more in Subsection 9.4.3. 9.3.2 Comparison with confidence intervals One of the great things about the infer package is that we can jump seamlessly between conducting hypothesis tests and constructing confidence intervals with minimal changes! Recall the code from the previous section that creates the null distribution, which in turn is needed to compute the \\(p\\)-value: null_distribution &lt;- promotions %&gt;% specify(formula = decision ~ gender, success = &quot;promoted&quot;) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;male&quot;, &quot;female&quot;)) To create the corresponding bootstrap distribution needed to construct a 95% confidence interval for \\(p_{m} - p_{f}\\), we only need to make two changes. First, we remove the hypothesize() step since we are no longer assuming a null hypothesis \\(H_0\\) is true. We can do this by deleting or commenting out the hypothesize() line of code. Second, we switch the type of resampling in the generate() step to be &quot;bootstrap&quot; instead of &quot;permute&quot;. bootstrap_distribution &lt;- promotions %&gt;% specify(formula = decision ~ gender, success = &quot;promoted&quot;) %&gt;% # Change 1 - Remove hypothesize(): # hypothesize(null = &quot;independence&quot;) %&gt;% # Change 2 - Switch type from &quot;permute&quot; to &quot;bootstrap&quot;: generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;male&quot;, &quot;female&quot;)) Using this bootstrap_distribution, let’s first compute the percentile-based confidence intervals, as we did in Section 8.4: percentile_ci &lt;- bootstrap_distribution %&gt;% get_confidence_interval(level = 0.95, type = &quot;percentile&quot;) percentile_ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 0.0444444 0.538542 Using our shorthand interpretation for 95% confidence intervals from Subsection 8.5.2, we are 95% “confident” that the true difference in population proportions \\(p_{m} - p_{f}\\) is between (0.044, 0.539). Let’s visualize bootstrap_distribution and this percentile-based 95% confidence interval for \\(p_{m} - p_{f}\\) in Figure 9.12. visualize(bootstrap_distribution) + shade_confidence_interval(endpoints = percentile_ci) FIGURE 9.12: Percentile-based 95% confidence interval. Notice a key value that is not included in the 95% confidence interval for \\(p_{m} - p_{f}\\): the value 0. In other words, a difference of 0 is not included in our net, suggesting that \\(p_{m}\\) and \\(p_{f}\\) are truly different! Furthermore, observe how the entirety of the 95% confidence interval for \\(p_{m} - p_{f}\\) lies above 0, suggesting that this difference is in favor of men. Since the bootstrap distribution appears to be roughly normally shaped, we can also use the standard error method as we did in Section 8.4. In this case, we must specify the point_estimate argument as the observed difference in promotion rates 0.292 = 29.2% saved in obs_diff_prop. This value acts as the center of the confidence interval. se_ci &lt;- bootstrap_distribution %&gt;% get_confidence_interval(level = 0.95, type = &quot;se&quot;, point_estimate = obs_diff_prop) se_ci # A tibble: 1 x 2 lower upper &lt;dbl&gt; &lt;dbl&gt; 1 0.0514129 0.531920 Let’s visualize bootstrap_distribution again, but now the standard error based 95% confidence interval for \\(p_{m} - p_{f}\\) in Figure 9.13. Again, notice how the value 0 is not included in our confidence interval, again suggesting that \\(p_{m}\\) and \\(p_{f}\\) are truly different! visualize(bootstrap_distribution) + shade_confidence_interval(endpoints = se_ci) FIGURE 9.13: Standard error-based 95% confidence interval. Learning check (LC9.1) Conduct the same hypothesis test and confidence interval analysis comparing male and female promotion rates using the median rating instead of the mean rating. What was different and what was the same? (LC9.2) Why are we relatively confident that the distributions of the sample proportions will be good approximations of the population distributions of promotion proportions for the two genders? (LC9.3) Using the definition of p-value, write in words what the \\(p\\)-value represents for the hypothesis test comparing the promotion rates for males and females. 9.3.3 “There is only one test” Let’s recap the steps necessary to conduct a hypothesis test using the terminology, notation, and definitions related to sampling you saw in Section 9.2 and the infer workflow from Subsection 9.3.1: specify() the variables of interest in your data frame. hypothesize() the null hypothesis \\(H_0\\). In other words, set a “model for the universe” assuming \\(H_0\\) is true. generate() shuffles assuming \\(H_0\\) is true. In other words, simulate data assuming \\(H_0\\) is true. calculate() the test statistic of interest, both for the observed data and your simulated data. visualize() the resulting null distribution and compute the \\(p\\)-value by comparing the null distribution to the observed test statistic. While this is a lot to digest, especially the first time you encounter hypothesis testing, the nice thing is that once you understand this general framework, then you can understand any hypothesis test. In a famous blog post, computer scientist Allen Downey called this the “There is only one test” framework, for which he created the flowchart displayed in Figure 9.14. FIGURE 9.14: Allen Downey’s hypothesis testing framework. Notice its similarity with the “hypothesis testing with infer” diagram you saw in Figure 9.9. That’s because the infer package was explicitly designed to match the “There is only one test” framework. So if you can understand the framework, you can easily generalize these ideas for all hypothesis testing scenarios. Whether for population proportions \\(p\\), population means \\(\\mu\\), differences in population proportions \\(p_1 - p_2\\), differences in population means \\(\\mu_1 - \\mu_2\\), and as you’ll see in Chapter 10 on inference for regression, population regression slopes \\(\\beta_1\\) as well. In fact, it applies more generally even than just these examples to more complicated hypothesis tests and test statistics as well. Learning check (LC9.4) Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between the promotion rate of males and females using this study. 9.4 Interpreting hypothesis tests Interpreting the results of hypothesis tests is one of the more challenging aspects of this method for statistical inference. In this section, we’ll focus on ways to help with deciphering the process and address some common misconceptions. 9.4.1 Two possible outcomes In Section 9.2, we mentioned that given a pre-specified significance level \\(\\alpha\\) there are two possible outcomes of a hypothesis test: If the \\(p\\)-value is less than \\(\\alpha\\), then we reject the null hypothesis \\(H_0\\) in favor of \\(H_A\\). If the \\(p\\)-value is greater than or equal to \\(\\alpha\\), we fail to reject the null hypothesis \\(H_0\\). Unfortunately, the latter result is often misinterpreted as “accepting the null hypothesis \\(H_0\\).” While at first glance it may seem that the statements “failing to reject \\(H_0\\)” and “accepting \\(H_0\\)” are equivalent, there actually is a subtle difference. Saying that we “accept the null hypothesis \\(H_0\\)” is equivalent to stating that “we think the null hypothesis \\(H_0\\) is true.” However, saying that we “fail to reject the null hypothesis \\(H_0\\)” is saying something else: “While \\(H_0\\) might still be false, we don’t have enough evidence to say so.” In other words, there is an absence of enough proof. However, the absence of proof is not proof of absence. To further shed light on this distinction, let’s use the United States criminal justice system as an analogy. A criminal trial in the United States is a similar situation to hypothesis tests whereby a choice between two contradictory claims must be made about a defendant who is on trial: The defendant is truly either “innocent” or “guilty.” The defendant is presumed “innocent until proven guilty.” The defendant is found guilty only if there is strong evidence that the defendant is guilty. The phrase “beyond a reasonable doubt” is often used as a guideline for determining a cutoff for when enough evidence exists to find the defendant guilty. The defendant is found to be either “not guilty” or “guilty” in the ultimate verdict. In other words, not guilty verdicts are not suggesting the defendant is innocent, but instead that “while the defendant may still actually be guilty, there wasn’t enough evidence to prove this fact.” Now let’s make the connection with hypothesis tests: Either the null hypothesis \\(H_0\\) or the alternative hypothesis \\(H_A\\) is true. Hypothesis tests are conducted assuming the null hypothesis \\(H_0\\) is true. We reject the null hypothesis \\(H_0\\) in favor of \\(H_A\\) only if the evidence found in the sample suggests that \\(H_A\\) is true. The significance level \\(\\alpha\\) is used as a guideline to set the threshold on just how strong of evidence we require. We ultimately decide to either “fail to reject \\(H_0\\)” or “reject \\(H_0\\).” So while gut instinct may suggest “failing to reject \\(H_0\\)” and “accepting \\(H_0\\)” are equivalent statements, they are not. “Accepting \\(H_0\\)” is equivalent to finding a defendant innocent. However, courts do not find defendants “innocent,” but rather they find them “not guilty.” Putting things differently, defense attorneys do not need to prove that their clients are innocent, rather they only need to prove that clients are not “guilty beyond a reasonable doubt”. So going back to our résumés activity in Section 9.3, recall that our hypothesis test was \\(H_0: p_{m} - p_{f} = 0\\) versus \\(H_A: p_{m} - p_{f} &gt; 0\\) and that we used a pre-specified significance level of \\(\\alpha\\) = 0.05. We found a \\(p\\)-value of 0.027. Since the \\(p\\)-value was smaller than \\(\\alpha\\) = 0.05, we rejected \\(H_0\\). In other words, we found needed levels of evidence in this particular sample to say that \\(H_0\\) is false at the \\(\\alpha\\) = 0.05 significance level. We also state this conclusion using non-statistical language: we found enough evidence in this data to suggest that there was gender discrimination at play. 9.4.2 Types of errors Unfortunately, there is some chance a jury or a judge can make an incorrect decision in a criminal trial by reaching the wrong verdict. For example, finding a truly innocent defendant “guilty”. Or on the other hand, finding a truly guilty defendant “not guilty.” This can often stem from the fact that prosecutors don’t have access to all the relevant evidence, but instead are limited to whatever evidence the police can find. The same holds for hypothesis tests. We can make incorrect decisions about a population parameter because we only have a sample of data from the population and thus sampling variation can lead us to incorrect conclusions. There are two possible erroneous conclusions in a criminal trial: either (1) a truly innocent person is found guilty or (2) a truly guilty person is found not guilty. Similarly, there are two possible errors in a hypothesis test: either (1) rejecting \\(H_0\\) when in fact \\(H_0\\) is true, called a Type I error or (2) failing to reject \\(H_0\\) when in fact \\(H_0\\) is false, called a Type II error. Another term used for “Type I error” is “false positive,” while another term for “Type II error” is “false negative.” This risk of error is the price researchers pay for basing inference on a sample instead of performing a census on the entire population. But as we’ve seen in our numerous examples and activities so far, censuses are often very expensive and other times impossible, and thus researchers have no choice but to use a sample. Thus in any hypothesis test based on a sample, we have no choice but to tolerate some chance that a Type I error will be made and some chance that a Type II error will occur. To help understand the concepts of Type I error and Type II errors, we apply these terms to our criminal justice analogy in Figure 9.15. FIGURE 9.15: Type I and Type II errors in criminal trials. Thus a Type I error corresponds to incorrectly putting a truly innocent person in jail, whereas a Type II error corresponds to letting a truly guilty person go free. Let’s show the corresponding table in Figure 9.16 for hypothesis tests. FIGURE 9.16: Type I and Type II errors in hypothesis tests. 9.4.3 How do we choose alpha? If we are using a sample to make inferences about a population, we run the risk of making errors. For confidence intervals, a corresponding “error” would be constructing a confidence interval that does not contain the true value of the population parameter. For hypothesis tests, this would be making either a Type I or Type II error. Obviously, we want to minimize the probability of either error; we want a small probability of making an incorrect conclusion: The probability of a Type I Error occurring is denoted by \\(\\alpha\\). The value of \\(\\alpha\\) is called the significance level of the hypothesis test, which we defined in Section 9.2. The probability of a Type II Error is denoted by \\(\\beta\\). The value of \\(1-\\beta\\) is known as the power of the hypothesis test. In other words, \\(\\alpha\\) corresponds to the probability of incorrectly rejecting \\(H_0\\) when in fact \\(H_0\\) is true. On the other hand, \\(\\beta\\) corresponds to the probability of incorrectly failing to reject \\(H_0\\) when in fact \\(H_0\\) is false. Ideally, we want \\(\\alpha = 0\\) and \\(\\beta = 0\\), meaning that the chance of making either error is 0. However, this can never be the case in any situation where we are sampling for inference. There will always be the possibility of making either error when we use sample data. Furthermore, these two error probabilities are inversely related. As the probability of a Type I error goes down, the probability of a Type II error goes up. What is typically done in practice is to fix the probability of a Type I error by pre-specifying a significance level \\(\\alpha\\) and then try to minimize \\(\\beta\\). In other words, we will tolerate a certain fraction of incorrect rejections of the null hypothesis \\(H_0\\), and then try to minimize the fraction of incorrect non-rejections of \\(H_0\\). So for example if we used \\(\\alpha\\) = 0.01, we would be using a hypothesis testing procedure that in the long run would incorrectly reject the null hypothesis \\(H_0\\) one percent of the time. This is analogous to setting the confidence level of a confidence interval. So what value should you use for \\(\\alpha\\)? Different fields have different conventions, but some commonly used values include 0.10, 0.05, 0.01, and 0.001. However, it is important to keep in mind that if you use a relatively small value of \\(\\alpha\\), then all things being equal, \\(p\\)-values will have a harder time being less than \\(\\alpha\\). Thus we would reject the null hypothesis less often. In other words, we would reject the null hypothesis \\(H_0\\) only if we have very strong evidence to do so. This is known as a “conservative” test. On the other hand, if we used a relatively large value of \\(\\alpha\\), then all things being equal, \\(p\\)-values will have an easier time being less than \\(\\alpha\\). Thus we would reject the null hypothesis more often. In other words, we would reject the null hypothesis \\(H_0\\) even if we only have mild evidence to do so. This is known as a “liberal” test. Learning check (LC9.5) What is wrong about saying, “The defendant is innocent.” based on the US system of criminal trials? (LC9.6) What is the purpose of hypothesis testing? (LC9.7) What are some flaws with hypothesis testing? How could we alleviate them? (LC9.8) Consider two \\(\\alpha\\) significance levels of 0.1 and 0.01. Of the two, which would lead to a more liberal hypothesis testing procedure? In other words, one that will, all things being equal, lead to more rejections of the null hypothesis \\(H_0\\). 9.5 Case study: Are action or romance movies rated higher? Let’s apply our knowledge of hypothesis testing to answer the question: “Are action or romance movies rated higher on IMDb?”. IMDb is a database on the internet providing information on movie and television show casts, plot summaries, trivia, and ratings. We’ll investigate if, on average, action or romance movies get higher ratings on IMDb. 9.5.1 IMDb ratings data The movies dataset in the ggplot2movies package contains information on 58,788 movies that have been rated by users of IMDb.com. movies # A tibble: 58,788 x 24 title year length budget rating votes r1 r2 r3 r4 r5 r6 &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5 2 $100… 1939 71 NA 6 20 0 14.5 4.5 24.5 14.5 14.5 3 $21 … 1941 7 NA 8.200 5 0 0 0 0 0 24.5 4 $40,… 1996 70 NA 8.200 6 14.5 0 0 0 0 0 5 $50,… 1975 71 NA 3.4 17 24.5 4.5 0 14.5 14.5 4.5 6 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5 14.5 14.5 14.5 7 $win… 2002 93 NA 5.3 200 4.5 0 4.5 4.5 24.5 24.5 8 &#39;15&#39; 2002 25 NA 6.7 24 4.5 4.5 4.5 4.5 4.5 14.5 9 &#39;38 1987 97 NA 6.6 18 4.5 4.5 4.5 0 0 0 10 &#39;49-… 1917 61 NA 6 51 4.5 0 4.5 4.5 4.5 44.5 # … with 58,778 more rows, and 12 more variables: r7 &lt;dbl&gt;, r8 &lt;dbl&gt;, r9 &lt;dbl&gt;, # r10 &lt;dbl&gt;, mpaa &lt;chr&gt;, Action &lt;int&gt;, Animation &lt;int&gt;, Comedy &lt;int&gt;, # Drama &lt;int&gt;, Documentary &lt;int&gt;, Romance &lt;int&gt;, Short &lt;int&gt; We’ll focus on a random sample of 68 movies that are classified as either “action” or “romance” movies but not both. We disregard movies that are classified as both so that we can assign all 68 movies into either category. Furthermore, since the original movies dataset was a little messy, we provide a pre-wrangled version of our data in the movies_sample data frame included in the moderndive package. If you’re curious, you can look at the necessary data wrangling code to do this on GitHub. movies_sample # A tibble: 68 x 4 title year rating genre &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; 1 Underworld 1985 3.1 Action 2 Love Affair 1932 6.3 Romance 3 Junglee 1961 6.8 Romance 4 Eversmile, New Jersey 1989 5 Romance 5 Search and Destroy 1979 4 Action 6 Secreto de Romelia, El 1988 4.9 Romance 7 Amants du Pont-Neuf, Les 1991 7.4 Romance 8 Illicit Dreams 1995 3.5 Action 9 Kabhi Kabhie 1976 7.7 Romance 10 Electric Horseman, The 1979 5.8 Romance # … with 58 more rows The variables include the title and year the movie was filmed. Furthermore, we have a numerical variable rating, which is the IMDb rating out of 10 stars, and a binary categorical variable genre indicating if the movie was an Action or Romance movie. We are interested in whether Action or Romance movies got a higher rating on average. Let’s perform an exploratory data analysis of this data. Recall from Subsection 2.7.1 that a boxplot is a visualization we can use to show the relationship between a numerical and a categorical variable. Another option you saw in Section 2.6 would be to use a faceted histogram. However, in the interest of brevity, let’s only present the boxplot in Figure 9.17. ggplot(data = movies_sample, aes(x = genre, y = rating)) + geom_boxplot() + labs(y = &quot;IMDb rating&quot;) FIGURE 9.17: Boxplot of IMDb rating vs. genre. Eyeballing Figure 9.17, romance movies have a higher median rating. Do we have reason to believe, however, that there is a significant difference between the mean rating for action movies compared to romance movies? It’s hard to say just based on this plot. The boxplot does show that the median sample rating is higher for romance movies. However, there is a large amount of overlap between the boxes. Recall that the median isn’t necessarily the same as the mean either, depending on whether the distribution is skewed. Let’s calculate some summary statistics split by the binary categorical variable genre: the number of movies, the mean rating, and the standard deviation split by genre. We’ll do this using dplyr data wrangling verbs. Notice in particular how we count the number of each type of movie using the n() summary function. movies_sample %&gt;% group_by(genre) %&gt;% summarize(n = n(), mean_rating = mean(rating), std_dev = sd(rating)) # A tibble: 2 x 4 genre n mean_rating std_dev &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; 1 Action 32 5.275 1.36121 2 Romance 36 6.32222 1.60963 Observe that we have 36 movies with an average rating of 6.322 stars and 32 movies with an average rating of 5.275 stars. The difference in these average ratings is thus 6.322 - 5.275 = 1.047. So there appears to be an edge of 1.047 stars in favor of romance movies. The question is, however, are these results indicative of a true difference for all romance and action movies? Or could we attribute this difference to chance sampling variation? 9.5.2 Sampling scenario Let’s now revisit this study in terms of terminology and notation related to sampling we studied in Subsection 7.3.1. The study population is all movies in the IMDb database that are either action or romance (but not both). The sample from this population is the 68 movies included in the movies_sample dataset. Since this sample was randomly taken from the population movies, it is representative of all romance and action movies on IMDb. Thus, any analysis and results based on movies_sample can generalize to the entire population. What are the relevant population parameter and point estimates? We introduce the fourth sampling scenario in Table 9.3. TABLE 9.3: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Symbol(s) 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) 4 Difference in population means \\(\\mu_1 - \\mu_2\\) Difference in sample means \\(\\overline{x}_1 - \\overline{x}_2\\) So, whereas the sampling bowl exercise in Section 7.1 concerned proportions, the pennies exercise in Section 8.1 concerned means, the case study on whether yawning is contagious in Section 8.6 and the promotions activity in Section 9.1 concerned differences in proportions, we are now concerned with differences in means. In other words, the population parameter of interest is the difference in population mean ratings \\(\\mu_a - \\mu_r\\), where \\(\\mu_a\\) is the mean rating of all action movies on IMDb and similarly \\(\\mu_r\\) is the mean rating of all romance movies. Additionally the point estimate/sample statistic of interest is the difference in sample means \\(\\overline{x}_a - \\overline{x}_r\\), where \\(\\overline{x}_a\\) is the mean rating of the \\(n_a\\) = 32 movies in our sample and \\(\\overline{x}_r\\) is the mean rating of the \\(n_r\\) = 36 in our sample. Based on our earlier exploratory data analysis, our estimate \\(\\overline{x}_a - \\overline{x}_r\\) is \\(5.275 - 6.322 = -1.047\\). So there appears to be a slight difference of -1.047 in favor of romance movies. The question is, however, could this difference of -1.047 be merely due to chance and sampling variation? Or are these results indicative of a true difference in mean ratings for all romance and action movies on IMDb? To answer this question, we’ll use hypothesis testing. 9.5.3 Conducting the hypothesis test We’ll be testing: \\[ \\begin{aligned} H_0 &amp;: \\mu_a - \\mu_r = 0\\\\ \\text{vs } H_A&amp;: \\mu_a - \\mu_r \\neq 0 \\end{aligned} \\] In other words, the null hypothesis \\(H_0\\) suggests that both romance and action movies have the same mean rating. This is the “hypothesized universe” we’ll assume is true. On the other hand, the alternative hypothesis \\(H_A\\) suggests that there is a difference. Unlike the one-sided alternative we used in the promotions exercise \\(H_a: p_m - p_f &gt; 0\\), we are now considering a two-sided alternative of \\(H_A: \\mu_a - \\mu_r \\neq 0\\). Furthermore, we’ll pre-specify a low significance level of \\(\\alpha\\) = 0.001. By setting this value low, all things being equal, there is a lower chance that the \\(p\\)-value will be less than \\(\\alpha\\). Thus, there is a lower chance that we’ll reject the null hypothesis \\(H_0\\) in favor of the alternative hypothesis \\(H_A\\). In other words, we’ll reject the hypothesis that there is no difference in mean ratings for all action and romance movies, only if we have quite strong evidence. This is known as a “conservative” hypothesis testing procedure. 1. specify variables Let’s now perform all the steps of the infer workflow. We first specify() the variables of interest in the movies_sample data frame using the formula rating ~ genre. This tells infer that the numerical variable rating is the outcome variable, while the binary variable genre is the explanatory variable. Note that unlike previously when we were interested in proportions, since we are now interested in the mean of a numerical variable, we do not need to set the success argument. movies_sample %&gt;% specify(formula = rating ~ genre) Response: rating (numeric) Explanatory: genre (factor) # A tibble: 68 x 2 rating genre &lt;dbl&gt; &lt;fct&gt; 1 3.1 Action 2 6.3 Romance 3 6.8 Romance 4 5 Romance 5 4 Action 6 4.9 Romance 7 7.4 Romance 8 3.5 Action 9 7.7 Romance 10 5.8 Romance # … with 58 more rows Observe at this point that the data in movies_sample has not changed. The only change so far is the newly defined Response: rating (numeric) and Explanatory: genre (factor) meta-data. 2. hypothesize the null We set the null hypothesis \\(H_0: \\mu_a - \\mu_r = 0\\) by using the hypothesize() function. Since we have two samples, action and romance movies, we set null to be &quot;independence&quot; as we described in Section 9.3. movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% hypothesize(null = &quot;independence&quot;) Response: rating (numeric) Explanatory: genre (factor) Null Hypothesis: independence # A tibble: 68 x 2 rating genre &lt;dbl&gt; &lt;fct&gt; 1 3.1 Action 2 6.3 Romance 3 6.8 Romance 4 5 Romance 5 4 Action 6 4.9 Romance 7 7.4 Romance 8 3.5 Action 9 7.7 Romance 10 5.8 Romance # … with 58 more rows 3. generate replicates After we have set the null hypothesis, we generate “shuffled” replicates assuming the null hypothesis is true by repeating the shuffling/permutation exercise you performed in Section 9.1. We’ll repeat this resampling without replacement of type = &quot;permute&quot; a total of reps = 1000 times. Feel free to run the code below to check out what the generate() step produces. movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% View() 4. calculate summary statistics Now that we have 1000 replicated “shuffles” assuming the null hypothesis \\(H_0\\) that both Action and Romance movies on average have the same ratings on IMDb, let’s calculate() the appropriate summary statistic for these 1000 replicated shuffles. From Section 9.2, summary statistics relating to hypothesis testing have a specific name: test statistics. Since the unknown population parameter of interest is the difference in population means \\(\\mu_{a} - \\mu_{r}\\), the test statistic of interest here is the difference in sample means \\(\\overline{x}_{a} - \\overline{x}_{r}\\). For each of our 1000 shuffles, we can calculate this test statistic by setting stat = &quot;diff in means&quot;. Furthermore, since we are interested in \\(\\overline{x}_{a} - \\overline{x}_{r}\\), we set order = c(&quot;Action&quot;, &quot;Romance&quot;). Let’s save the results in a data frame called null_distribution_movies: null_distribution_movies &lt;- movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% calculate(stat = &quot;diff in means&quot;, order = c(&quot;Action&quot;, &quot;Romance&quot;)) null_distribution_movies # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 0.511111 2 2 0.345833 3 3 -0.327083 4 4 -0.209028 5 5 -0.433333 6 6 -0.102778 7 7 0.387153 8 8 0.16875 9 9 0.257292 10 10 0.334028 # … with 990 more rows Observe that we have 1000 values of stat, each representing one instance of \\(\\overline{x}_{a} - \\overline{x}_{r}\\). The 1000 values form the null distribution, which is the technical term for the sampling distribution of the difference in sample means \\(\\overline{x}_{a} - \\overline{x}_{r}\\) assuming \\(H_0\\) is true. What happened in real life? What was the observed difference in promotion rates? What was the observed test statistic \\(\\overline{x}_{a} - \\overline{x}_{r}\\)? Recall from our earlier data wrangling, this observed difference in means was \\(5.275 - 6.322 = -1.047\\). We can also achieve this using the code that constructed the null distribution null_distribution_movies but with the hypothesize() and generate() steps removed. Let’s save this in obs_diff_means: obs_diff_means &lt;- movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% calculate(stat = &quot;diff in means&quot;, order = c(&quot;Action&quot;, &quot;Romance&quot;)) obs_diff_means # A tibble: 1 x 1 stat &lt;dbl&gt; 1 -1.04722 5. visualize the p-value Lastly, in order to compute the \\(p\\)-value, we have to assess how “extreme” the observed difference in means of -1.047 is. We do this by comparing -1.047 to our null distribution, which was constructed in a hypothesized universe of no true difference in movie ratings. Let’s visualize both the null distribution and the \\(p\\)-value in Figure 9.18. Unlike our example in Subsection 9.3.1 involving promotions, since we have a two-sided \\(H_A: \\mu_a - \\mu_r \\neq 0\\), we have to allow for both possibilities for more extreme, so we set direction = &quot;both&quot;. visualize(null_distribution_movies, bins = 10) + shade_p_value(obs_stat = obs_diff_means, direction = &quot;both&quot;) FIGURE 9.18: Null distribution, observed test statistic, and \\(p\\)-value. Let’s go over the elements of this plot. First, the histogram is the null distribution. Second, the solid line is the observed test statistic, or the difference in sample means we observed in real life of \\(5.275 - 6.322 = -1.047\\). Third, the two shaded areas of the histogram form the \\(p\\)-value, or the probability of obtaining a test statistic just as or more extreme than the observed test statistic assuming the null hypothesis \\(H_0\\) is true. What proportion of the null distribution is shaded? In other words, what is the numerical value of the \\(p\\)-value? We use the get_p_value() function to compute this value: null_distribution_movies %&gt;% get_p_value(obs_stat = obs_diff_means, direction = &quot;both&quot;) # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0.004 This \\(p\\)-value of 0.004 is very small. In other words, there is a very small chance that we’d observe a difference of 5.275 - 6.322 = -1.047 in a hypothesized universe where there was truly no difference in ratings. But this \\(p\\)-value is larger than our (even smaller) pre-specified \\(\\alpha\\) significance level of 0.001. Thus, we are inclined to fail to reject the null hypothesis \\(H_0: \\mu_a - \\mu_r = 0\\). In non-statistical language, the conclusion is: we do not have the evidence needed in this sample of data to suggest that we should reject the hypothesis that there is no difference in mean IMDb ratings between romance and action movies. We, thus, cannot say that a difference exists in romance and action movie ratings, on average, for all IMDb movies. Learning check (LC9.9) Conduct the same analysis comparing action movies versus romantic movies using the median rating instead of the mean rating. What was different and what was the same? (LC9.10) What conclusions can you make from viewing the faceted histogram looking at rating versus genre that you couldn’t see when looking at the boxplot? (LC9.11) Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between mean movie ratings for action and romance movies. (LC9.12) Why are we relatively confident that the distributions of the sample ratings will be good approximations of the population distributions of ratings for the two genres? (LC9.13) Using the definition of \\(p\\)-value, write in words what the \\(p\\)-value represents for the hypothesis test comparing the mean rating of romance to action movies. (LC9.14) What is the value of the \\(p\\)-value for the hypothesis test comparing the mean rating of romance to action movies? (LC9.15) Test your data wrangling knowledge and EDA skills: Use dplyr and tidyr to create the necessary data frame focused on only action and romance movies (but not both) from the movies data frame in the ggplot2movies package. Make a boxplot and a faceted histogram of this population data comparing ratings of action and romance movies from IMDb. Discuss how these plots compare to the similar plots produced for the movies_sample data. 9.6 Conclusion 9.6.1 Theory-based hypothesis tests Much as we did in Subsection 8.7.2 when we showed you a theory-based method for constructing confidence intervals that involved mathematical formulas, we now present an example of a traditional theory-based method to conduct hypothesis tests. This method relies on probability models, probability distributions, and a few assumptions to construct the null distribution. This is in contrast to the approach we’ve been using throughout this book where we relied on computer simulations to construct the null distribution. These traditional theory-based methods have been used for decades mostly because researchers didn’t have access to computers that could run thousands of calculations quickly and efficiently. Now that computing power is much cheaper and more accessible, simulation-based methods are much more feasible. However, researchers in many fields continue to use theory-based methods. Hence, we make it a point to include an example here. As we’ll show in this section, any theory-based method is ultimately an approximation to the simulation-based method. The theory-based method we’ll focus on is known as the two-sample \\(t\\)-test for testing differences in sample means. However, the test statistic we’ll use won’t be the difference in sample means \\(\\overline{x}_1 - \\overline{x}_2\\), but rather the related two-sample \\(t\\)-statistic. The data we’ll use will once again be the movies_sample data of action and romance movies from Section 9.5. Two-sample t-statistic A common task in statistics is the process of “standardizing a variable.” By standardizing different variables, we make them more comparable. For example, say you are interested in studying the distribution of temperature recordings from Portland, Oregon, USA and comparing it to that of the temperature recordings in Montreal, Quebec, Canada. Given that US temperatures are generally recorded in degrees Fahrenheit and Canadian temperatures are generally recorded in degrees Celsius, how can we make them comparable? One approach would be to convert degrees Fahrenheit into Celsius, or vice versa. Another approach would be to convert them both to a common “standardized” scale, like degrees Kelvin. One common method for standardizing a variable from probability and statistics theory is to compute the \\(z\\)-score: \\[z = \\frac{x - \\mu}{\\sigma}\\] where \\(x\\) represents one value of a variable, \\(\\mu\\) represents the mean of that variable, and \\(\\sigma\\) represents the standard deviation of that variable. You first subtract the mean \\(\\mu\\) from each value of \\(x\\) and then divide \\(x - \\mu\\) by the standard deviation \\(\\sigma\\). These operations will have the effect of re-centering your variable around 0 and re-scaling your variable \\(x\\) so that they have what are known as “standard units.” Thus for every value that your variable can take, it has a corresponding \\(z\\)-score that gives how many standard units away that value is from the mean \\(\\mu\\). \\(z\\)-scores are normally distributed with mean 0 and standard deviation 1. This curve is called a “\\(z\\)-distribution” or “standard normal” curve and has the common, bell-shaped pattern from Figure 9.19 discussed in Appendix A.2. FIGURE 9.19: Standard normal z curve. Bringing these back to the difference of sample mean ratings \\(\\overline{x}_a - \\overline{x}_r\\) of action versus romance movies, how would we standardize this variable? By once again subtracting its mean and dividing by its standard deviation. Recall two facts from Subsection 7.3.3. First, if the sampling was done in a representative fashion, then the sampling distribution of \\(\\overline{x}_a - \\overline{x}_r\\) will be centered at the true population parameter \\(\\mu_a - \\mu_r\\). Second, the standard deviation of point estimates like \\(\\overline{x}_a - \\overline{x}_r\\) has a special name: the standard error. Applying these ideas, we present the two-sample \\(t\\)-statistic: \\[t = \\dfrac{ (\\bar{x}_a - \\bar{x}_r) - (\\mu_a - \\mu_r)}{ \\text{SE}_{\\bar{x}_a - \\bar{x}_r} } = \\dfrac{ (\\bar{x}_a - \\bar{x}_r) - (\\mu_a - \\mu_r)}{ \\sqrt{\\dfrac{{s_a}^2}{n_a} + \\dfrac{{s_r}^2}{n_r}} }\\] Oofda! There is a lot to try to unpack here! Let’s go slowly. In the numerator, \\(\\bar{x}_a-\\bar{x}_r\\) is the difference in sample means, while \\(\\mu_a - \\mu_r\\) is the difference in population means. In the denominator, \\(s_a\\) and \\(s_r\\) are the sample standard deviations of the action and romance movies in our sample movies_sample. Lastly, \\(n_a\\) and \\(n_r\\) are the sample sizes of the action and romance movies. Putting this together under the square root gives us the standard error \\(\\text{SE}_{\\bar{x}_a - \\bar{x}_r}\\). Observe that the formula for \\(\\text{SE}_{\\bar{x}_a - \\bar{x}_r}\\) has the sample sizes \\(n_a\\) and \\(n_r\\) in them. So as the sample sizes increase, the standard error goes down. We’ve seen this concept numerous times now, in particular in our simulations using the three virtual shovels with \\(n\\) = 25, 50, and 100 slots in Figure 7.15 and in Subsection 8.5.3 where we studied the effect of using larger sample sizes on the widths of confidence intervals. So how can we use the two-sample \\(t\\)-statistic as a test statistic in our hypothesis test? First, assuming the null hypothesis \\(H_0: \\mu_a - \\mu_r = 0\\) is true, the right-hand side of the numerator (to the right of the \\(-\\) sign), \\(\\mu_a - \\mu_r\\), becomes 0. Second, similarly to how the Central Limit Theorem from Subsection 7.5.2 states that sample means follow a normal distribution, it can be mathematically proven that the two-sample \\(t\\)-statistic follows a \\(t\\) distribution with degrees of freedom “roughly equal” to \\(df = n_a + n_r - 2\\). To better understand this concept of degrees of freedom, we next display three examples of \\(t\\)-distributions in Figure 9.20 along with the standard normal \\(z\\) curve. FIGURE 9.20: Examples of t-distributions and the z curve. Begin by looking at the center of the plot at 0 on the horizontal axis. As you move up from the value of 0, follow along with the labels and note that the bottom curve corresponds to 1 degree of freedom, the curve above it is for 3 degrees of freedom, the curve above that is for 10 degrees of freedom, and lastly the dotted curve is the standard normal \\(z\\) curve. Observe that all four curves have a bell shape, are centered at 0, and that as the degrees of freedom increase, the \\(t\\)-distribution more and more resembles the standard normal \\(z\\) curve. The “degrees of freedom” measures how different the \\(t\\) distribution will be from a normal distribution. \\(t\\)-distributions tend to have more values in the tails of their distributions than the standard normal \\(z\\) curve. This “roughly equal” statement indicates that the equation \\(df = n_a + n_r - 2\\) is a “good enough” approximation to the true degrees of freedom. The true formula is a bit more complicated than this simple expression, but we’ve found the formula to be beyond the reach of those new to statistical inference and it does little to build the intuition of the \\(t\\)-test. The message to retain, however, is that small sample sizes lead to small degrees of freedom and thus small sample sizes lead to \\(t\\)-distributions that are different than the \\(z\\) curve. On the other hand, large sample sizes correspond to large degrees of freedom and thus produce \\(t\\) distributions that closely align with the standard normal \\(z\\)-curve. So, assuming the null hypothesis \\(H_0\\) is true, our formula for the test statistic simplifies a bit: \\[t = \\dfrac{ (\\bar{x}_a - \\bar{x}_r) - 0}{ \\sqrt{\\dfrac{{s_a}^2}{n_a} + \\dfrac{{s_r}^2}{n_r}} } = \\dfrac{ \\bar{x}_a - \\bar{x}_r}{ \\sqrt{\\dfrac{{s_a}^2}{n_a} + \\dfrac{{s_r}^2}{n_r}} }\\] Let’s compute the values necessary for this two-sample \\(t\\)-statistic. Recall the summary statistics we computed during our exploratory data analysis in Section 9.5.1. movies_sample %&gt;% group_by(genre) %&gt;% summarize(n = n(), mean_rating = mean(rating), std_dev = sd(rating)) # A tibble: 2 x 4 genre n mean_rating std_dev &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; 1 Action 32 5.275 1.36121 2 Romance 36 6.32222 1.60963 Using these values, the observed two-sample \\(t\\)-test statistic is \\[ \\dfrac{ \\bar{x}_a - \\bar{x}_r}{ \\sqrt{\\dfrac{{s_a}^2}{n_a} + \\dfrac{{s_r}^2}{n_r}} } = \\dfrac{5.28 - 6.32}{ \\sqrt{\\dfrac{{1.36}^2}{32} + \\dfrac{{1.61}^2}{36}} } = -2.906 \\] Great! How can we compute the \\(p\\)-value using this theory-based test statistic? We need to compare it to a null distribution, which we construct next. Null distribution Let’s revisit the null distribution for the test statistic \\(\\bar{x}_a - \\bar{x}_r\\) we constructed in Section 9.5. Let’s visualize this in the left-hand plot of Figure 9.21. # Construct null distribution of xbar_a - xbar_m: null_distribution_movies &lt;- movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% calculate(stat = &quot;diff in means&quot;, order = c(&quot;Action&quot;, &quot;Romance&quot;)) visualize(null_distribution_movies, bins = 10) The infer package also includes some built-in theory-based test statistics as well. So instead of calculating the test statistic of interest as the &quot;diff in means&quot; \\(\\bar{x}_a - \\bar{x}_r\\), we can calculate this defined two-sample \\(t\\)-statistic by setting stat = &quot;t&quot;. Let’s visualize this in the right-hand plot of Figure 9.21. # Construct null distribution of t: null_distribution_movies_t &lt;- movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% # Notice we switched stat from &quot;diff in means&quot; to &quot;t&quot; calculate(stat = &quot;t&quot;, order = c(&quot;Action&quot;, &quot;Romance&quot;)) visualize(null_distribution_movies_t, bins = 10) FIGURE 9.21: Comparing the null distributions of two test statistics. Observe that while the shape of the null distributions of both the difference in means \\(\\bar{x}_a - \\bar{x}_r\\) and the two-sample \\(t\\)-statistics are similar, the scales on the x-axis are different. The two-sample \\(t\\)-statistic values are spread out over a larger range. However, a traditional theory-based \\(t\\)-test doesn’t look at the simulated histogram in null_distribution_movies_t, but instead it looks at the \\(t\\)-distribution curve with degrees of freedom equal to roughly 65.85. This calculation is based on the complicated formula referenced previously, which we approximated with \\(df = n_a + n_r - 2 = 32 + 36 - 2 = 66\\). Let’s overlay this \\(t\\)-distribution curve over the top of our simulated two-sample \\(t\\)-statistics using the method = &quot;both&quot; argument in visualize(). visualize(null_distribution_movies_t, bins = 10, method = &quot;both&quot;) FIGURE 9.22: Null distribution using t-statistic and t-distribution. Observe that the curve does a good job of approximating the histogram here. To calculate the \\(p\\)-value in this case, we need to figure out how much of the total area under the \\(t\\)-distribution curve is at or “more extreme” than our observed two-sample \\(t\\)-statistic. Since \\(H_A: \\mu_a - \\mu_r \\neq 0\\) is a two-sided alternative, we need to add up the areas in both tails. We first compute the observed two-sample \\(t\\)-statistic using infer verbs. This shortcut calculation further assumes that the null hypothesis is true: that the population of action and romance movies have an equal average rating. obs_two_sample_t &lt;- movies_sample %&gt;% specify(formula = rating ~ genre) %&gt;% calculate(stat = &quot;t&quot;, order = c(&quot;Action&quot;, &quot;Romance&quot;)) obs_two_sample_t # A tibble: 1 x 1 stat &lt;dbl&gt; 1 -2.90589 We want to find the percentage of values that are at or above obs_two_sample_t \\(= -2.906\\) or at or below -obs_two_sample_t \\(= 2.906\\). We use the shade_p_value() function with the direction argument set to &quot;both&quot; to do this: visualize(null_distribution_movies_t, method = &quot;both&quot;) + shade_p_value(obs_stat = obs_two_sample_t, direction = &quot;both&quot;) Warning: Check to make sure the conditions have been met for the theoretical method. {infer} currently does not check these for you. FIGURE 9.23: Null distribution using t-statistic and t-distribution with \\(p\\)-value shaded. (We’ll discuss this warning message shortly.) What is the \\(p\\)-value? We apply get_p_value() to our null distribution saved in null_distribution_movies_t: null_distribution_movies_t %&gt;% get_p_value(obs_stat = obs_two_sample_t, direction = &quot;both&quot;) # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0.002 We have a very small \\(p\\)-value, and thus it is very unlikely that these results are due to sampling variation. Thus, we are inclined to reject \\(H_0\\). Let’s come back to that earlier warning message: Check to make sure the conditions have been met for the theoretical method. {infer} currently does not check these for you. To be able to use the \\(t\\)-test and other such theoretical methods, there are always a few conditions to check. The infer package does not automatically check these conditions, hence the warning message we received. These conditions are necessary so that the underlying mathematical theory holds. In order for the results of our two-sample \\(t\\)-test to be valid, three conditions must be met: Nearly normal populations or large sample sizes. A general rule of thumb that works in many (but not all) situations is that the sample size \\(n\\) should be greater than 30. Both samples are selected independently of each other. All observations are independent from each other. Let’s see if these conditions hold for our movies_sample data: This is met since \\(n_a\\) = 32 and \\(n_r\\) = 36 are both larger than 30, satisfying our rule of thumb. This is met since we sampled the action and romance movies at random and in an unbiased fashion from the database of all IMDb movies. Unfortunately, we don’t know how IMDb computes the ratings. For example, if the same person rated multiple movies, then those observations would be related and hence not independent. Assuming all three conditions are roughly met, we can be reasonably certain that the theory-based \\(t\\)-test results are valid. If any of the conditions were clearly not met, we couldn’t put as much trust into any conclusions reached. On the other hand, in most scenarios, the only assumption that needs to be met in the simulation-based method is that the sample is selected at random. Thus, in our experience, we prefer simulation-based methods as they have fewer assumptions, are conceptually easier to understand, and since computing power has recently become easily accessible, they can be run quickly. That being said since much of the world’s research still relies on traditional theory-based methods, we also believe it is important to understand them. You may be wondering why we chose reps = 1000 for these simulation-based methods. We’ve noticed that after around 1000 replicates for the null distribution and the bootstrap distribution for most problems you can start to get a general sense for how the statistic behaves. You can change this value to something like 10,000 though for reps if you would like even finer detail but this will take more time to compute. Feel free to iterate on this as you like to get an even better idea about the shape of the null and bootstrap distributions as you wish. 9.6.2 When inference is not needed We’ve now walked through several different examples of how to use the infer package to perform statistical inference: constructing confidence intervals and conducting hypothesis tests. For each of these examples, we made it a point to always perform an exploratory data analysis (EDA) first; specifically, by looking at the raw data values, by using data visualization with ggplot2, and by data wrangling with dplyr beforehand. We highly encourage you to always do the same. As a beginner to statistics, EDA helps you develop intuition as to what statistical methods like confidence intervals and hypothesis tests can tell us. Even as a seasoned practitioner of statistics, EDA helps guide your statistical investigations. In particular, is statistical inference even needed? Let’s consider an example. Say we’re interested in the following question: Of all flights leaving a New York City airport, are Hawaiian Airlines flights in the air for longer than Alaska Airlines flights? Furthermore, let’s assume that 2013 flights are a representative sample of all such flights. Then we can use the flights data frame in the nycflights13 package we introduced in Section 1.4 to answer our question. Let’s filter this data frame to only include Hawaiian and Alaska Airlines using their carrier codes HA and AS: flights_sample &lt;- flights %&gt;% filter(carrier %in% c(&quot;HA&quot;, &quot;AS&quot;)) There are two possible statistical inference methods we could use to answer such questions. First, we could construct a 95% confidence interval for the difference in population means \\(\\mu_{HA} - \\mu_{AS}\\), where \\(\\mu_{HA}\\) is the mean air time of all Hawaiian Airlines flights and \\(\\mu_{AS}\\) is the mean air time of all Alaska Airlines flights. We could then check if the entirety of the interval is greater than 0, suggesting that \\(\\mu_{HA} - \\mu_{AS} &gt; 0\\), or, in other words suggesting that \\(\\mu_{HA} &gt; \\mu_{AS}\\). Second, we could perform a hypothesis test of the null hypothesis \\(H_0: \\mu_{HA} - \\mu_{AS} = 0\\) versus the alternative hypothesis \\(H_A: \\mu_{HA} - \\mu_{AS} &gt; 0\\). However, let’s first construct an exploratory visualization as we suggested earlier. Since air_time is numerical and carrier is categorical, a boxplot can display the relationship between these two variables, which we display in Figure 9.24. ggplot(data = flights_sample, mapping = aes(x = carrier, y = air_time)) + geom_boxplot() + labs(x = &quot;Carrier&quot;, y = &quot;Air Time&quot;) FIGURE 9.24: Air time for Hawaiian and Alaska Airlines flights departing NYC in 2013. This is what we like to call “no PhD in Statistics needed” moments. You don’t have to be an expert in statistics to know that Alaska Airlines and Hawaiian Airlines have significantly different air times. The two boxplots don’t even overlap! Constructing a confidence interval or conducting a hypothesis test would frankly not provide much more insight than Figure 9.24. Let’s investigate why we observe such a clear cut difference between these two airlines using data wrangling. Let’s first group by the rows of flights_sample not only by carrier but also by destination dest. Subsequently, we’ll compute two summary statistics: the number of observations using n() and the mean airtime: flights_sample %&gt;% group_by(carrier, dest) %&gt;% summarize(n = n(), mean_time = mean(air_time, na.rm = TRUE)) # A tibble: 2 x 4 # Groups: carrier [2] carrier dest n mean_time &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; 1 AS SEA 714 325.618 2 HA HNL 342 623.088 It turns out that from New York City in 2013, Alaska only flew to SEA (Seattle) from New York City (NYC) while Hawaiian only flew to HNL (Honolulu) from NYC. Given the clear difference in distance from New York City to Seattle versus New York City to Honolulu, it is not surprising that we observe such different (statistically significantly different, in fact) air times in flights. This is a clear example of not needing to do anything more than a simple exploratory data analysis using data visualization and descriptive statistics to get an appropriate conclusion. This is why we highly recommend you perform an EDA of any sample data before running statistical inference methods like confidence intervals and hypothesis tests. 9.6.3 Problems with p-values On top of the many common misunderstandings about hypothesis testing and \\(p\\)-values we listed in Section 9.4, another unfortunate consequence of the expanded use of \\(p\\)-values and hypothesis testing is a phenomenon known as “p-hacking.” p-hacking is the act of “cherry-picking” only results that are “statistically significant” while dismissing those that aren’t, even if at the expense of the scientific ideas. There are lots of articles written recently about misunderstandings and the problems with \\(p\\)-values. We encourage you to check some of them out: Misunderstandings of \\(p\\)-values What a nerdy debate about \\(p\\)-values shows about science - and how to fix it Statisticians issue warning over misuse of \\(P\\) values You Can’t Trust What You Read About Nutrition A Litany of Problems with p-values Such issues were getting so problematic that the American Statistical Association (ASA) put out a statement in 2016 titled, “The ASA Statement on Statistical Significance and \\(P\\)-Values,” with six principles underlying the proper use and interpretation of \\(p\\)-values. The ASA released this guidance on \\(p\\)-values to improve the conduct and interpretation of quantitative science and to inform the growing emphasis on reproducibility of science research. We as authors much prefer the use of confidence intervals for statistical inference, since in our opinion they are much less prone to large misinterpretation. However, many fields still exclusively use \\(p\\)-values for statistical inference and this is one reason for including them in this text. We encourage you to learn more about “p-hacking” as well and its implication for science. 9.6.4 Additional resources An R script file of all R code used in this chapter is available here. If you want more examples of the infer workflow for conducting hypothesis tests, we suggest you check out the infer package homepage, in particular, a series of example analyses available at https://infer.netlify.com/articles/. 9.6.5 What’s to come We conclude with the infer pipeline for hypothesis testing in Figure 9.25. FIGURE 9.25: infer package workflow for hypothesis testing. Now that we’ve armed ourselves with an understanding of confidence intervals from Chapter 8 and hypothesis tests from this chapter, we’ll now study inference for regression in the upcoming Chapter 10. We’ll revisit the regression models we studied in Chapter 5 on basic regression and Chapter 6 on multiple regression. For example, recall Table 5.2 (shown again here in Table 9.4), corresponding to our regression model for an instructor’s teaching score as a function of their “beauty” score. # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals) # Get regression table: get_regression_table(score_model) TABLE 9.4: Linear regression table term estimate std_error statistic p_value lower_ci upper_ci intercept 3.880 0.076 50.96 0 3.731 4.030 bty_avg 0.067 0.016 4.09 0 0.035 0.099 We previously saw in Subsection 5.1.2 that the values in the estimate column are the fitted intercept \\(b_0\\) and fitted slope for beauty score \\(b_1\\). In Chapter 10, we’ll unpack the remaining columns: std_error which is the standard error, statistic which is the observed standardized test statistic to compute the p_value, and the 95% confidence intervals as given by lower_ci and upper_ci. "],
+["10-inference-for-regression.html", "Chapter 10 Inference for Regression 10.1 Regression refresher 10.2 Interpreting regression tables 10.3 Conditions for inference for regression 10.4 Simulation-based inference for regression 10.5 Conclusion", " Chapter 10 Inference for Regression In our penultimate chapter, we’ll revisit the regression models we first studied in Chapters 5 and 6. Armed with our knowledge of confidence intervals and hypothesis tests from Chapters 8 and 9, we’ll be able to apply statistical inference to further our understanding of relationships between outcome and explanatory variables. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once: ggplot2 for data visualization dplyr for data wrangling tidyr for converting data to “tidy” format readr for importing spreadsheet data into R As well as the more advanced purrr, tibble, stringr, and forcats packages If needed, read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(moderndive) library(infer) 10.1 Regression refresher Before jumping into inference for regression, let’s remind ourselves of the University of Texas Austin teaching evaluations analysis in Section 5.1. 10.1.1 Teaching evaluations analysis Recall using simple linear regression we modeled the relationship between A numerical outcome variable \\(y\\) (the instructor’s teaching score) and A single numerical explanatory variable \\(x\\) (the instructor’s “beauty” score). We first created an evals_ch5 data frame that selected a subset of variables from the evals data frame included in the moderndive package. This evals_ch5 data frame contains only the variables of interest for our analysis, in particular the instructor’s teaching score and the “beauty” rating bty_avg: evals_ch5 &lt;- evals %&gt;% select(ID, score, bty_avg, age) glimpse(evals_ch5) Observations: 463 Variables: 4 $ ID &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18… $ score &lt;dbl&gt; 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4… $ bty_avg &lt;dbl&gt; 5.00, 5.00, 5.00, 5.00, 3.00, 3.00, 3.00, 3.33, 3.33, 3.17, 3… $ age &lt;int&gt; 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4… In Subsection 5.1.1, we performed an exploratory data analysis of the relationship between these two variables of score and bty_avg. We saw there that a weakly positive correlation of 0.187 existed between the two variables. This was evidenced in Figure 10.1 of the scatterplot along with the “best-fitting” regression line that summarizes the linear relationship between the two variables of score and bty_avg. Recall in Subsection 5.3.2 that we defined a “best-fitting” line as the line that minimizes the sum of squared residuals. ggplot(evals_ch5, aes(x = bty_avg, y = score)) + geom_point() + labs(x = &quot;Beauty Score&quot;, y = &quot;Teaching Score&quot;, title = &quot;Relationship between teaching and beauty scores&quot;) + geom_smooth(method = &quot;lm&quot;, se = FALSE) FIGURE 10.1: Relationship with regression line. Looking at this plot again, you might be asking, “Does that line really have all that positive of a slope?”. It does increase from left to right as the bty_avg variable increases, but by how much? To get to this information, recall that we followed a two-step procedure: We first “fit” the linear regression model using the lm() function with the formula score ~ bty_avg. We saved this model in score_model. We get the regression table by applying the get_regression_table() function from the moderndive package to score_model. # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals_ch5) # Get regression table: get_regression_table(score_model) TABLE 10.1: Previously seen linear regression table term estimate std_error statistic p_value lower_ci upper_ci intercept 3.880 0.076 50.96 0 3.731 4.030 bty_avg 0.067 0.016 4.09 0 0.035 0.099 Using the values in the estimate column of the resulting regression table in Table 10.1, we could then obtain the equation of the “best-fitting” regression line in Figure 10.1: \\[ \\begin{aligned} \\widehat{y} &amp;= b_0 + b_1 \\cdot x\\\\ \\widehat{\\text{score}} &amp;= b_0 + b_{\\text{bty}\\_\\text{avg}} \\cdot\\text{bty}\\_\\text{avg}\\\\ &amp;= 3.880 + 0.067\\cdot\\text{bty}\\_\\text{avg} \\end{aligned} \\] where \\(b_0\\) is the fitted intercept and \\(b_1\\) is the fitted slope for bty_avg. Recall the interpretation of the \\(b_1\\) = 0.067 value of the fitted slope: For every increase of one unit in “beauty” rating, there is an associated increase, on average, of 0.067 units of evaluation score. Thus, the slope value quantifies the relationship between the \\(y\\) variable score and the \\(x\\) variable bty_avg. We also discussed the intercept value of \\(b_0\\) = 3.88 and its lack of practical interpretation, since the range of possible “beauty” scores does not include 0. 10.1.2 Sampling scenario Let’s now revisit this study in terms of the terminology and notation related to sampling we studied in Subsection 7.3.1. First, let’s view the instructors for these 463 courses as a representative sample from a greater study population. In our case, let’s assume that the study population is all instructors at UT Austin and that the sample of instructors who taught these 463 courses is a representative sample. Unfortunately, we can only assume these two facts without more knowledge of the sampling methodology used by the researchers. Since we are viewing these \\(n\\) = 463 courses as a sample, we can view our fitted slope \\(b_1\\) = 0.067 as a point estimate of the population slope \\(\\beta_1\\). In other words, \\(\\beta_1\\) quantifies the relationship between teaching score and “beauty” average bty_avg for all instructors at UT Austin. Similarly, we can view our fitted intercept \\(b_0\\) = 3.88 as a point estimate of the population intercept \\(\\beta_0\\) for all instructors at UT Austin. Putting these two ideas together, we can view the equation of the fitted line \\(\\widehat{y}\\) = \\(b_0 + b_1 \\cdot x\\) = \\(3.880 + 0.067 \\cdot \\text{bty}\\_\\text{avg}\\) as an estimate of some true and unknown population line \\(y = \\beta_0 + \\beta_1 \\cdot x\\). Thus we can draw parallels between our teaching evaluations analysis and all the sampling scenarios we’ve seen previously. In this chapter, we’ll focus on the final scenario of regression slopes as shown in Table 10.2. TABLE 10.2: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Symbol(s) 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) 4 Difference in population means \\(\\mu_1 - \\mu_2\\) Difference in sample means \\(\\overline{x}_1 - \\overline{x}_2\\) 5 Population regression slope \\(\\beta_1\\) Fitted regression slope \\(b_1\\) or \\(\\widehat{\\beta}_1\\) Since we are now viewing our fitted slope \\(b_1\\) and fitted intercept \\(b_0\\) as point estimates based on a sample, these estimates will again be subject to sampling variability. In other words, if we collected a new sample of data on a different set of \\(n\\) = 463 courses and their instructors, the new fitted slope \\(b_1\\) will likely differ from 0.067. The same goes for the new fitted intercept \\(b_0\\). But by how much will these estimates vary? This information is in the remaining columns of the regression table in Table 10.1. Our knowledge of sampling from Chapter 7, confidence intervals from Chapter 8, and hypothesis tests from Chapter 9 will help us interpret these remaining columns. 10.2 Interpreting regression tables We’ve so far focused only on the two leftmost columns of the regression table in Table 10.1: term and estimate. Let’s now shift our attention to the remaining columns: std_error, statistic, p_value, lower_ci and upper_ci in Table 10.3. TABLE 10.3: Previously seen regression table term estimate std_error statistic p_value lower_ci upper_ci intercept 3.880 0.076 50.96 0 3.731 4.030 bty_avg 0.067 0.016 4.09 0 0.035 0.099 Given the lack of practical interpretation for the fitted intercept \\(b_0\\), in this section we’ll focus only on the second row of the table corresponding to the fitted slope \\(b_1\\). We’ll first interpret the std_error, statistic, p_value, lower_ci and upper_ci columns. Afterwards in the upcoming Subsection 10.2.5, we’ll discuss how R computes these values. 10.2.1 Standard error The third column of the regression table in Table 10.1 std_error corresponds to the standard error of our estimates. Recall the definition of standard error we saw in Subsection 7.3.2: The standard error is the standard deviation of any point estimate computed from a sample. So what does this mean in terms of the fitted slope \\(b_1\\) = 0.067? This value is just one possible value of the fitted slope resulting from this particular sample of \\(n\\) = 463 pairs of teaching and beauty scores. However, if we collected a different sample of \\(n\\) = 463 pairs of teaching and beauty scores, we will almost certainly obtain a different fitted slope \\(b_1\\). This is due to sampling variability. Say we hypothetically collected 1000 such samples of pairs of teaching and beauty scores, computed the 1000 resulting values of the fitted slope \\(b_1\\), and visualized them in a histogram. This would be a visualization of the sampling distribution of \\(b_1\\), which we defined in Subsection 7.3.2. Further recall that the standard deviation of the sampling distribution of \\(b_1\\) has a special name: the standard error. Recall that we constructed three sampling distributions for the sample proportion \\(\\widehat{p}\\) using shovels of size 25, 50, and 100 in Figure 7.12. We observed that as the sample size increased, the standard error decreased as evidenced by the narrowing sampling distribution. The standard error of \\(b_1\\) similarly quantifies how much variation in the fitted slope \\(b_1\\) one would expect between different samples. So in our case, we can expect about 0.016 units of variation in the bty_avg slope variable. Recall that the estimate and std_error values play a key role in inferring the value of the unknown population slope \\(\\beta_1\\) relating to all instructors. In Section 10.4, we’ll perform a simulation using the infer package to construct the bootstrap distribution for \\(b_1\\) in this case. Recall from Subsection 8.7.1 that the bootstrap distribution is an approximation to the sampling distribution in that they have a similar shape. Since they have a similar shape, they have similar standard errors. However, unlike the sampling distribution, the bootstrap distribution is constructed from a single sample, which is a practice more aligned with what’s done in real life. 10.2.2 Test statistic The fourth column of the regression table in Table 10.1 statistic corresponds to a test statistic relating to the following hypothesis test: \\[ \\begin{aligned} H_0 &amp;: \\beta_1 = 0\\\\ \\text{vs } H_A&amp;: \\beta_1 \\neq 0. \\end{aligned} \\] Recall our terminology, notation, and definitions related to hypothesis tests we introduced in Section 9.2. A hypothesis test consists of a test between two competing hypotheses: (1) a null hypothesis \\(H_0\\) versus (2) an alternative hypothesis \\(H_A\\). A test statistic is a point estimate/sample statistic formula used for hypothesis testing. Here, our null hypothesis \\(H_0\\) assumes that the population slope \\(\\beta_1\\) is 0. If the population slope \\(\\beta_1\\) is truly 0, then this is saying that there is no true relationship between teaching and “beauty” scores for all the instructors in our population. In other words, \\(x\\) = “beauty” score would have no associated effect on \\(y\\) = teaching score. The alternative hypothesis \\(H_A\\), on the other hand, assumes that the population slope \\(\\beta_1\\) is not 0, meaning it could be either positive or negative. This suggests either a positive or negative relationship between teaching and “beauty” scores. Recall we called such alternative hypotheses two-sided. By convention, all hypothesis testing for regression assumes two-sided alternatives. Recall our “hypothesized universe” of no gender discrimination we assumed in our promotions activity in Section 9.1. Similarly here when conducting this hypothesis test, we’ll assume a “hypothesized universe” where there is no relationship between teaching and “beauty” scores. In other words, we’ll assume the null hypothesis \\(H_0: \\beta_1 = 0\\) is true. The statistic column in the regression table is a tricky one, however. It corresponds to a standardized t-test statistic, much like the two-sample \\(t\\) statistic we saw in Subsection 9.6.1 where we used a theory-based method for conducting hypothesis tests. In both these cases, the null distribution can be mathematically proven to be a \\(t\\)-distribution. Since such test statistics are tricky for individuals new to statistical inference to study, we’ll skip this and jump into interpreting the \\(p\\)-value. If you’re curious, we have included a discussion of this standardized t-test statistic in Subsection 10.5.1. 10.2.3 p-value The fifth column of the regression table in Table 10.1 p_value corresponds to the p-value of the hypothesis test \\(H_0: \\beta_1 = 0\\) versus \\(H_A: \\beta_1 \\neq 0\\). Again recalling our terminology, notation, and definitions related to hypothesis tests we introduced in Section 9.2, let’s focus on the definition of the \\(p\\)-value: A p-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis \\(H_0\\) is true. Recall that you can intuitively think of the \\(p\\)-value as quantifying how “extreme” the observed fitted slope of \\(b_1\\) = 0.067 is in a “hypothesized universe” where there is no relationship between teaching and “beauty” scores. Following the hypothesis testing procedure we outlined in Section 9.4, since the \\(p\\)-value in this case is 0, for any choice of significance level \\(\\alpha\\) we would reject \\(H_0\\) in favor of \\(H_A\\). Using non-statistical language, this is saying: we reject the hypothesis that there is no relationship between teaching and “beauty” scores in favor of the hypothesis that there is. That is to say, the evidence suggests there is a significant relationship, one that is positive. More precisely, however, the \\(p\\)-value corresponds to how extreme the observed test statistic of 4.09 is when compared to the appropriate null distribution. In Section 10.4, we’ll perform a simulation using the infer package to construct the null distribution in this case. An extra caveat here is that the results of this hypothesis test are only valid if certain “conditions for inference for regression” are met, which we’ll introduce shortly in Section 10.3. 10.2.4 Confidence interval The two rightmost columns of the regression table in Table 10.1 (lower_ci and upper_ci) correspond to the endpoints of the 95% confidence interval for the population slope \\(\\beta_1\\). Recall our analogy of “nets are to fish” what “confidence intervals are to population parameters” from Section 8.3. The resulting 95% confidence interval for \\(\\beta_1\\) of (0.035, 0.099) can be thought of as a range of plausible values for the population slope \\(\\beta_1\\) of the linear relationship between teaching and “beauty” scores. As we introduced in Subsection 8.5.2 on the precise and shorthand interpretation of confidence intervals, the statistically precise interpretation of this confidence interval is: “if we repeated this sampling procedure a large number of times, we expect about 95% of the resulting confidence intervals to capture the value of the population slope \\(\\beta_1\\).” However, we’ll summarize this using our shorthand interpretation that “we’re 95% ‘confident’ that the true population slope \\(\\beta_1\\) lies between 0.035 and 0.099.” Notice in this case that the resulting 95% confidence interval for \\(\\beta_1\\) of \\((0.035, \\, 0.099)\\) does not contain a very particular value: \\(\\beta_1\\) equals 0. Recall we mentioned that if the population regression slope \\(\\beta_1\\) is 0, this is equivalent to saying there is no relationship between teaching and “beauty” scores. Since \\(\\beta_1\\) = 0 is not in our plausible range of values for \\(\\beta_1\\), we are inclined to believe that there, in fact, is a relationship between teaching and “beauty” scores and a positive one at that. So in this case, the conclusion about the population slope \\(\\beta_1\\) from the 95% confidence interval matches the conclusion from the hypothesis test: evidence suggests that there is a meaningful relationship between teaching and “beauty” scores. Recall from Subsection 8.5.3, however, that the confidence level is one of many factors that determine confidence interval widths. So for example, say we used a higher confidence level of 99% instead of 95%. The resulting confidence interval for \\(\\beta_1\\) would be wider and thus might now include 0. The lesson to remember here is that any confidence-interval-based conclusion depends highly on the confidence level used. What are the calculations that went into computing the two endpoints of the 95% confidence interval for \\(\\beta_1\\)? Recall our sampling bowl example from Subsection 8.7.2 discussing lower_ci and upper_ci. Since the sampling and bootstrap distributions of the sample proportion \\(\\widehat{p}\\) were roughly normal, we could use the rule of thumb for bell-shaped distributions from Appendix A.2 to create a 95% confidence interval for \\(p\\) with the following equation: \\[\\widehat{p} \\pm \\text{MoE}_{\\widehat{p}} = \\widehat{p} \\pm 1.96 \\cdot \\text{SE}_{\\widehat{p}} = \\widehat{p} \\pm 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] We can generalize this to other point estimates that have roughly normally shaped sampling and/or bootstrap distributions: \\[\\text{point estimate} \\pm \\text{MoE} = \\text{point estimate} \\pm 1.96 \\cdot \\text{SE}.\\] We’ll show in Section 10.4 that the sampling/bootstrap distribution for the fitted slope \\(b_1\\) is in fact bell-shaped as well. Thus we can construct a 95% confidence interval for \\(\\beta_1\\) with the following equation: \\[b_1 \\pm \\text{MoE}_{b_1} = b_1 \\pm 1.96 \\cdot \\text{SE}_{b_1}.\\] What is the value of the standard error \\(\\text{SE}_{b_1}\\)? It is in fact in the third column of the regression table in Table 10.1: 0.016. Thus \\[ \\begin{aligned} b_1 \\pm 1.96 \\cdot \\text{SE}_{b_1} &amp;= 0.067 \\pm 1.96 \\cdot 0.016 = 0.067 \\pm 0.031\\\\ &amp;= (0.036, 0.098) \\end{aligned} \\] This closely matches the \\((0.035, 0.099)\\) confidence interval in the last two columns of Table 10.1. Much like hypothesis tests, however, the results of this confidence interval also are only valid if the “conditions for inference for regression” to be discussed in Section 10.3 are met. 10.2.5 How does R compute the table? Since we didn’t perform the simulation to get the values of the standard error, test statistic, \\(p\\)-value, and endpoints of the 95% confidence interval in Table 10.1, you might be wondering how were these values computed. What did R do behind the scenes? Does R run simulations like we did using the infer package in Chapters 8 and 9 on confidence intervals and hypothesis testing? The answer is no! Much like the theory-based method for constructing confidence intervals you saw in Subsection 8.7.2 and the theory-based hypothesis test you saw in Subsection 9.6.1, there exist mathematical formulas that allow you to construct confidence intervals and conduct hypothesis tests for inference for regression. These formulas were derived in a time when computers didn’t exist, so it would’ve been impossible to run the extensive computer simulations we have in this book. We present these formulas in Subsection 10.5.1 on “theory-based inference for regression.” In Section 10.4, we’ll go over a simulation-based approach to constructing confidence intervals and conducting hypothesis tests using the infer package. In particular, we’ll convince you that the bootstrap distribution of the fitted slope \\(b_1\\) is indeed bell-shaped. 10.3 Conditions for inference for regression Recall in Subsection 8.3.2 we stated that we could only use the standard-error-based method for constructing confidence intervals if the bootstrap distribution was bell shaped. Similarly, there are certain conditions that need to be met in order for the results of our hypothesis tests and confidence intervals we described in Section 10.2 to have valid meaning. These conditions must be met for the assumed underlying mathematical and probability theory to hold true. For inference for regression, there are four conditions that need to be met. Note the first four letters of these conditions are highlighted in bold in what follows: LINE. This can serve as a nice reminder of what to check for whenever you perform linear regression. Linearity of relationship between variables Independence of the residuals Normality of the residuals Equality of variance of the residuals Conditions L, N, and E can be verified through what is known as a residual analysis. Condition I can only be verified through an understanding of how the data was collected. In this section, we’ll go over a refresher on residuals, verify whether each of the four LINE conditions hold true, and then discuss the implications. 10.3.1 Residuals refresher Recall our definition of a residual from Subsection 5.1.3: it is the observed value minus the fitted value denoted by \\(y - \\widehat{y}\\). Recall that residuals can be thought of as the error or the “lack-of-fit” between the observed value \\(y\\) and the fitted value \\(\\widehat{y}\\) on the regression line in Figure 10.1. In Figure 10.2, we illustrate one particular residual out of 463 using an arrow, as well as its corresponding observed and fitted values using a circle and a square, respectively. FIGURE 10.2: Example of observed value, fitted value, and residual. Furthermore, we can automate the calculation of all \\(n\\) = 463 residuals by applying the get_regression_points() function to our saved regression model in score_model. Observe how the resulting values of residual are roughly equal to score - score_hat (there is potentially a slight difference due to rounding error). # Fit regression model: score_model &lt;- lm(score ~ bty_avg, data = evals_ch5) # Get regression points: regression_points &lt;- get_regression_points(score_model) regression_points # A tibble: 463 x 5 ID score bty_avg score_hat residual &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 1 4.7 5 4.214 0.486 2 2 4.100 5 4.214 -0.114 3 3 3.9 5 4.214 -0.314 4 4 4.8 5 4.214 0.586 5 5 4.600 3 4.08 0.52 6 6 4.3 3 4.08 0.22 7 7 2.8 3 4.08 -1.28 8 8 4.100 3.333 4.102 -0.002 9 9 3.4 3.333 4.102 -0.702 10 10 4.5 3.16700 4.091 0.40900 # … with 453 more rows A residual analysis is used to verify conditions L, N, and E and can be performed using appropriate data visualizations. While there are more sophisticated statistical approaches that can also be done, we’ll focus on the much simpler approach of looking at plots. 10.3.2 Linearity of relationship The first condition is that the relationship between the outcome variable \\(y\\) and the explanatory variable \\(x\\) must be Linear. Recall the scatterplot in Figure 10.1 where we had the explanatory variable \\(x\\) as “beauty” score and the outcome variable \\(y\\) as teaching score. Would you say that the relationship between \\(x\\) and \\(y\\) is linear? It’s hard to say because of the scatter of the points about the line. In the authors’ opinions, we feel this relationship is “linear enough.” Let’s present an example where the relationship between \\(x\\) and \\(y\\) is clearly not linear in Figure 10.3. In this case, the points clearly do not form a line, but rather a U-shaped polynomial curve. In this case, any results from an inference for regression would not be valid. FIGURE 10.3: Example of a clearly non-linear relationship. 10.3.3 Independence of residuals The second condition is that the residuals must be Independent. In other words, the different observations in our data must be independent of one another. For our UT Austin data, while there is data on 463 courses, these 463 courses were actually taught by 94 unique instructors. In other words, the same professor is often included more than once in our data. The original evals data frame that we used to construct the evals_ch5 data frame has a variable prof_ID, which is an anonymized identification variable for the professor: evals %&gt;% select(ID, prof_ID, score, bty_avg) # A tibble: 463 x 4 ID prof_ID score bty_avg &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; 1 1 1 4.7 5 2 2 1 4.100 5 3 3 1 3.9 5 4 4 1 4.8 5 5 5 2 4.600 3 6 6 2 4.3 3 7 7 2 2.8 3 8 8 3 4.100 3.333 9 9 3 3.4 3.333 10 10 4 4.5 3.16700 # … with 453 more rows For example, the professor with prof_ID equal to 1 taught the first 4 courses in the data, the professor with prof_ID equal to 2 taught the next 3, and so on. Given that the same professor taught these first four courses, it is reasonable to expect that these four teaching scores are related to each other. If a professor gets a high score in one class, chances are fairly good they’ll get a high score in another. This dataset thus provides different information than if we had 463 unique instructors teaching the 463 courses. In this case, we say there exists dependence between observations. The first four courses taught by professor 1 are dependent, the next 3 courses taught by professor 2 are related, and so on. Any proper analysis of this data needs to take into account that we have repeated measures for the same profs. So in this case, the independence condition is not met. What does this mean for our analysis? We’ll address this in Subsection 10.3.6 coming up, after we check the remaining two conditions. 10.3.4 Normality of residuals The third condition is that the residuals should follow a Normal distribution. Furthermore, the center of this distribution should be 0. In other words, sometimes the regression model will make positive errors: \\(y - \\widehat{y} &gt; 0\\). Other times, the regression model will make equally negative errors: \\(y - \\widehat{y} &lt; 0\\). However, on average the errors should equal 0 and their shape should be similar to that of a bell. The simplest way to check the normality of the residuals is to look at a histogram, which we visualize in Figure 10.4. ggplot(regression_points, aes(x = residual)) + geom_histogram(binwidth = 0.25, color = &quot;white&quot;) + labs(x = &quot;Residual&quot;) FIGURE 10.4: Histogram of residuals. This histogram shows that we have more positive residuals than negative. Since the residual \\(y-\\widehat{y}\\) is positive when \\(y &gt; \\widehat{y}\\), it seems our regression model’s fitted teaching scores \\(\\widehat{y}\\) tend to underestimate the true teaching scores \\(y\\). Furthermore, this histogram has a slight left-skew in that there is a tail on the left. This is another way to say the residuals exhibit a negative skew. Is this a problem? Again, there is a certain amount of subjectivity in the response. In the authors’ opinion, while there is a slight skew to the residuals, we feel it isn’t drastic. On the other hand, others might disagree with our assessment. Let’s present examples where the residuals clearly do and don’t follow a normal distribution in Figure 10.5. In this case of the model yielding the clearly non-normal residuals on the right, any results from an inference for regression would not be valid. FIGURE 10.5: Example of clearly normal and clearly not normal residuals. 10.3.5 Equality of variance The fourth and final condition is that the residuals should exhibit Equal variance across all values of the explanatory variable \\(x\\). In other words, the value and spread of the residuals should not depend on the value of the explanatory variable \\(x\\). Recall the scatterplot in Figure 10.1: we had the explanatory variable \\(x\\) of “beauty” score on the x-axis and the outcome variable \\(y\\) of teaching score on the y-axis. Instead, let’s create a scatterplot that has the same values on the x-axis, but now with the residual \\(y-\\widehat{y}\\) on the y-axis as seen in Figure 10.6. ggplot(regression_points, aes(x = bty_avg, y = residual)) + geom_point() + labs(x = &quot;Beauty Score&quot;, y = &quot;Residual&quot;) + geom_hline(yintercept = 0, col = &quot;blue&quot;, size = 1) FIGURE 10.6: Plot of residuals over beauty score. You can think of Figure 10.6 as a modified version of the plot with the regression line in Figure 10.1, but with the regression line flattened out to \\(y=0\\). Looking at this plot, would you say that the spread of the residuals around the line at \\(y=0\\) is constant across all values of the explanatory variable \\(x\\) of “beauty” score? This question is rather qualitative and subjective in nature, thus different people may respond with different answers. For example, some people might say that there is slightly more variation in the residuals for smaller values of \\(x\\) than for higher ones. However, it can be argued that there isn’t a drastic non-constancy. In Figure 10.7 let’s present an example where the residuals clearly do not have equal variance across all values of the explanatory variable \\(x\\). FIGURE 10.7: Example of clearly non-equal variance. Observe how the spread of the residuals increases as the value of \\(x\\) increases. This is a situation known as heteroskedasticity. Any inference for regression based on a model yielding such a pattern in the residuals would not be valid. 10.3.6 What’s the conclusion? Let’s list our four conditions for inference for regression again and indicate whether or not they were satisfied in our analysis: Linearity of relationship between variables: Yes Independence of residuals: No Normality of residuals: Somewhat Equality of variance: Yes So what does this mean for the results of our confidence intervals and hypothesis tests in Section 10.2? First, the Independence condition. The fact that there exist dependencies between different rows in evals_ch5 must be addressed. In more advanced statistics courses, you’ll learn how to incorporate such dependencies into your regression models. One such technique is called hierarchical/multilevel modeling. Second, when conditions L, N, E are not met, it often means there is a shortcoming in our model. For example, it may be the case that using only a single explanatory variable is insufficient, as we did with “beauty” score. We may need to incorporate more explanatory variables in a multiple regression model as we did in Chapter 6. In our case, the best we can do is view the results suggested by our confidence intervals and hypothesis tests as preliminary. While a preliminary analysis suggests that there is a significant relationship between teaching and “beauty” scores, further investigation is warranted; in particular, by improving the preliminary score ~ bty_avg model so that the four conditions are met. When the four conditions are roughly met, then we can put more faith into our confidence intervals and \\(p\\)-values. The conditions for inference in regression problems are a key part of regression analysis that are of vital importance to the processes of constructing confidence intervals and conducting hypothesis tests. However, it is often the case with regression analysis in the real world that not all the conditions are completely met. Furthermore, as you saw, there is a level of subjectivity in the residual analyses to verify the L, N, and E conditions. So what can you do? We as authors advocate for transparency in communicating all results. This lets the stakeholders of any analysis know about a model’s shortcomings or whether the model is “good enough.” So while this checking of assumptions has lead to some fuzzy “it depends” results, we decided as authors to show you these scenarios to help prepare you for difficult statistical decisions you may need to make down the road. Learning check (LC10.1) Continuing with our regression using age as the explanatory variable and teaching score as the outcome variable. Use the get_regression_points() function to get the observed values, fitted values, and residuals for all 463 instructors. Perform a residual analysis and look for any systematic patterns in the residuals. Ideally, there should be little to no pattern but comment on what you find here. 10.4 Simulation-based inference for regression Recall in Subsection 10.2.5 when we interpreted the third through seventh columns of a regression table, we stated that R doesn’t do simulations to compute these values. Rather R uses theory-based methods that involve mathematical formulas. In this section, we’ll use the simulation-based methods you previously learned in Chapters 8 and 9 to recreate the values in the regression table in Table 10.1. In particular, we’ll use the infer package workflow to Construct a 95% confidence interval for the population slope \\(\\beta_1\\) using bootstrap resampling with replacement. We did this previously in Sections 8.4 with the pennies data and 8.6 with the mythbusters_yawn data. Conduct a hypothesis test of \\(H_0: \\beta_1 = 0\\) versus \\(H_A: \\beta_1 \\neq 0\\) using a permutation test. We did this previously in Sections 9.3 with the promotions data and 9.5 with the movies_sample IMDb data. 10.4.1 Confidence interval for slope We’ll construct a 95% confidence interval for \\(\\beta_1\\) using the infer workflow outlined in Subsection 8.4.2. Specifically, we’ll first construct the bootstrap distribution for the fitted slope \\(b_1\\) using our single sample of 463 courses: specify() the variables of interest in evals_ch5 with the formula: score ~ bty_avg. generate() replicates by using bootstrap resampling with replacement from the original sample of 463 courses. We generate reps = 1000 replicates using type = &quot;bootstrap&quot;. calculate() the summary statistic of interest: the fitted slope \\(b_1\\). Using this bootstrap distribution, we’ll construct the 95% confidence interval using the percentile method and (if appropriate) the standard error method as well. It is important to note in this case that the bootstrapping with replacement is done row-by-row. Thus, the original pairs of score and bty_avg values are always kept together, but different pairs of score and bty_avg values may be resampled multiple times. The resulting confidence interval will denote a range of plausible values for the unknown population slope \\(\\beta_1\\) quantifying the relationship between teaching and “beauty” scores for all professors at UT Austin. Let’s first construct the bootstrap distribution for the fitted slope \\(b_1\\): bootstrap_distn_slope &lt;- evals_ch5 %&gt;% specify(formula = score ~ bty_avg) %&gt;% generate(reps = 1000, type = &quot;bootstrap&quot;) %&gt;% calculate(stat = &quot;slope&quot;) bootstrap_distn_slope # A tibble: 1,000 x 2 replicate stat &lt;int&gt; &lt;dbl&gt; 1 1 0.0651055 2 2 0.0382313 3 3 0.108056 4 4 0.0666601 5 5 0.0715932 6 6 0.0854565 7 7 0.0624868 8 8 0.0412859 9 9 0.0796269 10 10 0.0761299 # … with 990 more rows Observe how we have 1000 values of the bootstrapped slope \\(b_1\\) in the stat column. Let’s visualize the 1000 bootstrapped values in Figure 10.8. visualize(bootstrap_distn_slope) FIGURE 10.8: Bootstrap distribution of slope. Observe how the bootstrap distribution is roughly bell-shaped. Recall from Subsection 8.7.1 that the shape of the bootstrap distribution of \\(b_1\\) closely approximates the shape of the sampling distribution of \\(b_1\\). Percentile-method First, let’s compute the 95% confidence interval for \\(\\beta_1\\) using the percentile method. We’ll do so by identifying the 2.5th and 97.5th percentiles which include the middle 95% of values. Recall that this method does not require the bootstrap distribution to be normally shaped. percentile_ci &lt;- bootstrap_distn_slope %&gt;% get_confidence_interval(type = &quot;percentile&quot;, level = 0.95) percentile_ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 0.0323411 0.0990027 The resulting percentile-based 95% confidence interval for \\(\\beta_1\\) of (0.032, 0.099) is similar to the confidence interval in the regression Table 10.1 of (0.035, 0.099). Standard error method Since the bootstrap distribution in Figure 10.8 appears to be roughly bell-shaped, we can also construct a 95% confidence interval for \\(\\beta_1\\) using the standard error method. In order to do this, we need to first compute the fitted slope \\(b_1\\), which will act as the center of our standard error-based confidence interval. While we saw in the regression table in Table 10.1 that this was \\(b_1\\) = 0.067, we can also use the infer pipeline with the generate() step removed to calculate it: observed_slope &lt;- evals %&gt;% specify(score ~ bty_avg) %&gt;% calculate(stat = &quot;slope&quot;) observed_slope # A tibble: 1 x 1 stat &lt;dbl&gt; 1 0.0666370 We then use the get_ci() function with level = 0.95 to compute the 95% confidence interval for \\(\\beta_1\\). Note that setting the point_estimate argument to the observed_slope of 0.067 sets the center of the confidence interval. se_ci &lt;- bootstrap_distn_slope %&gt;% get_ci(level = 0.95, type = &quot;se&quot;, point_estimate = observed_slope) se_ci # A tibble: 1 x 2 lower upper &lt;dbl&gt; &lt;dbl&gt; 1 0.0333767 0.0998974 The resulting standard error-based 95% confidence interval for \\(\\beta_1\\) of \\((0.033, 0.1)\\) is slightly different than the confidence interval in the regression Table 10.1 of \\((0.035, 0.099)\\). Comparing all three Let’s compare all three confidence intervals in Figure 10.9, where the percentile-based confidence interval is marked with solid lines, the standard error based confidence interval is marked with dashed lines, and the theory-based confidence interval (0.035, 0.099) from the regression table in Table 10.1 is marked with dotted lines. visualize(bootstrap_distn_slope) + shade_confidence_interval(endpoints = percentile_ci, fill = NULL, linetype = &quot;solid&quot;, color = &quot;grey90&quot;) + shade_confidence_interval(endpoints = se_ci, fill = NULL, linetype = &quot;dashed&quot;, color = &quot;grey60&quot;) + shade_confidence_interval(endpoints = c(0.035, 0.099), fill = NULL, linetype = &quot;dotted&quot;, color = &quot;black&quot;) FIGURE 10.9: Comparing three confidence intervals for the slope. Observe that all three are quite similar! Furthermore, none of the three confidence intervals for \\(\\beta_1\\) contain 0 and are entirely located above 0. This is suggesting that there is in fact a meaningful positive relationship between teaching and “beauty” scores. 10.4.2 Hypothesis test for slope Let’s now conduct a hypothesis test of \\(H_0: \\beta_1 = 0\\) vs. \\(H_A: \\beta_1 \\neq 0\\). We will use the infer package, which follows the hypothesis testing paradigm in the “There is only one test” diagram in Figure 9.14. Let’s first think about what it means for \\(\\beta_1\\) to be zero as assumed in the null hypothesis \\(H_0\\). Recall we said if \\(\\beta_1 = 0\\), then this is saying there is no relationship between the teaching and “beauty” scores. Thus assuming this particular null hypothesis \\(H_0\\) means that in our “hypothesized universe” there is no relationship between score and bty_avg. We can therefore shuffle/permute the bty_avg variable to no consequence. We construct the null distribution of the fitted slope \\(b_1\\) by performing the steps that follow. Recall from Section 9.2 on terminology, notation, and definitions related to hypothesis testing where we defined the null distribution: the sampling distribution of our test statistic \\(b_1\\) assuming the null hypothesis \\(H_0\\) is true. specify() the variables of interest in evals_ch5 with the formula: score ~ bty_avg. hypothesize() the null hypothesis of independence. Recall from Section 9.3 that this is an additional step that needs to be added for hypothesis testing. generate() replicates by permuting/shuffling values from the original sample of 463 courses. We generate reps = 1000 replicates using type = &quot;permute&quot; here. calculate() the test statistic of interest: the fitted slope \\(b_1\\). In this case, we permute the values of bty_avg across the values of score 1000 times. We can do this shuffling/permuting since we assumed a “hypothesized universe” of no relationship between these two variables. Then we calculate the &quot;slope&quot; coefficient for each of these 1000 generated samples. null_distn_slope &lt;- evals %&gt;% specify(score ~ bty_avg) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 1000, type = &quot;permute&quot;) %&gt;% calculate(stat = &quot;slope&quot;) Observe the resulting null distribution for the fitted slope \\(b_1\\) in Figure 10.10. FIGURE 10.10: Null distribution of slopes. Notice how it is centered at \\(b_1\\) = 0. This is because in our hypothesized universe, there is no relationship between score and bty_avg and so \\(\\beta_1 = 0\\). Thus, the most typical fitted slope \\(b_1\\) we observe across our simulations is 0. Observe, furthermore, how there is variation around this central value of 0. Let’s visualize the \\(p\\)-value in the null distribution by comparing it to the observed test statistic of \\(b_1\\) = 0.067 in Figure 10.11. We’ll do this by adding a shade_p_value() layer to the previous visualize() code. FIGURE 10.11: Null distribution and \\(p\\)-value. Since the observed fitted slope 0.067 falls far to the right of this null distribution and thus the shaded region doesn’t overlap it, we’ll have a \\(p\\)-value of 0. For completeness, however, let’s compute the numerical value of the \\(p\\)-value anyways using the get_p_value() function. Recall that it takes the same inputs as the shade_p_value() function: null_distn_slope %&gt;% get_p_value(obs_stat = observed_slope, direction = &quot;both&quot;) # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0 This matches the \\(p\\)-value of 0 in the regression table in Table 10.1. We therefore reject the null hypothesis \\(H_0: \\beta_1 = 0\\) in favor of the alternative hypothesis \\(H_A: \\beta_1 \\neq 0\\). We thus have evidence that suggests there is a significant relationship between teaching and “beauty” scores for all instructors at UT Austin. When the conditions for inference for regression are met and the null distribution has a bell shape, we are likely to see similar results between the simulation-based results we just demonstrated and the theory-based results shown in the regression table in Table 10.1. Learning check (LC10.2) Repeat the inference but this time for the correlation coefficient instead of the slope. Note the implementation of stat = &quot;correlation&quot; in the calculate() function of the infer package. 10.5 Conclusion 10.5.1 Theory-based inference for regression Recall in Subsection 10.2.5 when we interpreted the regression table in Table 10.1, we mentioned that R does not compute its values using simulation-based methods for constructing confidence intervals and conducting hypothesis tests as we did in Chapters 8 and 9 using the infer package. Rather, R uses a theory-based approach using mathematical formulas, much like the theory-based confidence intervals you saw in Subsection 8.7.2 and the theory-based hypothesis tests you saw in Subsection 9.6.1. These formulas were derived in a time when computers didn’t exist, so it would’ve been incredibly labor intensive to run extensive simulations. In particular, there is a formula for the standard error of the fitted slope \\(b_1\\): \\[\\text{SE}_{b_1} = \\dfrac{\\dfrac{s_y}{s_x} \\cdot \\sqrt{1-r^2}}{\\sqrt{n-2}}\\] As with many formulas in statistics, there’s a lot going on here, so let’s first break down what each symbol represents. First \\(s_x\\) and \\(s_y\\) are the sample standard deviations of the explanatory variable bty_avg and the response variable score, respectively. Second, \\(r\\) is the sample correlation coefficient between score and bty_avg. This was computed as 0.187 in Chapter 5. Lastly, \\(n\\) is the number of pairs of points in the evals_ch5 data frame, here 463. To put this formula into words, the standard error of \\(b_1\\) depends on the relationship between the variability of the response variable and the variability of the explanatory variable as measured in the \\(s_y / s_x\\) term. Next, it looks into how the two variables relate to each other in the \\(\\sqrt{1-r^2}\\) term. However, the most important observation to make in the previous formula is that there is an \\(n - 2\\) in the denominator. In other words, as the sample size \\(n\\) increases, the standard error \\(\\text{SE}_{b_1}\\) decreases. Just as we demonstrated in Subsection 7.3.3 when we used shovels with \\(n\\) = 25, 50, and 100 slots, the amount of sampling variation of the fitted slope \\(b_1\\) will depend on the sample size \\(n\\). In particular, as the sample size increases, both the sampling and bootstrap distributions narrow and the standard error \\(\\text{SE}_{b_1}\\) decreases. Hence, our estimates of \\(b_1\\) for the true population slope \\(\\beta_1\\) get more and more precise. R then uses this formula for the standard error of \\(b_1\\) in the third column of the regression table and subsequently to construct 95% confidence intervals. But what about the hypothesis test? Much like with our theory-based hypothesis test in Subsection 9.6.1, R uses the following \\(t\\)-statistic as the test statistic for hypothesis testing: \\[ t = \\dfrac{ b_1 - \\beta_1}{ \\text{SE}_{b_1}} \\] And since the null hypothesis \\(H_0: \\beta_1 = 0\\) is assumed during the hypothesis test, the \\(t\\)-statistic becomes \\[ t = \\dfrac{ b_1 - 0}{ \\text{SE}_{b_1}} = \\dfrac{ b_1 }{ \\text{SE}_{b_1}} \\] What are the values of \\(b_1\\) and \\(\\text{SE}_{b_1}\\)? They are in the estimate and std_error column of the regression table in Table 10.1. Thus the value of 4.09 in the table is computed as 0.067/0.016 = 4.188. Note there is a difference due to some rounding error here. Lastly, to compute the \\(p\\)-value, we need to compare the observed test statistic of 4.09 to the appropriate null distribution. Recall from Section 9.2, that a null distribution is the sampling distribution of the test statistic assuming the null hypothesis \\(H_0\\) is true. Much like in our theory-based hypothesis test in Subsection 9.6.1, it can be mathematically proven that this distribution is a \\(t\\)-distribution with degrees of freedom equal to \\(df = n - 2 = 463 - 2 = 461\\). Don’t worry if you’re feeling a little overwhelmed at this point. There is a lot of background theory to understand before you can fully make sense of the equations for theory-based methods. That being said, theory-based methods and simulation-based methods for constructing confidence intervals and conducting hypothesis tests often yield consistent results. As mentioned before, in our opinion, two large benefits of simulation-based methods over theory-based are that (1) they are easier for people new to statistical inference to understand, and (2) they also work in situations where theory-based methods and mathematical formulas don’t exist. 10.5.2 Summary of statistical inference We’ve finished the last two scenarios from the “Scenarios of sampling for inference” table in Subsection 7.5.1, which we re-display in Table 10.4. TABLE 10.4: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Symbol(s) 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\overline{x}\\) or \\(\\widehat{\\mu}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) 4 Difference in population means \\(\\mu_1 - \\mu_2\\) Difference in sample means \\(\\overline{x}_1 - \\overline{x}_2\\) 5 Population regression slope \\(\\beta_1\\) Fitted regression slope \\(b_1\\) or \\(\\widehat{\\beta}_1\\) Armed with the regression modeling techniques you learned in Chapters 5 and 6, your understanding of sampling for inference in Chapter 7, and the tools for statistical inference like confidence intervals and hypothesis tests in Chapters 8 and 9, you’re now equipped to study the significance of relationships between variables in a wide array of data! Many of the ideas presented here can be extended into multiple regression and other more advanced modeling techniques. 10.5.3 Additional resources An R script file of all R code used in this chapter is available here. 10.5.4 What’s to come You’ve now concluded the last major part of the book on “Statistical Inference with infer.” The closing Chapter 11 concludes this book with various short case studies involving real data, such as house prices in the city of Seattle, Washington in the US. You’ll see how the principles in this book can help you become a great storyteller with data! "],
+["11-thinking-with-data.html", "Chapter 11 Tell Your Story with Data 11.1 Review 11.2 Case study: Seattle house prices 11.3 Case study: Effective data storytelling Concluding remarks", " Chapter 11 Tell Your Story with Data Recall in the Preface and at the end of chapters throughout this book, we displayed the “ModernDive flowchart” mapping your journey through this book. FIGURE 11.1: ModernDive flowchart. 11.1 Review Let’s go over a refresher of what you’ve covered so far. You first got started with data in Chapter 1 where you learned about the difference between R and RStudio, started coding in R, installed and loaded your first R packages, and explored your first dataset: all domestic departure flights from a major New York City airport in 2013. Then you covered the following three parts of this book (Parts 2 and 4 are combined into a single portion): Data science with tidyverse. You assembled your data science toolbox using tidyverse packages. In particular, you Ch.2: Visualized data using the ggplot2 package. Ch.3: Wrangled data using the dplyr package. Ch.4: Learned about the concept of “tidy” data as a standardized data frame input and output format for all packages in the tidyverse. Furthermore, you learned how to import spreadsheet files into R using the readr package. Data modeling with moderndive. Using these data science tools and helper functions from the moderndive package, you fit your first data models. In particular, you Ch.5: Discovered basic regression models with only one explanatory variable. Ch.6: Examined multiple regression models with more than one explanatory variable. Statistical inference with infer. Once again using your newly acquired data science tools, you unpacked statistical inference using the infer package. In particular, you Ch.7: Learned about the role that sampling variability plays in statistical inference and the role that sample size plays in this sampling variability. Ch.8: Constructed confidence intervals using bootstrapping. Ch.9: Conducted hypothesis tests using permutation. Data modeling with moderndive (revisited): Armed with your understanding of statistical inference, you revisited and reviewed the models you constructed in Ch.5 and Ch.6. In particular, you Ch.10: Interpreted confidence intervals and hypothesis tests in a regression setting. We’ve guided you through your first experiences of “thinking with data,” an expression originally coined by Dr. Diane Lambert. The philosophy underlying this expression guided your path in the flowchart in Figure 11.1. This philosophy is also well-summarized in “Practical Data Science for Stats”: a collection of pre-prints focusing on the practical side of data science workflows and statistical analysis curated by Dr. Jennifer Bryan and Dr. Hadley Wickham. They quote: There are many aspects of day-to-day analytical work that are almost absent from the conventional statistics literature and curriculum. And yet these activities account for a considerable share of the time and effort of data analysts and applied statisticians. The goal of this collection is to increase the visibility and adoption of modern data analytical workflows. We aim to facilitate the transfer of tools and frameworks between industry and academia, between software engineering and statistics and computer science, and across different domains. In other words, to be equipped to “think with data” in the 21st century, analysts need practice going through the “data/science pipeline” we saw in the Preface (re-displayed in Figure 11.2). It is our opinion that, for too long, statistics education has only focused on parts of this pipeline, instead of going through it in its entirety. FIGURE 11.2: Data/science pipeline. To conclude this book, we’ll present you with some additional case studies of working with data. In Section 11.2 we’ll take you through a full-pass of the “Data/Science Pipeline” in order to analyze the sale price of houses in Seattle, WA, USA. In Section 11.3, we’ll present you with some examples of effective data storytelling drawn from the data journalism website, FiveThirtyEight.com. We present these case studies to you because we believe that you should not only be able to “think with data,” but also be able to “tell your story with data.” Let’s explore how to do this! Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Read Section 1.3 for information on how to install and load R packages. library(tidyverse) library(moderndive) library(skimr) library(fivethirtyeight) 11.2 Case study: Seattle house prices Kaggle.com is a machine learning and predictive modeling competition website that hosts datasets uploaded by companies, governmental organizations, and other individuals. One of their datasets is the “House Sales in King County, USA”. It consists of sale prices of homes sold between May 2014 and May 2015 in King County, Washington, USA, which includes the greater Seattle metropolitan area. This dataset is in the house_prices data frame included in the moderndive package. The dataset consists of 21,613 houses and 21 variables describing these houses (for a full list and description of these variables, see the help file by running ?house_prices in the console). In this case study, we’ll create a multiple regression model where: The outcome variable \\(y\\) is the sale price of houses. Two explanatory variables: A numerical explanatory variable \\(x_1\\): house size sqft_living as measured in square feet of living space. Note that 1 square foot is about 0.09 square meters. A categorical explanatory variable \\(x_2\\): house condition, a categorical variable with five levels where 1 indicates “poor” and 5 indicates “excellent.” 11.2.1 Exploratory data analysis: Part I As we’ve said numerous times throughout this book, a crucial first step when presented with data is to perform an exploratory data analysis (EDA). Exploratory data analysis can give you a sense of your data, help identify issues with your data, bring to light any outliers, and help inform model construction. Recall the three common steps in an exploratory data analysis we introduced in Subsection 5.1.1: Looking at the raw data values. Computing summary statistics. Creating data visualizations. First, let’s look at the raw data using View() to bring up RStudio’s spreadsheet viewer and the glimpse() function from the dplyr package: View(house_prices) glimpse(house_prices) Observations: 21,613 Variables: 21 $ id &lt;chr&gt; &quot;7129300520&quot;, &quot;6414100192&quot;, &quot;5631500400&quot;, &quot;2487200875&quot;,… $ date &lt;date&gt; 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-0… $ price &lt;dbl&gt; 221900, 538000, 180000, 604000, 510000, 1225000, 257500… $ bedrooms &lt;int&gt; 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2… $ bathrooms &lt;dbl&gt; 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2… $ sqft_living &lt;int&gt; 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 18… $ sqft_lot &lt;int&gt; 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470… $ floors &lt;dbl&gt; 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, … $ waterfront &lt;lgl&gt; FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,… $ view &lt;int&gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0… $ condition &lt;fct&gt; 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4… $ grade &lt;fct&gt; 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, … $ sqft_above &lt;int&gt; 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 18… $ sqft_basement &lt;int&gt; 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0,… $ yr_built &lt;int&gt; 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2… $ yr_renovated &lt;int&gt; 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ zipcode &lt;fct&gt; 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198,… $ lat &lt;dbl&gt; 47.5, 47.7, 47.7, 47.5, 47.6, 47.7, 47.3, 47.4, 47.5, 4… $ long &lt;dbl&gt; -122, -122, -122, -122, -122, -122, -122, -122, -122, -… $ sqft_living15 &lt;int&gt; 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 2… $ sqft_lot15 &lt;int&gt; 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113,… Here are some questions you can ask yourself at this stage of an EDA: Which variables are numerical? Which are categorical? For the categorical variables, what are their levels? Besides the variables we’ll be using in our regression model, what other variables do you think would be useful to use in a model for house price? Observe, for example, that while the condition variable has values 1 through 5, these are saved in R as fct standing for “factors.” This is one of R’s ways of saving categorical variables. So you should think of these as the “labels” 1 through 5 and not the numerical values 1 through 5. Let’s now perform the second step in an EDA: computing summary statistics. Recall from Section 3.3 that summary statistics are single numerical values that summarize a large number of values. Examples of summary statistics include the mean, the median, the standard deviation, and various percentiles. We could do this using the summarize() function in the dplyr package along with R’s built-in summary functions, like mean() and median(). However, recall in Section 3.5, we saw the following code that computes a variety of summary statistics of the variable gain, which is the amount of time that a flight makes up mid-air: gain_summary &lt;- flights %&gt;% summarize( min = min(gain, na.rm = TRUE), q1 = quantile(gain, 0.25, na.rm = TRUE), median = quantile(gain, 0.5, na.rm = TRUE), q3 = quantile(gain, 0.75, na.rm = TRUE), max = max(gain, na.rm = TRUE), mean = mean(gain, na.rm = TRUE), sd = sd(gain, na.rm = TRUE), missing = sum(is.na(gain)) ) To repeat this for all three price, sqft_living, and condition variables would be tedious to code up. So instead, let’s use the convenient skim() function from the skimr package we first used in Subsection 6.1.1, being sure to only select() the variables of interest for our model: house_prices %&gt;% select(price, sqft_living, condition) %&gt;% skim() Skim summary statistics n obs: 21613 n variables: 3 ── Variable type:factor variable missing complete n n_unique top_counts ordered condition 0 21613 21613 5 3: 14031, 4: 5679, 5: 1701, 2: 172 FALSE ── Variable type:integer variable missing complete n mean sd p0 p25 p50 p75 p100 sqft_living 0 21613 21613 2079.9 918.44 290 1427 1910 2550 13540 ── Variable type:numeric variable missing complete n mean sd p0 p25 p50 p75 p100 price 0 21613 21613 540088.14 367127.2 75000 321950 450000 645000 7700000 Observe that the mean price of $540,088 is larger than the median of $450,000. This is because a small number of very expensive houses are inflating the average. In other words, there are “outlier” house prices in our dataset. (This fact will become even more apparent when we create our visualizations next.) However, the median is not as sensitive to such outlier house prices. This is why news about the real estate market generally report median house prices and not mean/average house prices. We say here that the median is more robust to outliers than the mean. Similarly, while both the standard deviation and interquartile-range (IQR) are both measures of spread and variability, the IQR is more robust to outliers. Let’s now perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Let’s first create univariate visualizations. These are plots focusing on a single variable at a time. Since price and sqft_living are numerical variables, we can visualize their distributions using a geom_histogram() as seen in Section 2.5 on histograms. On the other hand, since condition is categorical, we can visualize its distribution using a geom_bar(). Recall from Section 2.8 on barplots that since condition is not “pre-counted”, we use a geom_bar() and not a geom_col(). # Histogram of house price: ggplot(house_prices, aes(x = price)) + geom_histogram(color = &quot;white&quot;) + labs(x = &quot;price (USD)&quot;, title = &quot;House price&quot;) # Histogram of sqft_living: ggplot(house_prices, aes(x = sqft_living)) + geom_histogram(color = &quot;white&quot;) + labs(x = &quot;living space (square feet)&quot;, title = &quot;House size&quot;) # Barplot of condition: ggplot(house_prices, aes(x = condition)) + geom_bar() + labs(x = &quot;condition&quot;, title = &quot;House condition&quot;) In Figure 11.3, we display all three of these visualizations at once. FIGURE 11.3: Exploratory visualizations of Seattle house prices data. First, observe in the bottom plot that most houses are of condition “3”, with a few more of conditions “4” and “5”, and almost none that are “1” or “2”. Next, observe in the histogram for price in the top-left plot that a majority of houses are less than two million dollars. Observe also that the x-axis stretches out to 8 million dollars, even though there does not appear to be any houses close to that price. This is because there are a very small number of houses with prices closer to 8 million. These are the outlier house prices we mentioned earlier. We say that the variable price is right-skewed as exhibited by the long right tail. Further, observe in the histogram of sqft_living in the middle plot as well that most houses appear to have less than 5000 square feet of living space. For comparison, a football field in the US is about 57,600 square feet, whereas a standard soccer/association football field is about 64,000 square feet. Observe also that this variable is also right-skewed, although not as drastically as the price variable. For both the price and sqft_living variables, the right-skew makes distinguishing houses at the lower end of the x-axis hard. This is because the scale of the x-axis is compressed by the small number of quite expensive and immensely-sized houses. So what can we do about this skew? Let’s apply a log10 transformation to these variables. If you are unfamiliar with such transformations, we highly recommend you read Appendix A.3 on logarithmic (log) transformations. In summary, log transformations allow us to alter the scale of a variable to focus on multiplicative changes instead of additive changes. In other words, they shift the view to be on relative changes instead of absolute changes. Such multiplicative/relative changes are also called changes in orders of magnitude. Let’s create new log10 transformed versions of the right-skewed variable price and sqft_living using the mutate() function from Section 3.5, but we’ll give the latter the name log10_size, which is shorter and easier to understand than the name log10_sqft_living. house_prices &lt;- house_prices %&gt;% mutate( log10_price = log10(price), log10_size = log10(sqft_living) ) Let’s display the before and after effects of this transformation on these variables for only the first 10 rows of house_prices: house_prices %&gt;% select(price, log10_price, sqft_living, log10_size) # A tibble: 21,613 x 4 price log10_price sqft_living log10_size &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; 1 221900 5.34616 1180 3.07188 2 538000 5.73078 2570 3.40993 3 180000 5.25527 770 2.88649 4 604000 5.78104 1960 3.29226 5 510000 5.70757 1680 3.22531 6 1225000 6.08814 5420 3.73400 7 257500 5.41078 1715 3.23426 8 291850 5.46516 1060 3.02531 9 229500 5.36078 1780 3.25042 10 323000 5.50920 1890 3.27646 # … with 21,603 more rows Observe in particular the houses in the sixth and third rows. The house in the sixth row has price $1,225,000, which is just above one million dollars. Since \\(10^6\\) is one million, its log10_price is around 6.09. Contrast this with all other houses with log10_price less than six, since they all have price less than $1,000,000. The house in the third row is the only house with sqft_living less than 1000. Since \\(1000 = 10^3\\), it’s the lone house with log10_size less than 3. Let’s now visualize the before and after effects of this transformation for price in Figure 11.4. # Before log10 transformation: ggplot(house_prices, aes(x = price)) + geom_histogram(color = &quot;white&quot;) + labs(x = &quot;price (USD)&quot;, title = &quot;House price: Before&quot;) # After log10 transformation: ggplot(house_prices, aes(x = log10_price)) + geom_histogram(color = &quot;white&quot;) + labs(x = &quot;log10 price (USD)&quot;, title = &quot;House price: After&quot;) FIGURE 11.4: House price before and after log10 transformation. Observe that after the transformation, the distribution is much less skewed, and in this case, more symmetric and more bell-shaped. Now you can more easily distinguish the lower priced houses. Let’s do the same for house size, where the variable sqft_living was log10 transformed to log10_size. # Before log10 transformation: ggplot(house_prices, aes(x = sqft_living)) + geom_histogram(color = &quot;white&quot;) + labs(x = &quot;living space (square feet)&quot;, title = &quot;House size: Before&quot;) # After log10 transformation: ggplot(house_prices, aes(x = log10_size)) + geom_histogram(color = &quot;white&quot;) + labs(x = &quot;log10 living space (square feet)&quot;, title = &quot;House size: After&quot;) FIGURE 11.5: House size before and after log10 transformation. Observe in Figure 11.5 that the log10 transformation has a similar effect of unskewing the variable. We emphasize that while in these two cases the resulting distributions are more symmetric and bell-shaped, this is not always necessarily the case. Given the now symmetric nature of log10_price and log10_size, we are going to revise our multiple regression model to use our new variables: The outcome variable \\(y\\) is the sale log10_price of houses. Two explanatory variables: A numerical explanatory variable \\(x_1\\): house size log10_size as measured in log base 10 square feet of living space. A categorical explanatory variable \\(x_2\\): house condition, a categorical variable with five levels where 1 indicates “poor” and 5 indicates “excellent.” 11.2.2 Exploratory data analysis: Part II Let’s now continue our EDA by creating multivariate visualizations. Unlike the univariate histograms and barplot in the earlier Figures 11.3, 11.4, and 11.5, multivariate visualizations show relationships between more than one variable. This is an important step of an EDA to perform since the goal of modeling is to explore relationships between variables. Since our model involves a numerical outcome variable, a numerical explanatory variable, and a categorical explanatory variable, we are in a similar regression modeling situation as in Section 6.1 where we studied the UT Austin teaching scores dataset. Recall in that case the numerical outcome variable was teaching score, the numerical explanatory variable was instructor age, and the categorical explanatory variable was (binary) gender. We thus have two choices of models we can fit: either (1) an interaction model where the regression line for each condition level will have both a different slope and a different intercept or (2) a parallel slopes model where the regression line for each condition level will have the same slope but different intercepts. Recall from Subsection 6.1.3 that the geom_parallel_slopes() function is a special purpose function that Evgeni Chasnovski created and included in the moderndive package, since the geom_smooth() method in the ggplot2 package does not have a convenient way to plot parallel slopes models. We plot both resulting models in Figure 11.6, with the interaction model on the left. # Plot interaction model ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) + geom_point(alpha = 0.05) + geom_smooth(method = &quot;lm&quot;, se = FALSE) + labs(y = &quot;log10 price&quot;, x = &quot;log10 size&quot;, title = &quot;House prices in Seattle&quot;) # Plot parallel slopes model ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) + geom_point(alpha = 0.05) + geom_parallel_slopes(se = FALSE) + labs(y = &quot;log10 price&quot;, x = &quot;log10 size&quot;, title = &quot;House prices in Seattle&quot;) FIGURE 11.6: Interaction and parallel slopes models. In both cases, we see there is a positive relationship between house price and size, meaning as houses are larger, they tend to be more expensive. Furthermore, in both plots it seems that houses of condition 5 tend to be the most expensive for most house sizes as evidenced by the fact that the line for condition 5 is highest, followed by conditions 4 and 3. As for conditions 1 and 2, this pattern isn’t as clear. Recall from the univariate barplot of condition in Figure 11.3, there are only a few houses of condition 1 or 2. Let’s also show a faceted version of just the interaction model in Figure 11.7. It is now much more apparent just how few houses are of condition 1 or 2. ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) + geom_point(alpha = 0.4) + geom_smooth(method = &quot;lm&quot;, se = FALSE) + labs(y = &quot;log10 price&quot;, x = &quot;log10 size&quot;, title = &quot;House prices in Seattle&quot;) + facet_wrap(~ condition) FIGURE 11.7: Faceted plot of interaction model. Which exploratory visualization of the interaction model is better, the one in the left-hand plot of Figure 11.6 or the faceted version in Figure 11.7? There is no universal right answer. You need to make a choice depending on what you want to convey, and own that choice, with including and discussing both also as an option as needed. 11.2.3 Regression modeling Which of the two models in Figure 11.6 is “better”? The interaction model in the left-hand plot or the parallel slopes model in the right-hand plot? We had a similar discussion in Subsection 6.3.1 on model selection. Recall that we stated that we should only favor more complex models if the additional complexity is warranted. In this case, the more complex model is the interaction model since it considers five intercepts and five slopes total. This is in contrast to the parallel slopes model which considers five intercepts but only one common slope. Is the additional complexity of the interaction model warranted? Looking at the left-hand plot in Figure 11.6, we’re of the opinion that it is, as evidenced by the slight x-like pattern to some of the lines. Therefore, we’ll focus the rest of this analysis only on the interaction model. This visual approach is somewhat subjective, however, so feel free to disagree! What are the five different slopes and five different intercepts for the interaction model? We can obtain these values from the regression table. Recall our two-step process for getting the regression table: # Fit regression model: price_interaction &lt;- lm(log10_price ~ log10_size * condition, data = house_prices) # Get regression table: get_regression_table(price_interaction) TABLE 11.1: Regression table for interaction model term estimate std_error statistic p_value lower_ci upper_ci intercept 3.330 0.451 7.380 0.000 2.446 4.215 log10_size 0.690 0.148 4.652 0.000 0.399 0.980 condition2 0.047 0.498 0.094 0.925 -0.930 1.024 condition3 -0.367 0.452 -0.812 0.417 -1.253 0.519 condition4 -0.398 0.453 -0.879 0.380 -1.286 0.490 condition5 -0.883 0.457 -1.931 0.053 -1.779 0.013 log10_size:condition2 -0.024 0.163 -0.148 0.882 -0.344 0.295 log10_size:condition3 0.133 0.148 0.893 0.372 -0.158 0.424 log10_size:condition4 0.146 0.149 0.979 0.328 -0.146 0.437 log10_size:condition5 0.310 0.150 2.067 0.039 0.016 0.604 Recall we saw in Subsection 6.1.2 how to interpret a regression table when there are both numerical and categorical explanatory variables. Let’s now do the same for all 10 values in the estimate column of Table 11.1. In this case, the “baseline for comparison” group for the categorical variable condition are the condition 1 houses, since “1” comes first alphanumerically. Thus, the intercept and log10_size values are the intercept and slope for log10_size for this baseline group. Next, the condition2 through condition5 terms are the offsets in intercepts relative to the condition 1 intercept. Finally, the log10_size:condition2 through log10_size:condition5 are the offsets in slopes for log10_size relative to the condition 1 slope for log10_size. Let’s simplify this by writing out the equation of each of the five regression lines using these 10 estimate values. We’ll write out each line in the following format: \\[ \\widehat{\\log10(\\text{price})} = \\hat{\\beta}_0 + \\hat{\\beta}_{\\text{size}} \\cdot \\log10(\\text{size}) \\] Condition 1: \\[\\widehat{\\log10(\\text{price})} = 3.33 + 0.69 \\cdot \\log10(\\text{size})\\] Condition 2: \\[ \\begin{aligned} \\widehat{\\log10(\\text{price})} &amp;= (3.33 + 0.047) + (0.69 - 0.024) \\cdot \\log10(\\text{size}) \\\\ &amp;= 3.377 + 0.666 \\cdot \\log10(\\text{size}) \\end{aligned} \\] Condition 3: \\[ \\begin{aligned} \\widehat{\\log10(\\text{price})} &amp;= (3.33 - 0.367) + (0.69 + 0.133) \\cdot \\log10(\\text{size}) \\\\ &amp;= 2.963 + 0.823 \\cdot \\log10(\\text{size}) \\end{aligned} \\] Condition 4: \\[ \\begin{aligned} \\widehat{\\log10(\\text{price})} &amp;= (3.33 - 0.398) + (0.69 + 0.146) \\cdot \\log10(\\text{size}) \\\\ &amp;= 2.932 + 0.836 \\cdot \\log10(\\text{size}) \\end{aligned} \\] Condition 5: \\[ \\begin{aligned} \\widehat{\\log10(\\text{price})} &amp;= (3.33 - 0.883) + (0.69 + 0.31) \\cdot \\log10(\\text{size}) \\\\ &amp;= 2.447 + 1 \\cdot \\log10(\\text{size}) \\end{aligned} \\] These correspond to the regression lines in the left-hand plot of Figure 11.6 and the faceted plot in Figure 11.7. For homes of all five condition types, as the size of the house increases, the price increases. This is what most would expect. However, the rate of increase of price with size is fastest for the homes with conditions 3, 4, and 5 of 0.823, 0.836, and 1, respectively. These are the three largest slopes out of the five. 11.2.4 Making predictions Say you’re a realtor and someone calls you asking you how much their home will sell for. They tell you that it’s in condition = 5 and is sized 1900 square feet. What do you tell them? Let’s use the interaction model we fit to make predictions! We first make this prediction visually in Figure 11.8. The predicted log10_price of this house is marked with a black dot. This is where the following two lines intersect: The regression line for the condition = 5 homes and The vertical dashed black line at log10_size equals 3.28, since our predictor variable is the log10 transformed square feet of living space of \\(\\log10(1900) = 3.28\\). FIGURE 11.8: Interaction model with prediction. Eyeballing it, it seems the predicted log10_price seems to be around 5.75. Let’s now obtain the exact numerical value for the prediction using the equation of the regression line for the condition = 5 houses, being sure to log10() the square footage first. 2.45 + 1 * log10(1900) [1] 5.73 This value is very close to our earlier visually made prediction of 5.75. But wait! Is our prediction for the price of this house $5.75? No! Remember that we are using log10_price as our outcome variable! So, if we want a prediction in dollar units of price, we need to unlog this by taking a power of 10 as described in Appendix A.3. 10^(2.45 + 1 * log10(1900)) [1] 535493 So our predicted price for this home of condition 5 and of size 1900 square feet is $535,493. Learning check (LC11.1) Repeat the regression modeling in Subsection 11.2.3 and the prediction making you just did on the house of condition 5 and size 1900 square feet in Subsection 11.2.4, but using the parallel slopes model you visualized in Figure 11.6. Show that it’s $524,807! 11.3 Case study: Effective data storytelling As we’ve progressed throughout this book, you’ve seen how to work with data in a variety of ways. You’ve learned effective strategies for plotting data by understanding which types of plots work best for which combinations of variable types. You’ve summarized data in spreadsheet form and calculated summary statistics for a variety of different variables. Furthermore, you’ve seen the value of statistical inference as a process to come to conclusions about a population by using sampling. Lastly, you’ve explored how to fit linear regression models and the importance of checking the conditions required so that all confidence intervals and hypothesis tests have valid interpretation. All throughout, you’ve learned many computational techniques and focused on writing R code that’s reproducible. We now present another set of case studies, but this time on the “effective data storytelling” done by data journalists around the world. Great data stories don’t mislead the reader, but rather engulf them in understanding the importance that data plays in our lives through storytelling. 11.3.1 Bechdel test for Hollywood gender representation We recommend you read and analyze Walt Hickey’s FiveThirtyEight.com article, “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women.” In it, Walt completed a multidecade study of how many movies pass the Bechdel test, an informal test of gender representation in a movie that was created by Alison Bechdel. As you read over the article, think carefully about how Walt Hickey is using data, graphics, and analyses to tell the reader a story. In the spirit of reproducibility, FiveThirtyEight have also shared the data and R code that they used for this article. You can also find the data used in many more of their articles on their GitHub page. ModernDive co-authors Chester Ismay and Albert Y. Kim along with Jennifer Chunn went one step further by creating the fivethirtyeight package which provides access to these datasets more easily in R. For a complete list of all 127 datasets included in the fivethirtyeight package, check out the package webpage at https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html. Furthermore, example “vignettes” of fully reproducible start-to-finish analyses of some of these data using dplyr, ggplot2, and other packages in the tidyverse are available here. For example, a vignette showing how to reproduce one of the plots at the end of the article on the Bechdel test is available here. 11.3.2 US Births in 1999 The US_births_1994_2003 data frame included in the fivethirtyeight package provides information about the number of daily births in the United States between 1994 and 2003. For more information on this data frame including a link to the original article on FiveThirtyEight.com, check out the help file by running ?US_births_1994_2003 in the console. It’s always a good idea to preview your data, either by using RStudio’s spreadsheet View() function or using glimpse() from the dplyr package: glimpse(US_births_1994_2003) Observations: 3,652 Variables: 6 $ year &lt;int&gt; 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1… $ month &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ date_of_month &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … $ date &lt;date&gt; 1994-01-01, 1994-01-02, 1994-01-03, 1994-01-04, 1994-0… $ day_of_week &lt;ord&gt; Sat, Sun, Mon, Tues, Wed, Thurs, Fri, Sat, Sun, Mon, Tu… $ births &lt;int&gt; 8096, 7772, 10142, 11248, 11053, 11406, 11251, 8653, 79… We’ll focus on the number of births for each date, but only for births that occurred in 1999. Recall from Section 3.2 we can do this using the filter() function from the dplyr package: US_births_1999 &lt;- US_births_1994_2003 %&gt;% filter(year == 1999) As discussed in Section 2.4, since date is a notion of time and thus has sequential ordering to it, a linegraph would be a more appropriate visualization to use than a scatterplot. In other words, we should use a geom_line() instead of geom_point(). Recall that such plots are called time series plots. ggplot(US_births_1999, aes(x = date, y = births)) + geom_line() + labs(x = &quot;Date&quot;, y = &quot;Number of births&quot;, title = &quot;US Births in 1999&quot;) FIGURE 11.9: Number of births in the US in 1999. We see a big dip occurring just before January 1st, 2000, most likely due to the holiday season. However, what about the large spike of over 14,000 births occurring just before October 1st, 1999? What could be the reason for this anomalously high spike? Let’s sort the rows of US_births_1999 in descending order of the number of births. Recall from Section 3.6 that we can use the arrange() function from the dplyr function to do this, making sure to sort births in descending order: US_births_1999 %&gt;% arrange(desc(births)) # A tibble: 365 x 6 year month date_of_month date day_of_week births &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;date&gt; &lt;ord&gt; &lt;int&gt; 1 1999 9 9 1999-09-09 Thurs 14540 2 1999 12 21 1999-12-21 Tues 13508 3 1999 9 8 1999-09-08 Wed 13437 4 1999 9 21 1999-09-21 Tues 13384 5 1999 9 28 1999-09-28 Tues 13358 6 1999 7 7 1999-07-07 Wed 13343 7 1999 7 8 1999-07-08 Thurs 13245 8 1999 8 17 1999-08-17 Tues 13201 9 1999 9 10 1999-09-10 Fri 13181 10 1999 12 28 1999-12-28 Tues 13158 # … with 355 more rows The date with the highest number of births (14,540) is in fact 1999-09-09. If we write down this date in month/day/year format (a standard format in the US), the date with the highest number of births is 9/9/99! All nines! Could it be that parents deliberately induced labor at a higher rate on this date? Maybe? Whatever the cause may be, this fact makes a fun story! Learning check (LC11.2) What date between 1994 and 2003 has the fewest number of births in the US? What story could you tell about why this is the case? Time to think with data and further tell your story with data! How could statistical modeling help you here? What types of statistical inference would be helpful? What else can you find and where can you take this analysis? What assumptions did you make in this analysis? We leave these questions to you as the reader to explore and examine. Remember to get in touch with us via our contact info in the Preface. We’d love to see what you come up with! Please check out additional problem sets and labs at https://moderndive.com/labs as well. 11.3.3 Scripts of R code An R script file of all R code used in this chapter is available here. R code files saved as *.R files for all relevant chapters throughout the entire book are in the following table. chapter link 1 https://moderndive.com/scripts/01-getting-started.R 2 https://moderndive.com/scripts/02-visualization.R 3 https://moderndive.com/scripts/03-wrangling.R 4 https://moderndive.com/scripts/04-tidy.R 5 https://moderndive.com/scripts/05-regression.R 6 https://moderndive.com/scripts/06-multiple-regression.R 7 https://moderndive.com/scripts/07-sampling.R 8 https://moderndive.com/scripts/08-confidence-intervals.R 9 https://moderndive.com/scripts/09-hypothesis-testing.R 10 https://moderndive.com/scripts/10-inference-for-regression.R 11 https://moderndive.com/scripts/11-tell-your-story-with-data.R Concluding remarks Now that you’ve made it to this point in the book, we suspect that you know a thing or two about how to work with data in R! You’ve also gained a lot of knowledge about how to use simulation-based techniques for statistical inference and how these techniques help build intuition about traditional theory-based inferential methods like the \\(t\\)-test. The hope is that you’ve come to appreciate the power of data in all respects, such as data wrangling, tidying datasets, data visualization, data modeling, and statistical inference. In our opinion, while each of these is important, data visualization may be the most important tool for a citizen or professional data scientist to have in their toolbox. If you can create truly beautiful graphics that display information in ways that the reader can clearly understand, you have great power to tell your tale with data. Let’s hope that these skills help you tell great stories with data into the future. Thanks for coming along this journey as we dove into modern data analysis using R and the tidyverse! "],
+["A-appendixA.html", "A Statistical Background A.1 Basic statistical terms A.2 Normal distribution A.3 log10 transformations", " A Statistical Background A.1 Basic statistical terms Note that all the following statistical terms apply only to numerical variables, except the distribution which can exist for both numerical and categorical variables. A.1.1 Mean The mean is the most commonly reported measure of center. It is commonly called the average though this term can be a little ambiguous. The mean is the sum of all of the data elements divided by how many elements there are. If we have \\(n\\) data points, the mean is given by: \\[Mean = \\frac{x_1 + x_2 + \\cdots + x_n}{n}\\] A.1.2 Median The median is calculated by first sorting a variable’s data from smallest to largest. After sorting the data, the middle element in the list is the median. If the middle falls between two values, then the median is the mean of those two middle values. A.1.3 Standard deviation We will next discuss the standard deviation (\\(sd\\)) of a variable. The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far we expect a given data value will be from its mean: \\[sd = \\sqrt{\\frac{(x_1 - Mean)^2 + (x_2 - Mean)^2 + \\cdots + (x_n - Mean)^2}{n - 1}}\\] A.1.4 Five-number summary The five-number summary consists of five summary statistics: the minimum, the first quantile AKA 25th percentile, the second quantile AKA median or 50th percentile, the third quantile AKA 75th, and the maximum. The five-number summary of a variable is used when constructing boxplots, as seen in Section 2.7. The quantiles are calculated as first quantile (\\(Q_1\\)): the median of the first half of the sorted data third quantile (\\(Q_3\\)): the median of the second half of the sorted data The interquartile range (IQR) is defined as \\(Q_3 - Q_1\\) and is a measure of how spread out the middle 50% of values are. The IQR corresponds to the length of the box in a boxplot. The median and the IQR are not influenced by the presence of outliers in the ways that the mean and standard deviation are. They are, thus, recommended for skewed datasets. We say in this case that the median and IQR are more robust to outliers. A.1.5 Distribution The distribution of a variable shows how frequently different values of a variable occur. Looking at the visualization of a distribution can show where the values are centered, show how the values vary, and give some information about where a typical value might fall. It can also alert you to the presence of outliers. Recall from Chapter 2 that we can visualize the distribution of a numerical variable using binning in a histogram and that we can visualize the distribution of a categorical variable using a barplot. A.1.6 Outliers Outliers correspond to values in the dataset that fall far outside the range of “ordinary” values. In the context of a boxplot, by default they correspond to values below \\(Q_1 - (1.5 \\cdot IQR)\\) or above \\(Q_3 + (1.5 \\cdot IQR)\\). A.2 Normal distribution Let’s next discuss one particular kind of distribution: normal distributions. Such bell-shaped distributions are defined by two values: (1) the mean \\(\\mu\\) (“mu”) which locates the center of the distribution and (2) the standard deviation \\(\\sigma\\) (“sigma”) which determines the variation of the distribution. In Figure A.1, we plot three normal distributions where: The solid normal curve has mean \\(\\mu = 5\\) &amp; standard deviation \\(\\sigma = 2\\). The dotted normal curve has mean \\(\\mu = 5\\) &amp; standard deviation \\(\\sigma = 5\\). The dashed normal curve has mean \\(\\mu = 15\\) &amp; standard deviation \\(\\sigma = 2\\). FIGURE A.1: Three normal distributions. Notice how the solid and dotted line normal curves have the same center due to their common mean \\(\\mu\\) = 5. However, the dotted line normal curve is wider due to its larger standard deviation of \\(\\sigma\\) = 5. On the other hand, the solid and dashed line normal curves have the same variation due to their common standard deviation \\(\\sigma\\) = 2. However, they are centered at different locations. When the mean \\(\\mu\\) = 0 and the standard deviation \\(\\sigma\\) = 1, the normal distribution has a special name. It’s called the standard normal distribution or the \\(z\\)-curve. Furthermore, if a variable follows a normal curve, there are three rules of thumb we can use: 68% of values will lie within \\(\\pm\\) 1 standard deviation of the mean. 95% of values will lie within \\(\\pm\\) 1.96 \\(\\approx\\) 2 standard deviations of the mean. 99.7% of values will lie within \\(\\pm\\) 3 standard deviations of the mean. Let’s illustrate this on a standard normal curve in Figure A.2. The dashed lines are at -3, -1.96, -1, 0, 1, 1.96, and 3. These 7 lines cut up the x-axis into 8 segments. The areas under the normal curve for each of the 8 segments are marked and add up to 100%. For example: The middle two segments represent the interval -1 to 1. The shaded area above this interval represents 34% + 34% = 68% of the area under the curve. In other words, 68% of values. The middle four segments represent the interval -1.96 to 1.96. The shaded area above this interval represents 13.5% + 34% + 34% + 13.5%= 95% of the area under the curve. In other words, 95% of values. The middle six segments represent the interval -3 to 3. The shaded area above this interval represents 2.35% + 13.5% + 34% + 34% + 13.5% + 2.35% = 99.7% of the area under the curve. In other words, 99.7% of values. FIGURE A.2: Rules of thumb about areas under normal curves. Learning check Say you have a normal distribution with mean \\(\\mu = 6\\) and standard deviation \\(\\sigma = 3\\). (LCA.1) What proportion of the area under the normal curve is less than 3? Greater than 12? Between 0 and 12? (LCA.2) What is the 2.5th percentile of the area under the normal curve? The 95th percentile? The 100th percentile? A.3 log10 transformations At its simplest, log10 transformations return base 10 logarithms. For example, since \\(1000 = 10^3\\), running log10(1000) returns 3 in R. To undo a log10 transformation, we raise 10 to this value. For example, to undo the previous log10 transformation and return the original value of 1000, we raise 10 to the power of 3 by running 10^(3) = 1000 in R. Log transformations allow us to focus on changes in orders of magnitude. In other words, they allow us to focus on multiplicative changes instead of additive ones. Let’s illustrate this idea in Table A.1 with examples of prices of consumer goods in 2019 US dollars. TABLE A.1: log10 transformed prices, orders of magnitude, and examples Price log10(Price) Order of magnitude Examples $1 0 Singles Cups of coffee $10 1 Tens Books $100 2 Hundreds Mobile phones $1,000 3 Thousands High definition TVs $10,000 4 Tens of thousands Cars $100,000 5 Hundreds of thousands Luxury cars and houses $1,000,000 6 Millions Luxury houses Let’s make some remarks about log10 transformations based on Table A.1: When purchasing a cup of coffee, we tend to think of prices ranging in single dollars, such as $2 or $3. However, when purchasing a mobile phone, we don’t tend to think of their prices in units of single dollars such as $313 or $727. Instead, we tend to think of their prices in units of hundreds of dollars like $300 or $700. Thus, cups of coffee and mobile phones are of different orders of magnitude in price. Let’s say we want to know the log10 transformed value of $76. This would be hard to compute exactly without a calculator. However, since $76 is between $10 and $100 and since log10(10) = 1 and log10(100) = 2, we know log10(76) will be between 1 and 2. In fact, log10(76) is 1.880814. log10 transformations are monotonic, meaning they preserve orders. So if Price A is lower than Price B, then log10(Price A) will also be lower than log10(Price B). Most importantly, increments of one in log10-scale correspond to relative multiplicative changes in the original scale and not absolute additive changes. For example, increasing a log10(Price) from 3 to 4 corresponds to a multiplicative increase by a factor of 10: $100 to $1000. "],
+["B-appendixB.html", "B Inference Examples Needed packages B.1 Inference mind map B.2 One mean B.3 One proportion B.4 Two proportions B.5 Two means (independent samples) B.6 Two means (paired samples)", " B Inference Examples This appendix is designed to provide you with examples of the five basic hypothesis tests and their corresponding confidence intervals. Traditional theory-based methods as well as computational-based methods are presented. Note: This appendix is still under construction. If you would like to contribute, please check us out on GitHub at https://github.com/moderndive/moderndive_book. Needed packages library(dplyr) library(ggplot2) library(infer) library(knitr) library(kableExtra) library(readr) library(janitor) B.1 Inference mind map To help you better navigate and choose the appropriate analysis, we’ve created a mind map on http://coggle.it available here and below. FIGURE B.1: Mind map for Inference. B.2 One mean B.2.1 Problem statement The National Survey of Family Growth conducted by the Centers for Disease Control gathers information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men’s and women’s health. One of the variables collected on this survey is the age at first marriage. 5,534 randomly sampled US women between 2006 and 2010 completed the survey. The women sampled here had been married at least once. Do we have evidence that the mean age of first marriage for all US women from 2006 to 2010 is greater than 23 years? (Tweaked a bit from Diez, Barr, and Çetinkaya-Rundel 2014 [Chapter 4]) B.2.2 Competing hypotheses In words Null hypothesis: The mean age of first marriage for all US women from 2006 to 2010 is equal to 23 years. Alternative hypothesis: The mean age of first marriage for all US women from 2006 to 2010 is greater than 23 years. In symbols (with annotations) \\(H_0: \\mu = \\mu_{0}\\), where \\(\\mu\\) represents the mean age of first marriage for all US women from 2006 to 2010 and \\(\\mu_0\\) is 23. \\(H_A: \\mu &gt; 23\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.2.3 Exploring the sample data age_at_marriage &lt;- read_csv(&quot;https://moderndive.com/data/ageAtMar.csv&quot;) age_summ &lt;- age_at_marriage %&gt;% summarize(sample_size = n(), mean = mean(age), sd = sd(age), minimum = min(age), lower_quartile = quantile(age, 0.25), median = median(age), upper_quartile = quantile(age, 0.75), max = max(age)) kable(age_summ) %&gt;% kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16), latex_options = c(&quot;hold_position&quot;)) sample_size mean sd minimum lower_quartile median upper_quartile max 5534 23.4 4.72 10 20 23 26 43 The histogram below also shows the distribution of age. ggplot(data = age_at_marriage, mapping = aes(x = age)) + geom_histogram(binwidth = 3, color = &quot;white&quot;) The observed statistic of interest here is the sample mean: x_bar &lt;- age_at_marriage %&gt;% specify(response = age) %&gt;% calculate(stat = &quot;mean&quot;) x_bar # A tibble: 1 x 1 stat &lt;dbl&gt; 1 23.4402 Guess about statistical significance We are looking to see if the observed sample mean of 23.44 is statistically greater than \\(\\mu_0 = 23\\). They seem to be quite close, but we have a large sample size here. Let’s guess that the large sample size will lead us to reject this practically small difference. B.2.4 Non-traditional methods Bootstrapping for hypothesis test In order to look to see if the observed sample mean of 23.44 is statistically greater than \\(\\mu_0 = 23\\), we need to account for the sample size. We also need to determine a process that replicates how the original sample of size 5534 was selected. We can use the idea of bootstrapping to simulate the population from which the sample came and then generate samples from that simulated population to account for sampling variability. Recall how bootstrapping would apply in this context: Sample with replacement from our original sample of 5534 women and repeat this process 10,000 times, calculate the mean for each of the 10,000 bootstrap samples created in Step 1., combine all of these bootstrap statistics calculated in Step 2 into a boot_distn object, and shift the center of this distribution over to the null value of 23. (This is needed since it will be centered at 23.44 via the process of bootstrapping.) set.seed(2018) null_distn_one_mean &lt;- age_at_marriage %&gt;% specify(response = age) %&gt;% hypothesize(null = &quot;point&quot;, mu = 23) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;mean&quot;) null_distn_one_mean %&gt;% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a right-tailed test so we will be looking for values that are greater than or equal to 23.44 for our \\(p\\)-value. null_distn_one_mean %&gt;% visualize(obs_stat = x_bar, direction = &quot;greater&quot;) Calculate \\(p\\)-value pvalue &lt;- null_distn_one_mean %&gt;% get_pvalue(obs_stat = x_bar, direction = &quot;greater&quot;) pvalue # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0 So our \\(p\\)-value is 0 and we reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are far into the tail of the null distribution. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\mu\\) using our sample data using bootstrapping. Note that we don’t need to shift this distribution since we want the center of our confidence interval to be our point estimate \\(\\bar{x}_{obs} = 23.44\\). boot_distn_one_mean &lt;- age_at_marriage %&gt;% specify(response = age) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;mean&quot;) ci &lt;- boot_distn_one_mean %&gt;% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 23.3148 23.5669 boot_distn_one_mean %&gt;% visualize(endpoints = ci, direction = &quot;between&quot;) We see that 23 is not contained in this confidence interval as a plausible value of \\(\\mu\\) (the unknown population mean) and the entire interval is larger than 23. This matches with our hypothesis test results of rejecting the null hypothesis in favor of the alternative (\\(\\mu &gt; 23\\)). Interpretation: We are 95% confident the true mean age of first marriage for all US women from 2006 to 2010 is between 23.315 and 23.567. B.2.5 Traditional methods Check conditions Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: The observations are collected independently. The cases are selected independently through random sampling so this condition is met. Approximately normal: The distribution of the response variable should be normal or the sample size should be at least 30. The histogram for the sample above does show some skew. The Q-Q plot below also shows some skew. ggplot(data = age_at_marriage, mapping = aes(sample = age)) + stat_qq() The sample size here is quite large though (\\(n = 5534\\)) so both conditions are met. Test statistic The test statistic is a random variable based on the sample data. Here, we want to look at a way to estimate the population mean \\(\\mu\\). A good guess is the sample mean \\(\\bar{X}\\). Recall that this sample mean is actually a random variable that will vary as different samples are (theoretically, would be) collected. We are looking to see how likely is it for us to have observed a sample mean of \\(\\bar{x}_{obs} = 23.44\\) or larger assuming that the population mean is 23 (assuming the null hypothesis is true). If the conditions are met and assuming \\(H_0\\) is true, we can “standardize” this original test statistic of \\(\\bar{X}\\) into a \\(T\\) statistic that follows a \\(t\\) distribution with degrees of freedom equal to \\(df = n - 1\\): \\[ T =\\dfrac{ \\bar{X} - \\mu_0}{ S / \\sqrt{n} } \\sim t (df = n - 1) \\] where \\(S\\) represents the standard deviation of the sample and \\(n\\) is the sample size. Observed test statistic While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the t_test() function to perform this analysis for us. t_test_results &lt;- age_at_marriage %&gt;% infer::t_test(formula = age ~ NULL, alternative = &quot;greater&quot;, mu = 23) t_test_results # A tibble: 1 x 6 statistic t_df p_value alternative lower_ci upper_ci &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; 1 6.93570 5533 2.25216e-12 greater 23.3358 Inf We see here that the \\(t_{obs}\\) value is 6.936. Compute \\(p\\)-value The \\(p\\)-value—the probability of observing an \\(t_{obs}\\) value of 6.936 or more in our null distribution of a \\(t\\) with 5533 degrees of freedom—is essentially 0. State conclusion We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that our observed sample mean was statistically greater than the hypothesized mean has supporting evidence here. Based on this sample, we have evidence that the mean age of first marriage for all US women from 2006 to 2010 is greater than 23 years. Confidence interval t.test(x = age_at_marriage$age, alternative = &quot;two.sided&quot;, mu = 23)$conf [1] 23.3 23.6 attr(,&quot;conf.level&quot;) [1] 0.95 B.2.6 Comparing results Observing the bootstrap distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions also being met (the large sample size was the driver here) leads us to better guess that using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) will lead to similar results. B.3 One proportion B.3.1 Problem statement The CEO of a large electric utility claims that 80 percent of his 1,000,000 customers are satisfied with the service they receive. To test this claim, the local newspaper surveyed 100 customers, using simple random sampling. 73 were satisfied and the remaining were unsatisfied. Based on these findings from the sample, can we reject the CEO’s hypothesis that 80% of the customers are satisfied? [Tweaked a bit from http://stattrek.com/hypothesis-test/proportion.aspx?Tutorial=AP] B.3.2 Competing hypotheses In words Null hypothesis: The proportion of all customers of the large electric utility satisfied with service they receive is equal 0.80. Alternative hypothesis: The proportion of all customers of the large electric utility satisfied with service they receive is different from 0.80. In symbols (with annotations) \\(H_0: \\pi = p_{0}\\), where \\(\\pi\\) represents the proportion of all customers of the large electric utility satisfied with service they receive and \\(p_0\\) is 0.8. \\(H_A: \\pi \\ne 0.8\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.3.3 Exploring the sample data elec &lt;- c(rep(&quot;satisfied&quot;, 73), rep(&quot;unsatisfied&quot;, 27)) %&gt;% as_data_frame() %&gt;% rename(satisfy = value) The bar graph below also shows the distribution of satisfy. ggplot(data = elec, aes(x = satisfy)) + geom_bar() The observed statistic is computed as p_hat &lt;- elec %&gt;% specify(response = satisfy, success = &quot;satisfied&quot;) %&gt;% calculate(stat = &quot;prop&quot;) p_hat # A tibble: 1 x 1 stat &lt;dbl&gt; 1 0.73 Guess about statistical significance We are looking to see if the sample proportion of 0.73 is statistically different from \\(p_0 = 0.8\\) based on this sample. They seem to be quite close, and our sample size is not huge here (\\(n = 100\\)). Let’s guess that we do not have evidence to reject the null hypothesis. B.3.4 Non-traditional methods Simulation for hypothesis test In order to look to see if 0.73 is statistically different from 0.8, we need to account for the sample size. We also need to determine a process that replicates how the original sample of size 100 was selected. We can use the idea of an unfair coin to simulate this process. We will simulate flipping an unfair coin (with probability of success 0.8 matching the null hypothesis) 100 times. Then we will keep track of how many heads come up in those 100 flips. Our simulated statistic matches with how we calculated the original statistic \\(\\hat{p}\\): the number of heads (satisfied) out of our total sample of 100. We then repeat this process many times (say 10,000) to create the null distribution looking at the simulated proportions of successes: set.seed(2018) null_distn_one_prop &lt;- elec %&gt;% specify(response = satisfy, success = &quot;satisfied&quot;) %&gt;% hypothesize(null = &quot;point&quot;, p = 0.8) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;prop&quot;) null_distn_one_prop %&gt;% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a two-tailed test so we will be looking for values that are 0.8 - 0.73 = 0.07 away from 0.8 in BOTH directions for our \\(p\\)-value: null_distn_one_prop %&gt;% visualize(obs_stat = p_hat, direction = &quot;both&quot;) Calculate \\(p\\)-value pvalue &lt;- null_distn_one_prop %&gt;% get_pvalue(obs_stat = p_hat, direction = &quot;both&quot;) pvalue # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0.1136 So our \\(p\\)-value is 0.114 and we fail to reject the null hypothesis at the 5% level. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\pi\\) using our sample data. To do so, we use bootstrapping, which involves sampling with replacement from our original sample of 100 survey respondents and repeating this process 10,000 times, calculating the proportion of successes for each of the 10,000 bootstrap samples created in Step 1., combining all of these bootstrap statistics calculated in Step 2 into a boot_distn object, identifying the 2.5th and 97.5th percentiles of this distribution (corresponding to the 5% significance level chosen) to find a 95% confidence interval for \\(\\pi\\), and interpret this confidence interval in the context of the problem. boot_distn_one_prop &lt;- elec %&gt;% specify(response = satisfy, success = &quot;satisfied&quot;) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;prop&quot;) Just as we use the mean function for calculating the mean over a numerical variable, we can also use it to compute the proportion of successes for a categorical variable where we specify what we are calling a “success” after the ==. (Think about the formula for calculating a mean and how R handles logical statements such as satisfy == &quot;satisfied&quot; for why this must be true.) ci &lt;- boot_distn_one_prop %&gt;% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 0.64 0.81 boot_distn_one_prop %&gt;% visualize(endpoints = ci, direction = &quot;between&quot;) We see that 0.80 is contained in this confidence interval as a plausible value of \\(\\pi\\) (the unknown population proportion). This matches with our hypothesis test results of failing to reject the null hypothesis. Interpretation: We are 95% confident the true proportion of customers who are satisfied with the service they receive is between 0.64 and 0.81. B.3.5 Traditional methods Check conditions Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: The observations are collected independently. The cases are selected independently through random sampling so this condition is met. Approximately normal: The number of expected successes and expected failures is at least 10. This condition is met since 73 and 27 are both greater than 10. Test statistic The test statistic is a random variable based on the sample data. Here, we want to look at a way to estimate the population proportion \\(\\pi\\). A good guess is the sample proportion \\(\\hat{P}\\). Recall that this sample proportion is actually a random variable that will vary as different samples are (theoretically, would be) collected. We are looking to see how likely is it for us to have observed a sample proportion of \\(\\hat{p}_{obs} = 0.73\\) or larger assuming that the population proportion is 0.80 (assuming the null hypothesis is true). If the conditions are met and assuming \\(H_0\\) is true, we can standardize this original test statistic of \\(\\hat{P}\\) into a \\(Z\\) statistic that follows a \\(N(0, 1)\\) distribution. \\[ Z =\\dfrac{ \\hat{P} - p_0}{\\sqrt{\\dfrac{p_0(1 - p_0)}{n} }} \\sim N(0, 1) \\] Observed test statistic While one could compute this observed test statistic by “hand” by plugging the observed values into the formula, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. The calculation has been done in R below for completeness though: p_hat &lt;- 0.73 p0 &lt;- 0.8 n &lt;- 100 (z_obs &lt;- (p_hat - p0) / sqrt( (p0 * (1 - p0)) / n)) [1] -1.75 We see here that the \\(z_{obs}\\) value is around -1.75. Our observed sample proportion of 0.73 is 1.75 standard errors below the hypothesized parameter value of 0.8. Visualize and compute \\(p\\)-value elec %&gt;% specify(response = satisfy, success = &quot;satisfied&quot;) %&gt;% hypothesize(null = &quot;point&quot;, p = 0.8) %&gt;% calculate(stat = &quot;z&quot;) %&gt;% visualize(method = &quot;theoretical&quot;, obs_stat = z_obs, direction = &quot;both&quot;) 2 * pnorm(z_obs) [1] 0.0801 The \\(p\\)-value—the probability of observing an \\(z_{obs}\\) value of -1.75 or more extreme (in both directions) in our null distribution—is around 8%. Note that we could also do this test directly using the prop.test function. stats::prop.test(x = table(elec$satisfy), n = length(elec$satisfy), alternative = &quot;two.sided&quot;, p = 0.8, correct = FALSE) 1-sample proportions test without continuity correction data: table(elec$satisfy), null probability 0.8 X-squared = 3, df = 1, p-value = 0.08 alternative hypothesis: true p is not equal to 0.8 95 percent confidence interval: 0.636 0.807 sample estimates: p 0.73 prop.test does a \\(\\chi^2\\) test here but this matches up exactly with what we would expect: \\(x^2_{obs} = 3.06 = (-1.75)^2 = (z_{obs})^2\\) and the \\(p\\)-values are the same because we are focusing on a two-tailed test. Note that the 95 percent confidence interval given above matches well with the one calculated using bootstrapping. State conclusion We, therefore, do not have sufficient evidence to reject the null hypothesis. Our initial guess that our observed sample proportion was not statistically greater than the hypothesized proportion has not been invalidated. Based on this sample, we have do not evidence that the proportion of all customers of the large electric utility satisfied with service they receive is different from 0.80, at the 5% level. B.3.6 Comparing results Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions also being met leads us to better guess that using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) will lead to similar results. B.4 Two proportions B.4.1 Problem statement A 2010 survey asked 827 randomly sampled registered voters in California “Do you support? Or do you oppose? Drilling for oil and natural gas off the Coast of California? Or do you not know enough to say?” Conduct a hypothesis test to determine if the data provide strong evidence that the proportion of college graduates who do not have an opinion on this issue is different than that of non-college graduates. (Tweaked a bit from Diez, Barr, and Çetinkaya-Rundel 2014 [Chapter 6]) B.4.2 Competing hypotheses In words Null hypothesis: There is no association between having an opinion on drilling and having a college degree for all registered California voters in 2010. Alternative hypothesis: There is an association between having an opinion on drilling and having a college degree for all registered California voters in 2010. Another way in words Null hypothesis: The probability that a Californian voter in 2010 having no opinion on drilling and is a college graduate is the same as that of a non-college graduate. Alternative hypothesis: These parameter probabilities are different. In symbols (with annotations) \\(H_0: \\pi_{college} = \\pi_{no\\_college}\\) or \\(H_0: \\pi_{college} - \\pi_{no\\_college} = 0\\), where \\(\\pi\\) represents the probability of not having an opinion on drilling. \\(H_A: \\pi_{college} - \\pi_{no\\_college} \\ne 0\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.4.3 Exploring the sample data offshore &lt;- read_csv(&quot;https://moderndive.com/data/offshore.csv&quot;) offshore %&gt;% tabyl(college_grad, response) college_grad no opinion opinion no 131 258 yes 104 334 off_summ &lt;- offshore %&gt;% group_by(college_grad) %&gt;% summarize(prop_no_opinion = mean(response == &quot;no opinion&quot;), sample_size = n()) ggplot(offshore, aes(x = college_grad, fill = response)) + geom_bar(position = &quot;fill&quot;) + coord_flip() Guess about statistical significance We are looking to see if a difference exists in the size of the bars corresponding to no opinion for the plot. Based solely on the plot, we have little reason to believe that a difference exists since the bars seem to be about the same size, BUT…it’s important to use statistics to see if that difference is actually statistically significant! B.4.4 Non-traditional methods Collecting summary info The observed statistic is d_hat &lt;- offshore %&gt;% specify(response ~ college_grad, success = &quot;no opinion&quot;) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;yes&quot;, &quot;no&quot;)) d_hat # A tibble: 1 x 1 stat &lt;dbl&gt; 1 -0.0993180 Randomization for hypothesis test In order to look to see if the observed sample proportion of no opinion for college graduates of 0.337 is statistically different than that for graduates of 0.237, we need to account for the sample sizes. Note that this is the same as looking to see if \\(\\hat{p}_{grad} - \\hat{p}_{nograd}\\) is statistically different than 0. We also need to determine a process that replicates how the original group sizes of 389 and 438 were selected. We can use the idea of randomization testing (also known as permutation testing) to simulate the population from which the sample came (with two groups of different sizes) and then generate samples using shuffling from that simulated population to account for sampling variability. set.seed(2018) null_distn_two_props &lt;- offshore %&gt;% specify(response ~ college_grad, success = &quot;no opinion&quot;) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;yes&quot;, &quot;no&quot;)) null_distn_two_props %&gt;% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a two-tailed test so we will be looking for values that are greater than or equal to -0.099 or less than or equal to 0.099 for our \\(p\\)-value. null_distn_two_props %&gt;% visualize(obs_stat = d_hat, direction = &quot;two_sided&quot;) Calculate \\(p\\)-value pvalue &lt;- null_distn_two_props %&gt;% get_pvalue(obs_stat = d_hat, direction = &quot;two_sided&quot;) pvalue # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0.00240000 So our \\(p\\)-value is 0.002 and we reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are far into the tails of the null distribution. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\pi_{college} - \\pi_{no\\_college}\\) using our sample data with bootstrapping. boot_distn_two_props &lt;- offshore %&gt;% specify(response ~ college_grad, success = &quot;no opinion&quot;) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;diff in props&quot;, order = c(&quot;yes&quot;, &quot;no&quot;)) ci &lt;- boot_distn_two_props %&gt;% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 -0.160030 -0.0379112 boot_distn_two_props %&gt;% visualize(endpoints = ci, direction = &quot;between&quot;) We see that 0 is not contained in this confidence interval as a plausible value of \\(\\pi_{college} - \\pi_{no\\_college}\\) (the unknown population parameter). This matches with our hypothesis test results of rejecting the null hypothesis. Since zero is not a plausible value of the population parameter, we have evidence that the proportion of college graduates in California with no opinion on drilling is different than that of non-college graduates. Interpretation: We are 95% confident the true proportion of non-college graduates with no opinion on offshore drilling in California is between 0.16 dollars smaller to 0.04 dollars smaller than for college graduates. B.4.5 Traditional methods B.4.6 Check conditions Remember that in order to use the short-cut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: Each case that was selected must be independent of all the other cases selected. This condition is met since cases were selected at random to observe. Sample size: The number of pooled successes and pooled failures must be at least 10 for each group. We need to first figure out the pooled success rate: \\[\\hat{p}_{obs} = \\dfrac{131 + 104}{827} = 0.28.\\] We now determine expected (pooled) success and failure counts: \\(0.28 \\cdot (131 + 258) = 108.92\\), \\(0.72 \\cdot (131 + 258) = 280.08\\) \\(0.28 \\cdot (104 + 334) = 122.64\\), \\(0.72 \\cdot (104 + 334) = 315.36\\) Independent selection of samples: The cases are not paired in any meaningful way. We have no reason to suspect that a college graduate selected would have any relationship to a non-college graduate selected. B.4.7 Test statistic The test statistic is a random variable based on the sample data. Here, we are interested in seeing if our observed difference in sample proportions corresponding to no opinion on drilling (\\(\\hat{p}_{college, obs} - \\hat{p}_{no\\_college, obs}\\) = 0.033) is statistically different than 0. Assuming that conditions are met and the null hypothesis is true, we can use the standard normal distribution to standardize the difference in sample proportions (\\(\\hat{P}_{college} - \\hat{P}_{no\\_college}\\)) using the standard error of \\(\\hat{P}_{college} - \\hat{P}_{no\\_college}\\) and the pooled estimate: \\[ Z =\\dfrac{ (\\hat{P}_1 - \\hat{P}_2) - 0}{\\sqrt{\\dfrac{\\hat{P}(1 - \\hat{P})}{n_1} + \\dfrac{\\hat{P}(1 - \\hat{P})}{n_2} }} \\sim N(0, 1) \\] where \\(\\hat{P} = \\dfrac{\\text{total number of successes} }{ \\text{total number of cases}}.\\) Observed test statistic While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the prop.test function to perform this analysis for us. z_hat &lt;- offshore %&gt;% specify(response ~ college_grad, success = &quot;no opinion&quot;) %&gt;% calculate(stat = &quot;z&quot;, order = c(&quot;yes&quot;, &quot;no&quot;)) z_hat # A tibble: 1 x 1 stat &lt;dbl&gt; 1 -3.16081 The observed difference in sample proportions is 3.16 standard deviations smaller than 0. The \\(p\\)-value—the probability of observing a \\(Z\\) value of -3.16 or more extreme in our null distribution—is 0.0016. This can also be calculated in R directly: 2 * pnorm(-3.16, lower.tail = TRUE) [1] 0.00158 B.4.8 State conclusion We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference did not exist in the proportions of no opinion on offshore drilling between college educated and non-college educated Californians was not validated. We do have evidence to suggest that there is a dependency between college graduation and position on offshore drilling for Californians. B.4.9 Comparing results Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions were not met since the number of pairs was small, but the sample data was not highly skewed. Using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) lead to similar results. B.5 Two means (independent samples) B.5.1 Problem statement Average income varies from one region of the country to another, and it often reflects both lifestyles and regional living expenses. Suppose a new graduate is considering a job in two locations, Cleveland, OH and Sacramento, CA, and he wants to see whether the average income in one of these cities is higher than the other. He would like to conduct a hypothesis test based on two randomly selected samples from the 2000 Census. (Tweaked a bit from Diez, Barr, and Çetinkaya-Rundel 2014 [Chapter 5]) B.5.2 Competing hypotheses In words Null hypothesis: There is no association between income and location (Cleveland, OH and Sacramento, CA). Alternative hypothesis: There is an association between income and location (Cleveland, OH and Sacramento, CA). Another way in words Null hypothesis: The mean income is the same for both cities. Alternative hypothesis: The mean income is different for the two cities. In symbols (with annotations) \\(H_0: \\mu_{sac} = \\mu_{cle}\\) or \\(H_0: \\mu_{sac} - \\mu_{cle} = 0\\), where \\(\\mu\\) represents the average income. \\(H_A: \\mu_{sac} - \\mu_{cle} \\ne 0\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.5.3 Exploring the sample data cle_sac &lt;- read.delim(&quot;https://moderndive.com/data/cleSac.txt&quot;) %&gt;% rename(metro_area = Metropolitan_area_Detailed, income = Total_personal_income) %&gt;% na.omit() inc_summ &lt;- cle_sac %&gt;% group_by(metro_area) %&gt;% summarize(sample_size = n(), mean = mean(income), sd = sd(income), minimum = min(income), lower_quartile = quantile(income, 0.25), median = median(income), upper_quartile = quantile(income, 0.75), max = max(income)) kable(inc_summ) %&gt;% kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16), latex_options = c(&quot;hold_position&quot;)) metro_area sample_size mean sd minimum lower_quartile median upper_quartile max Cleveland_ OH 212 27467 27681 0 8475 21000 35275 152400 Sacramento_ CA 175 32428 35774 0 8050 20000 49350 206900 The boxplot below also shows the mean for each group highlighted by the red dots. ggplot(cle_sac, aes(x = metro_area, y = income)) + geom_boxplot() + stat_summary(fun.y = &quot;mean&quot;, geom = &quot;point&quot;, color = &quot;red&quot;) Guess about statistical significance We are looking to see if a difference exists in the mean income of the two levels of the explanatory variable. Based solely on the boxplot, we have reason to believe that no difference exists. The distributions of income seem similar and the means fall in roughly the same place. B.5.4 Non-traditional methods Collecting summary info We now compute the observed statistic: d_hat &lt;- cle_sac %&gt;% specify(income ~ metro_area) %&gt;% calculate(stat = &quot;diff in means&quot;, order = c(&quot;Sacramento_ CA&quot;, &quot;Cleveland_ OH&quot;)) d_hat # A tibble: 1 x 1 stat &lt;dbl&gt; 1 4960.48 Randomization for hypothesis test In order to look to see if the observed sample mean for Sacramento of 27467.066 is statistically different than that for Cleveland of 32427.543, we need to account for the sample sizes. Note that this is the same as looking to see if \\(\\bar{x}_{sac} - \\bar{x}_{cle}\\) is statistically different than 0. We also need to determine a process that replicates how the original group sizes of 212 and 175 were selected. We can use the idea of randomization testing (also known as permutation testing) to simulate the population from which the sample came (with two groups of different sizes) and then generate samples using shuffling from that simulated population to account for sampling variability. set.seed(2018) null_distn_two_means &lt;- cle_sac %&gt;% specify(income ~ metro_area) %&gt;% hypothesize(null = &quot;independence&quot;) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;diff in means&quot;, order = c(&quot;Sacramento_ CA&quot;, &quot;Cleveland_ OH&quot;)) null_distn_two_means %&gt;% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a two-tailed test so we will be looking for values that are greater than or equal to 4960.477 or less than or equal to -4960.477 for our \\(p\\)-value. null_distn_two_means %&gt;% visualize(obs_stat = d_hat, direction = &quot;both&quot;) Calculate \\(p\\)-value pvalue &lt;- null_distn_two_means %&gt;% get_pvalue(obs_stat = d_hat, direction = &quot;both&quot;) pvalue # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0.1262 So our \\(p\\)-value is 0.126 and we fail to reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are not very far into the tail of the null distribution. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\mu_{sac} - \\mu_{cle}\\) using our sample data with bootstrapping. Here we will bootstrap each of the groups with replacement instead of shuffling. This is done using the groups argument in the resample function to fix the size of each group to be the same as the original group sizes of 175 for Sacramento and 212 for Cleveland. boot_distn_two_means &lt;- cle_sac %&gt;% specify(income ~ metro_area) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;diff in means&quot;, order = c(&quot;Sacramento_ CA&quot;, &quot;Cleveland_ OH&quot;)) ci &lt;- boot_distn_two_means %&gt;% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 -1359.50 11499.7 boot_distn_two_means %&gt;% visualize(endpoints = ci, direction = &quot;between&quot;) We see that 0 is contained in this confidence interval as a plausible value of \\(\\mu_{sac} - \\mu_{cle}\\) (the unknown population parameter). This matches with our hypothesis test results of failing to reject the null hypothesis. Since zero is a plausible value of the population parameter, we do not have evidence that Sacramento incomes are different than Cleveland incomes. Interpretation: We are 95% confident the true mean yearly income for those living in Sacramento is between 1359.5 dollars smaller to 11499.69 dollars higher than for Cleveland. Note: You could also use the null distribution based on randomization with a shift to have its center at \\(\\bar{x}_{sac} - \\bar{x}_{cle} = \\$4960.48\\) instead of at 0 and calculate its percentiles. The confidence interval produced via this method should be comparable to the one done using bootstrapping above. B.5.5 Traditional methods Check conditions Remember that in order to use the short-cut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: The observations are independent in both groups. This metro_area variable is met since the cases are randomly selected from each city. Approximately normal: The distribution of the response for each group should be normal or the sample sizes should be at least 30. ggplot(cle_sac, aes(x = income)) + geom_histogram(color = &quot;white&quot;, binwidth = 20000) + facet_wrap(~ metro_area) We have some reason to doubt the normality assumption here since both the histograms show deviation from a normal model fitting the data well for each group. The sample sizes for each group are greater than 100 though so the assumptions should still apply. Independent samples: The samples should be collected without any natural pairing. There is no mention of there being a relationship between those selected in Cleveland and in Sacramento. B.5.6 Test statistic The test statistic is a random variable based on the sample data. Here, we are interested in seeing if our observed difference in sample means (\\(\\bar{x}_{sac, obs} - \\bar{x}_{cle, obs}\\) = 4960.477) is statistically different than 0. Assuming that conditions are met and the null hypothesis is true, we can use the \\(t\\) distribution to standardize the difference in sample means (\\(\\bar{X}_{sac} - \\bar{X}_{cle}\\)) using the approximate standard error of \\(\\bar{X}_{sac} - \\bar{X}_{cle}\\) (invoking \\(S_{sac}\\) and \\(S_{cle}\\) as estimates of unknown \\(\\sigma_{sac}\\) and \\(\\sigma_{cle}\\)). \\[ T =\\dfrac{ (\\bar{X}_1 - \\bar{X}_2) - 0}{ \\sqrt{\\dfrac{S_1^2}{n_1} + \\dfrac{S_2^2}{n_2}} } \\sim t (df = min(n_1 - 1, n_2 - 1)) \\] where 1 = Sacramento and 2 = Cleveland with \\(S_1^2\\) and \\(S_2^2\\) the sample variance of the incomes of both cities, respectively, and \\(n_1 = 175\\) for Sacramento and \\(n_2 = 212\\) for Cleveland. Observed test statistic Note that we could also do (ALMOST) this test directly using the t.test function. The x and y arguments are expected to both be numeric vectors here so we’ll need to appropriately filter our datasets. cle_sac %&gt;% specify(income ~ metro_area) %&gt;% calculate(stat = &quot;t&quot;, order = c(&quot;Cleveland_ OH&quot;, &quot;Sacramento_ CA&quot;)) # A tibble: 1 x 1 stat &lt;dbl&gt; 1 -1.50062 We see here that the observed test statistic value is around -1.5. While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. B.5.7 Compute \\(p\\)-value The \\(p\\)-value—the probability of observing an \\(t_{174}\\) value of -1.501 or more extreme (in both directions) in our null distribution—is 0.13. This can also be calculated in R directly: 2 * pt(-1.501, df = min(212 - 1, 175 - 1), lower.tail = TRUE) [1] 0.135 We can also approximate by using the standard normal curve: 2 * pnorm(-1.501) [1] 0.133 Note that the 95 percent confidence interval given above matches well with the one calculated using bootstrapping. B.5.8 State conclusion We, therefore, do not have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference not existing in the means was backed by this statistical analysis. We do not have evidence to suggest that the true mean income differs between Cleveland, OH and Sacramento, CA based on this data. B.5.9 Comparing results Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions also being met leads us to better guess that using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) will lead to similar results. B.6 Two means (paired samples) Problem statement Trace metals in drinking water affect the flavor and an unusually high concentration can pose a health hazard. Ten pairs of data were taken measuring zinc concentration in bottom water and surface water at 10 randomly selected locations on a stretch of river. Do the data suggest that the true average concentration in the surface water is smaller than that of bottom water? (Note that units are not given.) [Tweaked a bit from https://onlinecourses.science.psu.edu/stat500/node/51] B.6.1 Competing hypotheses In words Null hypothesis: The mean concentration in the bottom water is the same as that of the surface water at different paired locations. Alternative hypothesis: The mean concentration in the surface water is smaller than that of the bottom water at different paired locations. In symbols (with annotations) \\(H_0: \\mu_{diff} = 0\\), where \\(\\mu_{diff}\\) represents the mean difference in concentration for surface water minus bottom water. \\(H_A: \\mu_{diff} &lt; 0\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.6.2 Exploring the sample data zinc_tidy &lt;- read_csv(&quot;https://moderndive.com/data/zinc_tidy.csv&quot;) We want to look at the differences in surface - bottom for each location: zinc_diff &lt;- zinc_tidy %&gt;% group_by(loc_id) %&gt;% summarize(pair_diff = diff(concentration)) %&gt;% ungroup() Next we calculate the mean difference as our observed statistic: d_hat &lt;- zinc_diff %&gt;% specify(response = pair_diff) %&gt;% calculate(stat = &quot;mean&quot;) d_hat # A tibble: 1 x 1 stat &lt;dbl&gt; 1 -0.0804 The histogram below also shows the distribution of pair_diff. ggplot(zinc_diff, aes(x = pair_diff)) + geom_histogram(binwidth = 0.04, color = &quot;white&quot;) Guess about statistical significance We are looking to see if the sample paired mean difference of -0.08 is statistically less than 0. They seem to be quite close, but we have a small number of pairs here. Let’s guess that we will fail to reject the null hypothesis. B.6.3 Non-traditional methods Bootstrapping for hypothesis test In order to look to see if the observed sample mean difference \\(\\bar{x}_{diff} = 4960.477\\) is statistically less than 0, we need to account for the number of pairs. We also need to determine a process that replicates how the paired data was selected in a way similar to how we calculated our original difference in sample means. Treating the differences as our data of interest, we next use the process of bootstrapping to build other simulated samples and then calculate the mean of the bootstrap samples. We hypothesize that the mean difference is zero. This process is similar to comparing the One Mean example seen above, but using the differences between the two groups as a single sample with a hypothesized mean difference of 0. set.seed(2018) null_distn_paired_means &lt;- zinc_diff %&gt;% specify(response = pair_diff) %&gt;% hypothesize(null = &quot;point&quot;, mu = 0) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;mean&quot;) null_distn_paired_means %&gt;% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a left-tailed test so we will be looking for values that are less than or equal to 4960.477 for our \\(p\\)-value. null_distn_paired_means %&gt;% visualize(obs_stat = d_hat, direction = &quot;less&quot;) Calculate \\(p\\)-value pvalue &lt;- null_distn_paired_means %&gt;% get_pvalue(obs_stat = d_hat, direction = &quot;less&quot;) pvalue # A tibble: 1 x 1 p_value &lt;dbl&gt; 1 0 So our \\(p\\)-value is essentially 0 and we reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are far into the left tail of the null distribution. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\mu_{diff}\\) using our sample data (the calculated differences) with bootstrapping. This is similar to the bootstrapping done in a one sample mean case, except now our data is differences instead of raw numerical data. Note that this code is identical to the pipeline shown in the hypothesis test above except the hypothesize() function is not called. boot_distn_paired_means &lt;- zinc_diff %&gt;% specify(response = pair_diff) %&gt;% generate(reps = 10000) %&gt;% calculate(stat = &quot;mean&quot;) ci &lt;- boot_distn_paired_means %&gt;% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 -0.111600 -0.0501975 boot_distn_paired_means %&gt;% visualize(endpoints = ci, direction = &quot;between&quot;) We see that 0 is not contained in this confidence interval as a plausible value of \\(\\mu_{diff}\\) (the unknown population parameter). This matches with our hypothesis test results of rejecting the null hypothesis. Since zero is not a plausible value of the population parameter and since the entire confidence interval falls below zero, we have evidence that surface zinc concentration levels are lower, on average, than bottom level zinc concentrations. Interpretation: We are 95% confident the true mean zinc concentration on the surface is between 0.11 units smaller to 0.05 units smaller than on the bottom. B.6.4 Traditional methods Check conditions Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: The observations among pairs are independent. The locations are selected independently through random sampling so this condition is met. Approximately normal: The distribution of population of differences is normal or the number of pairs is at least 30. The histogram above does show some skew so we have reason to doubt the population being normal based on this sample. We also only have 10 pairs which is fewer than the 30 needed. A theory-based test may not be valid here. Test statistic The test statistic is a random variable based on the sample data. Here, we want to look at a way to estimate the population mean difference \\(\\mu_{diff}\\). A good guess is the sample mean difference \\(\\bar{X}_{diff}\\). Recall that this sample mean is actually a random variable that will vary as different samples are (theoretically, would be) collected. We are looking to see how likely is it for us to have observed a sample mean of \\(\\bar{x}_{diff, obs} = 0.0804\\) or larger assuming that the population mean difference is 0 (assuming the null hypothesis is true). If the conditions are met and assuming \\(H_0\\) is true, we can “standardize” this original test statistic of \\(\\bar{X}_{diff}\\) into a \\(T\\) statistic that follows a \\(t\\) distribution with degrees of freedom equal to \\(df = n - 1\\): \\[ T =\\dfrac{ \\bar{X}_{diff} - 0}{ S_{diff} / \\sqrt{n} } \\sim t (df = n - 1) \\] where \\(S\\) represents the standard deviation of the sample differences and \\(n\\) is the number of pairs. Observed test statistic While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the t_test function on the differences to perform this analysis for us. t_test_results &lt;- zinc_diff %&gt;% infer::t_test(formula = pair_diff ~ NULL, alternative = &quot;less&quot;, mu = 0) t_test_results # A tibble: 1 x 6 statistic t_df p_value alternative lower_ci upper_ci &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; 1 -4.86381 9 0.000445558 less -Inf -0.0500982 We see here that the \\(t_{obs}\\) value is -4.864. Compute \\(p\\)-value The \\(p\\)-value—the probability of observing a \\(t_{obs}\\) value of -4.864 or less in our null distribution of a \\(t\\) with 9 degrees of freedom—is 0. This can also be calculated in R directly: pt(-4.8638, df = nrow(zinc_diff) - 1, lower.tail = TRUE) [1] 0.000446 State conclusion We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that our observed sample mean difference was not statistically less than the hypothesized mean of 0 has been invalidated here. Based on this sample, we have evidence that the mean concentration in the bottom water is greater than that of the surface water at different paired locations. B.6.5 Comparing results Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions were not met since the number of pairs was small, but the sample data was not highly skewed. Using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) lead to similar results here. References "],
 ["C-appendixC.html", "C Reach for the Stars Needed packages C.1 Sorted barplots C.2 Interactive graphics", " C Reach for the Stars Needed packages library(dplyr) library(ggplot2) library(knitr) library(dygraphs) library(nycflights13) C.1 Sorted barplots Building upon the example in Section 2.8: flights_table &lt;- table(flights$carrier) flights_table 9E AA AS B6 DL EV F9 FL HA MQ OO UA US 18460 32729 714 54635 48110 54173 685 3260 342 26397 32 58665 20536 VX WN YV 5162 12275 601 We can sort this table from highest to lowest counts by using the sort function: sorted_flights &lt;- sort(flights_table, decreasing = TRUE) names(sorted_flights) [1] &quot;UA&quot; &quot;B6&quot; &quot;EV&quot; &quot;DL&quot; &quot;AA&quot; &quot;MQ&quot; &quot;US&quot; &quot;9E&quot; &quot;WN&quot; &quot;VX&quot; &quot;FL&quot; &quot;AS&quot; &quot;F9&quot; &quot;YV&quot; &quot;HA&quot; [16] &quot;OO&quot; It is often preferred for barplots to be ordered corresponding to the heights of the bars. This allows the reader to more easily compare the ordering of different airlines in terms of departed flights (Robbins 2013). We can also much more easily answer questions like “How many airlines have more departing flights than Southwest Airlines?”. We can use the sorted table giving the number of flights defined as sorted_flights to reorder the carrier. ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() + scale_x_discrete(limits = names(sorted_flights)) FIGURE C.1: Number of flights departing NYC in 2013 by airline - Descending numbers. The last addition here specifies the values of the horizontal x axis on a discrete scale to correspond to those given by the entries of sorted_flights. C.2 Interactive graphics C.2.1 Interactive linegraphs Another useful tool for viewing linegraphs such as this is the dygraph function in the dygraphs package in combination with the dyRangeSelector function. This allows us to zoom in on a selected range and get an interactive plot for us to work with: library(dygraphs) flights_day &lt;- mutate(flights, date = as.Date(time_hour)) flights_summarized &lt;- flights_day %&gt;% group_by(date) %&gt;% summarize(median_arr_delay = median(arr_delay, na.rm = TRUE)) rownames(flights_summarized) &lt;- flights_summarized$date flights_summarized &lt;- select(flights_summarized, -date) dyRangeSelector(dygraph(flights_summarized)) The syntax here is a little different than what we have covered so far. The dygraph function is expecting for the dates to be given as the rownames of the object. We then remove the date variable from the flights_summarized data frame since it is accounted for in the rownames. Lastly, we run the dygraph function on the new data frame that only contains the median arrival delay as a column and then provide the ability to have a selector to zoom in on the interactive plot via dyRangeSelector. (Note that this plot will only be interactive in the HTML version of this book.) References "],
-["D-appendixD.html", "D Learning Check Solutions D.1 Chapter 2 Solutions D.2 Chapter 3 Solutions D.3 Chapter 4 Solutions D.4 Chapter 5 Solutions D.5 Chapter 6 Solutions", " D Learning Check Solutions D.1 Chapter 2 Solutions library(dplyr) library(ggplot2) library(nycflights13) (LC2.1) Repeat the above installing steps, but for the dplyr, nycflights13, and knitr packages. This will install the earlier mentioned dplyr package, the nycflights13 package containing data on all domestic flights leaving a NYC airport in 2013, and the knitr package for writing reports in R. (LC2.2) “Load” the dplyr, nycflights13, and knitr packages as well by repeating the above steps. Solution: If the following code runs with no errors, you’ve succeeded! library(dplyr) library(nycflights13) library(knitr) (LC2.3) What does any ONE row in this flights dataset refer to? A. Data on an airline B. Data on a flight C. Data on an airport D. Data on multiple flights Solution: This is data on a flight. Not a flight path! Example: a flight path would be United 1545 to Houston a flight would be United 1545 to Houston at a specific date/time. For example: 2013/1/1 at 5:15am. (LC2.4) What are some examples in this dataset of categorical variables? What makes them different than quantitative variables? Solution: Hint: Type ?flights in the console to see what all the variables mean! Categorical: carrier the company dest the destination flight the flight number. Even though this is a number, its simply a label. Example United 1545 is not less than United 1714 Quantitative: distance the distance in miles time_hour time (LC2.5) What properties of the observational unit do each of lat, lon, alt, tz, dst, and tzone describe for the airports data frame? Note that you may want to use ?airports to get more information. Solution: lat long represent the airport geographic coordinates, alt is the altitude above sea level of the airport (Run airports %&gt;% filter(faa == &quot;DEN&quot;) to see the altitude of Denver International Airport), tz is the time zone difference with respect to GMT in London UK, dst is the daylight savings time zone, and tzone is the time zone label. (LC2.6) Provide the names of variables in a data frame with at least three variables in which one of them is an identification variable and the other two are not. In other words, create your own tidy dataset that matches these conditions. Solution: In the weather example in LC3.8, the combination of origin, year, month, day, hour are identification variables as they identify the observation in question. Anything else pertains to observations: temp, humid, wind_speed, etc. D.2 Chapter 3 Solutions library(nycflights13) library(ggplot2) library(dplyr) (LC3.1) Take a look at both the flights and alaska_flights data frames by running View(flights) and View(alaska_flights) in the console. In what respect do these data frames differ? For example, think about the number of rows in each dataset. Solution: flights contains all flight data, while alaska_flights contains only data from Alaskan carrier “AS”. We can see that flights has 336776 rows while alaska_flights has only 714 (LC3.2) What are some practical reasons why dep_delay and arr_delay have a positive relationship? Solution: The later a plane departs, typically the later it will arrive. (LC3.3) What variables in the weather data frame would you expect to have a negative correlation (i.e. a negative relationship) with dep_delay? Why? Remember that we are focusing on numerical variables here. Hint: Explore the weather dataset by using the View() function. Solution: An example in the weather dataset is visibility, which measure visibility in miles. As visibility increases, we would expect departure delays to decrease. (LC3.4) Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaskan flights? Solution: The point (0,0) means no delay in departure nor arrival. From the point of view of Alaska airlines, this means the flight was on time. It seems most flights are at least close to being on time. (LC3.5) What are some other features of the plot that stand out to you? Solution: Different people will answer this one differently. One answer is most flights depart and arrive less than an hour late. (LC3.6) Create a new scatterplot using different variables in the alaska_flights data frame by modifying the example above. Solution: Many possibilities for this one, see the plot below. Is there a pattern in departure delay depending on when the flight is scheduled to depart? Interestingly, there seems to be only two blocks of time where flights depart. ggplot(data = alaska_flights, mapping = aes(x = dep_time, y = dep_delay)) + geom_point() (LC3.7) Why is setting the alpha argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot? Solution: Why is setting the alpha argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot? It thins out the points so we address overplotting. But more importantly it hints at the (statistical) density and distribution of the points: where are the points concentrated, where do they occur. (LC3.8) After viewing the Figure 2.4 above, give an approximate range of arrival delays and departure delays that occur the most frequently. How has that region changed compared to when you observed the same plot without the alpha = 0.2 set in Figure 2.2? Solution: After viewing the Figure 2.4 above, give a range of arrival delays and departure delays that occur most frequently? How has that region changed compared to when you observed the same plot without the alpha = 0.2 set in Figure 2.2? The lower plot suggests that most Alaska flights from NYC depart between 12 minutes early and on time and arrive between 50 minutes early and on time. (LC3.9) Take a look at both the weather and early_january_weather data frames by running View(weather) and View(early_january_weather) in the console. In what respect do these data frames differ? Solution: Take a look at both the weather and early_january_weather data frames by running View(weather) and View(early_january_weather) in the console. In what respect do these data frames differ? The rows of early_january_weather are a subset of weather. (LC3.10) View() the flights data frame again. Why does the time_hour variable uniquely identify the hour of the measurement whereas the hour variable does not? Solution: View() the flights data frame again. Why does the time_hour variable correctly identify the hour of the measurement whereas the hour variable does not? Because to uniquely identify an hour, we need the year/month/day/hour sequence, whereas there are only 24 possible hour’s. (LC3.11) Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis? Solution: Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis? Because lines suggest connectedness and ordering. (LC3.12) Why are linegraphs frequently used when time is the explanatory variable? Solution: Why are linegraphs frequently used when time is the explanatory variable? Because time is sequential: subsequent observations are closely related to each other. (LC3.13) Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013. Solution: Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013. Humidity is a good one to look at, since this very closely related to the cycles of a day. ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = humid)) + geom_line() (LC3.14) What does changing the number of bins from 30 to 40 tell us about the distribution of temperatures? Solution: The distribution doesn’t change much. But by refining the bin width, we see that the temperature data has a high degree of accuracy. What do I mean by accuracy? Looking at the temp variabile by View(weather), we see that the precision of each temperature recording is 2 decimal places. (LC3.15) Would you classify the distribution of temperatures as symmetric or skewed? Solution: It is rather symmetric, i.e. there are no long tails on only one side of the distribution (LC3.16) What would you guess is the “center” value in this distribution? Why did you make that choice? Solution: The center is around 55.26°F. By running the summary() command, we see that the mean and median are very similar. In fact, when the distribution is symmetric the mean equals the median. (LC3.17) Is this data spread out greatly from the center or is it close? Why? Solution: This can only be answered relatively speaking! Let’s pick things to be relative to Seattle, WA temperatures: FIGURE D.1: Annual temperatures at SEATAC Airport. While, it appears that Seattle weather has a similar center of 55°F, its temperatures are almost entirely between 35°F and 75°F for a range of about 40°F. Seattle temperatures are much less spread out than New York i.e. much more consistent over the year. New York on the other hand has much colder days in the winter and much hotter days in the summer. Expressed differently, the middle 50% of values, as delineated by the interquartile range is 30°F: (LC3.18) What other things do you notice about the faceted plot above? How does a faceted plot help us see relationships between two variables? Solution: Certain months have much more consistent weather (August in particular), while others have crazy variability like January and October, representing changes in the seasons. Because we see temp recordings split by month, we are considering the relationship between these two variables. For example, for summer months, temperatures tend to be higher. (LC3.19) What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100? Solution: They correspond to the month of the flight. While month is technically a number between 1-12, we’re viewing it as a categorical variable here. Specifically, this is an ordinal categorical variable since there is an ordering to the categories. 25, 50, 75, 100 are temperatures (LC3.20) For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics. Solution: It would not work if we had a very large number of facets. For example, if we facetted by individual days rather than months, as we would have 365 facets to look at. When considering all days in 2013, it could be argued that we shouldn’t care about day-to-day fluctuation in weather so much, but rather month-to-month fluctuations, allowing us to focus on seasonal trends. (LC3.21) Does the temp variable in the weather data-set have a lot of variability? Why do you say that? Solution: Again, like in LC (LC3.17), this is a relative question. I would say yes, because in New York City, you have 4 clear seasons with different weather. Whereas in Seattle WA and Portland OR, you have two seasons: summer and rain! (LC3.22) What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point. Solution: It appears to be an outlier. Let’s revisit the use of the filter command to hone in on it. We want all data points where the month is 5 and temp&lt;25 weather %&gt;% filter(month == 5 &amp; temp &lt; 25) # A tibble: 1 x 16 origin year month day hour temp dewp humid wind_dir wind_speed wind_gust &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 JFK 2013 5 8 22 13.1 12.02 95.34 80 8.05546 NA # … with 5 more variables: precip &lt;dbl&gt;, pressure &lt;dbl&gt;, visib &lt;dbl&gt;, # time_hour &lt;dttm&gt;, temp_in_C &lt;dbl&gt; There appears to be only one hour and only at JFK that recorded 13.1 F (-10.5 C) in the month of May. This is probably a data entry mistake! Why wasn’t the weather at least similar at EWR (Newark) and LGA (La Guardia)? (LC3.23) Which months have the highest variability in temperature? What reasons do you think this is? Solution: We are now interested in the spread of the data. One measure some of you may have seen previously is the standard deviation. But in this plot we can read off the Interquartile Range (IQR): The distance from the 1st to the 3rd quartiles i.e. the length of the boxes You can also think of this as the spread of the middle 50% of the data Just from eyeballing it, it seems November has the biggest IQR, i.e. the widest box, so has the most variation in temperature August has the smallest IQR, i.e. the narrowest box, so is the most consistent temperature-wise Here’s how we compute the exact IQR values for each month (we’ll see this more in depth Chapter 3 of the text): group the observations by month then for each group, i.e. month, summarize it by applying the summary statistic function IQR(), while making sure to skip over missing data via na.rm=TRUE then arrange the table in descending order of IQR weather %&gt;% group_by(month) %&gt;% summarize(IQR = IQR(temp, na.rm=TRUE)) %&gt;% arrange(desc(IQR)) month IQR 11 16.02 12 14.04 1 13.77 9 12.06 4 12.06 5 11.88 6 10.98 10 10.98 2 10.08 7 9.18 3 9.00 8 7.02 (LC3.24) We looked at the distribution of the numerical variable temp split by the numerical variable month that we converted to a categorical variable using the factor() function. Why would a boxplot of temp split by the numerical variable pressure similarly converted to a categorical variable using the factor() not be informative? Solution: Because there are 12 unique values of month yielding only 12 boxes in our boxplot. There are many more unique values of pressure (469 unique values in fact), because values are to the first decimal place. This would lead to 469 boxes, which is too many for people to digest. (LC3.25) Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram? Solution: In a histogram, the bin corresponding to where an outlier lies may not by high enough for us to see. In a boxplot, they are explicitly labelled separately. (LC3.26) Why are histograms inappropriate for visualizing categorical variables? Solution: Histograms are for numerical variables i.e. the horizontal part of each histogram bar represents an interval, whereas for a categorical variable each bar represents only one level of the categorical variable. (LC3.27) What is the difference between histograms and barplots? Solution: See above. (LC3.28) How many Envoy Air flights departed NYC in 2013? Solution: Envoy Air is carrier code MQ and thus 26397 flights departed NYC in 2013. (LC3.29) What was the seventh highest airline in terms of departed flights from NYC in 2013? How could we better present the table to get this answer quickly? Solution: The answer is US, AKA U.S. Airways, with 20536 flights. However, picking out the seventh highest airline when the rows are sorted alphabetically by carrier code is difficult. This would be easier to do if the rows were sorted by number. We’ll learn how to do this in Chapter 3 on data wrangling. (LC3.30) Why should pie charts be avoided and replaced by barplots? Solution: In our opinion, comparisons using horizontal lines are easier than comparing angles and areas of circles. (LC3.31) What is your opinion as to why pie charts continue to be used? Solution: Legacy? (LC3.32) What kinds of questions are not easily answered by looking at the above figure? Solution: Because the red, green, and blue bars don’t all start at 0 (only red does), it makes comparing counts hard. (LC3.33) What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights? Solution: The different airlines prefer different airports. For example, United is mostly a Newark carrier and JetBlue is a JFK carrier. If airlines didn’t prefer airports, each color would be roughly one third of each bar.} (LC3.34) Why might the side-by-side (AKA dodged) barplot be preferable to a stacked barplot in this case? Solution: We can easily compare the different aiports for a given carrier using a single comparison line i.e. things are lined up (LC3.35) What are the disadvantages of using a side-by-side (AKA dodged) barplot, in general? Solution: It is hard to get totals for each airline. (LC3.36) Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case? Solution: Not that different than using side-by-side; depends on how you want to organize your presentation. (LC3.37) What information about the different carriers at different airports is more easily seen in the faceted barplot? Solution: Now we can also compare the different carriers within a particular airport easily too. For example, we can read off who the top carrier for each airport is easily using a single horizontal line. D.3 Chapter 4 Solutions library(dplyr) library(ggplot2) library(nycflights13) (LC4.1) What’s another way using the “not” operator ! to filter only the rows that are not going to Burlington, VT nor Seattle, WA in the flights data frame? Test this out using the code above. Solution: # Original in book not_BTV_SEA &lt;- flights %&gt;% filter(!(dest == &quot;BTV&quot; | dest == &quot;SEA&quot;)) # Alternative way not_BTV_SEA &lt;- flights %&gt;% filter(!dest == &quot;BTV&quot; &amp; !dest == &quot;SEA&quot;) # Yet another way not_BTV_SEA &lt;- flights %&gt;% filter(dest != &quot;BTV&quot; &amp; dest != &quot;SEA&quot;) (LC4.2) Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor’s approach? Solution: The missing patients may have died of lung cancer! So to ignore them might seriously bias your results! It is very important to think of what the consequences on your analysis are of ignoring missing data! Ask yourself: There is a systematic reasons why certain values are missing? If so, you might be biasing your results! If there isn’t, then it might be ok to “sweep missing values under the rug.” (LC4.3) Modify the above summarize function to create summary_temp to also use the n() summary function: summarize(count = n()). What does the returned value correspond to? Solution: It corresponds to a count of the number of observations/rows: weather %&gt;% summarize(count = n()) # A tibble: 1 x 1 count &lt;int&gt; 1 26115 (LC4.4) Why doesn’t the following code work? Run the code line by line instead of all at once, and then look at the data. In other words, run summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE)) first. summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE)) %&gt;% summarize(std_dev = sd(temp, na.rm = TRUE)) Solution: Consider the output of only running the first two lines: weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE)) # A tibble: 1 x 1 mean &lt;dbl&gt; 1 55.2604 Because after the first summarize(), the variable temp disappears as it has been collapsed to the value mean. So when we try to run the second summarize(), it can’t find the variable temp to compute the standard deviation of. (LC4.5) Recall from Chapter 2 when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the summary_monthly_temp data frame tell us about temperatures in New York City throughout the year? Solution: The standard deviation is a quantification of spread and variability. We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days. (LC4.6) What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC? Solution: Note: group_by(day) is not enough, because day is a value between 1-31. We need to group_by(year, month, day) library(dplyr) library(nycflights13) summary_temp_by_month &lt;- weather %&gt;% group_by(month) %&gt;% summarize( mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE) ) (LC4.7) Recreate by_monthly_origin, but instead of grouping via group_by(origin, month), group variables in a different order group_by(month, origin). What differs in the resulting dataset? Solution: by_monthly_origin In by_monthly_origin the month column is now first and the rows are sorted by month instead of origin. If you compare the values of count in by_origin_monthly and by_monthly_origin using the View() function, you’ll see that the values are actually the same, just presented in a different order. (LC4.8) How could we identify how many flights left each of the three airports for each carrier? Solution: We could summarize the count from each airport using the n() function, which counts rows. All remarkably similar! Note: the n() function counts rows, whereas the sum(VARIABLE_NAME) funciton sums all values of a certain numerical variable VARIABLE_NAME. (LC4.9) How does the filter operation differ from a group_by followed by a summarize? Solution: filter picks out rows from the original dataset without modifying them, whereas group_by %&gt;% summarize computes summaries of numerical variables, and hence reports new values. (LC4.10) What do positive values of the gain variable in flights correspond to? What about negative values? And what about a zero value? Solution: Say a flight departed 20 minutes late, i.e. dep_delay = 20 Then arrived 10 minutes late, i.e. arr_delay = 10. Then gain = dep_delay - arr_delay = 20 - 10 = 10 is positive, so it “made up/gained time in the air.” 0 means the departure and arrival time were the same, so no time was made up in the air. We see in most cases that the gain is near 0 minutes. I never understood this. If the pilot says “we’re going make up time in the air” because of delay by flying faster, why don’t you always just fly faster to begin with? (LC4.11) Could we create the dep_delay and arr_delay columns by simply subtracting dep_time from sched_dep_time and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights. Solution: No because you can’t do direct arithmetic on times. The difference in time between 12:03 and 11:59 is 4 minutes, but 1203-1159 = 44 (LC4.12) What can we say about the distribution of gain? Describe it in a few sentences using the plot and the gain_summary data frame values. Solution: Most of the time the gain is a little under zero, most of the time the gain is between -50 and 50 minutes. There are some extreme cases however! (LC4.13) Looking at Figure 3.7, when joining flights and weather (or, in other words, matching the hourly weather values with each flight), why do we need to join by all of year, month, day, hour, and origin, and not just hour? Solution: Because hour is simply a value between 0 and 23; to identify a specific hour, we need to know which year, month, day and at which airport. (LC4.14) What surprises you about the top 10 destinations from NYC in 2013? Solution: This question is subjective! What surprises me is the high number of flights to Boston. Wouldn’t it be easier and quicker to take the train? (LC4.15) What are some advantages of data in normal forms? What are some disadvantages? Solution: When datasets are in normal form, we can easily _join them with other datasets! For example, we can join the flights data with the planes data. (LC4.16) What are some ways to select all three of the dest, air_time, and distance variables from flights? Give the code showing how to do this in at least three different ways. Solution: (LC4.17) How could one use starts_with, ends_with, and contains to select columns from the flights data frame? Provide three different examples in total: one for starts_with, one for ends_with, and one for contains. Solution: (LC4.18) Why might we want to use the select() function on a data frame? Solution: To narrow down the data frame, to make it easier to look at. Using View() for example. (LC4.19) Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013. Solution: (LC4.20) Using the datasets included in the nycflights13 package, compute the available seat miles for each airline sorted in descending order. After completing all the necessary data wrangling steps, the resulting data frame should have 16 rows (one for each airline) and 2 columns (airline name and available seat miles). Here are some hints: Crucial: Unless you are very confident in what you are doing, it is worthwhile to not starting coding right away, but rather first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level pseudocode that is informal yet detailed enough to articulate what you are doing. This way you won’t confuse what you are trying to do (the algorithm) with how you are going to do it (writing dplyr code). Take a close look at all the datasets using the View() function: flights, weather, planes, airports, and airlines to identify which variables are necessary to compute available seat miles. Figure 3.7 above showing how the various datasets can be joined will also be useful. Consider the data wrangling verbs in Table 3.2 as your toolbox! Solution: Here are some examples of student-written pseudocode. Based on our own pseudocode, let’s first display the entire solution. Let’s now break this down step-by-step. To compute the available seat miles for a given flight, we need the distance variable from the flights data frame and the seats variable from the planes data frame, necessitating a join by the key variable tailnum as illustrated in Figure 3.7. To keep the resulting data frame easy to view, we’ll select() only these two variables and carrier: Now for each flight we can compute the available seat miles ASM by multiplying the number of seats by the distance via a mutate(): Next we want to sum the ASM for each carrier. We achieve this by first grouping by carrier and then summarizing using the sum() function: However, because for certain carriers certain flights have missing NA values, the resulting table also returns NA’s. We can eliminate these by adding a na.rm = TRUE argument to sum(), telling R that we want to remove the NA’s in the sum. We saw this in Section 3.3: Finally, we arrange() the data in desc()ending order of ASM. While the above data frame is correct, the IATA carrier code is not always useful. For example, what carrier is WN? We can address this by joining with the airlines dataset using carrier is the key variable. While this step is not absolutely required, it goes a long way to making the table easier to make sense of. It is important to be empathetic with the ultimate consumers of your presented data! D.4 Chapter 5 Solutions library(dplyr) library(ggplot2) library(nycflights13) library(tidyr) library(readr) (LC5.1) What are common characteristics of “tidy” datasets? Solution: Rows correspond to observations, while columns correspond to variables. (LC5.2) What makes “tidy” datasets useful for organizing data? Solution: Tidy datasets are an organized way of viewing data. This format is required for the ggplot2 and dplyr packages for data visualization and wrangling. (LC5.3) Take a look the airline_safety data frame included in the fivethirtyeight data. Run the following: airline_safety After reading the help file by running ?airline_safety, we see that airline_safety is a data frame containing information on different airlines companies’ safety records. This data was originally reported on the data journalism website FiveThirtyEight.com in Nate Silver’s article “Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?”. Let’s ignore the incl_reg_subsidiaries and avail_seat_km_per_week variables for simplicity: airline_safety_smaller &lt;- airline_safety %&gt;% select(-c(incl_reg_subsidiaries, avail_seat_km_per_week)) airline_safety_smaller # A tibble: 56 x 7 airline incidents_85_99 fatal_accidents… fatalities_85_99 incidents_00_14 &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; 1 Aer Li… 2 0 0 0 2 Aerofl… 76 14 128 6 3 Aeroli… 6 0 0 1 4 Aerome… 3 1 64 5 5 Air Ca… 2 0 0 2 6 Air Fr… 14 4 79 6 7 Air In… 2 1 329 4 8 Air Ne… 3 0 0 5 9 Alaska… 5 0 0 5 10 Alital… 7 2 50 4 # … with 46 more rows, and 2 more variables: fatal_accidents_00_14 &lt;int&gt;, # fatalities_00_14 &lt;int&gt; This data frame is not in “tidy” format. How would you convert this data frame to be in “tidy” format, in particular so that it has a variable incident_type_years indicating the indicent type/year and a variable count of the counts? Solution: Using the gather() function from the tidyr package: airline_safety_smaller_tidy &lt;- airline_safety_smaller %&gt;% gather(key = incident_type_years, value = count, -airline) airline_safety_smaller_tidy # A tibble: 336 x 3 airline incident_type_years count &lt;chr&gt; &lt;chr&gt; &lt;int&gt; 1 Aer Lingus incidents_85_99 2 2 Aeroflot incidents_85_99 76 3 Aerolineas Argentinas incidents_85_99 6 4 Aeromexico incidents_85_99 3 5 Air Canada incidents_85_99 2 6 Air France incidents_85_99 14 7 Air India incidents_85_99 2 8 Air New Zealand incidents_85_99 3 9 Alaska Airlines incidents_85_99 5 10 Alitalia incidents_85_99 7 # … with 326 more rows If you look at the resulting airline_safety_smaller_tidy data frame in the spreadsheet viewer, you’ll see that the variable incident_type_years has 6 possible values: &quot;incidents_85_99&quot;, &quot;fatal_accidents_85_99&quot;, &quot;fatalities_85_99&quot;, &quot;incidents_00_14&quot;, &quot;fatal_accidents_00_14&quot;, &quot;fatalities_00_14&quot; corresponding to the 6 columns of airline_safety_smaller we tidied. (LC5.4) Convert the dem_score data frame into a tidy data frame and assign the name of dem_score_tidy to the resulting long-formatted data frame. Solution: Running the following in the console: Let’s now compare the dem_score and dem_score_tidy. dem_score has democracy score information for each year in columns, whereas in dem_score_tidy there are explicit variables year and democracy_score. While both representations of the data contain the same information, we can only use ggplot() to create plots using the dem_score_tidy data frame. (LC5.5) Read in the life expectancy data stored at https://moderndive.com/data/le_mess.csv and convert it to a tidy data frame. Solution: The code is similar We observe the same construct structure with respect to year in life_expectancy vs life_expectancy_tidy as we did in dem_score vs dem_score_tidy: D.5 Chapter 6 Solutions To come! library(ggplot2) library(dplyr) library(moderndive) library(gapminder) #library(skimr) "],
-["E-appendixE.html", "E Information about R Packages Used", " E Information about R Packages Used This book uses the following versions of R packages (and their dependent packages). If you are seeing results slightly different than what is shown in the book and you want to get a closer match, we recommend you install the particular version of the package we used. This can be done by first installing the remotes package via install.packages(&quot;remotes&quot;) and then the particular version of a package using syntax similar to the following replacing the package argument with the name of the package in quotes and the version argument with the particular number of the version to install. remotes::install_version(package = &quot;moderndive&quot;, version = &quot;0.3.0&quot;) package version askpass 1.1 assertthat 0.2.1 backports 1.1.4 base64enc 0.1-3 BH 1.69.0-1 brew 1.0-6 broom 0.5.2 callr 3.3.0 cellranger 1.1.0 cli 1.1.0 clipr 0.6.0 clisymbols 1.2.0 colorspace 1.4-1 commonmark 1.7 crayon 1.3.4 curl 4.0 DBI 1.0.0 dbplyr 1.4.2 desc 1.2.0 devtools 2.1.0 digest 0.6.20 dplyr 0.8.3 dygraphs 1.1.1.6 ellipsis 0.2.0.1 evaluate 0.14 fansi 0.4.0 fivethirtyeight 0.4.0 forcats 0.4.0 formula.tools 1.7.1 fs 1.3.1 gapminder 0.3.0 generics 0.0.2 ggplot2 3.2.1 ggplot2movies 0.0.1 ggrepel 0.8.1 gh 1.0.1 git2r 0.26.1 glue 1.3.1 gridExtra 2.3 gtable 0.3.0 haven 2.1.0 highr 0.8 hms 0.4.2 htmltools 0.3.6 htmlwidgets 1.3 httr 1.4.0 infer 0.4.1 ini 0.3.1 ISLR 1.2 janitor 1.2.0 jsonlite 1.6 kableExtra 1.1.0 knitr 1.23 labeling 0.3 lattice 0.20-38 lazyeval 0.2.2 lubridate 1.7.4 magrittr 1.5 markdown 1.0 MASS 7.3-51.4 Matrix 1.2-17 memoise 1.1.0 mgcv 1.8-28 mime 0.7 modelr 0.1.4 moderndive 0.3.0 munsell 0.5.0 mvtnorm 1.0-11 nlme 3.1-139 nycflights13 1.0.0 openssl 1.4.1 operator.tools 1.6.3 pander 0.6.3 patchwork 0.0.1 pillar 1.4.2 pkgbuild 1.0.3 pkgconfig 2.0.2 pkgload 1.0.2 plogr 0.2.0 plyr 1.8.4 praise 1.0.0 prettyunits 1.0.2 processx 3.4.0 progress 1.2.2 ps 1.3.0 purrr 0.3.2 R6 2.4.0 rcmdcheck 1.3.3 RColorBrewer 1.1-2 Rcpp 1.0.2 readr 1.3.1 readxl 1.3.1 rematch 1.0.1 remotes 2.1.0 reprex 0.3.0 reshape2 1.4.3 rlang 0.4.0 rmarkdown 1.14 roxygen2 6.1.1 rprojroot 1.3-2 rstudioapi 0.10 rvest 0.3.4 scales 1.0.0 selectr 0.4-1 sessioninfo 1.1.1 skimr 1.0.7 snakecase 0.11.0 stringi 1.4.3 stringr 1.4.0 sys 3.2 testthat 2.1.1 tibble 2.1.3 tidyr 0.8.3 tidyselect 0.2.5 tidyverse 1.2.1 tinytex 0.14 usethis 1.5.1 utf8 1.1.4 vctrs 0.2.0 viridis 0.5.1 viridisLite 0.3.0 webshot 0.5.1 whisker 0.3-2 withr 2.1.2 xfun 0.8 xml2 1.2.2 xopen 1.0.0 xts 0.11-2 yaml 2.2.0 zeallot 0.1.0 zoo 1.8-6 "],
+["D-appendixD.html", "D Learning Check Solutions D.1 Chapter 1 Solutions D.2 Chapter 2 Solutions D.3 Chapter 3 Solutions D.4 Chapter 4 Solutions D.5 Chapter 5 Solutions D.6 Chapter 6 Solutions D.7 Chapter 7 Solutions D.8 Chapter 8 Solutions D.9 Chapter 9 Solutions D.10 Chapter 10 Solutions D.11 Chapter 11 Solutions", " D Learning Check Solutions D.1 Chapter 1 Solutions library(dplyr) library(ggplot2) library(nycflights13) (LC1.1) Repeat the above installing steps, but for the dplyr, nycflights13, and knitr packages. This will install the earlier mentioned dplyr package, the nycflights13 package containing data on all domestic flights leaving a NYC airport in 2013, and the knitr package for writing reports in R. (LC1.2) “Load” the dplyr, nycflights13, and knitr packages as well by repeating the above steps. Solution: If the following code runs with no errors, you’ve succeeded! library(dplyr) library(nycflights13) library(knitr) (LC1.3) What does any ONE row in this flights dataset refer to? A. Data on an airline B. Data on a flight C. Data on an airport D. Data on multiple flights Solution: This is data on a flight. Not a flight path! Example: a flight path would be United 1545 to Houston a flight would be United 1545 to Houston at a specific date/time. For example: 2013/1/1 at 5:15am. (LC1.4) What are some examples in this dataset of categorical variables? What makes them different than quantitative variables? Solution: Hint: Type ?flights in the console to see what all the variables mean! Categorical: carrier the company dest the destination flight the flight number. Even though this is a number, its simply a label. Example United 1545 is not less than United 1714 Quantitative: distance the distance in miles time_hour time (LC1.5) What properties of the observational unit do each of lat, lon, alt, tz, dst, and tzone describe for the airports data frame? Note that you may want to use ?airports to get more information. Solution: lat long represent the airport geographic coordinates, alt is the altitude above sea level of the airport (Run airports %&gt;% filter(faa == &quot;DEN&quot;) to see the altitude of Denver International Airport), tz is the time zone difference with respect to GMT in London UK, dst is the daylight savings time zone, and tzone is the time zone label. (LC1.6) Provide the names of variables in a data frame with at least three variables in which one of them is an identification variable and the other two are not. In other words, create your own tidy dataset that matches these conditions. Solution: In the weather example in LC2.8, the combination of origin, year, month, day, hour are identification variables as they identify the observation in question. Anything else pertains to observations: temp, humid, wind_speed, etc. D.2 Chapter 2 Solutions library(nycflights13) library(ggplot2) library(dplyr) (LC2.1) Take a look at both the flights and alaska_flights data frames by running View(flights) and View(alaska_flights) in the console. In what respect do these data frames differ? For example, think about the number of rows in each dataset. Solution: flights contains all flight data, while alaska_flights contains only data from Alaskan carrier “AS”. We can see that flights has 336776 rows while alaska_flights has only 714 (LC2.2) What are some practical reasons why dep_delay and arr_delay have a positive relationship? Solution: The later a plane departs, typically the later it will arrive. (LC2.3) What variables in the weather data frame would you expect to have a negative correlation (i.e. a negative relationship) with dep_delay? Why? Remember that we are focusing on numerical variables here. Hint: Explore the weather dataset by using the View() function. Solution: An example in the weather dataset is visibility, which measure visibility in miles. As visibility increases, we would expect departure delays to decrease. (LC2.4) Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaskan flights? Solution: The point (0,0) means no delay in departure nor arrival. From the point of view of Alaska airlines, this means the flight was on time. It seems most flights are at least close to being on time. (LC2.5) What are some other features of the plot that stand out to you? Solution: Different people will answer this one differently. One answer is most flights depart and arrive less than an hour late. (LC2.6) Create a new scatterplot using different variables in the alaska_flights data frame by modifying the example above. Solution: Many possibilities for this one, see the plot below. Is there a pattern in departure delay depending on when the flight is scheduled to depart? Interestingly, there seems to be only two blocks of time where flights depart. ggplot(data = alaska_flights, mapping = aes(x = dep_time, y = dep_delay)) + geom_point() (LC2.7) Why is setting the alpha argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot? Solution: It thins out the points so we address overplotting. But more importantly it hints at the (statistical) density and distribution of the points: where are the points concentrated, where do they occur. (LC2.8) After viewing the Figure 2.4 above, give an approximate range of arrival delays and departure delays that occur the most frequently. How has that region changed compared to when you observed the same plot without the alpha = 0.2 set in Figure 2.2? Solution: The lower plot suggests that most Alaska flights from NYC depart between 12 minutes early and on time and arrive between 50 minutes early and on time. (LC2.9) Take a look at both the weather and early_january_weather data frames by running View(weather) and View(early_january_weather) in the console. In what respect do these data frames differ? Solution: The rows of early_january_weather are a subset of weather. (LC2.10) View() the flights data frame again. Why does the time_hour variable uniquely identify the hour of the measurement whereas the hour variable does not? Solution: Because to uniquely identify an hour, we need the year/month/day/hour sequence, whereas there are only 24 possible hour’s. (LC2.11) Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis? Solution: Because lines suggest connectedness and ordering. (LC2.12) Why are linegraphs frequently used when time is the explanatory variable? Solution: Because time is sequential: subsequent observations are closely related to each other. (LC2.13) Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013. Solution: Humidity is a good one to look at, since this very closely related to the cycles of a day. ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = humid)) + geom_line() (LC2.14) What does changing the number of bins from 30 to 40 tell us about the distribution of temperatures? Solution: The distribution doesn’t change much. But by refining the bin width, we see that the temperature data has a high degree of accuracy. What do I mean by accuracy? Looking at the temp variable by View(weather), we see that the precision of each temperature recording is 2 decimal places. (LC2.15) Would you classify the distribution of temperatures as symmetric or skewed? Solution: It is rather symmetric, i.e. there are no long tails on only one side of the distribution (LC2.16) What would you guess is the “center” value in this distribution? Why did you make that choice? Solution: The center is around 55.26°F. By running the summary() command, we see that the mean and median are very similar. In fact, when the distribution is symmetric the mean equals the median. (LC2.17) Is this data spread out greatly from the center or is it close? Why? Solution: This can only be answered relatively speaking! Let’s pick things to be relative to Seattle, WA temperatures: FIGURE D.1: Annual temperatures at SEATAC Airport. While, it appears that Seattle weather has a similar center of 55°F, its temperatures are almost entirely between 35°F and 75°F for a range of about 40°F. Seattle temperatures are much less spread out than New York i.e. much more consistent over the year. New York on the other hand has much colder days in the winter and much hotter days in the summer. Expressed differently, the middle 50% of values, as delineated by the interquartile range is 30°F: (LC2.18) What other things do you notice about the faceted plot above? How does a faceted plot help us see relationships between two variables? Solution: Certain months have much more consistent weather (August in particular), while others have crazy variability like January and October, representing changes in the seasons. Because we see temp recordings split by month, we are considering the relationship between these two variables. For example, for summer months, temperatures tend to be higher. (LC2.19) What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100? Solution: They correspond to the month of the flight. While month is technically a number between 1-12, we’re viewing it as a categorical variable here. Specifically, this is an ordinal categorical variable since there is an ordering to the categories. 25, 50, 75, 100 are temperatures (LC2.20) For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics. Solution: It would not work if we had a very large number of facets. For example, if we faceted by individual days rather than months, as we would have 365 facets to look at. When considering all days in 2013, it could be argued that we shouldn’t care about day-to-day fluctuation in weather so much, but rather month-to-month fluctuations, allowing us to focus on seasonal trends. (LC2.21) Does the temp variable in the weather dataset have a lot of variability? Why do you say that? Solution: Again, like in LC (LC2.17), this is a relative question. I would say yes, because in New York City, you have 4 clear seasons with different weather. Whereas in Seattle WA and Portland OR, you have two seasons: summer and rain! (LC2.22) What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point. Solution: It appears to be an outlier. Let’s revisit the use of the filter command to hone in on it. We want all data points where the month is 5 and temp&lt;25 weather %&gt;% filter(month == 5 &amp; temp &lt; 25) # A tibble: 1 x 16 origin year month day hour temp dewp humid wind_dir wind_speed wind_gust &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 JFK 2013 5 8 22 13.1 12.02 95.34 80 8.05546 NA # … with 5 more variables: precip &lt;dbl&gt;, pressure &lt;dbl&gt;, visib &lt;dbl&gt;, # time_hour &lt;dttm&gt;, temp_in_C &lt;dbl&gt; There appears to be only one hour and only at JFK that recorded 13.1 F (-10.5 C) in the month of May. This is probably a data entry mistake! Why wasn’t the weather at least similar at EWR (Newark) and LGA (LaGuardia)? (LC2.23) Which months have the highest variability in temperature? What reasons do you think this is? Solution: We are now interested in the spread of the data. One measure some of you may have seen previously is the standard deviation. But in this plot we can read off the Interquartile Range (IQR): The distance from the 1st to the 3rd quartiles i.e. the length of the boxes You can also think of this as the spread of the middle 50% of the data Just from eyeballing it, it seems November has the biggest IQR, i.e. the widest box, so has the most variation in temperature August has the smallest IQR, i.e. the narrowest box, so is the most consistent temperature-wise Here’s how we compute the exact IQR values for each month (we’ll see this more in depth Chapter 3 of the text): group the observations by month then for each group, i.e. month, summarize it by applying the summary statistic function IQR(), while making sure to skip over missing data via na.rm=TRUE then arrange the table in descending order of IQR weather %&gt;% group_by(month) %&gt;% summarize(IQR = IQR(temp, na.rm=TRUE)) %&gt;% arrange(desc(IQR)) month IQR 11 16.02 12 14.04 1 13.77 9 12.06 4 12.06 5 11.88 6 10.98 10 10.98 2 10.08 7 9.18 3 9.00 8 7.02 (LC2.24) We looked at the distribution of the numerical variable temp split by the numerical variable month that we converted to a categorical variable using the factor() function. Why would a boxplot of temp split by the numerical variable pressure similarly converted to a categorical variable using the factor() not be informative? Solution: Because there are 12 unique values of month yielding only 12 boxes in our boxplot. There are many more unique values of pressure (469 unique values in fact), because values are to the first decimal place. This would lead to 469 boxes, which is too many for people to digest. (LC2.25) Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram? Solution: In a histogram, the bin corresponding to where an outlier lies may not by high enough for us to see. In a boxplot, they are explicitly labelled separately. (LC2.26) Why are histograms inappropriate for visualizing categorical variables? Solution: Histograms are for numerical variables i.e. the horizontal part of each histogram bar represents an interval, whereas for a categorical variable each bar represents only one level of the categorical variable. (LC2.27) What is the difference between histograms and barplots? Solution: See above. (LC2.28) How many Envoy Air flights departed NYC in 2013? Solution: Envoy Air is carrier code MQ and thus 26397 flights departed NYC in 2013. (LC2.29) What was the seventh highest airline in terms of departed flights from NYC in 2013? How could we better present the table to get this answer quickly? Solution: The answer is US, AKA U.S. Airways, with 20536 flights. However, picking out the seventh highest airline when the rows are sorted alphabetically by carrier code is difficult. This would be easier to do if the rows were sorted by number. We’ll learn how to do this in Chapter 3 on data wrangling. (LC2.30) Why should pie charts be avoided and replaced by barplots? Solution: In our opinion, comparisons using horizontal lines are easier than comparing angles and areas of circles. (LC2.31) What is your opinion as to why pie charts continue to be used? Solution: In our opinion, pie charts are generally considered as a poorer method for communicating data than bar charts. People’s brains are not as good at comparing the size of angles because there is no scale, and in comparison, it is much easier to compare the heights of bars in a bar charts. However, in some circumstances, for example, when representing 25% and 75% of a sample size, if we have 2 bars, in which the higher one is three times in height of the other one, it is difficult to tell the scale of their comparison without labels. But in a bar chart, it would be easy to compare if a circle is divided by 75% and 25%. (Read more at: https://www.displayr.com/why-pie-charts-are-better-than-bar-charts/) (LC2.32) What kinds of questions are not easily answered by looking at the above figure? Solution: Because the red, green, and blue bars don’t all start at 0 (only red does), it makes comparing counts hard. (LC2.33) What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights? Solution: The different airlines prefer different airports. For example, United is mostly a Newark carrier and JetBlue is a JFK carrier. If airlines didn’t prefer airports, each color would be roughly one third of each bar.} (LC2.34) Why might the side-by-side (AKA dodged) barplot be preferable to a stacked barplot in this case? Solution: We can easily compare the different airports for a given carrier using a single comparison line i.e. things are lined up (LC2.35) What are the disadvantages of using a side-by-side (AKA dodged) barplot, in general? Solution: It is hard to get totals for each airline. (LC2.36) Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case? Solution: Not that different than using side-by-side; depends on how you want to organize your presentation. (LC2.37) What information about the different carriers at different airports is more easily seen in the faceted barplot? Solution: Now we can also compare the different carriers within a particular airport easily too. For example, we can read off who the top carrier for each airport is easily using a single horizontal line. D.3 Chapter 3 Solutions library(dplyr) library(ggplot2) library(nycflights13) (LC3.1) What’s another way using the “not” operator ! to filter only the rows that are not going to Burlington, VT nor Seattle, WA in the flights data frame? Test this out using the code above. Solution: # Original in book not_BTV_SEA &lt;- flights %&gt;% filter(!(dest == &quot;BTV&quot; | dest == &quot;SEA&quot;)) # Alternative way not_BTV_SEA &lt;- flights %&gt;% filter(!dest == &quot;BTV&quot; &amp; !dest == &quot;SEA&quot;) # Yet another way not_BTV_SEA &lt;- flights %&gt;% filter(dest != &quot;BTV&quot; &amp; dest != &quot;SEA&quot;) (LC3.2) Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor’s approach? Solution: The missing patients may have died of lung cancer! So to ignore them might seriously bias your results! It is very important to think of what the consequences on your analysis are of ignoring missing data! Ask yourself: There is a systematic reasons why certain values are missing? If so, you might be biasing your results! If there isn’t, then it might be ok to “sweep missing values under the rug.” (LC3.3) Modify the above summarize function to create summary_temp to also use the n() summary function: summarize(count = n()). What does the returned value correspond to? Solution: It corresponds to a count of the number of observations/rows: weather %&gt;% summarize(count = n()) # A tibble: 1 x 1 count &lt;int&gt; 1 26115 (LC3.4) Why doesn’t the following code work? Run the code line by line instead of all at once, and then look at the data. In other words, run summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE)) first. summary_temp &lt;- weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE)) %&gt;% summarize(std_dev = sd(temp, na.rm = TRUE)) Solution: Consider the output of only running the first two lines: weather %&gt;% summarize(mean = mean(temp, na.rm = TRUE)) # A tibble: 1 x 1 mean &lt;dbl&gt; 1 55.2604 Because after the first summarize(), the variable temp disappears as it has been collapsed to the value mean. So when we try to run the second summarize(), it can’t find the variable temp to compute the standard deviation of. (LC3.5) Recall from Chapter 2 when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the summary_monthly_temp data frame tell us about temperatures in New York City throughout the year? Solution: month mean std_dev 1 35.6 10.22 2 34.3 6.98 3 39.9 6.25 4 51.7 8.79 5 61.8 9.68 6 72.2 7.55 7 80.1 7.12 8 74.5 5.19 9 67.4 8.47 10 60.1 8.85 11 45.0 10.44 12 38.4 9.98 The standard deviation is a quantification of spread and variability. We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days. (LC3.6) What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC? Solution: summary_temp_by_day &lt;- weather %&gt;% group_by(year, month, day) %&gt;% summarize( mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE) ) summary_temp_by_day # A tibble: 364 x 5 # Groups: year, month [12] year month day mean std_dev &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; 1 2013 1 1 36.9997 4.00117 2 2013 1 2 28.7025 3.45205 3 2013 1 3 29.9725 2.58472 4 2013 1 4 34.94 2.45283 5 2013 1 5 37.205 4.00500 6 2013 1 6 40.0518 4.39562 7 2013 1 7 40.5825 3.68319 8 2013 1 8 40.1175 5.77457 9 2013 1 9 43.225 5.39724 10 2013 1 10 43.85 2.95214 # … with 354 more rows Note: group_by(day) is not enough, because day is a value between 1-31. We need to group_by(year, month, day) library(dplyr) library(nycflights13) summary_temp_by_month &lt;- weather %&gt;% group_by(month) %&gt;% summarize( mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE) ) (LC3.7) Recreate by_monthly_origin, but instead of grouping via group_by(origin, month), group variables in a different order group_by(month, origin). What differs in the resulting dataset? Solution: by_monthly_origin &lt;- flights %&gt;% group_by(month, origin) %&gt;% summarize(count = n()) by_monthly_origin month origin count 1 EWR 9893 1 JFK 9161 1 LGA 7950 2 EWR 9107 2 JFK 8421 2 LGA 7423 3 EWR 10420 3 JFK 9697 3 LGA 8717 4 EWR 10531 4 JFK 9218 4 LGA 8581 5 EWR 10592 5 JFK 9397 5 LGA 8807 6 EWR 10175 6 JFK 9472 6 LGA 8596 7 EWR 10475 7 JFK 10023 7 LGA 8927 8 EWR 10359 8 JFK 9983 8 LGA 8985 9 EWR 9550 9 JFK 8908 9 LGA 9116 10 EWR 10104 10 JFK 9143 10 LGA 9642 11 EWR 9707 11 JFK 8710 11 LGA 8851 12 EWR 9922 12 JFK 9146 12 LGA 9067 In by_monthly_origin the month column is now first and the rows are sorted by month instead of origin. If you compare the values of count in by_origin_monthly and by_monthly_origin using the View() function, you’ll see that the values are actually the same, just presented in a different order. (LC3.8) How could we identify how many flights left each of the three airports for each carrier? Solution: We could summarize the count from each airport using the n() function, which counts rows. count_flights_by_airport &lt;- flights %&gt;% group_by(origin, carrier) %&gt;% summarize(count=n()) count_flights_by_airport origin carrier count EWR 9E 1268 EWR AA 3487 EWR AS 714 EWR B6 6557 EWR DL 4342 EWR EV 43939 EWR MQ 2276 EWR OO 6 EWR UA 46087 EWR US 4405 EWR VX 1566 EWR WN 6188 JFK 9E 14651 JFK AA 13783 JFK B6 42076 JFK DL 20701 JFK EV 1408 JFK HA 342 JFK MQ 7193 JFK UA 4534 JFK US 2995 JFK VX 3596 LGA 9E 2541 LGA AA 15459 LGA B6 6002 LGA DL 23067 LGA EV 8826 LGA F9 685 LGA FL 3260 LGA MQ 16928 LGA OO 26 LGA UA 8044 LGA US 13136 LGA WN 6087 LGA YV 601 All remarkably similar! Note: the n() function counts rows, whereas the sum(VARIABLE_NAME) function sums all values of a certain numerical variable VARIABLE_NAME. (LC3.9) How does the filter operation differ from a group_by followed by a summarize? Solution: filter picks out rows from the original dataset without modifying them, whereas group_by %&gt;% summarize computes summaries of numerical variables, and hence reports new values. (LC3.10) What do positive values of the gain variable in flights correspond to? What about negative values? And what about a zero value? Solution: Say a flight departed 20 minutes late, i.e. dep_delay = 20 Then arrived 10 minutes late, i.e. arr_delay = 10. Then gain = dep_delay - arr_delay = 20 - 10 = 10 is positive, so it “made up/gained time in the air.” 0 means the departure and arrival time were the same, so no time was made up in the air. We see in most cases that the gain is near 0 minutes. I never understood this. If the pilot says “we’re going make up time in the air” because of delay by flying faster, why don’t you always just fly faster to begin with? (LC3.11) Could we create the dep_delay and arr_delay columns by simply subtracting dep_time from sched_dep_time and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights. Solution: No because you can’t do direct arithmetic on times. The difference in time between 12:03 and 11:59 is 4 minutes, but 1203-1159 = 44 (LC3.12) What can we say about the distribution of gain? Describe it in a few sentences using the plot and the gain_summary data frame values. Solution: Most of the time the gain is a little under zero, most of the time the gain is between -50 and 50 minutes. There are some extreme cases however! (LC3.13) Looking at Figure 3.7, when joining flights and weather (or, in other words, matching the hourly weather values with each flight), why do we need to join by all of year, month, day, hour, and origin, and not just hour? Solution: Because hour is simply a value between 0 and 23; to identify a specific hour, we need to know which year, month, day and at which airport. (LC3.14) What surprises you about the top 10 destinations from NYC in 2013? Solution: This question is subjective! What surprises me is the high number of flights to Boston. Wouldn’t it be easier and quicker to take the train? (LC3.15) What are some advantages of data in normal forms? What are some disadvantages? Solution: When datasets are in normal form, we can easily _join them with other datasets! For example, we can join the flights data with the planes data. (LC3.16) What are some ways to select all three of the dest, air_time, and distance variables from flights? Give the code showing how to do this in at least three different ways. Solution: # The regular way: flights %&gt;% select(dest, air_time, distance) # A tibble: 336,776 x 3 dest air_time distance &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; 1 IAH 227 1400 2 IAH 227 1416 3 MIA 160 1089 4 BQN 183 1576 5 ATL 116 762 6 ORD 150 719 7 FLL 158 1065 8 IAD 53 229 9 MCO 140 944 10 ORD 138 733 # … with 336,766 more rows # Since they are sequential columns in the dataset flights %&gt;% select(dest:distance) # A tibble: 336,776 x 3 dest air_time distance &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; 1 IAH 227 1400 2 IAH 227 1416 3 MIA 160 1089 4 BQN 183 1576 5 ATL 116 762 6 ORD 150 719 7 FLL 158 1065 8 IAD 53 229 9 MCO 140 944 10 ORD 138 733 # … with 336,766 more rows # Not as effective, by removing everything else flights %&gt;% select(-year, -month, -day, -dep_time, -sched_dep_time, -dep_delay, -arr_time, -sched_arr_time, -arr_delay, -carrier, -flight, -tailnum, -origin, -hour, -minute, -time_hour) # A tibble: 336,776 x 6 dest air_time distance gain hours gain_per_hour &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 IAH 227 1400 -9 3.78333 -2.37885 2 IAH 227 1416 -16 3.78333 -4.22907 3 MIA 160 1089 -31 2.66667 -11.625 4 BQN 183 1576 17 3.05 5.57377 5 ATL 116 762 19 1.93333 9.82759 6 ORD 150 719 -16 2.5 -6.4 7 FLL 158 1065 -24 2.63333 -9.11392 8 IAD 53 229 11 0.883333 12.4528 9 MCO 140 944 5 2.33333 2.14286 10 ORD 138 733 -10 2.300 -4.34783 # … with 336,766 more rows (LC3.17) How could one use starts_with, ends_with, and contains to select columns from the flights data frame? Provide three different examples in total: one for starts_with, one for ends_with, and one for contains. Solution: # Anything that starts with &quot;d&quot; flights %&gt;% select(starts_with(&quot;d&quot;)) # A tibble: 336,776 x 5 day dep_time dep_delay dest distance &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; 1 1 517 2 IAH 1400 2 1 533 4 IAH 1416 3 1 542 2 MIA 1089 4 1 544 -1 BQN 1576 5 1 554 -6 ATL 762 6 1 554 -4 ORD 719 7 1 555 -5 FLL 1065 8 1 557 -3 IAD 229 9 1 557 -3 MCO 944 10 1 558 -2 ORD 733 # … with 336,766 more rows # Anything related to delays: flights %&gt;% select(ends_with(&quot;delay&quot;)) # A tibble: 336,776 x 2 dep_delay arr_delay &lt;dbl&gt; &lt;dbl&gt; 1 2 11 2 4 20 3 2 33 4 -1 -18 5 -6 -25 6 -4 12 7 -5 19 8 -3 -14 9 -3 -8 10 -2 8 # … with 336,766 more rows # Anything related to departures: flights %&gt;% select(contains(&quot;dep&quot;)) # A tibble: 336,776 x 3 dep_time sched_dep_time dep_delay &lt;int&gt; &lt;int&gt; &lt;dbl&gt; 1 517 515 2 2 533 529 4 3 542 540 2 4 544 545 -1 5 554 600 -6 6 554 558 -4 7 555 600 -5 8 557 600 -3 9 557 600 -3 10 558 600 -2 # … with 336,766 more rows (LC3.18) Why might we want to use the select() function on a data frame? Solution: To narrow down the data frame, to make it easier to look at. Using View() for example. (LC3.19) Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013. Solution: top_five &lt;- flights %&gt;% group_by(dest) %&gt;% summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %&gt;% arrange(desc(avg_delay)) %&gt;% top_n(n = 5) top_five # A tibble: 5 x 2 dest avg_delay &lt;chr&gt; &lt;dbl&gt; 1 CAE 41.7642 2 TUL 33.6599 3 OKC 30.6190 4 JAC 28.0952 5 TYS 24.0692 (LC3.20) Using the datasets included in the nycflights13 package, compute the available seat miles for each airline sorted in descending order. After completing all the necessary data wrangling steps, the resulting data frame should have 16 rows (one for each airline) and 2 columns (airline name and available seat miles). Here are some hints: Crucial: Unless you are very confident in what you are doing, it is worthwhile to not starting coding right away, but rather first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level pseudocode that is informal yet detailed enough to articulate what you are doing. This way you won’t confuse what you are trying to do (the algorithm) with how you are going to do it (writing dplyr code). Take a close look at all the datasets using the View() function: flights, weather, planes, airports, and airlines to identify which variables are necessary to compute available seat miles. Figure 3.7 above showing how the various datasets can be joined will also be useful. Consider the data wrangling verbs in Table 3.2 as your toolbox! Solution: Here are some examples of student-written pseudocode. Based on our own pseudocode, let’s first display the entire solution. flights %&gt;% inner_join(planes, by = &quot;tailnum&quot;) %&gt;% select(carrier, seats, distance) %&gt;% mutate(ASM = seats * distance) %&gt;% group_by(carrier) %&gt;% summarize(ASM = sum(ASM, na.rm = TRUE)) %&gt;% arrange(desc(ASM)) # A tibble: 16 x 2 carrier ASM &lt;chr&gt; &lt;dbl&gt; 1 UA 15516377526 2 DL 10532885801 3 B6 9618222135 4 AA 3677292231 5 US 2533505829 6 VX 2296680778 7 EV 1817236275 8 WN 1718116857 9 9E 776970310 10 HA 642478122 11 AS 314104736 12 FL 219628520 13 F9 184832280 14 YV 20163632 15 MQ 7162420 16 OO 1299835 Let’s now break this down step-by-step. To compute the available seat miles for a given flight, we need the distance variable from the flights data frame and the seats variable from the planes data frame, necessitating a join by the key variable tailnum as illustrated in Figure 3.7. To keep the resulting data frame easy to view, we’ll select() only these two variables and carrier: flights %&gt;% inner_join(planes, by = &quot;tailnum&quot;) %&gt;% select(carrier, seats, distance) # A tibble: 284,170 x 3 carrier seats distance &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; 1 UA 149 1400 2 UA 149 1416 3 AA 178 1089 4 B6 200 1576 5 DL 178 762 6 UA 191 719 7 B6 200 1065 8 EV 55 229 9 B6 200 944 10 B6 200 1028 # … with 284,160 more rows Now for each flight we can compute the available seat miles ASM by multiplying the number of seats by the distance via a mutate(): flights %&gt;% inner_join(planes, by = &quot;tailnum&quot;) %&gt;% select(carrier, seats, distance) %&gt;% # Added: mutate(ASM = seats * distance) # A tibble: 284,170 x 4 carrier seats distance ASM &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; 1 UA 149 1400 208600 2 UA 149 1416 210984 3 AA 178 1089 193842 4 B6 200 1576 315200 5 DL 178 762 135636 6 UA 191 719 137329 7 B6 200 1065 213000 8 EV 55 229 12595 9 B6 200 944 188800 10 B6 200 1028 205600 # … with 284,160 more rows Next we want to sum the ASM for each carrier. We achieve this by first grouping by carrier and then summarizing using the sum() function: flights %&gt;% inner_join(planes, by = &quot;tailnum&quot;) %&gt;% select(carrier, seats, distance) %&gt;% mutate(ASM = seats * distance) %&gt;% # Added: group_by(carrier) %&gt;% summarize(ASM = sum(ASM)) # A tibble: 16 x 2 carrier ASM &lt;chr&gt; &lt;dbl&gt; 1 9E 776970310 2 AA 3677292231 3 AS 314104736 4 B6 9618222135 5 DL 10532885801 6 EV 1817236275 7 F9 184832280 8 FL 219628520 9 HA 642478122 10 MQ 7162420 11 OO 1299835 12 UA 15516377526 13 US 2533505829 14 VX 2296680778 15 WN 1718116857 16 YV 20163632 However, because for certain carriers certain flights have missing NA values, the resulting table also returns NA’s. We can eliminate these by adding a na.rm = TRUE argument to sum(), telling R that we want to remove the NA’s in the sum. We saw this in Section 3.3: flights %&gt;% inner_join(planes, by = &quot;tailnum&quot;) %&gt;% select(carrier, seats, distance) %&gt;% mutate(ASM = seats * distance) %&gt;% group_by(carrier) %&gt;% # Modified: summarize(ASM = sum(ASM, na.rm = TRUE)) # A tibble: 16 x 2 carrier ASM &lt;chr&gt; &lt;dbl&gt; 1 9E 776970310 2 AA 3677292231 3 AS 314104736 4 B6 9618222135 5 DL 10532885801 6 EV 1817236275 7 F9 184832280 8 FL 219628520 9 HA 642478122 10 MQ 7162420 11 OO 1299835 12 UA 15516377526 13 US 2533505829 14 VX 2296680778 15 WN 1718116857 16 YV 20163632 Finally, we arrange() the data in desc()ending order of ASM. flights %&gt;% inner_join(planes, by = &quot;tailnum&quot;) %&gt;% select(carrier, seats, distance) %&gt;% mutate(ASM = seats * distance) %&gt;% group_by(carrier) %&gt;% summarize(ASM = sum(ASM, na.rm = TRUE)) %&gt;% # Added: arrange(desc(ASM)) # A tibble: 16 x 2 carrier ASM &lt;chr&gt; &lt;dbl&gt; 1 UA 15516377526 2 DL 10532885801 3 B6 9618222135 4 AA 3677292231 5 US 2533505829 6 VX 2296680778 7 EV 1817236275 8 WN 1718116857 9 9E 776970310 10 HA 642478122 11 AS 314104736 12 FL 219628520 13 F9 184832280 14 YV 20163632 15 MQ 7162420 16 OO 1299835 While the above data frame is correct, the IATA carrier code is not always useful. For example, what carrier is WN? We can address this by joining with the airlines dataset using carrier is the key variable. While this step is not absolutely required, it goes a long way to making the table easier to make sense of. It is important to be empathetic with the ultimate consumers of your presented data! flights %&gt;% inner_join(planes, by = &quot;tailnum&quot;) %&gt;% select(carrier, seats, distance) %&gt;% mutate(ASM = seats * distance) %&gt;% group_by(carrier) %&gt;% summarize(ASM = sum(ASM, na.rm = TRUE)) %&gt;% arrange(desc(ASM)) %&gt;% # Added: inner_join(airlines, by = &quot;carrier&quot;) # A tibble: 16 x 3 carrier ASM name &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; 1 UA 15516377526 United Air Lines Inc. 2 DL 10532885801 Delta Air Lines Inc. 3 B6 9618222135 JetBlue Airways 4 AA 3677292231 American Airlines Inc. 5 US 2533505829 US Airways Inc. 6 VX 2296680778 Virgin America 7 EV 1817236275 ExpressJet Airlines Inc. 8 WN 1718116857 Southwest Airlines Co. 9 9E 776970310 Endeavor Air Inc. 10 HA 642478122 Hawaiian Airlines Inc. 11 AS 314104736 Alaska Airlines Inc. 12 FL 219628520 AirTran Airways Corporation 13 F9 184832280 Frontier Airlines Inc. 14 YV 20163632 Mesa Airlines Inc. 15 MQ 7162420 Envoy Air 16 OO 1299835 SkyWest Airlines Inc. D.4 Chapter 4 Solutions library(dplyr) library(ggplot2) library(readr) library(tidyr) library(nycflights13) library(fivethirtyeight) (LC4.1) What are common characteristics of “tidy” datasets? Solution: Rows correspond to observations, while columns correspond to variables. (LC4.2) What makes “tidy” datasets useful for organizing data? Solution: Tidy datasets are an organized way of viewing data. This format is required for the ggplot2 and dplyr packages for data visualization and wrangling. (LC4.3) Take a look the airline_safety data frame included in the fivethirtyeight data. Run the following: airline_safety After reading the help file by running ?airline_safety, we see that airline_safety is a data frame containing information on different airlines companies’ safety records. This data was originally reported on the data journalism website FiveThirtyEight.com in Nate Silver’s article “Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?”. Let’s ignore the incl_reg_subsidiaries and avail_seat_km_per_week variables for simplicity: airline_safety_smaller &lt;- airline_safety %&gt;% select(-c(incl_reg_subsidiaries, avail_seat_km_per_week)) airline_safety_smaller # A tibble: 56 x 7 airline incidents_85_99 fatal_accidents… fatalities_85_99 incidents_00_14 &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; 1 Aer Li… 2 0 0 0 2 Aerofl… 76 14 128 6 3 Aeroli… 6 0 0 1 4 Aerome… 3 1 64 5 5 Air Ca… 2 0 0 2 6 Air Fr… 14 4 79 6 7 Air In… 2 1 329 4 8 Air Ne… 3 0 0 5 9 Alaska… 5 0 0 5 10 Alital… 7 2 50 4 # … with 46 more rows, and 2 more variables: fatal_accidents_00_14 &lt;int&gt;, # fatalities_00_14 &lt;int&gt; This data frame is not in “tidy” format. How would you convert this data frame to be in “tidy” format, in particular so that it has a variable incident_type_years indicating the incident type/year and a variable count of the counts? Solution: This can been done using the pivot_longer() function from the tidyr package: airline_safety_smaller_tidy &lt;- airline_safety_smaller %&gt;% pivot_longer(names_to = &quot;incident_type_years&quot;, values_to = &quot;count&quot;, cols = -airline) airline_safety_smaller_tidy # A tibble: 336 x 3 airline incident_type_years count &lt;chr&gt; &lt;chr&gt; &lt;int&gt; 1 Aer Lingus incidents_85_99 2 2 Aer Lingus fatal_accidents_85_99 0 3 Aer Lingus fatalities_85_99 0 4 Aer Lingus incidents_00_14 0 5 Aer Lingus fatal_accidents_00_14 0 6 Aer Lingus fatalities_00_14 0 7 Aeroflot incidents_85_99 76 8 Aeroflot fatal_accidents_85_99 14 9 Aeroflot fatalities_85_99 128 10 Aeroflot incidents_00_14 6 # … with 326 more rows If you look at the resulting airline_safety_smaller_tidy data frame in the spreadsheet viewer, you’ll see that the variable incident_type_years has 6 possible values: &quot;incidents_85_99&quot;, &quot;fatal_accidents_85_99&quot;, &quot;fatalities_85_99&quot;, &quot;incidents_00_14&quot;, &quot;fatal_accidents_00_14&quot;, &quot;fatalities_00_14&quot; corresponding to the 6 columns of airline_safety_smaller we tidied. Note that prior to tidyr version 1.0.0 released to CRAN in September 2019, this could also have been done using the gather() function from the tidyr package: airline_safety_smaller_tidy &lt;- airline_safety_smaller %&gt;% gather(key = incident_type_years, value = count, -airline) airline_safety_smaller_tidy # A tibble: 336 x 3 airline incident_type_years count &lt;chr&gt; &lt;chr&gt; &lt;int&gt; 1 Aer Lingus incidents_85_99 2 2 Aeroflot incidents_85_99 76 3 Aerolineas Argentinas incidents_85_99 6 4 Aeromexico incidents_85_99 3 5 Air Canada incidents_85_99 2 6 Air France incidents_85_99 14 7 Air India incidents_85_99 2 8 Air New Zealand incidents_85_99 3 9 Alaska Airlines incidents_85_99 5 10 Alitalia incidents_85_99 7 # … with 326 more rows (LC4.4) Convert the dem_score data frame into a tidy data frame and assign the name of dem_score_tidy to the resulting long-formatted data frame. Solution: Running the following in the console: dem_score_tidy &lt;- dem_score %&gt;% pivot_longer(names_to = &quot;year&quot;, values_to = &quot;democracy_score&quot;, cols = -country) # gather(key = year, value = democracy_score, - country) Let’s now compare the dem_score and dem_score_tidy. dem_score has democracy score information for each year in columns, whereas in dem_score_tidy there are explicit variables year and democracy_score. While both representations of the data contain the same information, we can only use ggplot() to create plots using the dem_score_tidy data frame. dem_score # A tibble: 96 x 10 country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 Albania -9 -9 -9 -9 -9 -9 -9 -9 5 2 Argentina -9 -1 -1 -9 -9 -9 -8 8 7 3 Armenia -9 -7 -7 -7 -7 -7 -7 -7 7 4 Australia 10 10 10 10 10 10 10 10 10 5 Austria 10 10 10 10 10 10 10 10 10 6 Azerbaijan -9 -7 -7 -7 -7 -7 -7 -7 1 7 Belarus -9 -7 -7 -7 -7 -7 -7 -7 7 8 Belgium 10 10 10 10 10 10 10 10 10 9 Bhutan -10 -10 -10 -10 -10 -10 -10 -10 -10 10 Bolivia -4 -3 -3 -4 -7 -7 8 9 9 # … with 86 more rows dem_score_tidy # A tibble: 864 x 3 country year democracy_score &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; 1 Albania 1952 -9 2 Albania 1957 -9 3 Albania 1962 -9 4 Albania 1967 -9 5 Albania 1972 -9 6 Albania 1977 -9 7 Albania 1982 -9 8 Albania 1987 -9 9 Albania 1992 5 10 Argentina 1952 -9 # … with 854 more rows (LC4.5) Read in the life expectancy data stored at https://moderndive.com/data/le_mess.csv and convert it to a tidy data frame. Solution: The code is similar life_expectancy &lt;- read_csv(&quot;https://moderndive.com/data/le_mess.csv&quot;) life_expectancy_tidy &lt;- life_expectancy %&gt;% pivot_longer(names_to = &quot;year&quot;, values_to = &quot;life_expectancy&quot;, cols = -country) # gather(key = year, value = life_expectancy, -country) We observe the same construct structure with respect to year in life_expectancy vs life_expectancy_tidy as we did in dem_score vs dem_score_tidy: life_expectancy # A tibble: 202 x 67 country `1951` `1952` `1953` `1954` `1955` `1956` `1957` `1958` `1959` &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 Afghan… 27.13 27.67 28.19 28.73 29.27 29.8 30.34 30.86 31.4 2 Albania 54.72 55.23 55.85 56.59 57.45 58.42 59.48 60.6 61.75 3 Algeria 43.03 43.5 43.96 44.44 44.93 45.44 45.94 46.45 46.97 4 Angola 31.05 31.59 32.14 32.6900 33.24 33.78 34.33 34.88 35.43 5 Antigu… 58.26 58.8 59.34 59.87 60.41 60.93 61.45 61.97 62.48 6 Argent… 61.93 62.54 63.1 63.59 64.03 64.41 64.73 65 65.22 7 Armenia 62.67 63.13 63.6 64.0700 64.54 65 65.45 65.92 66.39 8 Aruba 58.96 60.01 60.98 61.87 62.69 63.42 64.09 64.68 65.2 9 Austra… 68.710 69.11 69.69 69.84 70.16 70.03 70.31 70.86 70.43 10 Austria 65.2400 66.78 67.27 67.3 67.58 67.7 67.460 68.460 68.39 # … with 192 more rows, and 57 more variables: `1960` &lt;dbl&gt;, `1961` &lt;dbl&gt;, # `1962` &lt;dbl&gt;, `1963` &lt;dbl&gt;, `1964` &lt;dbl&gt;, `1965` &lt;dbl&gt;, `1966` &lt;dbl&gt;, # `1967` &lt;dbl&gt;, `1968` &lt;dbl&gt;, `1969` &lt;dbl&gt;, `1970` &lt;dbl&gt;, `1971` &lt;dbl&gt;, # `1972` &lt;dbl&gt;, `1973` &lt;dbl&gt;, `1974` &lt;dbl&gt;, `1975` &lt;dbl&gt;, `1976` &lt;dbl&gt;, # `1977` &lt;dbl&gt;, `1978` &lt;dbl&gt;, `1979` &lt;dbl&gt;, `1980` &lt;dbl&gt;, `1981` &lt;dbl&gt;, # `1982` &lt;dbl&gt;, `1983` &lt;dbl&gt;, `1984` &lt;dbl&gt;, `1985` &lt;dbl&gt;, `1986` &lt;dbl&gt;, # `1987` &lt;dbl&gt;, `1988` &lt;dbl&gt;, `1989` &lt;dbl&gt;, `1990` &lt;dbl&gt;, `1991` &lt;dbl&gt;, # `1992` &lt;dbl&gt;, `1993` &lt;dbl&gt;, `1994` &lt;dbl&gt;, `1995` &lt;dbl&gt;, `1996` &lt;dbl&gt;, # `1997` &lt;dbl&gt;, `1998` &lt;dbl&gt;, `1999` &lt;dbl&gt;, `2000` &lt;dbl&gt;, `2001` &lt;dbl&gt;, # `2002` &lt;dbl&gt;, `2003` &lt;dbl&gt;, `2004` &lt;dbl&gt;, `2005` &lt;dbl&gt;, `2006` &lt;dbl&gt;, # `2007` &lt;dbl&gt;, `2008` &lt;dbl&gt;, `2009` &lt;dbl&gt;, `2010` &lt;dbl&gt;, `2011` &lt;dbl&gt;, # `2012` &lt;dbl&gt;, `2013` &lt;dbl&gt;, `2014` &lt;dbl&gt;, `2015` &lt;dbl&gt;, `2016` &lt;dbl&gt; life_expectancy_tidy # A tibble: 13,332 x 3 country year life_expectancy &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; 1 Afghanistan 1951 27.13 2 Afghanistan 1952 27.67 3 Afghanistan 1953 28.19 4 Afghanistan 1954 28.73 5 Afghanistan 1955 29.27 6 Afghanistan 1956 29.8 7 Afghanistan 1957 30.34 8 Afghanistan 1958 30.86 9 Afghanistan 1959 31.4 10 Afghanistan 1960 31.94 # … with 13,322 more rows D.5 Chapter 5 Solutions library(tidyverse) library(moderndive) library(skimr) library(gapminder) (LC5.1) Conduct a new exploratory data analysis with the same outcome variable \\(y\\) being score but with age as the new explanatory variable \\(x\\). Remember, this involves three things: Looking at the raw data values. Computing summary statistics. Creating data visualizations. What can you say about the relationship between age and teaching scores based on this exploration? Solution: Looking at the raw data values: glimpse(evals_ch5) Observations: 463 Variables: 4 $ ID &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18… $ score &lt;dbl&gt; 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4… $ bty_avg &lt;dbl&gt; 5.00, 5.00, 5.00, 5.00, 3.00, 3.00, 3.00, 3.33, 3.33, 3.17, 3… $ age &lt;int&gt; 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4… Computing summary statistics: skim_with(numeric = list(hist = NULL), integer = list(hist = NULL)) evals_ch5 %&gt;% select(score, age) %&gt;% skim() Skim summary statistics n obs: 463 n variables: 2 ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 p75 p100 age 0 463 463 48.37 9.8 29 42 48 57 73 ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 p75 p100 score 0 463 463 4.17 0.54 2.3 3.8 4.3 4.6 5 (Note that for formatting purposes, the inline histogram that is usually printed with skim() has been removed. This can be done by running skim_with(numeric = list(hist = NULL), integer = list(hist = NULL)) prior to using the skim() function as well.) Creating data visualizations: ggplot(evals_ch5, aes(x = age, y = score)) + geom_point() + labs(x = &quot;Age&quot;, y = &quot;Teaching Score&quot;, title = &quot;Scatterplot of relationship of teaching score and age&quot;) Based on the scatterplot visualization, there seem to have a weak negative relationship between age and teaching score. As age increases, the teaching score see, to decrease slightly. (LC5.2) Fit a new simple linear regression using lm(score ~ age, data = evals_ch5) where age is the new explanatory variable \\(x\\). Get information about the “best-fitting” line from the regression table by applying the get_regression_table() function. How do the regression results match up with the results from your earlier exploratory data analysis? Solution: # Fit regression model: score_age_model &lt;- lm(score ~ age, data = evals_ch5) # Get regression table: get_regression_table(score_age_model) # A tibble: 2 x 7 term estimate std_error statistic p_value lower_ci upper_ci &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 intercept 4.462 0.127 35.195 0 4.213 4.711 2 age -0.006 0.003 -2.311 0.021 -0.011 -0.001 \\[ \\begin{aligned} \\widehat{y} &amp;= b_0 + b_1 \\cdot x\\\\ \\widehat{\\text{score}} &amp;= b_0 + b_{\\text{age}} \\cdot\\text{age}\\\\ &amp;= 4.462 - 0.006\\cdot\\text{age} \\end{aligned} \\] For every increase of 1 unit in age, there is an associated decrease of, on average, 0.006 units of score. It matches with the results from our earlier exploratory data analysis. (LC5.3) Generate a data frame of the residuals of the model where you used age as the explanatory \\(x\\) variable. Solution: score_age_regression_points &lt;- get_regression_points(score_age_model) score_age_regression_points # A tibble: 463 x 5 ID score age score_hat residual &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; 1 1 4.7 36 4.248 0.452 2 2 4.100 36 4.248 -0.148 3 3 3.9 36 4.248 -0.34800 4 4 4.8 36 4.248 0.552 5 5 4.600 59 4.112 0.488 6 6 4.3 59 4.112 0.188 7 7 2.8 59 4.112 -1.312 8 8 4.100 51 4.159 -0.059 9 9 3.4 51 4.159 -0.759 10 10 4.5 40 4.224 0.276 # … with 453 more rows (LC5.4) Conduct a new exploratory data analysis with the same explanatory variable \\(x\\) being continent but with gdpPercap as the new outcome variable \\(y\\). Remember, this involves three things: Most crucially: Looking at the raw data values. Computing summary statistics, such as means, medians, and interquartile ranges. Creating data visualizations. What can you say about the differences in GDP per capita between continents based on this exploration? Solution: Looking at the raw data values: glimpse(gapminder2007) Observations: 142 Variables: 4 $ country &lt;fct&gt; Afghanistan, Albania, Algeria, Angola, Argentina, Australia… $ lifeExp &lt;dbl&gt; 43.8, 76.4, 72.3, 42.7, 75.3, 81.2, 79.8, 75.6, 64.1, 79.4,… $ continent &lt;fct&gt; Asia, Europe, Africa, Africa, Americas, Oceania, Europe, As… $ gdpPercap &lt;dbl&gt; 975, 5937, 6223, 4797, 12779, 34435, 36126, 29796, 1391, 33… Computing summary statistics, such as means, medians, and interquartile ranges: gapminder2007 %&gt;% select(gdpPercap, continent) %&gt;% skim() Skim summary statistics n obs: 142 n variables: 2 ── Variable type:factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── variable missing complete n n_unique top_counts continent 0 142 142 5 Afr: 52, Asi: 33, Eur: 30, Ame: 25 ordered FALSE ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 gdpPercap 0 142 142 11680.07 12859.94 277.55 1624.84 6124.37 p75 p100 18008.84 49357.19 Creating data visualizations: ggplot(gapminder2007, aes(x = continent, y = gdpPercap)) + geom_boxplot() + labs(x = &quot;Continent&quot;, y = &quot;GPD per capita&quot;, title = &quot;GDP by continent&quot;) Based on this exploration, it seems that GDP’s are very different among different continents, which means that continent might be a statistically significant predictor for an area’s GDP. (LC5.5) Fit a new linear regression using lm(gdpPercap ~ continent, data = gapminder2007) where gdpPercap is the new outcome variable \\(y\\). Get information about the “best-fitting” line from the regression table by applying the get_regression_table() function. How do the regression results match up with the results from your previous exploratory data analysis? Solution: # Fit regression model: gdp_model &lt;- lm(gdpPercap ~ continent, data = gapminder2007) # Get regression table: get_regression_table(gdp_model) # A tibble: 5 x 7 term estimate std_error statistic p_value lower_ci upper_ci &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 intercept 3089.03 1372.74 2.25 0.026 374.538 5803.53 2 continentAmericas 7914.00 2409.14 3.285 0.001 3150.08 12677.9 3 continentAsia 9383.99 2203.13 4.259 0 5027.46 13740.5 4 continentEurope 21965.4 2269.52 9.678 0 17477.6 26453.3 5 continentOceania 26721.2 7132.96 3.746 0 12616.2 40826.1 \\[ \\begin{aligned} \\widehat{y} = \\widehat{\\text{gdpPercap}} &amp;= b_0 + b_{\\text{Amer}}\\cdot\\mathbb{1}_{\\mbox{Amer}}(x) + b_{\\text{Asia}}\\cdot\\mathbb{1}_{\\mbox{Asia}}(x) + \\\\ &amp; \\qquad b_{\\text{Euro}}\\cdot\\mathbb{1}_{\\mbox{Euro}}(x) + b_{\\text{Ocean}}\\cdot\\mathbb{1}_{\\mbox{Ocean}}(x)\\\\ &amp;= 3089 + 7914\\cdot\\mathbb{1}_{\\mbox{Amer}}(x) + 9384\\cdot\\mathbb{1}_{\\mbox{Asia}}(x) + \\\\ &amp; \\qquad 21965\\cdot\\mathbb{1}_{\\mbox{Euro}}(x) + 26721\\cdot\\mathbb{1}_{\\mbox{Ocean}}(x) \\end{aligned} \\] In our previous exploratory data analysis, it seemed that continent is a statistically significant predictor for an area’s GDP. Here, by fit a new linear regression using lm(gdpPercap ~ continent, data = gapminder2007) where gdpPercap is the new outcome variable \\(y\\), we are able to write an equation to predict gdpPercap using the continent as statistically significant predictors. Therefore, the regression results matches with the results from your previous exploratory data analysis. (LC5.6) Using either the sorting functionality of RStudio’s spreadsheet viewer or using the data wrangling tools you learned in Chapter 3, identify the five countries with the five smallest (most negative) residuals? What do these negative residuals say about their life expectancy relative to their continents? Solution: Using the sorting functionality of RStudio’s spreadsheet viewer, we can identify that the five countries with the five smallest (most negative) residuals are: Afghanistan, Swaziland, Mozambique, Haiti, and Zambia. These negative residuals indicate that these data points have the biggest negative deviations from their group means. This means that these five countries’ average life expectancies are the lowest comparing to their respective continents’ average life expectancies. For example, the residual for Afghanistan is \\(-26.900\\) and it is the smallest residual. This means that the average life expectancy of Afghanistan is \\(26.900\\) years lower than the average life expectancy of its continent, Asia. (LC5.7) Repeat this process, but identify the five countries with the five largest (most positive) residuals. What do these positive residuals say about their life expectancy relative to their continents? Solution: Using either the sorting functionality of RStudio’s spreadsheet viewer, we can identify that the five countries with the five largest (most positive) residuals are: Reunion, Libya, Tunisia, Mauritius, and Algeria. These positive residuals indicate that the data points are above the regression line with the longest distance. This means that these five countries’ average life expectancies are the highest comparing to their respective continents’ average life expectancies. For example, the residual for Reunion is \\(21.636\\) and it is the largest residual. This means that the average life expectancy of Reunion is \\(21.636\\) years lower than the average life expectancy of its continent, Africa. (LC5.8) Note in the following plot there are 3 points marked with dots along with: The “best” fitting solid regression line in blue An arbitrarily chosen dotted red line Another arbitrarily chosen dashed green line FIGURE D.2: Regression line and two others. Compute the sum of squared residuals by hand for each line and show that of these three lines, the regression line in blue has the smallest value. Solution: The “best” fitting solid regression line in blue: \\[ \\sum_{i=1}^{n}(y_i - \\widehat{y}_i)^2 = (2.0-1.5)^2+(0.50-2.0)^2+(3.0-2.5)^2=2.75 \\] An arbitrarily chosen dotted red line: \\[ \\sum_{i=1}^{n}(y_i - \\widehat{y}_i)^2 = (2.0-2.5)^2+(0.50-2.5)^2+(3.0-2.5)^2=4.5 \\] Another arbitrarily chosen dashed green line: \\[ \\sum_{i=1}^{n}(y_i - \\widehat{y}_i)^2 = (2.0-2.0)^2+(0.50-1.5)^2+(3.0-1.0)^2=5 \\] As calculated, \\(2.75&lt;4.5&lt;5\\). Therefore, we show that the regression line in blue has the smallest value of the residual sum of squares. D.6 Chapter 6 Solutions library(tidyverse) library(moderndive) library(skimr) library(ISLR) (LC6.1) Compute the observed values, fitted values, and residuals not for the interaction model as we just did, but rather for the parallel slopes model we saved in score_model_interaction. Solution: regression_points_parallel &lt;- get_regression_points(score_model_parallel_slopes) regression_points_parallel # A tibble: 463 x 6 ID score age gender score_hat residual &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; 1 1 4.7 36 female 4.172 0.528 2 2 4.100 36 female 4.172 -0.072000 3 3 3.9 36 female 4.172 -0.272 4 4 4.8 36 female 4.172 0.628 5 5 4.600 59 male 4.163 0.437 6 6 4.3 59 male 4.163 0.137 7 7 2.8 59 male 4.163 -1.363 8 8 4.100 51 male 4.232 -0.132 9 9 3.4 51 male 4.232 -0.832 10 10 4.5 40 female 4.13700 0.363 # … with 453 more rows (LC6.2) Conduct a new exploratory data analysis with the same outcome variable \\(y\\) being debt but with credit_rating and age as the new explanatory variables \\(x_1\\) and \\(x_2\\). Remember, this involves three things: Most crucially: Looking at the raw data values. Computing summary statistics, such as means, medians, and interquartile ranges. Creating data visualizations. What can you say about the relationship between a credit card holder’s debt and their credit rating and age? Solution: Most crucially: Looking at the raw data values. credit_ch6 %&gt;% select(debt, credit_rating, age) %&gt;% head() # A tibble: 6 x 3 debt credit_rating age &lt;int&gt; &lt;int&gt; &lt;int&gt; 1 333 283 34 2 903 483 82 3 580 514 71 4 964 681 36 5 331 357 68 6 1151 569 77 Computing summary statistics, such as means, medians, and interquartile ranges. skim_with(numeric = list(hist = NULL), integer = list(hist = NULL)) credit_ch6 %&gt;% select(debt, credit_rating, age) %&gt;% skim() Skim summary statistics n obs: 400 n variables: 3 ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 p75 p100 age 0 400 400 55.67 17.25 23 41.75 56 70 98 credit_rating 0 400 400 354.94 154.72 93 247.25 344 437.25 982 debt 0 400 400 520.01 459.76 0 68.75 459.5 863 1999 Creating data visualizations. ggplot(credit_ch6, aes(x = credit_rating, y = debt)) + geom_point() + labs(x = &quot;Credit rating&quot;, y = &quot;Credit card debt (in $)&quot;, title = &quot;Debt and credit rating&quot;) + geom_smooth(method = &quot;lm&quot;, se = FALSE) ggplot(credit_ch6, aes(x = age, y = debt)) + geom_point() + labs(x = &quot;Age (in year)&quot;, y = &quot;Credit card debt (in $)&quot;, title = &quot;Debt and age&quot;) + geom_smooth(method = &quot;lm&quot;, se = FALSE) It seems that there is a positive relationship between one’s credit rating and their debt, and a slight negative between one’s age and their debt. (LC6.3) Fit a new simple linear regression using lm(debt ~ credit_rating + age, data = credit_ch6) where credit_rating and age are the new numerical explanatory variables \\(x_1\\) and \\(x_2\\). Get information about the “best-fitting” regression plane from the regression table by applying the get_regression_table() function. How do the regression results match up with the results from your previous exploratory data analysis? # Fit regression model: debt_model_2 &lt;- lm(debt ~ credit_rating + age, data = credit_ch6) # Get regression table: get_regression_table(debt_model_2) # A tibble: 3 x 7 term estimate std_error statistic p_value lower_ci upper_ci &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; 1 intercept -269.581 44.806 -6.017 0 -357.668 -181.494 2 credit_rating 2.593 0.074 34.84 0 2.447 2.74 3 age -2.351 0.668 -3.521 0 -3.663 -1.038 The coefficients for both new numerical explanatory variables \\(x_1\\) and \\(x_2\\), credit_rating and age, are \\(2.59\\) and \\(-2.35\\) respectively, which means that debt and credit_rating are positively correlated, and debt and age are negatively correlated. This matches up with the results from your previous exploratory data analysis. D.7 Chapter 7 Solutions library(ggplot2) library(dplyr) library(moderndive) library(gapminder) library(skimr) (LC7.1) Why was it important to mix the bowl before we sampled the balls? Solution: So that we make sure the sampled balls are randomized. (LC7.2) Why is it that our 33 groups of friends did not all have the same numbers of balls that were red out of 50, and hence different proportions red? Solution: Because not all pairs have the same portion of the population of the balls, so each pair has a different sampled balls with different color compositions. (LC7.3) Why couldn’t we study the effects of sampling variation when we used the virtual shovel only once? Why did we need to take more than one virtual sample (in our case 33 virtual samples)? Solution: If we use the virtual shovel only once, we only get one sample of the population. We need to take more than one virtual sample to get a range of proportions. (LC7.4) Why did we not take 1000 “tactile” samples of 50 balls by hand? Solution: That would be way too much repeated work. (LC7.5) Looking at Figure 7.10, would you say that sampling 50 balls where 30% of them were red is likely or not? What about sampling 50 balls where 10% of them were red? Solution: According to the Figure, less than 150 out of the 1000 counts were 30% red. So I would say that sampling 50 balls where 30% of them were red is not very likely. Almost no count was only 10% red, so sampling 50 balls where 10% of them were red is extremely unlikely. (LC7.6) In Figure 7.12, we used shovels to take 1000 samples each, computed the resulting 1000 proportions of the shovel’s balls that were red, and then visualized the distribution of these 1000 proportions in a histogram. We did this for shovels with 25, 50, and 100 slots in them. As the size of the shovels increased, the histograms got narrower. In other words, as the size of the shovels increased from 25 to 50 to 100, did the 1000 proportions A. vary less, B. vary by the same amount, or C. vary more? Solution: A. As the histograms got narrower, the 1000 proportions varied less. (LC7.7) What summary statistic did we use to quantify how much the 1000 proportions red varied? A. The inter-quartile range B. The standard deviation C. The range: the largest value minus the smallest. Solution: B. The standard deviation is used to quantify how much a set of data varies. (LC7.8) In the case of our bowl activity, what is the population parameter? Do we know its value? Solution: The population parameter in the case of our bowl activity is the total number of balls. We know its value. (LC7.9) What would performing a census in our bowl activity correspond to? Why did we not perform a census? Solution: Performing a census in our bowl activity correspond to counting the total number of red balls in all balls, We did not perform a census because it would be too much repetitive work and it is unnecessary. (LC7.10) What purpose do point estimates serve in general? What is the name of the point estimate specific to our bowl activity? What is its mathematical notation? Solution: Point estimates serve to estimate an unknown population parameter in the sample. In our bowl activity, our point estimate is the sample proportion: the proportion of the shovel’s balls that are red. We mathematically denote the sample proportion using \\(\\widehat{p}\\). (LC7.11) How did we ensure that our tactile samples using the shovel were random? Solution: We virtually shuffle the sample each time. (LC7.12) Why is it important that sampling be done at random? Solution: So that we get different samples each time to estimate the total population. (LC7.13) What are we inferring about the bowl based on the samples using the shovel? Solution: We are inferring that the samples are representing the total population in the ball. (LC7.14) What purpose did the sampling distributions serve? Solution: Using the sampling distributions, for a given sample size \\(n\\), we can make statements about what values we can typically expect. (LC7.15) What does the standard error of the sample proportion \\(\\widehat{p}\\) quantify? Solution: Standard errors quantify the effect of sampling variation induced on our estimates. (LC7.16) The table that follows is a version of Table 7.3 matching sample sizes \\(n\\) to different standard errors of the sample proportion \\(\\widehat{p}\\), but with the rows randomly re-ordered and the sample sizes removed. Fill in the table by matching the correct sample sizes to the correct standard errors. Sample size Standard error of \\(\\widehat{p}\\) n = 0.094 n = 0.045 n = 0.069 Solution: \\(n\\) = \\(25\\), \\(100\\), \\(50\\) respectively. For the following four learning checks, let the estimate be the sample proportion \\(\\widehat{p}\\): the proportion of a shovel’s balls that were red. It estimates the population proportion \\(p\\): the proportion of the bowl’s balls that were red. (LC7.17) What is the difference between an accurate estimate and a precise estimate? Solution: An accurate estimate gives an estimate that is close to, but not necessary the exact, actual value. A precise estimate gives the exact actual value. (LC7.18) How do we ensure that an estimate is accurate? How do we ensure that an estimate is precise? To ensure that an estimate is accurate, we need to have a reasonable range of estimate, and make sure that the estimate is reasonably close to the actual value To ensure that an estimate is precise, we need to make sure the estimate is equivalent to the actual value. (LC7.19) In a real-life situation, we would not take 1000 different samples to infer about a population, but rather only one. Then, what was the purpose of our exercises where we took 1000 different samples? Solution: To get a narrower range of the estimates. (LC7.20) Figure 7.16 with the targets shows four combinations of “accurate versus precise” estimates. Draw four corresponding sampling distributions of the sample proportion \\(\\widehat{p}\\), like the one in the left-most plot in Figure 7.15. Solution: Comment on the representativeness of the following sampling methodologies: (LC7.21) The Royal Air Force wants to study how resistant all their airplanes are to bullets. They study the bullet holes on all the airplanes on the tarmac after an air battle against the Luftwaffe (German Air Force). Solution: The airplanes on the tarmac after an air battle against the Luftwaffe is not a good representation of all airplanes, because the airplanes which were attacked in less resistant areas did not make it back to the tarmac. This is called survival bias. Survivor’s bias or survival bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. This can lead to false conclusions in several different ways. It is a form of selection bias. (LC7.22) Imagine it is 1993, a time when almost all households had landlines. You want to know the average number of people in each household in your city. You randomly pick out 500 phone numbers from the phone book and conduct a phone survey. Solution: This is not a good representation, because: (1) adults are more likely to pickup phone calls; (2) households with more people are more likely to have people to be available to pickup phone calls; (3) we are not certain whether all households are in the phone book. (LC7.23) You want to know the prevalence of illegal downloading of TV shows among students at a local college. You get the emails of 100 randomly chosen students and ask them, “How many times did you download a pirated TV show last week?”. Solution: This is not a good representation, because it is very likely that students will lie in this survey to stay out of trouble. So we may not get honest data. This is called volunteer bias: systematic error due to differences between those who choose to participate in studies and those who do not. (LC7.24) A local college administrator wants to know the average income of all graduates in the last 10 years. So they get the records of five randomly chosen graduates, contact them, and obtain their answers. Solution: This is not a good representation, because the sample size is too small. The sample is representative but not precise. D.8 Chapter 8 Solutions library(tidyverse) library(moderndive) library(infer) (LC8.1) What is the chief difference between a bootstrap distribution and a sampling distribution? Solution: A bootstrap sample is a smaller sample that is “bootstrapped” from a larger sample. Bootstrapping is a type of resampling where large numbers of smaller samples of the same size are repeatedly drawn, with replacement, from a single original sample. (LC8.2) Looking at the bootstrap distribution for the sample mean in Figure 8.14, between what two values would you say most values lie? Solution: Most values lie in 1990 amd 2000. (LC8.3) What condition about the bootstrap distribution must be met for us to be able to construct confidence intervals using the standard error method? Solution: We can only use the standard error rule when the bootstrap distribution is roughly normally distributed. (LC8.4) Say we wanted to construct a 68% confidence interval instead of a 95% confidence interval for \\(\\mu\\). Describe what changes are needed to make this happen. Hint: we suggest you look at Appendix A.2 on the normal distribution. Solution: Thus, using our 68% rule of thumb about normal distributions from Appendix A.2, we can use the following formula to determine the lower and upper endpoints of a 95% confidence interval for \\(\\mu\\): \\[\\overline{x} \\pm 1 \\cdot SE = (\\overline{x} - 1 \\cdot SE, \\overline{x} + 1 \\cdot SE)\\] (LC8.5) Construct a 95% confidence interval for the median year of minting of all US pennies? Use the percentile method and, if appropriate, then use the standard-error method. Solution: Using the percentile method: bootstrap_distribution &lt;- pennies_sample %&gt;% specify(response = year) %&gt;% generate(reps = 1000) %&gt;% calculate(stat = &quot;median&quot;) percentile_ci &lt;- bootstrap_distribution %&gt;% get_confidence_interval(level = 0.95, type = &quot;percentile&quot;) percentile_ci # A tibble: 1 x 2 `2.5%` `97.5%` &lt;dbl&gt; &lt;dbl&gt; 1 1988 2000 D.9 Chapter 9 Solutions library(tidyverse) library(infer) library(moderndive) library(nycflights13) library(ggplot2movies) D.10 Chapter 10 Solutions library(tidyverse) library(moderndive) library(infer) D.11 Chapter 11 Solutions library(tidyverse) library(moderndive) library(skimr) library(fivethirtyeight) "],
+["E-appendixE.html", "E Versions of R Packages Used", " E Versions of R Packages Used If you are seeing different results than what is in the book, we recommend installing the exact version of the packages we used. This can be done by first installing the remotes package via install.packages(&quot;remotes&quot;). Then, use install_version() replacing the package argument with the package name in quotes and the version argument with the particular version number to install.2 remotes::install_version(package = &quot;skimr&quot;, version = &quot;1.0.6&quot;) package version bookdown 0.16 broom 0.5.2 dplyr 0.8.3 dygraphs 1.1.1.6 fivethirtyeight 0.5.0 forcats 0.4.0 gapminder 0.3.0 ggplot2 3.2.1 ggplot2movies 0.0.1 infer 0.5.1 ISLR 1.2 janitor 1.2.0 kableExtra 1.1.0 knitr 1.26 moderndive 0.4.0 mvtnorm 1.0-11 nycflights13 1.0.1 patchwork 0.0.1 purrr 0.3.3 readr 1.3.1 scales 1.1.0 skimr 1.0.6 stringr 1.4.0 tibble 2.1.3 tidyr 1.0.0 tidyverse 1.3.0 viridis 0.5.1 viridisLite 0.3.0 As of November 2019, the patchwork package is not on CRAN and needs to be installed via remotes::install_github(&quot;thomasp85/patchwork&quot;) instead of using install_version().↩ "],
 ["references.html", "References", " References "]
 ]