There are many ways to get a feel for the data contained in a data frame such as flights
. We present three functions that take as their “argument” (their input) the data frame in question. We also include a fourth method for exploring one particular column of a data frame:
Learning check
@@ -853,11 +877,12 @@
Exploring data frames
+
By running View(flights)
, we can explore the different variables listed in the columns. Observe that there are many different types of variables. Some of the variables like distance
, day
, and arr_delay
are what we will call quantitative variables. These variables are numerical in nature. Other variables here are categorical.
-
Note that if you look in the leftmost column of the View(flights)
output, you will see a column of numbers. These are the row numbers of the dataset. If you glance across a row with the same number, say row 5, you can get an idea of what each row is representing. In other words, this will allow you to identify what object is being described in a given row. This is often called the observational unit. The observational unit in this example is an individual flight departing from New York City in 2013. You can identify the observational unit by determining what “thing” is being measured or described by each of the variables. We’ll talk more about observational units in Section 1.4.4 on identification and measurement variables.
+
Note that if you look in the leftmost column of the View(flights)
output, you will see a column of numbers. These are the row numbers of the dataset. If you glance across a row with the same number, say row 5, you can get an idea of what each row is representing. This will allow you to identify what object is being described in a given row by taking note of the values of the columns in that specific row. This is often called the observational unit. The observational unit in this example is an individual flight departing from New York City in 2013. You can identify the observational unit by determining what “thing” is being measured or described by each of the variables. We’ll talk more about observational units in Subsection 1.4.4 on identification and measurement variables.
2. glimpse()
:
-
The second way to explore a data frame is using the glimpse()
function included in the dplyr
package. Thus, you can only use the glimpse()
function after you’ve loaded the dplyr
package by running library(dplyr)
. This function provides us with an alternative perspective for exploring a data frame than the View()
function:
-
glimpse(flights)
+
The second way we’ll cover to explore a data frame is using the glimpse()
function included in the dplyr
package. Thus, you can only use the glimpse()
function after you’ve loaded the dplyr
package by running library(dplyr)
. This function provides us with an alternative perspective for exploring a data frame than the View()
function:
+
Observations: 336,776
Variables: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
@@ -879,7 +904,9 @@ Exploring data frames
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, …
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, …
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 …
-
Observe that glimpse()
will give you the first few entries of each variable in a row after the variable name. In addition, the data type (see Subsection 1.2.1) of the variable is given immediately after each variable’s name inside < >
. Here, int
and dbl
refer to “integer” and “double”, which are computer coding terminology for quantitative/numerical variables. In contrast, chr
refers to “character”, which is computer terminology for text data. Text data, such as the carrier
or origin
of a flight, are categorical variables. The time_hour
variable is another data type: dttm
. These types of variables represent date and time combinations. However, we won’t work with dates and times in this book, we leave this topic for a more advanced data science book.
+
Observe that glimpse()
will give you the first few entries of each variable in a row after the variable name. In addition, the data type (see Subsection 1.2.1) of the variable is given immediately after each variable’s name inside < >
. Here, int
and dbl
refer to “integer” and “double”, which are computer coding terminology for quantitative/numerical variables. “Doubles” take up twice the size to store on a computer compared to integers.
+
In contrast, chr
refers to “character”, which is computer terminology for text data. In most forms, text data, such as the carrier
or origin
of a flight, are categorical variables. The time_hour
variable is another data type: dttm
. These types of variables represent date and time combinations. However, we won’t work with dates and times in this book; we leave this topic for other data science books like Introduction to Data Science by Tiffany-Anne Timbers, Melissa Lee, and Trevor Campbell or R for Data Science (Grolemund and Wickham 2017).
+
Learning check
@@ -889,39 +916,43 @@
Exploring data frames
+
+
3. kable()
:
The final way to explore the entirety of a data frame is using the kable()
function from the knitr
package. Let’s explore the different carrier codes for all the airlines in our dataset two ways. Run both of these lines of code in the console:
-
airlines
-kable(airlines)
-
At first glance, it may not appear that there is much difference in the outputs. However when using tools for producing reproducible reports such as R Markdown, the latter code produces output that is much more legible and reader-friendly.
+
+
At first glance, it may not appear that there is much difference in the outputs. However, when using tools for producing reproducible reports such as R Markdown, the latter code produces output that is much more legible and reader-friendly. You’ll see us use this reader-friendly style in many places in the book when we want to print a data frame as a nice table.
4. $
operator
Lastly, the $
operator allows us to extract and then explore a single variable within a data frame. For example, run the following in your console
-
airlines$name
+
We used the $
operator to extract only the name
variable and return it as a vector of length 16. We’ll only be occasionally exploring data frames using the $
operator, instead favoring the View()
and glimpse()
functions.
-
Identification & measurement variables
-
There is a subtle difference between the kinds of variables that you will encounter in data frames: identification variables and measurement variables. For example, let’s explore the airports
data frame by showing the output of glimpse(airports)
:
-
glimpse(airports)
+
Identification and measurement variables
+
There is a subtle difference between the kinds of variables that you will encounter in data frames. There are identification variables and measurement variables. For example, let’s explore the airports
data frame by showing the output of glimpse(airports)
:
+
Observations: 1,458
Variables: 8
$ faa <chr> "04G", "06A", "06C", "06N", "09J", "0A9", "0G6", "0G7", "0P2", …
$ name <chr> "Lansdowne Airport", "Moton Field Municipal Airport", "Schaumbu…
$ lat <dbl> 41.1, 32.5, 42.0, 41.4, 31.1, 36.4, 41.5, 42.9, 39.8, 48.1, 39.…
$ lon <dbl> -80.6, -85.7, -88.1, -74.4, -81.4, -82.2, -84.5, -76.8, -76.6, …
-$ alt <int> 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 1…
+$ alt <dbl> 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 1…
$ tz <dbl> -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5,…
$ dst <chr> "A", "A", "A", "A", "A", "A", "A", "A", "U", "A", "A", "U", "A"…
$ tzone <chr> "America/New_York", "America/Chicago", "America/Chicago", "Amer…
The variables faa
and name
are what we will call identification variables, variables that uniquely identify each observational unit. In this case, the identification variables uniquely identify airports. Such variables are mainly used in practice to uniquely identify each row in a data frame. faa
gives the unique code provided by the FAA for that airport, while the name
variable gives the longer official name of the airport. The remaining variables (lat
, lon
, alt
, tz
, dst
, tzone
) are often called measurement or characteristic variables: variables that describe properties of each observational unit. For example, lat
and long
describe the latitude and longitude of each airport.
-
Furthermore, sometimes a single variable might not be enough to uniquely identify each observational unit: combinations of variables might be needed. While it is not an absolute rule, for organizational purposes it is considered good practice to have your identification variables in the left-most columns of your data frame.
+
Furthermore, sometimes a single variable might not be enough to uniquely identify each observational unit: combinations of variables might be needed. While it is not an absolute rule, for organizational purposes it is considered good practice to have your identification variables in the leftmost columns of your data frame.
(LC1.5) What properties of each airport do the variables lat
, lon
, alt
, tz
, dst
, and tzone
describe in the airports
data frame? Take your best guess.
-
(LC1.6) Provide the names of variables in a data frame with at least three variables in which one of them is an identification variable and the other two are not. In other words, create your own tidy data frame that matches these conditions.
+
(LC1.6) Provide the names of variables in a data frame with at least three variables where one of them is an identification variable and the other two are not. Further, create your own tidy data frame that matches these conditions.
@@ -929,14 +960,14 @@
Identification & measur
Help files
Another nice feature of R are help files, which provide documentation for various functions and datasets. You can bring up help files by adding a ?
before the name of a function or data frame and then run this in the console. You will then be presented with a page showing the corresponding documentation if it exists. For example, let’s look at the help file for the flights
data frame.
-
?flights
-
The help file should pop-up in the Help pane of RStudio. If you have questions about a function or data frame included in an R package, you should get in the habit of consulting the help file right away.
+
+
The help file should pop up in the Help pane of RStudio. If you have questions about a function or data frame included in an R package, you should get in the habit of consulting the help file right away.
-
(LC1.7) Look at the help file for the airports
data frame. Revise your earlier guesses about what the variables lat
, lon
, alt
, tz
, dst
, and tzone
each describe. How good were your guesses?
+
(LC1.7) Look at the help file for the airports
data frame. Revise your earlier guesses about what the variables lat
, lon
, alt
, tz
, dst
, and tzone
each describe.
@@ -944,27 +975,30 @@
Help files
Conclusion
-
We’ve given you what we feel is a minimally viable set of tools to explore data in R. Does this chapter contain everything you need to know? Absolutely not. To try to include everything in this chapter would make the chapter so large it wouldn’t be useful! As we said earlier, the best way to further add to your toolbox is to learn by doing.
+
We’ve given you what we feel is a minimally viable set of tools to explore data in R. Does this chapter contain everything you need to know? Absolutely not. To try to include everything in this chapter would make the chapter so large it wouldn’t be useful! As we said earlier, the best way to add to your toolbox is to get into RStudio and run and write code as much as possible.
Additional resources
-
If you are completely new to the world of coding, R, and RStudio and feel you could benefit from a more detailed introduction, we suggest you check out ModernDive co-author Chester Ismay’s short book “Getting used to R, RStudio, and R Markdown” (Ismay 2016), which includes screencast recordings that you can follow along and pause as you learn. Furthermore, this book contains an introduction to R Markdown, a tool used for reproducible research in R.
-
What’s to come?
-
As we stated earlier, however, the best way to learn R is to learn by doing. We’re now going to start the “Data Science with tidyverse” portion of this book in Chapter 2 with what we feel is the most important tool in a data scientist’s toolbox: data visualization. We’ll continue to explore the data included in the nycflights13
package using the ggplot2
package for data visualization. You’ll see that data visualization is a powerful tool to add to your toolbox for data exploring that provides additional insight to what the View()
and glimpse()
functions can provide.
-
@@ -974,13 +1008,19 @@
What’s to come?
References
-
Bray, Andrew, Chester Ismay, Evgeni Chasnovski, Ben Baumer, and Mine Cetinkaya-Rundel. 2019. Infer: Tidy Statistical Inference. https://github.com/tidymodels/infer.
+
Bray, Andrew, Chester Ismay, Evgeni Chasnovski, Ben Baumer, and Mine Cetinkaya-Rundel. 2019. Infer: Tidy Statistical Inference.
+
+
+
Grolemund, Garrett, and Hadley Wickham. 2017. R for Data Science. First. Sebastopol, CA: O’Reilly Media. https://r4ds.had.co.nz/.
+
+
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, and Hiroaki Yutani. 2019. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
-
+
@@ -74,7 +77,6 @@
a.sourceLine:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode { white-space: pre; position: relative; }
-div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
@@ -145,25 +147,28 @@