R for reproducible scientific analysis
Data structures

Learning Objectives {.objectives}

  • To be aware of the different types of data
  • To be aware of the different basic data structures commonly encountered in R
  • To be able to ask questions from R about the type, class, and structure of an object.

Data Types

Before we can analyse the gapminder data, we'll need to have a strong understanding of the basic data types and data structures. It is Very Important to understand because these are the things you will manipulate on a day-to-day basis in R, and are the source of most frustration encountered by beginners.

R has 5 basic atomic types (meaning they can't be broken down into anything smaller):

  • logical (e.g., TRUE, FALSE)
  • numeric
    • integer (e.g, 3, 2L, as.integer(3))
    • double (i.e. decimal) (e.g, -24.57, 2.0, pi)
  • complex (i.e. complex numbers) (e.g, 1 + 0i, 1 + 4i)
  • character (e.g, "a", "swc", 'This is a cat')

There are a few functions we can use to interrogate data in R to determine its type:

typeof() # what is its atomic type?
is.logical() # is it TRUE/FALSE data?
is.numeric() # is it numeric?
is.complex() # is it complex number data?
is.character() # is it character data?

Challenge 1: Data types {.challenge}

Use your knowledge of how to assign a value to a variable, to create examples of data with the following characteristics:

    1. Variable name: 'answer', Type: logical
    1. Variable name: 'height', Type: numeric
    1. Variable name: 'dog_name', Type: character

For each variable you've created, test that it has the data type you intended. Do you find anything unexpected?

Data Structures

There are five data structures you will commonly encounter in R. These include:

  • vector
  • factors
  • list
  • matrix
  • data.frame

For now, let's focus on vectors in more detail, to discover more about data types.


A vector is the most common and basic data structure in R and is pretty much the workhorse of R. They are sometimes referred to as atomic vectors, because importantly, they can only contain one data type. They are the building blocks of every other data structure.

We've seen vectors already when we retrieved the rownames and columnames of the gapminder dataset, and when accessing its individual columns!

A vector can contain any of the five types we introduced before:

  • logical (e.g., TRUE, FALSE)
  • integer (e.g,, 2L, as.integer(3))
  • numeric (real or decimal) (e.g, 2, 2.0, pi)
  • complex (e.g, 1 + 0i, 1 + 4i)
  • character (e.g, "a", "swc")

Tip: "Character Vectors" {.callout}

You will sometimes hear the term "character vector", especially in warning or error messages. This is a somewhat confusing and unfortunate name. Remember that the type "character" really means some text wrapped in quotation symbols.

Create an empty vector with vector() or by using the concatenate function, c().

x <- vector()

So by default, it creates an empty vector (i.e. a length of 0) of type "logical".

x <- vector(length = 10) # with a predefined length

If we count the number of FALSEs there should be 10.

x <- vector("character", length = 10)  # with a predefined length and type
 [1] "" "" "" "" "" "" "" "" "" ""

Or we can use the concatenate function to combine any values we like into a vector (so long as they're the same atomic type!).

x <- c(10, 12, 45, 33)
[1] 10 12 45 33

You can also create vectors as sequence of numbers

series <- 1:10 
 [1]  1  2  3  4  5  6  7  8  9 10
 [1]  1  2  3  4  5  6  7  8  9 10
seq(1, 10, by = 0.1)
 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
 [16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
 [31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
 [46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
 [61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
 [76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
 [91] 10.0

Tip: Creating integers {.callout}

When you combine numbers using the concatenate function, c() the type will automatically become "numeric", that is real/decimal numbers. If you specifically want to create a vector of integers (whole numbers only), you need to append each number with an L, i.e. c(10L, 12L, 45L, 33L).

You can also use the concatenate function to add elements to a vector:

x <- c(x, 57)
[1] 10 12 45 33 57

Challenge 2 {.challenge}

Vectors can only contain one atomic type. If you try to combine different types, R will create a vector that is the least common denominator: the type that is easiest to coerce to.

Guess what the following do without running them first:

xx <- c(1.7, "a") 
xx <- c(TRUE, 2) 
xx <- c("a", TRUE) ```

This is called implicit coercion.

The coersion rule goes logical -> integer -> numeric -> complex -> character.

You can also coerce vectors explicitly using the as.<class_name>. Example


R will try to do whatever makes the most sense for that value:

[1] "0" "1" "2" "3" "4" "5" "6"
[1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i
x <- 0:6 

This is behaviour you will find in many programming languages. 0 is FALSE, while every other number is treated as TRUE. Sometimes coercions, especially nonsensical ones won't work.

In some cases, R won't be able to do anything sensible:

x <- c("a", "b", "c") 
[1] NA NA NA
Warning message:
NAs introduced by coercion 

In both cases, a vector of "NAs" was returned, and in the first case so was a warning.

Tip: Special Objects {.callout}

"NA" is a special object in R which denotes a missing value. NA can occur in any type of vector. There are a few other types of special objects: Inf denotes infinity (can be positive or negative), while NaN means Not a number, an undefined value (i.e. 0/0). NULL denotes that the data structure doesn't exist (but can occur in list elements).

You can ask questions about the structure of vectors:

x <- 0:10
tail(x, n=2) # get the last 'n' elements
[1] 45 33
head(x, n=1) # get the first 'n' elements
[1] 33
[1] 4
 num [1:4] 10 12 45 33

Like data.frames, vectors can also be named:

names(x) <- c("a", "b", "c", "d")
 a  b  c  d 
 10 12 45 33 

Advanced Tip for Programmers {.callout}

If you're coming from other programming languages you might recognise this as a useful tool akin to dictionaries and hash tables. This is true for small vectors, but for true hash table functionality, you should use the environment object. See ?new.env.


Another data structure you'll likely encounter are Matrices. Underneath the hood, they are really just atomic vectors, with added dimension attributes.

We can create one with the matrix function. Let's generate some random data:

set.seed(1) # make sure the random numbers are the same for each run
x <- matrix(rnorm(18), ncol=6, nrow=3)
           [,1]       [,2]      [,3]       [,4]       [,5]        [,6]
[1,] -0.6264538  1.5952808 0.4874291 -0.3053884 -0.6212406 -0.04493361
[2,]  0.1836433  0.3295078 0.7383247  1.5117812 -2.2146999 -0.01619026
[3,] -0.8356286 -0.8204684 0.5757814  0.3898432  1.1249309  0.94383621
num [1:3, 1:6] -0.626 0.184 -0.836 1.595 0.33 ...

You can use rownames, colnames, and dimnames to set or retrieve the column and rownames of a matrix. The functions nrow and ncol will tell you the number of rows and columns (this also applies to data frames!), while length will tell you the number of elements.

Challenge 3 {.challenge}

What do you think will be the result of length(x)? Try it. Were you right? Why / why not?

Challenge 4 {.challenge}

Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behaviour? See if you can figure out how to change this. (hint: read the documentation for matrix!)


Factors are special vectors that represent categorical data. Factors can be ordered or unordered and are important when for modelling functions such as aov(), lm() and glm() and also in plot methods.

Factors can only contain pre-defined values, and we can create one with the factor function:

x <- factor(c("yes", "no", "no", "yes", "yes"))
[1] yes no  no  yes yes
Levels: no yes

So we can see that the output is very similar to a character vector, but with an attached levels component. This becomes clearer when we look at its structure:

 Factor w/ 2 levels "no","yes": 2 1 1 2 2

This reveals something important: while factors look (and often behave) like character vectors, they are actually integers under the hood, and here, we can see that "no" is represented by a 1, and "yes" a 2.

In modelling functions, important to know what baseline levels is. This is the first factor but by default the ordering is determined by alphabetical order of words entered. You can change this by specifying the levels:

x <- factor(c("case", "control", "control", "case"), levels = c("control", "case"))
 Factor w/ 2 levels "case","control": 2 1 1 2

In this case, we've explicitly told R that "control" should represented by 1, and "case" by 2. This designation can be very important for interpreting the results of statistical models!


If you want to combine different types of data, you will need to use lists. Lists act as containers, and can contain any type of data structure, even themselves!

Lists can be created using list or coerced from other objects using as.list():

x <- list(1, "a", TRUE, 1+4i)
[1] 1

[1] "a"

[1] TRUE

[1] 1+4i

Each element of the list is denoted by a [[ in the output. Inside each list element is an atomic vector of length one containing

Lists can contain more complex objects:

xlist <- list(a = "Research Bazaar", b = 1:10, data = head(iris))
[1] "Research Bazaar"

 [1]  1  2  3  4  5  6  7  8  9 10

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   1          5.1         3.5          1.4         0.2  setosa
   2          4.9         3.0          1.4         0.2  setosa
   3          4.7         3.2          1.3         0.2  setosa
   4          4.6         3.1          1.5         0.2  setosa
   5          5.0         3.6          1.4         0.2  setosa
   6          5.4         3.9          1.7         0.4  setosa

In this case our list contains a character vector of lenght one, a numeric vector with 10 entries, and a small data frame from one of R's many preloaded datasets (see ?data). We've also given each list element a name, which is why you see $a instead of [[1]].

Lists can also contain themselves:


Challenge 5 {.challenge}

Create a list containing two character vectors for each of the sections in this part of the workshop:

  • Data types
  • Data structures

Populate each character vector with the names of the data types and data structures we've seen so far.

Lists are extremely useful inside functions. You can "staple" together lots of different kinds of results into a single object that a function can return. In fact many R functions which return complex output store their results in a list.

Data frames

We've already encountered a data frame in the previous lesson - the gapminder dataset. Let's return to them now that we know a little bit more about data structures.

Data frames are similar to matrices, except they can contain multiple atomic types. Underneath the hood, data frames are really lists, where each element is an atomic vector, with the added restriction that they're all the same length.

Data frames can be created manually with the data.frame function:

df <- data.frame(id = c('a', 'b', 'c', 'd', 'e', 'f'), x = 1:6, y = c(214:219))
  id x   y
1  a 1 214
2  b 2 215
3  c 3 216
4  d 4 217
5  e 5 218
6  f 6 219

Challenge 6: Dataframes {.challenge}

Try using the length function to query your dataframe df. Does it give the result you expect?

Each column in the data frame is simply a list element, which is why when you ask for the length of the data frame, it tells you the number of columns. If you actually want the number of rows, you can use the nrow function.

We can add rows or columns to a data.frame using rbind or cbind (these are the two-dimensional equivalents of the c function):

df <- rbind(df, list("g", 11, 42))) 

This doesn't work as expected! What does this error message tell us?

Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 11) :
  invalid factor level, NA generated

It sounds like it was trying to generate a factor level. Why? Perhaps our first column (containing characters) is to blame... We can access a column in a data.frame by using the $ operator.


Indeed, R automatically made this first column a factor, not a character vector. We can change this in place by converting the type of this column.

df$id <- as.character(df$id)

Okay, now let's try adding that row again.

df <- rbind(df, list("g", 11, 42))) 
tail(df, n=3)
  id  x   y
5  e  5 218
6  f  6 219
7  g 11  42

Note that to add a row, we need to use a list, because each column is a different type! If you want to add multiple rows to a data.frame, you will need to separate the new columns in the list:

df <- rbind(df, list(c("l", "m"), c(12, 13), c(534, -20)))
tail(df, n=3)
  id  x   y
7  g 11  42
8  l 12 534
9  m 13 -20

You can also row-bind data.frames together:

rbind(df, df)
   id  x   y
1   a  1 214
2   b  2 215
3   c  3 216
4   d  4 217
5   e  5 218
6   f  6 219
7   g 11  42
8   l 12 534
9   m 13 -20
10  a  1 214
11  b  2 215
12  c  3 216
13  d  4 217
14  e  5 218
15  f  6 219
16  g 11  42
17  l 12 534
18  m 13 -20

To add a column we can use cbind:

df <- cbind(df, 9:1)
  id  x   y 9:1
1  a  1 214   9
2  b  2 215   8
3  c  3 216   7
4  d  4 217   6
5  e  5 218   5
6  f  6 219   4
7  g 11  42   3
8  l 12 534   2
9  m 13 -20   1

Challenge 7 {.challenge}

Create a dataframe that holds the following information for yourself:

  • First name
  • Last name
  • Age

Then use rbind to add the same information for the people sitting near you.

Now use cbind to add a column of logicals answering the question, "Is there anything in this workshop you're finding confusing?"

Using dataframes: the gapminder dataset

To recap what we've just learnt, let's have a look at our example data (life expectancy in various countries for various years).

Remember, there are a few functions we can use to interrogate data structures in R:

class() # what is the data structure?
typeof() # what is its atomic type?
length() # how long is it? What about two dimensional objects?
attributes() # does it have any metadata?
str() # A full summary of the entire object

Let's use them to explore the gapminder dataset.

[1] "list"

Remember, data frames are lists 'under the hood'.

[1] "data.frame"

The gapminder data is stored in a "data.frame". This is the default data structure when you read in data, and (as we've heard) is useful for storing data with mixed types of columns.

Let's look at some of the columns.

Challenge 8: Data types in a real dataset {.challenge}

Look at the first 6 rows of the gapminder dataframe we loaded before:

      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134

Write down what data type you think is in each column

[1] "integer"
[1] "double"

Can anyone guess what we should expect the type of the continent column to be?

[1] "integer"

If you were expecting a the answer to be "character", you would rightly be surprised by the answer. Let's take a look at the class of this column:

[1] "factor"

One of the default behaviours of R is to treat any text columns as "factors" when reading in data. The reason for this is that text columns often represent categorical data, which need to be factors to be handled appropriately by the statistical modelling functions in R.

However it's not obvious behaviour, and something that trips novices up. We can disable this behaviour and read in the data again.

gapminder <- read.table(
  file="data/gapminder-FiveYearData.csv", header=TRUE, row.names=1, 

This is a personal preference of mine. There are a few reasons I like to turn off this behaviour:

  1. I'm often tripped up by this behaviour.
  2. It's better to explicitly convert the variables into factors when running statistical models. This forces you to think about the question you're asking, and makes it easier to specify the ordering of the categories (this can be important!).

However there are many in the R community who find it more sensible to leave this as the default behaviour.

Tip: Changing options {.callout}

When R starts, the first thing it does is runs any code in the file .Rprofile in the project directory. Any permanent changes to default behaviour you want to make should be stored in that file.

The first thing you should do when reading data in, is check that it matches what you expect, even if the command ran without warnings or errors. The str function, short for "structure", is really useful for this:

'data.frame': 1704 obs. of  6 variables:
 $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

We can see that the object is a data.frame with 1,704 observations (rows), and 6 variables (columns). Below that, we see the name of each column, followed by a ":", followed by the type of variable in that column, along with the first few entries.

We can also retrieve or modify the column or rownames of the data.frame:

[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
See those numbers in the square brackets on the left? That tells you the number of the first entry in that row of output. So in the last line, we see that the "[1701]" element has "1701" stored in it. The rownames in this case are simply the row numbers.

We can also modify this information:

copy <- gapminder # lets create a copy so we don't mess up the original
colnames(copy) <- c("a", "b", "c", "d", "e", "f")
            a    b        c    d      e        f
1 Afghanistan 1952  8425333 Asia 28.801 779.4453
2 Afghanistan 1957  9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
6 Afghanistan 1977 14880372 Asia 38.438 786.1134

There are a few related ways of retreiving and modifying this information. attributes will give you both the row and column names, along with the class information, while dimnames will give you just the rownames and column names.

In both cases, the output object is stored in a list:

List of 2
 $ : chr [1:1704] "1" "2" "3" "4" ...
 $ : chr [1:6] "country" "year" "pop" "continent" ...

Understanding how lists are used in function output

Lets run a basic linear regression on the gapminder dataset:

# What is the relationship between life expectancy and year?
l1 <- lm(lifeExp ~ year, data=gapminder)

We won't go into too much detail of what I just wrote, but briefly; the "" denotes a formula, which means treat the variable on the left of the "" as the left hand side of the equation (or response in this case), and everything on the right as the right hand side. By telling the linear model function to use the gapminder data frame, it knows to look for those variable names as its columns.

Let's look at the output:


lm(formula = lifeExp ~ year, data = df)

(Intercept)         year  
  -585.6522       0.3259  

Not much there right? But if we look at the structure...

List of 12
 $ coefficients : Named num [1:2] -585.652 0.326
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "year"
 $ residuals    : Named num [1:1704] -21.7 -21.8 -21.8 -21.4 -20.9 ...
  ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
 $ effects      : Named num [1:1704] -2455.1 232.2 -20.8 -20.5 -20.2 ...
  ..- attr(*, "names")= chr [1:1704] "(Intercept)" "year" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:1704] 50.5 52.1 53.8 55.4 57 ...
  ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:1704, 1:2] -41.2795 0.0242 0.0242 0.0242 0.0242 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:1704] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "year"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.02 1.03
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 1702
 $ xlevels      : Named list()
 $ call         : language lm(formula = lifeExp ~ year, data = df)
 $ terms        :Classes 'terms', 'formula' length 3 lifeExp ~ year
  .. ..- attr(*, "variables")= language list(lifeExp, year)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "lifeExp" "year"
  .. .. .. ..$ : chr "year"
  .. ..- attr(*, "term.labels")= chr "year"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(lifeExp, year)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "year"
 $ model        :'data.frame':  1704 obs. of  2 variables:
  ..$ lifeExp: num [1:1704] 28.8 30.3 32 34 36.1 ...
  ..$ year   : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 lifeExp ~ year
  .. .. ..- attr(*, "variables")= language list(lifeExp, year)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "lifeExp" "year"
  .. .. .. .. ..$ : chr "year"
  .. .. ..- attr(*, "term.labels")= chr "year"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(lifeExp, year)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "year"
 - attr(*, "class")= chr "lm"

There's a lot of stuff, stored in nested lists! This is why the structure function is really useful, it allows you to see all the data available to you returned by a function.

For now, we can look at the summary:

lm(formula = lifeExp ~ year, data = df)

    Min      1Q  Median      3Q     Max 
-39.949  -9.651   1.697  10.335  22.158 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -585.65219   32.31396  -18.12   <2e-16 ***
year           0.32590    0.01632   19.96   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.63 on 1702 degrees of freedom
Multiple R-squared:  0.1898,  Adjusted R-squared:  0.1893 
F-statistic: 398.6 on 1 and 1702 DF,  p-value: < 2.2e-16

As you might expect, life expectancy has slowly been increasing over time, so we see a significant positive association!