-
Notifications
You must be signed in to change notification settings - Fork 0
/
CH2_data_structures.qmd
617 lines (336 loc) · 16 KB
/
CH2_data_structures.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
---
title: "Chapter 2 - Data Structures in R"
author: "Government Analysis Function and ONS Data Science Campus"
engine: knitr
execute:
echo: true
eval: true
freeze: auto # re-render only when source changes
---
> To switch between light and dark modes, use the toggle in the top left
# Learning Objectives
* Be familiar with data structures in R.
* Understand how vectors operate.
* Be familiar with lists.
* Be familiar with data frames and tibbles.
# Data Structures
We rarely work with single data values, we often work with a combination or collection of data.
R organises these in certain structures, and stores this data so that we can manipulate and work with it.
# Vectors
In the previous chapter we looked at variables, where we said we could assign a name to a value.
Crucially, each of these values (and by extension the variable itself) has an associated datatype.
In the below example, we create a variable with the **character** data type.
```{r}
# To assign a variable
my_friends <- "ian"
```
Suppose you wanted to store more than one value in "my_friends", e.g "hannah", "mike", "almas".
The following seem like possible solutions to this problem, but unfortunately are not the correct way to store multiple data points.
```{r, eval=FALSE}
# This code will throw an error
my_friends <- "ian", "hannah", "mike", "almas"
my_friends <- ("ian", "hannah", "mike", "almas")
```
You will get the error below,
>**Error: unexpected ',' in "my_friends <- "ian","**
as the comma should not be used in this way.
This brings us to the **c()** function.
## The c() Function
The "c()" function is used for creating a vector in R, which is it's most simple data structure. The c stands for **combine**.
```{r, eval=FALSE}
#To access the R help documentation
?c()
# or
help(c)
```
We can use this function to store multiple elements (data points) in a single variable, with each element **separated by a comma**.
```{r}
# Creating a vector
my_friends <- c("ian", "hannah", "mike", "almas")
# To display the data
my_friends
```
Let's check the datatype of our new vector.
```{r}
# Check datatype of my_friends
typeof(my_friends)
```
We see that this is a character vector, a.k.a, a vector containing character data. This is because each element contained within the vector are character type.
## Vector Definition
A vector is a collection of values that are **all of the same type** (doubles, characters, etc.).
Since R insists all elements in a vector have to be of the same type we then can have types of vectors. These are:
* Logical vector - contains only logical values.
* Numerical vector - contains only numerical values.
* Character vector - contains only character values.
There are more types of vectors, but for the purpose of our learning these are sufficient.
Visually:
![](Images/vector.png){fig-alt="Diagram of a vector as a column (collection) of the same red rectangle."}
## Creating Vectors
There are several different ways of creating a vector. You can create a vector by using:
- The combine function **c()** that we saw previously.
- The sequence function **seq()** to create any sequence of numbers.
- The **colon** operator to create a vector of consecutive numbers.
### The c() function (combine)
Let's create a numeric vector with c() this time!
```{r}
# Creating a vector using the combine function
sample_vector <- c(1, 2, 3, 4, 5)
# To display the created vector
sample_vector
```
### The seq() function (sequence)
A sequence of values has:
* A starting point
* An ending point
* A "step", a.k.a the value we jump by from number to number.
> For example, if we start at 2, end at 10 and step by 2 we will have (2, 4, 6, 8, 10) as our final sequence.
The seq() function takes the arguments:
* from - the start point
* to - the end point
* by - the step value
```{r}
# Creating a vector using the sequence function,
sequence_vector <- seq(from = 2, to = 6, by = 2)
# To display the created vector
sequence_vector
```
Earlier we saw in the console the number [1] next to our output. This represents the position of the element to the right of this number in the vector.
Thus, 2 is element 1 of our vector, 4 is element 2 and so on. This is essential information for later.
### The Colon Operator
A range of values (i.e., consecutive values) could be generated as a vector using a colon **:** in R.
Since this is consecutive, we can't have broken sequences like seq() allows.
```{r}
# Creating a vector using a colon
colon_vector <- 6:10
# To display the created vector
colon_vector
```
### Exercise
::: {.panel-tabset}
### **Exercise** {-}
1. Create two vectors, one with numeric data and one with character data and name them appropriately. For example favourite_movies, lucky_numbers.
2. Use the c() function to create a new vector with your vectors as inputs. This combines the smaller vectors into a larger vector, further showcasing their versatility.
3. Find the type of your new vector. Why do you think this is the case?
### **Show Answer**{-}
1.
```{r}
# Creating a character vector
favourite_movies <- c("Frozen", "The Lion King", "Moana", "The Dark Knight")
# To display the data
favourite_movies
# Creating a numeric vector
lucky_numbers <- c(7, 12, 15, 1)
# To display the data
lucky_numbers
```
2.
```{r}
# Creating a vector of vectors
movies_and_numbers <- c(favourite_movies, lucky_numbers)
movies_and_numbers
```
3.
```{r}
# Checking the type of vector
typeof(movies_and_numbers)
```
:::
We end up with a character vector as opposed to an error for incompatible types (numeric and character).
As vectors are supposed to be of the same data type, R will convert values so that they are all the same datatype. This is called **coercion**, which works on a complex heirarchy:
* Any datatype can be converted to a character by wrapping it in quotation marks **" "**.
* Any integer can be converted to a double by adding in the decimal place, i.e. 1L --> 1.0
* Any logical can be converted to an integer or double as they have numeric representation as 1 and 0 (or 1.0, 0.0) in the background.
However, a character such as "hello" cannot be converted to any other type.
As such, we have:
>**Logical < Integer < Double < Character** and we see that character is the most all encompassing, with logicals being converted from but not to.
## Vectorised Language
Vectors aren't just containers for homogeneous data (data of the same type).
> **As R is a vectorised language, this means operations are applied to each element of the vector automatically, without the need to loop through the vector.**
This is not a common behaviour among programming languages and is a key advantage of R's nature as a vectorised language.
This also means that every variable we have created so far have also been vectors with one element, and the operators (such as adding 5 to them) have been vectorised operations.
Let's see this in practice!
```{r}
# Add sample and colon vector
vector_addition <- sample_vector + colon_vector
vector_addition
```
We see that each first element is added, each second element is added and so on, this is an element wise operation (element by element).
```{r}
# Multiplying vectors (element wise)
vector_multiplication <- sample_vector * colon_vector
vector_multiplication
```
We see that each first element is multiplied, each second element is multiplied and so on.
```{r}
# Multiply vector by a value
sample_vector <- sample_vector * 3
sample_vector
```
We see that each element is multiplied by 3, like expanding a bracket!
### Exercise
::: {.panel-tabset}
### **Exercise**{-}
1. Try and add "sample_vector" and "sequence_vector".
2. Can you figure out what has happened?
### **Show Answer**{-}
1.
```{r}
# Adding two vectors of different lengths
vector_after_addition <- sample_vector + sequence_vector
vector_after_addition
```
2. Here we have one vector with 3 elements and one with 5 elements. The number of elements in a vector is referred to as the **length** of the vector, so sample vector has length 5.
When we try to add vectors of different lengths R gives a warning message as the **longer object length is not a multiple of the shorter object length**.
As we can see R has recycled the elements in the shorter vector, wrapped them around to reach the length of the longer one, before adding them together.
When applying arithmetic to two vectors their lengths should either be equal, or the length of the longer one a multiple of the length of the shorter one, i.e. adding a vector of length 3 to a length 6.
:::
## Indexing a Vector
Often, we want to select specific elements or ranges of them to work with going forward.
Elements in a vector can be selected using square brackets **[ ]**.
> **We are going to use indexes, which are the position of each element within the vector.**
Note that R indexes start at 1 (not 0 like other languages).
## **Example - Indexing a single element**{-}
Take the favourite_movies vector from the earlier exercise.
```{r}
# To display the vector
favourite_movies
```
If we want to select "The Lion King" from this vector, which is at position 2, we need to use the number 2 in square brackets.
```{r}
# Selecting the second element in the vector
favourite_movies[2]
```
Reading this left to right we are:
* From the favourite_movies vector
* Select [ ]
* The element at position 2
## **Example - Index Sequentially**{-}
We can also index multiple elements using the colon operator. Remember that this creates a sequence of consecutive numbers, so allows us to index sequentially.
For example, if we wanted to select from "The Lion King" to "The Dark Knight", we select from index 2 to index 4, i.e. **2:4**.
```{r}
# Selecting sequentially
favourite_movies[2:4]
```
## **Example - Index out of sequence**{-}
We can even index **out of sequence**, such as obtaining the elements at indexes 1, 3 and 4.
This is interesting as most other programming languages require external packages to be able to do this, whereas R does it out of the box.
To do this, we need to create a vector of values inside the square brackets, using the c() function.
```{r}
# Selecting first, third and fourth items
favourite_movies[c(1,3,4)]
```
### Exercise
::: {.panel-tabset}
### **Exercise**{-}
Use the vector below containing days of the week (Monday - Sunday), for the exercises.
```{r}
# Days of the week vector
days_of_the_week <- c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday")
```
1. Select Wednesday only.
2. Select the week days.
3. Select Tuesday and Thursday.
### **Show Answer**{-}
1.
```{r}
# Selecting Wednesday
days_of_the_week[3]
```
2.
```{r}
# Selecting the week days
days_of_the_week[1:5]
```
3.
```{r}
# Selecting Tuesday and Thursday
days_of_the_week[c(2,4)]
```
### **Extension Exercise**{-}
Run the following lines of code one at a time and try to determine what is happening.
```{r, eval=FALSE}
# Find out what the code below does
days_of_the_week[3] <- "Wensday"
# Find out what the code below does
days_of_the_week[c(-6,-7)]
# Find out what the code below does
days_of_the_week[8]
```
### **Extension Answer**{-}
* The first line of code modifies the element at index position 3 and overwrites it (notice the assignment operator) to something else.
* Using the minus sign when selecting elements reverses the process, i.e. tells R **not** to select that value. So here we are saying select all except index positions 6 and 7.
* Giving an index position outside of the range of vector elements returns **NA (Not Available)** as there is no 8th value in the vector.
:::
# Lists
Lists are similar data structure to vectors in that they are an ordered collection of elements.
They differ to vectors because their elements can be of any type, including vectors and lists themselves!
Unlike with vectors, where combining multiple vectors creates one vector, in lists, each element can also be a collection of elements, and no coercion or combination takes place in its creation.
>**So why do you need lists?**
Lists enables you to gather a variety of objects with different contents and lengths under one name in an ordered way.
![](Images/list.png){fig-alt="Visual of a list, with each element being a collection of boxes with different colours."}
To create a list we will use the **list()** function.
```{r}
# Creating a list
movies_numbers_friends <- list(favourite_movies, lucky_numbers, my_friends)
# To display the list
movies_numbers_friends
```
Each of the contents will appear on a new line:
* Starting with double-square brackets [[1]], this denotes the first element of the list, since it is a **container of containers** there are 2 levels to select from.
* Next is the singular square bracket [1] denoting the first element of the numeric vector inside the list.
This is interesting as indexing from a list is a little more complicated.
## Indexing from a list
We can select at two levels from lists, and this differs by what we actually **return** as a result.
Previously when we indexed from vectors:
* If we selected one value, we return the value itself as a single data point.
* If we selected a few values, we return a **vector** of those chosen values, as it is still a collection after all.
With lists:
* If we select with double square brackets, **[[ ]]**, we return the collection at position 1 in this list, this could be a vector, another list, whatever the value is.
* If we select with single square brackets, **[ ]**, we return a smaller list just containing the element we asked for.
>**Let's see this in practice**
```{r}
# Index to return the collection at position 1
movies_numbers_friends[[1]]
```
```{r}
# Index to return a smaller/sub-list with element 1 only
movies_numbers_friends[1]
```
We can even **double-select** to select the collection with double brackets, and then a specific element or collection of elements from there.
```{r}
# Access "The Dark Knight" from the list
movies_numbers_friends[[2]][4]
```
So we access the second element of the list, which is a character vector, then the 4th element of the vector.
# Data Frames
Data frames are like the tables we are used to from Excel and other programs, with columns and rows that form a 2 dimensional object.
They are collections of vectors, where:
* Each vector contains the same data type
* Subsequent vectors contain different data types (i.e. column 1 could be numeric, column 2 could be character)
As a whole, DataFrames have the following features:
* Columns are variables.
* Rows are observations (i.e. an entry for each variable forms an observation/row).
* They can hold variables of different types.
![](Images/dataframe.png){fig-alt="Visual of a dataframe where each column has a name."}
To create one, we can use the **data.frame()** function on vectors you would like to be your columns, they must be the same length.
>**Using data.frame()**
Within this function, the argument we assign the vectors to becomes the column name, providing us a handy shortcut to doing these tasks separately.
```{r}
# Create a dataframe with favourite movies and numbers
about_me_df <- data.frame(numbers = lucky_numbers,
movies = favourite_movies)
about_me_df
```
You see that this just prints the column names and content in the console. We would need to investigate the data further to obtain other information about it.
When it comes to working with data, these are outdated, and the **tidyverse**, which we will be introduced to in the next chapter, provide us an excellent upgrade to the data frame, known as the **tibble**.
We will see this in the next chapter.
# Summary
We have covered a lot material in R and yet there is still so much more to cover in terms of functionality, as R has so much to offer.
By no means are you expected to remember all the above, what is better is that you understand the problems you want to solve and can then use the references or material provided (or you find yourself) to go about solving it.
Next up, we will introduces packages, importing and exporting data.