generated from jtr13/bookdown-template
-
Notifications
You must be signed in to change notification settings - Fork 32
/
Copy pathrepro_cont.qmd
1689 lines (1398 loc) · 76.7 KB
/
repro_cont.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Reproducible analytical pipelines with Docker
If the book ended at the end of the previous chapter, it would have been titled
"Building analytical pipelines with R", because we have not ensured that the
pipeline we built is reproducible. We did our best though:
- we used functional and literate programming;
- we documented, tested and versioned the code;
- we used `{renv}` to record the dependencies of the project;
- the project was rewritten as a `{targets}` pipeline and re-running it is as easy as it can possibly get.
But there are still many variables that we need to consider. If we go back to
the reproducibility iceberg, you will notice that we can still go deeper. As the
code stands now, we did our best using programming paradigms and libraries, but
now we need to consider other aspects.
As already mentioned in the introduction and Chapter 10, `{renv}` *only*
restores package versions. The R version used for the analysis only gets
recorded. So to make sure that the pipeline reproduces the same results, you'd
need to install the same R version that was used to build the pipeline
originally. But installing the right R version can be difficult sometimes; it
depends on the operating system you want to install it on, and how old a version
we’re talking about. Installing R 4.0 on Windows should be quite easy, but I
wouldn’t be very keen on trying to install R 2.15.0 (released on March 2012) on
a current Linux distribution (and it might be problematic even on Windows as
well).
Next comes the operating system on which the pipeline was developed. In
practice, this rarely matters, but there have been cases where the same code
produced different results on different operating systems, sometimes even on
different versions of the same operating system! For example, @neupane2019
discuss their attempt at reproducing a paper from 2014. The original scripts and
data were available, and yet they were not able to reproduce the results, even
though they made sure to use the same version of Python that the original
authors from 2014 were using. The reason was the operating system: they were
conducting their replication exercise on a different operating system, and this
was causing the results to be different. What was going on? The original script
relied on how the operating system ordered the files for analysis. If the files
were ordered in a different way, the results would be different. And file
ordering is operating system dependent! The table below, from @neupane2019,
shows how the results vary depending on which operating system the script runs:
::: {.content-hidden when-format="pdf"}
<figure>
<img src="images/neupane_table1.png"
alt="Different operating system yield different results."></img>
<figcaption>Different operating system yield different results.</figcaption>
</figure>
:::
::: {.content-visible when-format="pdf"}
```{r, echo = F}
#| fig-cap: "Different operating system yield different results."
knitr::include_graphics("images/neupane_table1.png")
```
:::
and this table shows how Windows and Ubuntu (Linux) sort files:
::: {.content-hidden when-format="pdf"}
<figure>
<img src="images/neupane_table2.png"
alt="Different OS order files differently!"></img>
<figcaption>Different OS order files differently!</figcaption>
</figure>
:::
::: {.content-visible when-format="pdf"}
```{r, echo = F}
#| fig-cap: "Different OS order files differently!"
knitr::include_graphics("images/neupane_table1.png")
```
:::
So the operating system can have an impact, and often an unexpected impact,
on our pipeline!
And finally, I believe that we are in a transition period when it comes to
hardware architecture. Apple will very likely completely switch over to an ARM
architecture with their Apple silicon CPUs (as of writing, the Mac Pro is the
only computer manufactured by Apple that doesn't use an Apple silicon CPU and
only because it was released in 2019) and it wouldn't surprise me if other
manufacturers follow suit and develop their own ARM cpus. This means that
projects written today may not run anymore in the future, because of these
architecture changes. Libraries compiled for current architectures would need to
be recompiled for ARM, and that may be difficult.
So, as I explained in the previous chapter, we want our pipeline to be the
composition of pure functions. Nothing in the global environment (apart from
`{target}`-specific options) should influence the runs of the pipeline. But,
what about the environment R is running in? The R engine is itself running in
some kind of environment. This is what I've explained above: operating system
(and all the math libraries that are included in the OS that R relies on to run
code) and hardware are variables that need to be recorded and/or frozen as much
as possible.
Think about it this way: when you running a pure function `f()` of one argument
you think you do this:
```
f(1)
```
but actually what you're doing is:
```
f(1, "windows 10 - build 22H2 - patch 10.0.19045.2075",
"intel x86_64 cpu i9-13900F",
"R version 4.2.2")
```
and so on. `f()` is only pure as far as the R version currently running `f()` is
concerned. But everything else should also be taken into account! Remember, in
technical terms, this means that our function is not referentially transparent.
This is exactly what happened in the paper from @neupane2019 that I described
before. The authors relied on a *hidden* state (the order of the files) to
program their script; in other words, their pipeline was not referentially
transparent.
To deal with this, I will now teach you how to use Docker. Docker will
essentially allow you to turn your pipeline referentially transparent, by
freezing R's and the operating system's versions (and the CPU architecture as
well).
Before continuing, let me warn you: if you’re using an Apple computer with an
Apple Silicon CPU (M1 or M2), then you may have issues following along. I don’t
own such a machine so I cannot test if the code below works flawlessly. What I
can say is that I’ve read some users of these computers have had trouble using
Docker in the past. These issues might have been solved in the meantime. It
seems that enabling the option "use Rosetta for x86/amd64 emulation on Apple
Silicon" in Docker Desktop (I will discuss Docker Desktop briefly in the
following sections) may solve the issue.
## What is Docker?
Let me first explain in very simple terms what Docker is.
In very simple (and technically wrong) terms, Docker makes it easy to run a
Linux virtual machine (VM) on your computer. A VM is a computer within a
computer: the idea is that you turn on your computer, start Windows (the
operating system you use every day), but then start Ubuntu (a very popular Linux
distribution) as if it were any other app installed on your computer and use it
(almost) as you would normally. This is what a classic VM solution like
*Virtualbox* offers you. You can start and use Ubuntu interactively from within
Windows. This can be quite useful for testing purposes for example.
The way Docker differs from Virtualbox (or VMware) is that it strips down
the VM to its bare essentials. There’s no graphical user interface for example,
and you will not (typically) use a Docker VM interactively. What you will do
instead is write down in a text file the specifications of the VM you want.
Let’s call this text file a *Dockerfile*. For example, you want the VM to be
based on Ubuntu. So that would be the first line in the Dockerfile. You then
want it to have R installed. So that would be the second line. Then you need to
install R packages, so you add those lines as well. Maybe you need to add some
system dependencies? Add them. Finally, you add the code of the pipeline that
you want to make reproducible as well.
Once you’re done, you have this text file, the Dockerfile, with a complete
recipe to generate a Docker VM. That VM is called an *image* (as I said
previously it’s technically not a true VM, but let’s not discuss this). So you
have a text file, and this file helps you define and generate an image. Here,
you should already see a first advantage of using Docker over a more traditional
VM solution like Virtualbox: you can very easily write these Dockerfiles and
version them. You can easily start off from another Dockerfile from another
project and adapt it to your current pipeline. And most importantly, because
everything is written down, it’s reproducible (but more on that at the end of
this chapter...).
Ok, so you have this image. This image will be based on some Linux distribution,
very often Ubuntu. It comes with a specific version of Ubuntu, and you can add
to it a specific version of R. You can also download a specific version of all
the packages required for your pipeline. You end up with an environment that is
tailor-made for your pipeline. You can then run the pipeline with this Docker
image, and *always get exactly the same results, ever*. This is because,
regardless of how, where or when you will run this *dockerized* pipeline, the
same version of R, with the same version of R packages, on the same Linux
distribution will be used to reproduce the results of your pipeline. By the way,
when you run a Docker image, as in, you’re executing your pipeline using that
image definition, this now is referred to as a Docker container.
So: a Dockerfile defines a Docker image, from which you can then run containers.
I hope that the pictures below will help. The first picture shows what happens
when you run the same pipeline on two different R versions and two different
operating systems:
::: {.content-hidden when-format="pdf"}
<figure>
<img src="images/without_docker.png"
alt="Running a pipeline without Docker results (potentially) in different outputs."></img>
<figcaption>Running a pipeline without Docker results (potentially) in different outputs.</figcaption>
</figure>
:::
::: {.content-visible when-format="pdf"}
```{r, echo = F}
#| fig-cap: "Running a pipeline without Docker results (potentially) in different outputs."
knitr::include_graphics("images/without_docker.png")
```
:::
Take a close look at the output, you will notice it’s different!
Now, you run the same pipeline, which is now *dockerized*:
::: {.content-hidden when-format="pdf"}
<figure>
<img src="images/with_docker.png"
alt="Running a pipeline with Docker results in the same outputs."></img>
<figcaption>Running a pipeline with Docker results in the same outputs.</figcaption>
</figure>
:::
::: {.content-visible when-format="pdf"}
```{r, echo = F}
#| fig-cap: "Running a pipeline with Docker results in the same outputs."
knitr::include_graphics("images/with_docker.png")
```
:::
Another way of looking at a Docker image: it’s an immutable sandbox, where the
rules of the game are always the same. It doesn’t matter where or when you run
this sandbox, the pipeline will always be executed in this same, well-defined
space. Because the pipeline runs on the same versions of R (and packages)
and on the same operating system defined within the Docker image, our pipeline
is now truly reproducible.
But why Linux though; why isn’t it possible to create Docker images based on
Windows or macOS? Remember in the introduction, where I explained what
reproducibility is? I wrote:
> Open source is a hard requirement for reproducibility.
Open source is not just a requirement for the programming language used for
building the pipeline but extends to the operating system that the pipeline
runs on as well. So the reason Docker uses Linux is because you can use Linux
distributions like Ubuntu for free and without restrictions. There aren’t any
licenses that you need to buy or activate, and Linux distributions can be
customized for any use case imaginable. Thus Linux distributions are the only
option available to Docker for this task.
## A primer on Linux
Up until this point, you could have followed along using any operating system.
Most of the code shown in this book is R code, so it doesn’t matter on what
operating system you’re running it. But there was some code that I ran in the
Linux console (for example, I’ve used `ls` to list files). These commands should
also work on macOS, but on Windows, I told you to run them in the Git Bash
terminal instead. This is because `ls` (and other such commands) don’t work in
the default Windows command prompt (but should work in Powershell though).
Instead of using the terminal (or Git Bash) to navigate your computer’s file
system, you could have followed along using the user interface of your operating
system as well. For example, in [Chapter -@sec-packages], I list the contents
of the `dev/` directory using the following command:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ ls dev/
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ ls dev/
```
:::
but you could have just opened the `dev/` folder in the file explorer of your
operating system of choice. But to use Docker, you will need to get to know
Linux and the Linux ecosystem and concepts a little bit. No worries, it’s not as
difficult as it sounds, and I think that you likely aren’t afraid of difficult
things, or else you would have stopped reading this book much earlier.
*Linux* is not the name of any one specific operating system, but of an
operating system kernel. A kernel is an important component of an operating
system. Linux is free and open-source, and among the most successful free and
open source projects ever. Because it's license allows (and encourages) re-use,
anyone can take that kernel, and add all the other components needed to build a
complete operating system and release the finished product. This is why there
are many *Linux distributions*: a Linux distribution is a complete operating
system that uses Linux as its kernel. The most popular Linux distribution is
called Ubuntu, and if one time you googled something along the lines of "easy
linux os for beginners" the answer that came out on top was likely Ubuntu, or
one of the other variants of Ubuntu (yes, because Ubuntu itself is also
open-source and free software, it is possible to build a variant using Ubuntu as
a basis, like Linux Mint).
To define our Docker images, we will be using Ubuntu as a base. The Ubuntu
operating system has two releases a year, one in April and one in October. On
even years, the April release is a long-term support (LTS) release. LTS releases
get security updates for 5 years, and Docker images generally use an LTS release
as a base. As of writing (May 2023), the current LTS is Ubuntu 22.04 *Jammy
Jellyfish* (Ubuntu releases are named with a number of the form YY.MM and then a
code name based on some animal).
If you want, you can install Ubuntu on your computer. But there’s no need for
this, since you can use Docker to ship your projects!
A major difference between Ubuntu (and other Linux distributions) and macOS and
Windows is how you install software. In short, software for Linux distributions
is distributed as packages. If you want to install, say, the Emacs text editor,
you can run the following command in the terminal:
```bash
sudo apt-get install emacs-gtk
```
Let’s break this down: `sudo` makes the next commands run as root. *root* is
Linux jargon for the administrator account. So if I type `sudo xyz`, the command
`xyz` will run with administrator privileges. Then comes `apt-get install`.
`apt-get` is Ubuntu’s package manager, and `install` is the command that
installs `emacs-gtk`. `emacs-gtk` is the name of the Emacs package. Because
you’re an R user, this should be somewhat familiar: after all, extensions for R
are also installed using a package manager and a command:
`install.packages("package_name")`. Just like in R, where the packages get
downloaded from CRAN, Ubuntu downloads packages from a repository which you can
browse
[here](https://packages.ubuntu.com/jammy/)^[https://packages.ubuntu.com/jammy/].
Of course, because using the command line is intimidating for beginners, it is
also possible to install packages using a software store, just like on macOS or
Windows. But remember, Docker only uses what it absolutely needs to function, so
there’s no interactive user interface. This is not because Docker’s developers
don’t like user interfaces, but because the point of Docker is not to use Docker
images interactively, so there’s no need for the user interface. So you need to
know how to install Ubuntu packages with the command line.
Just like for R, it is possible to install software from different sources. It
is possible to add different repositories, and install software from there. We
are not going to use this here, but just as an aside, if you are using Ubuntu on
your computer as your daily driver operating system, you really should check out
[r2u](https://github.com/eddelbuettel/r2u)^[https://github.com/eddelbuettel/r2u],
an Ubuntu repository that comes with pre-compiled R packages that can get
installed, very, very quickly. Even though we will not be using this here (and
I’ll explain why later in this chapter), you should definitely consider `r2u` to
provide binary R packages if you use Ubuntu as your daily operating system.
Let’s suppose that you are using Ubuntu on your machine, and are using R. If you want
to install the `{dplyr}` R package, something interesting happens when you type:
```{r, eval = F}
install.packages("dplyr")
```
On Windows and macOS, a compiled binary gets downloaded from CRAN and installed
on your computer. A "binary" is the compiled source code of the package. Many R
packages come with C++ or Fortran code, and this code cannot be used as is by R.
So these bits of C++ and Fortran code need to be compiled to be used. Think of
it this way: if the source code is the ingredients of a recipe, the compiled
binary is the cooked meal. Now imagine that each time you want to eat
Bouillabaisse, you have to cook it yourself... or you could get it delivered to
your home. You’d probably go for the delivery (especially if it would be free)
instead of cooking it each time. But this supposes that there are people out there
cooking Bouillabaisse for you. CRAN essentially cooks the package source codes
into binaries for Windows and macOS, as shown below:
::: {.content-hidden when-format="pdf"}
<figure>
<img src="images/tidyverse_packages.png"
alt="Download links to pre-compiled tidyverse binaries."></img>
<figcaption>Download links to pre-compiled tidyverse binaries.</figcaption>
</figure>
:::
::: {.content-visible when-format="pdf"}
```{r, echo = F}
#| fig-cap: "Download links to pre-compiled tidyverse binaries."
knitr::include_graphics("images/tidyverse_packages.png")
```
:::
In the image above, you can see links to compiled binaries of the `{tidyverse}`
package for Windows and macOS, but none for any Linux distribution. This is
because, as stated in the introduction, there are many, many, many Linux
distributions. So at best, CRAN could offer binaries for Ubuntu, but Ubuntu is
not the only Linux distribution, and Ubuntu has two releases a year, which means
that every CRAN package (that needs compilation) would need to get compiled
twice a year. This is a huge undertaking unless CRAN decided to only offer
binaries for LTS releases. But that would still be every two years.
So instead, what happens, is that the burden of compilation is pushed to the
user. Every time you type `install.packages("package_name")`, and if that
package requires compilation, that package gets compiled on your machine which
not only takes some time, but can also fail. This is because very often, R
packages that require compilation need some additional system-level dependencies
that need to be installed. For example, here are the Ubuntu dependencies that
need to be installed for the installation of the `{tidyverse}` package to
succeed:
```
libicu-dev
zlib1g-dev
make
libcurl4-openssl-dev
libssl-dev
libfontconfig1-dev
libfreetype6-dev
libfribidi-dev
libharfbuzz-dev
libjpeg-dev
libpng-dev
libtiff-dev
pandoc
libxml2-dev
```
This is why r2u is so useful: by adding this repository, what you’re essentially
doing is telling R to not fetch the packages from CRAN, but from the r2u
repository. And this repository contains compiled R packages for Ubuntu. So the
required system-level dependencies get installed automatically and the R package
doesn’t need compilation. So installation of the `{tidyverse}` package takes
less than half a minute on a modern machine.
But if r2u is so nice, why did I say above that we would not be using it?
Unfortunately, this is because r2u does not archive compiled binaries of older
packages, and this is exactly what we need for reproducibility. So when you’re
building a Docker image to make a project reproducible, because that image will
be based on Ubuntu, we will need to make sure that our Docker image contains the
right system-level dependencies so that compilation of the R packages doesn’t
fail. Thankfully, you’re reading the right book.
## First steps with Docker
Let’s start by creating a "Hello World" Docker image. As I explained in the
beginning, to define a Docker image, we need to create a Dockerfile with some
instructions. But first, you need of course to install Docker. To install Docker
on any operating system (Windows, macOS or Ubuntu or other Linuxes), you can
install [Docker
Desktop](https://docs.docker.com/desktop/)^[https://docs.docker.com/desktop/].
If you’re running Ubuntu (or another Linux distribution) and don’t want the GUI,
you could install the [Docker
engine](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository)^[https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository]
and then follow the post-installation [steps for
Linux](https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user)^[https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user]
instead.
In any case, whatever operating system you’re using, we will be using the
command line to interact with Docker. Once you’re done with installing Docker,
create a folder somewhere on your computer, and create inside of this folder an
empty text file with the name "Dockerfile". This can be tricky on Windows,
because you have to remove the `.txt` extension that gets added by default.
You might need to turn on the option "File name extensions" in the
`View` pane of the Windows file explorer to make this process easier. Then, open
this file with your favourite text editor, and add the following lines:
```
FROM ubuntu:jammy
RUN uname -a
```
This very simple Dockerfile does two things: it starts by stating that it’s
based on the Ubuntu Jammy (so version 22.04) operating system, and then runs the
`uname -a` command. This command, which gets run inside the Ubunu command line,
prints the Linux kernel version from that particular Ubuntu release. `FROM` and
`RUN` are Docker commands; there are a couple of others that we will discover a
bit later. Now, what do you do with this Dockerfile? Remember, a Dockerfile
defines an image. So now, we need to build this image to run a container. Open a
terminal/command prompt in the folder where the Dockerfile is and type the
following:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ docker build -t raps_hello .
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ docker build -t raps_hello .
```
:::
The `docker build` command builds an image from the Dockerfile that is in the
path `.` (a single `.` means "this current working directory"). The `-t` option
tags that image with the name `raps_hello`. If everything goes well, you
should see this output:
```bash
Sending build context to Docker daemon 2.048kB
Step 1/2 : FROM ubuntu:jammy
---> 08d22c0ceb15
Step 2/2 : RUN uname -a
---> Running in 697194b9a519
Linux 697194b9a519 6.2.6-1-default #1 SMP PREEMPT_DYNAMIC
Mon Mar 13 18:57:27 UTC 2023 (fa1a4c6) x86_64 x86_64 x86_64 GNU/Linux
Removing intermediate container 697194b9a519
---> a0ea59f23d01
Successfully built a0ea59f23d01
Successfully tagged raps_hello:latest
```
Look at `Step 2/2`: you should see the output of the `uname -a` command:
```bash
Linux 697194b9a519 6.2.6-1-default #1 SMP PREEMPT_DYNAMIC
Mon Mar 13 18:57:27 UTC 2023 (fa1a4c6) x86_64 x86_64 x86_64 GNU/Linux
```
Every `RUN` statement in the Dockerfile gets executed at build time: so this is
what we will use to install R and needed packages. This way, once the image
is built, we end up with an image that contains all the software we need.
Now, we would like to be able to use this image. Using a built image, we can
start one or several containers that we can use for whatever we want. Let's now
create a more realistic example. Let's build a Docker image that comes with R
pre-installed. But for this, we need to go back to our Dockerfile and change it
a bit:
```
FROM ubuntu:jammy
ENV TZ=Europe/Luxembourg
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update && apt-get install -y r-base
CMD ["R"]
```
First we define a variable using `ENV`, called `TZ` and we set that to the
`Europe/Luxembourg` time zone (you can change this to your own time zone). We
then run a rather complex looking command that sets the defined time zone
system-wide. We had to do all this, because when we will later install R, a
system-level dependency called `tzdata` gets installed alongside it. This tool
then asks the user to enter his or her time zone interactively. But we cannot
interact with the image interactively as it's being built, so the build process
gets stuck at this prompt. By using these two commands, we can set the correct
time zone and once `tzdata` gets installed, that tool doesn't ask for the time
zone anymore, so the build process can continue. This is a rather known issue
when building Docker images based on Ubuntu, so the fix is easily found with a
Google search (but I'm giving it to you, dear reader, for free).
Then come `RUN` statements. The first one uses Ubuntu's package manager to first
refresh the repositories (this ensures that our local Ubuntu installation
repositories are in sync with the latest software updates that were pushed to
the central Ubuntu repos). Then we use Ubuntu's package manager to install
`r-base`. `r-base` is the package that installs R. We then finish this
Dockerfile by running `CMD ["R"]`. This is the command that will be executed
when we run the container. Remember: `RUN` commands get executed at build-time,
`CMD` commands at run-time. This distinction will be important later on.
Let's build the image (this will take some time, because a lot of software gets
installed):
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ docker build -t raps_ubuntu_r .
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ docker build -t raps_ubuntu_r .
```
:::
This builds an image called `raps_ubuntu_r`. This image is based on Ubuntu
22.04 Jammy Jellyfish and comes with R pre-installed. But the version of R
that gets installed is the one made available through the Ubuntu repositories,
and as of writing that is version 4.1.2, while the latest version available is
R version 4.2.3. So the version available through the Ubuntu repositories lags
behind the actual release. But no matter, we will deal with that later.
We can now start a container with the following command:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ docker run raps_ubuntu_r
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ docker run raps_ubuntu_r
```
:::
And this is the output we get:
```r
Fatal error: you must specify '--save', '--no-save' or '--vanilla'
```
What is going on here? When you run a container, the command specified by `CMD`
gets executed, and then the container quits. So here, the container ran the
command `R`, which started the R interpreter, but then quit immediately. When
quitting R, users should specify if they want to save or not save the workspace.
This is what the message above is telling us. So, how can we use this? Is there
a way to use this R version interactively?
Yes, there is a way to use this R version boxed inside our Docker image
interactively, even though that's not really what we want to achieve. What we
want instead is that our pipeline gets executed when we run the container. We
don't want to mess with the container interactively. But let me show you how we
can interact with this dockerized R version. First, you need to let the
container run in the background. You can achieve this by running the following
command:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ docker run -d -it --name ubuntu_r_1 raps_ubuntu_r
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ docker run -d -it --name ubuntu_r_1 raps_ubuntu_r
```
:::
This runs the container that we name "ubuntu_r_1" from the image "raps_ubuntu_r"
(remember that we can run many containers from one single image definition).
Thanks to the option `-d`, the container runs in the background, and the option
`-it` states that we want an interactive shell to be waiting for us. So the
container runs in the background, with an interactive shell waiting for us,
instead of launching (and then immediately stopping) the R command. You can now
"connect" to the interactive shell and start R in it using:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ docker exec -it ubuntu_r_1 R
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ docker exec -it ubuntu_r_1 R
```
:::
You should now see the familiar R prompt:
```r
R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
```
Welcome to a dockerized version of R. Now, all of this might have felt overly
complicated to you. And of course if this is the first time that you have played
around with Docker, it is tricky indeed. However, you shouldn't worry too much
about it, for several reasons:
- we are not going to use Docker containers interactively, that's not really the point, but it can be useful to log in into the running container to check if things are working as expected;
- we will build our images on top of pre-built images from the [Rocker project](https://rocker-project.org/)^[https://rocker-project.org/] and these images come with a lot of software pre-installed and configuration taken care of.
What you should take away from this section is that you need to write a
Dockerfile which then allows you to build an image. This image can then be used
to run one (or several) containers. These containers, at run-time, will execute
our pipeline in an environment that is frozen, such that the output of this run
will stay constant, forever.
## The Rocker project
The Rocker project offers a very large collection of "R-ready" Docker images
that you can use as starting points for building your own Docker images. Before
using these images though, I still need to explain one very important Docker
concept. Let's go back to our "Hello World" Docker image:
```
FROM ubuntu:jammy
RUN uname -a
```
The very first line, `FROM ubuntu:jammy` downloads an Ubuntu Jammy image, but
from where? All these images get downloaded from *Docker Hub*, which you can
browse [here](https://hub.docker.com/)^[https://hub.docker.com/]. If you create
an account you can even push your own images on there. For example, we could push
the image we built before, which we called `raps_ubuntu_r`, on Docker Hub. Then,
if we wanted to create a new Docker image that builds upon `raps_ubuntu_r` we
could simply type `FROM username:raps_ubuntu_r` (or something similar).
It's also possible to not use Docker Hub at all, and share the image you built
as a file. I’ll explain how later.
The Rocker project offers many different images, which are described
[here](https://rocker-project.org/images/)^[https://rocker-project.org/images/].
We are going to be using the *versioned* images. These are images that ship
specific versions of R. This way, it doesn't matter when the image gets built,
the same version of R will be installed by getting built from source. Let me
explain why building R from source is important. When we build the image from
the Dockerfile we wrote before, R gets installed from the Ubuntu repositories.
For this we use Ubuntu's package manager and the following command: `apt-get
install -y r-base`. As of writing, the version of R that gets installed is
version 4.1.3. There's two problems with installing R from Ubuntu's
repositories. First, we have to use whatever gets installed, which can be a
problem with reproducibility. If we ran our analysis using R version 4.2.1, then
we would like to dockerize that version of R. The second problem is that when we
build the image today we get version 4.1.3. But it is not impossible that if we
build the image in 6 months, we get R version 4.2.0, because it is likely that
the version that ships in Ubuntu's repositories will get updated at some point.
This means that depending on *when* we build the Docker image, we might get a
different version of R. There are only two ways of avoiding this problem: either
we build the image once and archive it and make sure to always keep a copy and
ship that copy forever (or for as long as we want to make sure that pipeline is
reproducible) just as you would ship data, code and any documentation required
to make the pipeline reproducible. Or we write the Dockerfile in such a way that
it always produces the same image, regardless of *when* it gets built. I very
strongly advise you to go for the second option, but to *also* archive the
image. But of course, this also depends on how critical your project is. Maybe
you don't need to start archiving images, or maybe you don't even need to make
sure that the Dockerfile always produces the same image. But I would still
highly recommend that you write your Dockerfiles in such a way that they always
output the same image. It is safer, and it doesn't really mean extra work,
thanks to the Rocker project.
So, let's go back to the Rocker project, and specifically their *versioned*
images which you can find
[here](https://rocker-project.org/images/versioned/r-ver.html)^[https://rocker-project.org/images/versioned/r-ver.html].
When you use one of the versioned images as a base for your project, you get the
following guarantees:
- a fixed version of R that gets built from source. It doesn't matter *when* you build the image, it will always ship with the same version of R;
- the operating system will be the LTS release that was current when that specific version of R was current;
- the R repositories are set to the Posit Public Package Manager (PPPM) at a specific date. This ensures that R packages don't need to be compiled as PPPM serves binary packages for the amd64 architecture (which is the architecture that virtually all non-Apple computers use these days).
This last point requires some more explanations. You should know that versioned
Rocker images use the PPPM set at a specific date. This is a very neat feature
of the PPPM. For example, the versioned Rocker image that comes with R 4.2.2 has
the repos set at the 14th of March 2023, as you can see for yourself
[here](https://github.com/rocker-org/rocker-versioned2/blob/fb1d32e70061b0f978b7e35f9c68e2b79bafb69a/dockerfiles/r-ver_4.2.2.Dockerfile#L16)^[https://is.gd/fdrq4p].
This means that if you use `install.packages("dplyr")` inside a container
running from that image, then the version of `{dplyr}` that will get installed
is the one that was available on the 14th of March.
This can be convenient in certain situations, and you may want, depending on
your needs, to use the PPPM set a specific date to define Docker images, as the
Rocker project does. You could even set the PPPM at a specific date for your
main development machine (just follow the instructions
[here](https://packagemanager.rstudio.com/client/#/repos/2/overview)^[https://is.gd/jbdTKC]).
But keep in mind that you will not be getting any updates to packages, so if you
want to install a fresh version of a package that may introduce some nice new
features, you'll need to change the repos again. This is why I highly advise you
to stay with your default repositories (or use r2u if you are on Ubuntu) and
manage your projects' package libraries using `{renv}`. This way, you don't have
to mess with anything, and have the flexibility to have a separate package library
per project. The other added benefit is that you can then use the project's `renv.lock`
file to install the exact same package library inside the Docker image.
As a quick introduction to using Rocker images, let's grab our pipeline's
`renv.lock` file which you can download from
[here](https://raw.githubusercontent.com/rap4all/housing/980a1b0cd20c60a85322dbd4c6da45fbfcebd931/renv.lock)^[https://is.gd/5UcuxW].
This is the latest `renv.lock` file that we generated for our pipeline, it
contains all the needed packages to run our pipeline, including the right
versions of the `{targets}` package and the `{housing}` package that we
developed. An important remark: it doesn't matter if the `renv.lock` file
contains packages that were released after the 14th of March. Even if the
repositories inside the Rocker image that we will be using are set to that date,
the lock file also specifies the URL of the right repository to download the
packages from. So that URL will be used instead of the one defined for the
Rocker image.
Another useful aspect of the `renv.lock` file is that it also records the R
version that was used to originally develop the pipeline, in this case, R
version 4.2.2. So that's the version we will be using in our Dockerfile. Next,
we need to check the version of `{renv}` that we used to build the `renv.lock`
file. You don't necessarily need to install the same version, but I recommend
you do. For example, as I'm writing these lines, `{renv}` version 0.17.1 is
available, but the `renv.lock` file was written by `{renv}` version 0.16.0. So
to avoid any compatibility issues, we will also install the exact same version.
Thankfully, that is quite easy to do (to check the version of `{renv}` that was
used to write the lock file simply look for the word "renv" in the lock file).
While `{renv}` takes care of installing the right R packages, it doesn’t take
care of installing the right system-level dependencies. So that’s why we need to
install these system-level dependencies ourselves. I will give you a list of
system-level dependencies that you can install to avoid any issues below, and I
will also explain to you how I was able to come up with this list. It is quite
easy thanks to Posit and their PPPM. For example,
[here](https://packagemanager.rstudio.com/client/#/repos/2/packages/tidyverse)^[https://is.gd/ZaXHwa]
is the summary page for the `{tidyverse}` package. If you select "Ubuntu 22.04
(Jammy)" on the top right, and then scroll down, you will see a list of
dependencies that you can simply copy and paste into your Dockerfile:
::: {.content-hidden when-format="pdf"}
<figure>
<img src="images/tidyverse_jammy_deps.png"
alt="System-level dependencies for the {tidyverse} package on Ubuntu."></img>
<figcaption>System-level dependencies for the {tidyverse} package on Ubuntu.</figcaption>
</figure>
:::
::: {.content-visible when-format="pdf"}
```{r, echo = F}
#| fig-cap: "System-level dependencies for the {tidyverse} package on Ubuntu."
knitr::include_graphics("images/tidyverse_jammy_deps.png")
```
:::
We will use this list to install the required dependencies for our pipeline.
Create a new folder and call it whatever you want and save the `renv.lock` file
linked above inside of it. Then, create an empty text file and call it
Dockerfile. Add the following lines:
```
FROM rocker/r-ver:4.2.2
RUN apt-get update && apt-get install -y \
libglpk-dev \
libxml2-dev \
libcairo2-dev \
libgit2-dev \
default-libmysqlclient-dev \
libpq-dev \
libsasl2-dev \
libsqlite3-dev \
libssh2-1-dev \
libxtst6 \
libcurl4-openssl-dev \
libharfbuzz-dev \
libfribidi-dev \
libfreetype6-dev \
libpng-dev \
libtiff5-dev \
libjpeg-dev \
libxt-dev \
unixodbc-dev \
wget \
pandoc
RUN R -e "install.packages('remotes')"
RUN R -e "remotes::install_github('rstudio/[email protected]')"
RUN mkdir /home/housing
COPY renv.lock /home/housing/renv.lock
RUN R -e "setwd('/home/housing');renv::init();renv::restore()"
```
The first line states that we will be basing our image on the image from the
Rocker project that ships with R version 4.2.2, which is the right version that
we need. Then, we install the required system-level dependencies using Ubuntu’s
package manager, as previously explained. Then comes the `{remotes}` package.
This will allow us to download a specific version from `{renv}` from Github,
which is what we do in the next line. I want to stress again that I do this
simply because the original `renv.lock` file was generated using `{renv}`
version 0.16.0 and so to avoid any potential compatibility issues, I also use
this one to restore the required packages for the pipeline. But it is very
likely that I could have installed the current version of `{renv}` to restore
the packages, and that it would have worked without problems. (Note that for later versions of `{renv}`, you may need to insert a 'v' before the version number: `[email protected]` for example.) But just to be on
the safe side, I install the right version of `{renv}`. By the way, I knew how
to do this because I read [this
vignette](https://rstudio.github.io/renv/articles/docker.html)^[https://rstudio.github.io/renv/articles/docker.html]
that explains all these steps (but I’ve only kept the absolute essential lines
of code to make it work). Next comes the line `RUN mkdir /home/housing`, which
creates a folder (`mkdir` stands for *make directory*), inside the Docker image,
in `/home/housing`. On Linux distributions, `/home/` is the directory that users
use to store their files, so I create the `/home/` folder and inside of it, I
create a new folder, `housing` which will contain the files for my project. It
doesn’t really matter if you keep that structure or not, you could skip the
`/home/` folder if you wanted. What matters is that you put the files where you
can find them.
Next comes `COPY renv.lock /home/housing/renv.lock`. This copies the `renv.lock`
file from our computer (remember, I told you to save this file next to the
Dockerfile) to `/home/housing/renv.lock`. By doing this, we include the
`renv.lock` file inside of the Docker image which will be crucial for the next
and final step:
`RUN R -e "setwd('/home/housing');renv::init();renv::restore()"`.
This runs the `R` program from the Linux command line with the option `-e`. This
option allows you to pass an `R` expression to the command line, which needs to
be written between `""`. Using `R -e` will quickly become a habit, because this
is how you can run R non-interactively, from the command line. The expression we
pass sets the working directory to `/home/housing`, and then we use
`renv::init()` and `renv::restore()` to restore the packages from the
`renv.lock` file that we copied before. Using this Dockerfile, we can now build
an image that will come with R version 4.2.2 pre-installed as well as all the
same packages that we used to develop the `housing` pipeline.
Build the image using `docker build -t housing_image .` (don’t forget the `.` at
the end).
The build process will take some time, so I would advise you to go get a hot
beverage in the meantime. Now, we did half the work: we have an environment that
contains the required software for our pipeline, but the pipeline files
themselves are missing. But before adding the pipeline itself, let’s see if the
Docker image we built is working. For this, log in to a command line inside a
running Docker container started from this image with this single command:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ docker run --rm -it --name housing_container housing_image bash
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ docker run --rm -it --name housing_container housing_image bash
```
:::
This starts `bash` (Ubuntu’s command line) inside the `housing_container` that
gets started from the `housing_image` image. We add the `--rm` flag to `docker
run`, this way the Docker container gets stopped when we log out (if not, then
the Docker container will continue running in the background). Once logged in,
we can move to the folder’s project using:
::: {.content-hidden when-format="pdf"}
```bash
user@docker ➤ cd home/housing
```
:::
::: {.content-visible when-format="pdf"}
```bash
user@docker $ cd home/housing
```
:::
and then start the R interpreter:
::: {.content-hidden when-format="pdf"}
```bash
user@docker ➤ R
```
:::
::: {.content-visible when-format="pdf"}
```bash
user@docker $ R
```
:::
if everything goes well, you should see the familiar R prompt with a message
from `{renv}` at the end:
```r
R version 4.2.2 (2022-10-31) -- "Innocent and Trusting"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
* Project '/home/housing' loaded. [renv 0.16.0]
```
Try to load the `{housing}` package with `library("housing")`. This should work
flawlessly!
## Dockerizing projects
So we now have a Docker image that has the right environment for our project. We
can now dockerize the project itself. There are two ways to do this: we either
simply add the required lines to our Dockerfile, meaning copying the
`_targets.R` script to the Docker image at build time and then use
`targets::tar_make()` to run the pipeline, or we now create a new Dockerfile
that will build upon this image and add the required lines there. In this
section, we will use the first approach, and in the next section, we will use
the second. The advantage of the first approach is that we have a single
Dockerfile, and everything we need is right there. Also, each Docker image is
completely tailor-made for each project. The issue is that building takes some
time, so if for every project we restart from scratch it can be tedious to have
to wait for the build process to be done (especially if you use continuous
integration, as we shall see in the next chapter).
The advantage of the second approach is that we have a base that we can keep
using for as long as we want. You will only need to wait once for R and the
required packages to get installed. Then, you can use this base for any project
that requires the same version of R and packages. This is especially useful if
you don’t update your development environment very often, and develop a lot of
projects with it.
In summary, the first approach is "dockerize pipelines", and the second approach
is "dockerize the dev environment and use it for many pipelines". It all depends
on how you work: in research, you might want to go for the first approach, as