-
Notifications
You must be signed in to change notification settings - Fork 0
/
mini-project.tex
executable file
·2733 lines (2483 loc) · 116 KB
/
mini-project.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%----------------------------------------------------------------------------------------
% PACKAGES AND OTHER DOCUMENT CONFIGURATIONS
%----------------------------------------------------------------------------------------
\documentclass[12pt]{report}
% Default font size is 12pt, it can be changed here
\usepackage[none]{hyphenat}
\usepackage{geometry} % Required to change the page size to A4
\geometry{a4paper} % Set the page size to be A4 as opposed to the default US Letter
\usepackage{graphicx} % Required for including pictures
\usepackage{listings}
\usepackage{cleveref}
\crefname{subsection}{subsection}{subsections}
\usepackage{tikz}
\usepackage{float} % Allows putting an [H] in \begin{figure} to specify the exact location of the figure
\usepackage{wrapfig} % Allows in-line images such as the example fish picture
\usepackage{lipsum} % Used for inserting dummy 'Lorem ipsum' text into the template
\linespread{1.2} % Line spacing
%\setlength\parindent{0pt} % Uncomment to remove all indentation from paragraphs
\graphicspath{{Pictures/}} % Specifies the directory where pictures are stored
\setlength\parindent{0pt}
\begin{document}
%----------------------------------------------------------------------------------------
% TITLE PAGE
%----------------------------------------------------------------------------------------
\begin{titlepage}
\newcommand{\HRule}{\rule{\linewidth}{0.5mm}} % Defines a new command for the horizontal lines, change thickness here
\center % Center everything on the page
\textsc{\LARGE Vrije Universiteit Brussel}\\[1.5cm]
\textsc{\Large Operating Systems and Security}\\[0.5cm]
\textsc{\large Mini-project essay}\\[0.5cm]
\HRule \\[0.4cm]
{ \huge \bfseries High availability and scalability on GNU/Linux clusters}\\[0.4cm] % Title of your document
\HRule \\[1.5cm]
\begin{minipage}{0.4\textwidth}
\begin{flushleft} \large
\emph{Author:}\\
Pieter Libin % Your name
\end{flushleft}
\end{minipage}
~
\begin{minipage}{0.4\textwidth}
\begin{flushright} \large
\emph{Professor:} \\
Prof. Dr. Timmerman\\
\emph{Assistant:} \\
Mr. Fayyad-Kazan
% Supervisor's Name
\end{flushright}
\end{minipage}\\[4cm]
{\large \today}\\[3cm] % Date, change the \today to a set date if you want to be precise
%\includegraphics{Logo}\\[1cm] % Include a department/university logo - this will require the graphicx package
\vfill % Fill the rest of the page with whitespace
\end{titlepage}
%----------------------------------------------------------------------------------------
% TABLE OF CONTENTS
%----------------------------------------------------------------------------------------
\tableofcontents % Include a table of contents
\newpage % Begins the essay on a new page instead of on the same page
% as the table of contents
\chapter*{Preface}
\section{Source code}
When I reference to source code files, this will always be in the src
directory. This directory can be found in this public github.com
project: https://github.com/plibin-vub/OSSEC-mini-project.
\section{Videos}
For some demonstrations I created a video, I've put these videos on
Youtube and reference to them in this document via their YouTube URL.
\section{Naming conventions}
Server names are always expressed using single quotes (e.g.: 'node1').
\section{Presentation}
Before the presentation, I will hand in the following deliverables:
\begin{itemize}
\item this final report
\item the final presentation
\item a directory with all the edited videos
\item a directory with all pictures that I used in the report
\item source code (the src directory on github)
\item the base ubuntu virtual machine, which I used to derive all
other virtual machines from (root-password=''empty'', root-user=''plibin'')
\end{itemize}
During the presentation, I will present some videos that depict
experiments that I executed (most of the videos in the deliverable).
\chapter{Introduction} % Major section
\section{Abstract}
Nowadays, several web-applications exist (Facebook, Wikipedia, Twitter, ...) that serve millions of users concurrently. To achieve this, huge clusters are in place, where each cluster node serves a large amount of users. In order to assure the availability of such applications, it is required that whenever a cluster node crashes, the systems remains responsive and handles subsequent requests by another cluster node.\\
The number of users of such web applications can vary greatly over different time intervals (e.g.: Twitter tends to be under heavy load upon the arrival of significant events). Also, when an application gains in popularity, the number of users grows quickly in a short timespan. This requires an application to be able to scale the number of users it can serve.\\
This project situates these concepts in the general ICT landscape and makes a detailed analysis on the importance of availability and scalability provisions on GNU/Linux server clusters.\\
The project identifies a set of technologies that are available on the GNU/Linux platform to set up a system that can assure high availability and support proper scalability for both computational and storage nodes. The selected technologies will be tested in a simulated cluster setting.\\
In the last phase of the project, a high-availability and scalable
cluster simulation is constructed that mimics the functionality of
Wikipedia.
\section{Interest}
A large amount of today's software applications runs on remote servers, and
this amount will most likely grow even more in the near future. In
order to allow users to have a good experience, it is vital that the
application remains available at all times. For applications that are only
used by a handful of users, this might be an easy task by just
providing a redundant server, however, for applications that are used
by thousands (or millions) of concurrent users, this becomes
quite complex.\\
Another aspect is that such applications need to be able to scale; a
sudden rise in popularity may cause an application to be used by a
x-fold of users from one day to another.\\
These considerations of high-availability and scalability interest me
greatly because they have a huge impact on how applications are
hosted; and this again determines how such applications are to be
implemented.\\
In my opinion, having an overview and understanding of the technologies and
architectures that high-availability and scalability setups use,
is a big asset for a computer scientist.\\
I have always been fascinated by large data centers,
while the hardware of such setups is no longer my main interest (this
is more inclined to software these days) I still remain intrigued how
the different components in such setups interact.
\section{Relevance}
Many huge companies (Google, Facebook, Microsoft, ...) and
organisations (Wikipedia, ...) depend on available and scalable
applications \footnote{And the number of companies that operate in the
``Cloud'' seems to grow every days}. This makes
high-available and scalability solutions economically relevant.
Also in the High Performance Computing domain high-available and
scalability solutions become more and more important to operate huge
research clusters. Such research clusters are essential for scientists
and companies to do research.
\chapter{General scope of high-availability and scalability}
This project will focus on GNU/Linux clusters, however,
the aspect of high-availability and scalability are not limited to
this domain. In this chapter, I present a short overview of the
importance of these aspects in other domains.
\section{Airplane/car/spacecraft software}
\subsection{On-board computers}
Cars and planes that are released today are operated by millions of lines of code
\cite{ieee_cars_and_planes}. It is of course vital in such devices
that there is a high level of availability; important systems (breaks,
steering, ...) should be available in a redundant way. \\
Another important aspect is that failure should be detected as soon as
possible; if certain parts of a car break down without the vehicle being aware of this, a very dangerous
situation can arise.\\
For spacecrafts, the same rules apply when considering passenger
safety. However, also unmanned flights, such as the mars exploration missions
need to consider an extra dimension of availability. These vehicles
can only be controlled by on-board computers or via a satellite
up-link (where signals take 1 to 1.5 hours \cite{mars_rover} to go
from one end to the other). These vehicles have a
huge development and production cost, which is an extra motivation to
keep the system running as long as physically possible.
\subsection{Computer Aided Design}
The design of vehicles can be assisted with several software
solutions, these software packages require a lot of computational
power and can benefit from the use of High Performance Clusters
\cite{hpc_cars}.
\section{Healthcare}
\subsection{Patient monitoring}
For patient monitoring equipment, it is vital that these devices
report problems as soon as possible. Usually these devices can be
quickly replaced with a working unit, but not knowing that a device is
corrupted might cost patient lives.\\
\subsection{Drug development}
For drug development both
academic researchers and pharmaceutical company staff use High
Performance Clusters intensively to study the effects of drugs
\cite{deforche2007bayesian} \cite{lengauer2006bioinformatics}.
\section{Desktop systems}
The stability of desktop computers has improved a lot the last decade,
mainly by stabilising desktop operating systems. A huge source of system
crashes in the past were buggy device drivers that made the entire
operating system crash.
Current desktop operating systems
(e.g.: Windows NT) are built to recover from such crashes
\cite{windows_nt_kernel}.
\section{Home entertainment systems}
These appliances (TV's, game consoles, media center) are used via a limited interface (e.g.:
remote controls) and users expect them to work without any problems
(as their VCR a couple of decades ago).\\
While system crashes might still be acceptable on the desktop, this is
definitely not the case with these kind of devices.
\chapter{Different aspects in cluster availability}
\label{chap:aspects}
\section{Hardware}
\subsection{Servers}
Servers are the processing nodes in a cluster, they contain a CPU
(often more than one), memory (often a significant amount), one or
more network interfaces, a power supply (often 2 for redundancy) and
usually one or more hard disks.\\
Several server-cases exist:
\subsubsection{Tower server}
A tower server is comparable to a desktop PC tower. It's cheap, but it is
an inconvenient format to use in cluster rooms.
\subsection{Rack server}
These servers are build as thin boxes that can be stacked on top of each other in a rack.
\subsection{Blade server}
A blade server is a large computer that is able to contain multiple
blades. A blade is essentially a slice containing memory, a CPU and a
disk. The power and networking is managed by the surrounding case.
Blade servers are convenient for cluster setups since they
allow to replace (broken) blades very easily.
\subsection{Network}
The network infrastructure connects the different elements in the
cluster and allows the application
that runs on the cluster to connect to the internet.\\
Several components are used to setup cluster networks:
\begin{itemize}
\item switches
\item routers
\item firewalls
\end{itemize}
\subsection{Storage}
Storage can be integrated in a server. Solutions exist for implementing storage that can be
integrated directly in the network:
\begin{itemize}
\item network attached storage
\item storage area network
\end{itemize}
There are also separate units that can be connected to a server
directly (with no network in between) : direct-attached storage.
\subsubsection{Disk types}
Currently, two types of disks exist: hard disk drives (HDDs) and
solid state disks (SSDs). A HDD contains rotating platters and
stores data by magnetising parts of the disk. An SSD is fully
implemented in an integrated electronic circuit; it contains no moving
parts at all (comparable to a USB flash drive).\\
Overall, SSDs (which are only ``recently'' on the market) have a much better
performance. However, traditional HDDs still offer a much better
price/MB ratio than SSDs.\\
Therefore the choice of using HDDs or SSDs usually depends on the
problem you want to solve: quickly accessing a limited amount of data
vs storing vast amounts of data that are only issued
periodically \footnote{The two systems are often used in combination,
so the technologies can complement each other}.
\section{Infrastructure}
\subsection{Sites}
Big companies such as Google, Facebook, Microsoft, ... have their own
sites to host their clusters. Smaller companies often start by hosting
their applications at a third party's site. Some of these third party
providers offer servers that can be completely administered by their
users (e.g.: rackspace \cite{rackspace}, Amazon EC2 \cite{amazon_ec2}), others provide cloud
platforms that allow the users to simply install an application
(e.g.: Google App Engine \cite{google_app_engine}).
\subsection{Location}
An important aspect to consider is the distance of
the cluster to the end users. When this distance is too big, users will suffer from the
network connection's latency. For example: between the US
and Europe, a latency of $\approx 90ms$ can be expected
\cite{verizon_latency}.
This is a significant number when we consider that a latency of 1/10th
of a second (100ms) already starts to feel slow for an end user
\cite{web_app_latency}.\\
Therefore it is beneficial to have the data center as close to the end
user as possible.\\
Another important aspect is the climate of the region where the data
center is located. Data centers tend to get hot and cooling down the
datacenter can become an expensive (and carbon-unfriendly) issue.
A solution can be to move the datacenter to a colder region
\cite{datacenter_cold}.\\
Proximity to an energy provider is another factor that needs to be
considered (see \cref{energy_provider}).
\subsection{Buildings}
The buildings that host a data center need to take a set of precautions to
avoid or handle natural disasters such as floods, fires, earthquakes
and extreme weather.\\
The buildings should be well secured (physically), since quite often,
private data of end-users will reside on the servers in the
building.\\
Applying passive architectural techniques can allow
the data center to consume dramatically less power \cite{passive_architecture}.
\subsection{Energy providers}
\label{energy_provider}
Data centers might produce their own power on-site by
using solar panels or even building a dedicated
power plant.\\
It should be taken under consideration whether a country / region has
trustworthy energy providers (which can be a problem in development
countries).\\
A company will negotiate a contract that guarantees the availability of power
within a reasonable degree of certainty.\\
The datacenter should be able to handle short power-outs by using
UPSs. To deal with longer power-outs diesel-powered generators can be used.\\
Considering climate change, the cost of energy could rise
significantly over the next years, rendering these aspects even more important.
\section{Internet Availability}
Most companies will work with a third party provider to setup a
connection with the Internet. \\
A company will negotiate a contract that guarantees the availability of Internet
within a reasonable a reasonable degree of certainty.\\
\section{Storage}
A proper backup strategy should be in place. This strategy will depend
on the companies' resources and the importance of the different
datasources.
Some general guidelines for backups (based on \cite{ha_book}):
\subsubsection{Mirroring does not replace backups}
If a file
is removed by a human error, a mirrored solution will not be able to
restore the data.
Additionally, it is still possible that these disks will
fail at the same time
(for example when the power is interrupted).
\subsubsection{Backups are more often used to fix human errors
than to fix a disaster}
For this reason, it is important that restoring data can happen
quickly.
\subsubsection{Regularly test the restore procedure}
It is always possible that there is a hardware problem or some backup
rule is invalidated. To make sure this is not discovered in the case
of a disaster, it is important to regularly test both the hardware and
the restore procedures.
\subsubsection{Handle hardware (and tapes) with care}
A lot of backups are made on tapes, such cartridges are particularly
sensitive to dust and moist, so it is important to store them
appropriately.
\subsubsection{Only use a tape as many times as the producer
guarantees it will work}
Tapes have an expiration date, make sure to stop using them after this
date. If you dispose of them, make sure to destroy them thoroughly,
since they might contain private data.
\subsubsection{Multiple locations}
If you can afford it, it is always better to distribute backups over
multiple physical locations. In case there goes anything seriously
wrong in one of the data centers, it is still possible to recover from
this.
\section{Database systems}
The following two (mainly) database technologies are currently used in cluster
settings: Relational database management systems and NoSQL systems.
\subsection{Relational database management systems}
This technology is based on the relational model \cite{codd}. It
allows data to be modelled as tables that represent entities. These
entities are then connected using relations (basically pointers from
one table to another).\\
It is possible to query the information in the model using the
declarative language SQL.\\
RDBMs implement ACID
and transactions, two features which are very
useful for application developers, but make scaling RDBMs difficult.
\subsection{NoSQL databases}
The concept of NoSQL databases was introduced as a reaction on the
difficulties of scaling out RDBMs. While RDBMs store data using a relational
model, most NoSQL solutions simply act as a key-value store ($\approx$
a distributed dictionary data structure).\\
NoSQL databases are usually simpler in design and allow for easier horizontal
scaling \cite{wikipedia_nosql}.\\
However, since data is simply stored in a key-value store, joining
over multiple datasets (which can be easily done with RDBMs) has to be
implemented on top of the NoSQL database (doing this might still
be slow \cite{nosql_problems}). \\
\section{Application crashes}
Software will never be perfect, therefore there will always be the possibility
that a program crashes. Important is that developers try to
create defensive software, that will show an
appropriate error message, rather than crash.\\
That said, since we cannot built a high-availability cluster assuming
our software will not crash, we need to try to contain the crashes as
good as possible.\\
When software is implemented as a process that runs in the user space,
it cannot crash the entire operating system, but only that specific process (if
there are no bugs in the operating system). In my opinion this is
essential to ensure the stability, if not, one error can ultimately
take down the entire cluster.\\
But still, if we have a web server that is written in C/C++, when an
error occurs in one of the applications threads, this can potentially
bring down the entire application server. Java applications can cope
better with these kind of errors: when one application thread throws
an exception the application server (Tomcat, JBoss, ...) can continue
to serve the other users.\\
In order to ensure that an error in a C/C++ program does not crash the entire
application (and all sessions that are tied to it), a process can be
assigned to each application session, however, running the application
like this will require more computational resources.\\
When one session of the application throws an error and the
application server can properly recover from it (by using one of the
techniques described in the previous paragraphs), we can show the
user an appropriate error message or try to continue this user's
session in another thread/process.\\
Another very important remark : when a crash happens, it is important
to properly log what is wrong and inform an administrator about this
problem. This way the problem can be investigated and hopefully
fixed.
Because it is not always possible to look what is
going on on the production server, it is important for the logging to
be as complete as possible. This way developers and administrators can try to
reproduce the problem without the need to access the production system.
\section{Human error}
A lot of problems are introduced by human errors: bugs in
applications, bugs in operating systems, errors in architectures and
their implementation, ...\\
While it makes sense to have redundancy in hardware, it makes equally
sense to have redundancy in developers and system administrators.\\
For developers, it is vital to have peer reviews, or even pair
development techniques for the more complicated parts of a system.\\
System administrators often take decisions that can impact huge parts
of a cluster, so it would probably be a good idea to double-check such
decisions before actually applying them. This can be done by first
applying the changes on a test setup before pushing them to the
production cluster \footnote{Its vital for system administrators to
have a sandbox that allows them to perform realistic tests}.\\
It is also important that the knowledge of systems and processes is
well distributed over employees, in case any of the employees leaves,
the system can still be operational. Therefore it is also important
that architectures, procedures and system layouts are thoroughly
documented, and that this documentation is kept up to date.
\section{Operating systems}
GNU/Linux is very popular in cluster settings
\cite{server_market_share}.
The GNU/Linux operating system is known for its speed and stability.
It is however important to note that the kernel has a monolithic
design, while this makes it a fast kernel, it also allows system drivers
to crash the entire kernel (in extremis). \\
Since clusters will usually use reputed well-tested hardware that has
good driver support for the Linux kernel, this is usually not a big
problem.\\
When an operating system does crash for some reason (a bug, or a
hardware failure) this should be dealt with appropriately, for example
by restarting the cluster node that has crashed or (if the crashed
node cannot be started again) by starting a new node that will take
the crashed node's place.\\
When a crash happens, it is essential to investigate what went wrong,
so lessons can be learned and actions can be taken to avoid the
problem from happening again.
\section{Monitoring}
It is important to monitor the status of a cluster, to take actions in
case there is something wrong. On the other hand, we should also take
care not to overload the network by constantly polling nodes.\\
Two aspects are particularly important when monitoring:
\begin{itemize}
\item gathering usage statistics (CPU usage, network load, ...)
\item diagnose problems in time, and act on them
\end{itemize}
\section{Virtualization}
Virtualization is a technique that is used a lot in cluster
environments. It allows us to deal with server resources in a more
abstract fashion.\\
This allows certain actions to be automated in an easier way;
for example: restarting a node only requires invoking a program.\\
Furthermore, it allows server resources to be split up easier; one
virtualised node can use a part of the physical memory of an
underlying server, or only 1 of the set of CPUs the underlying server
has installed.\\
Virtualising hardware has some costs in terms of performance, however,
these days, several hardware provisions are in place to further
improve the performance of virtual machines (e.g.: Intel's Hardware-Assisted
Virtualization Technology \cite{intel_havt}, AMD Virtualization
\cite{amd_virt}).
\section{Different kinds of outages}
I will present an overview of the different kind of outages that are
possible (based on \cite{ha_book}):
\subsection{Hardware}
Hardware can fail whether it is new or old. Especially prone for
defect are the moving parts in a computer: the HDDs, tape drives,
tape libraries and fans.\\
To extend the computer's lifetime, the machine needs to be properly
cooled \footnote{A sufficient amount of fans is required and the room needs to
be air conditioned}. \\
There should be as little dust in the room as possible to avoid the
fans to get dirty (and eventually fail).\\
The life-time of servers that run constantly under heavy load is
considered to be 3 years \cite{ms_cloud_cost}.
\subsection{Environmental and physical failure}
\subsubsection{Natural disasters}
\begin{itemize}
\item earth quakes
\item fires
\item floods
\item extreme weather
\end{itemize}
\subsubsection{Infrastructure failure}
\begin{itemize}
\item cables that break
\item air conditioning that breaks down
\item UPSs or diesel generators that fail
\end{itemize}
\subsubsection{Third party delivery failure}
\begin{itemize}
\item internet delivery failures
\item power delivery failures
\end{itemize}
\subsection{Network failures}
It is possible that network components fail (this can be connected to
error in hardware and/or
software). Also problems with the DNS or DHCP servers might occur.\\
Distributed Denial-of-service (DDOS) or other security
attacks can also shutdown an entire network.
\subsection{Application failures}
All layers of an application can experience problems: the web/application
server, the database or the actual application. It is important to be
aware of this and write applications in a defensive way.
\section{Definition of high availability, expectations and realistic
goals}
\subsection{Definition of high availability}
I use the definition used in \cite{ha_book}:\\
``Availability is a measure of the time that a is server is functioning
normally''.\\
Availability can be calculated with this formula \cite{ha_book}:
$A=\frac{MTBF}{MTBF +
MTTR}$ \footnote{MTBF = mean time between failures, MTTR = mean time to repair}.\\
An example: to achieve 99.9999 \% availability one is permitted to 6
minutes downtime in 11.4 years! This is highly improbable to be
achieved \cite{ha_book}.\\
A more realistic goal could be 10 minutes downtime per year, this is
expressed as an availability of 99.998\% \cite{ha_book}.
\subsection{Expectations and realistic goals}
Unavailability can be costly: people of your online web shop
might leave and look for another shop, imagine the stock market not
able to respond, ...\\
We should always try to be realistic in our expectations, if 50\%
of a company's data centers fail, there is a good chance not everyone
will still be able to be server. However, this is such a far-fetched
scenario that it is probably not cost-effective to try to preempt
it.\\
No matter how well an infrastructure is set up, there are always
aspects that are beyond our control \cite{black_swan}, so it is
important to be careful promising availability percentages to
customers.
\chapter{Scalability and its effects on economy and environment}
\section{Power consumption}
The power consumed by large web applications is huge. As an
example, in 2011 Facebook consumed 532 million kilowatt hours of
energy and Google consumed 2 billion kilowatt hours of
energy in 2010 \cite{datacenter_power} \footnote{As a comparison:
Belgium used 84 billion kilowatt hours in 2008}
\section{Software efficiency}
For some companies (such as Facebook) one of the reasons why this
power consumption is so high is that it takes a lot of computation to
render PHP pages \cite{facebook_php}. Facebook in particular is trying
to fix this by creating a PHP virtual machine \cite{php_vm}.\\
Note that for such projects, using more computation-efficient
programming languages (such as C++ and Java) from the start could have
prevented these efforts.
\section{Climate change}
Needless to say that numbers in the range of Google's or Facebook's
power usage contribute significantly to climate change. Since it can
be expected that the demand for cloud solutions will only go up, my
opinion is that green data centers \footnote{And applications!} will become more
important in the near future.\\
\chapter{Identifying high availability and scalability solutions}
\section{Cluster nodes}
\subsection{Cluster node monitoring}
The two major monitoring tools on GNU/Linux platforms are Nagios and
Ganglia. Both tools implement several monitoring aspects, but the
focus of each of the tools is different. Nagios is more oriented
towards detecting system problems and sending out notifications for
such events, while Ganglia provides a long term overview of the status
of cluster nodes. To ensure high availability, detecting problems is
the most important aspect. The sooner problems are revealed, the
sooner actions can be undertaken to avoid any impact on end users. It
is important that problems are properly communicated to system
administrators, since they might be able to further investigate the
problem. In the event of any hardware failure the intervention of
an administrator will be required. However, it is also necessary to
execute programs to repair the state of the cluster
upon the detection of failure. Nagios specialises on both these
issues \cite{nagios:2013}.
In order to make new nodes available when necessary, we need to
monitor the resource availability on all of the online nodes.
Monitoring statistics also allows system engineers to gain insight in
performance needs of an application over the course of time.
Such statistics can also be used to setup models that can be used to
predict system load \cite{andreolini:2006}. Ganglia collects these
kind of statistics and stores them in a round-robin database, namely RDDTool
\cite{ganglia:2013} \cite{rrdt:2013}.
\section{Computational cluster nodes}
\subsection{Application monitoring}
Monitoring a cluster node is not enough to guarantee that it is
properly servicing clients. It is possible that the application (for
example: the web application server) has gotten in an infinite loop,
and as such is using practically all of the CPU's power, but in reality is
processing no new client request.
A possibility to monitor applications is to keep track of the number of
client request per seconds (or another appropriate time unit) that get
processed and compare it with the amount of computational resources
that is used.
This number of client requests is very application specific; a web
server might process thousands client requests per second, while a
server responsible for rendering a scene in an animated movie might only process
one request per hour \cite{apm:2013}.
This information usually can be extracted from the log files generated
by the server application (for example: the request log file for
Apache httpd). This information than can be passed to Ganglia to be
stored and monitored together with the other server's statistics
\cite{ganglia:2013}.
\subsection{Cluster node management}
In large cluster environments, it is not possible to wait for
a human intervention when a problem occurs. Therefore, when nodes or
applications fail, the monitor server should restart them.
There are several systems that allow for programs to restart
physical servers that reside in a rack (for example Dell's DRAC).
Since I will only use a simulated cluster in this mini-project, I will
only deal with starting and stopping virtualised nodes \footnote{Which I think should
be a valuable contribution taken into account the large amount of
virtualised server that are used in data centers these days}.
I will use VirtualBox to set up my simulated cluster, this
virtualization software has support to start and stop virtual machines
using the command line.
When an application stalls but the operating system is still fully
functioning, the monitoring server can decide to restart the server application, or
(if the application has support for this) simply kill the thread that
is responsible for the problems.
\subsection{Load balancing proxies}
\label{sec:load_balancing_proxies}
Load balancing is a method for distributing workloads across multiple
computing nodes. It can be used to improve the reliability of a system through
redundancy and to provide scalability by sending requests to the most
available node.
Load balancing has several applications such as High Performance
Clusters, database servers, and HTTP servers. In this section I will
focus on HTTP load balancing.
A prominent TCP/HTTP load balancing proxy is HAProxy.
\cite{haproxy:2013}. This software supports a set of
balancing algorithms \cite{tarreau:2006}, most notably:
\begin{itemize}
\item round robin: each server is used in turns, taking into account
the different server's weights (these weights can be configured
per server and can be adjusted on the fly)
\item balance leastconn: the server with the lowest number of connections receives the connection
\item source/uri: this algorithm hashes respectively the source IP
or URI of the request, this hash is than divided by the total
weight of the running servers to determine which server should
receive the request
\end{itemize}
HAProxy also supports Access Control Lists configurations that allow
the load balancer to select a server based on a header in the HTTP
request. When multiple server clusters exists in different countries
and/or continents, this feature allows us to connect a user to a
server that is located near his own location.
Another important aspect to consider when setting up a load balancing
infrastructure for HTTP servers is that many dynamic web applications
rely on session context \cite{tarreau:2006}. Since this session context is usually stored
on the web server, it is necessary for the load balancer to pass the
client's request to the same web server: this is called persistence.
\section{Storage and backup}
\subsection{Data replication}
Data replication implies that the same date is stored on multiple
data storage systems. This should be transparent to the end user: the
user should not be aware of this when using the data (note that when
working with a multi-tier architecture, it is possible that the
`end-user' is represented by another tier in the system).
Data replication is relevant for both:
\begin{itemize}
\item high availability systems: to ensure a smooth transition from a
failing data storage node to a working node
\item systems that require scalability: to make load balancing possible
\end{itemize}
We can distinguish 3 types of data replication
\cite{datareplication:2013} :
\subsubsection{Database replication}
\paragraph*{Master-slave replication}
In this setup, there exists one master database, this database is the only instance
that accepts updates. When an update statement is received by the
master database, it is applied to the database and appended to the log. The statements
in this log are than propagated to the slave databases.
Most database systems support this replication strategy (Postgres \cite{postgres_db:2013},
MySQL \cite{mysql_db:2013}, ...).
\paragraph*{Multi-master replication}
In this setup, multiple master instances exists, each of these master
instances accept updates. These updates are than communicated with the
other master instances.
This has the advantages that :
\begin{itemize}
\item the master node can fail without interrupting the system
\item the update load can be distributed over multiple nodes
\end{itemize}
There are also some significant disadvantages related to this technique:
\begin{itemize}
\item increased complexity
\item most implementations violate the ACID constraints
\item to fix conflicts that may arise between different database
instances, resources of the database nodes are required and
communication between the conflicted nodes will increase the network
traffic
\end{itemize}
Note that there are other techniques to accommodate the advantage of
load balancing mentioned above: database shards, NoSQL (reference to
this section).
\subsubsection{Disk storage replication}
Disk storage replication collects updates to a block device and
applies these updates collectively to multiple devices.
This can be implemented in hardware, the functionality is than
embedded in the array disk controller.
DRBD \cite{drbd_soft:2013} implements this functionality in software.
This software allows users to mirror disks within a system or over a
network (the DRBD website explains this as follows: ``DRBD can be
understood as network based RAID-1'' \cite{drbd_soft:2013}).
\subsubsection{File-based replication}
Disk storage replication replicates entire block devices, however for
some applications, it might be necessary to only replicate parts of
the logical file system.
\paragraph*{Batch replication}
To replicate file systems in batch, synchronisation tools such as
rsync \cite{rsync_software:2013} can be
used. The rsync Unix tool can synchronise 2 directories from one location to the
other (this can be done over a network). The advantage compared to a
simple ``scp -r'' is that rsync will only transfer the changes by using
delta encoding.
\paragraph*{Real-time replication}
I was not able to find any tools on GNU/Linux that allow real-time
file-based replication based with default filesystems such as ext4.
It is however possible to achieve this kind of file-based replication
by using one of the distributed file systems, which I will discuss in
the next section.
\subsection{Backup}
In order to setup a reliable backup system, some technologies
are required:
\begin{itemize}
\item the ability to take a snapshot of a file system: this can be
achieved by using the Logical Volume Manager, which is part of the
mainline Linux kernel \cite{linux_kernel_soft:2013}
\item the ability to take snapshots and incremental backups of RDBMs: most RDBMS systems
support this; some examples:
\begin{itemize}
\item Postgres: via barman \cite{barman_software:2013}
\item MySQL: via mysqldump (and by enabling the binary log)
\end{itemize}
\item the ability to make incremental filesystem backups; this can be
done using rsync \cite{rsync_software:2013}
\end{itemize}
\subsection{Distributed file systems}
\label{ceph}
Distributed file systems share storage using a network protocol, they
deliver scalability and failure
correction in such a way that it is transparent to the client.
Such filesystems can restrict access to certain files based on access lists
and/or file system quotas. Distributed file systems allow clients to
access files in the same way as they would be able to do on their
local file system \footnote{A client might refer to an end user,
but it might as well refer to one of the tiers in a multi-tier
system}.
The most interesting free software implementations I encountered were
GlusterFS \cite{glusterfs_soft:2013} and Ceph \cite{ceph_soft:2013}.
\subsection{Databases}
\subsubsection{RDBMS sharding}
For some application, the amount of data that needs to be stored can be so high
that it is impossible to fit it on a single cluster node. In order to
accommodate such a large database in a relational context, database
sharding can be used. Database sharding allows databases to be
partitioned horizontally; so that different rows of the database can
reside on different cluster nodes. It is advisable that rows that are
closely connected are located on the same cluster node, to improve
query performance, however, it still remains possible to execute
queries that involve rows distributed over multiple cluster nodes.
For example: an application where the data that is stored is mostly
private to a user (storage of emails, storage of Skype contacts, ...)
could be sharded by the user name.
\subsubsection{Document-oriented NoSQL databases}
\label{sec:no_sql}
When the data that is stored is private to one user and performing
queries over multiple users is a rare event (storage of emails,
storage of Skype contacts, ...), it can be interesting to keep a
database per user. This can be achieved by using so-called
document-oriented NoSQL solutions.
These solutions are able to store a document
containing all information related to one user. This document can for
example be structured using the JSON format. Contrary to RDBMS
systems; document-oriented NoSQL solutions have no schemas that define
how the data should be structured, instead fields can be freely added
to the JSON documents.
Although this provides a greater flexibility, this also makes it very
difficult to perform queries over multiple documents that require data
to be joined together.
When data is grouped together in a document per entity (e.g.: a user)
and this document is stored in one file; it becomes much easier to scale the data over
multiple clusters.
With the booming of the ``cloud'' several document-oriented NoSQL
solutions have been developed, most notably: CouchBase and MongoDB.
\chapter{Experimenting with HA and scalability solutions}
\section{Open source technologies}
In this project, I focus on open source (mainly free software, as in
GPL licensed) technologies. In my opinion,
open source technologies are ideal to setup large clusters.
All the source code of the building blocks of the cluster is available
and thus can be reviewed. Each cluster setup can be very
specific towards the problem the company or research institution is
trying to solve, therefore, being able to adapt the source code to
meet specific solutions is definitely an advantage.
\section{Test environment setup}
I will simulate a mini-cluster using different virtual machines. I
will create these virtual machines with VirtualBox
\cite{virtualbox_soft:2013}. The virtual machines will have
``Ubuntu Linux 13.10 (AMD64) Server Edition''
\cite{ubuntu_server_13_10:2013} installed as base operating system.\\
I want all the virtual machines to be able to contact each other,
while it should also be possible for them to access
the internet.
To make this possible, I opted to use VirtualBox' virtualised NAT
feature (this is a new feature, available since VirtualBox
4.3).
This new feature emulates a NAT environment on your host machine.
The advantage of this is that the test setup will also work when no
network is available.\\
I created a base virtual machine with ``Ubuntu Linux 13.10 (AMD64)
Server Edition'' and the VirtualBox guest additions
installed. I will use this base virtual machine as a starting point
for all the virtual machines that I will setup to execute
experiments.\\
I will include this base virtual machine in my final deliverable,
since all other virtual machines can be derived from it by following
the instructions in this essay.
I also created a shared filesystem, this allows me to share files between the
different virtual machines and my laptop.
\noindent First we need to create the NAT network:
\begin{lstlisting}[language=bash]
VBoxManage natnetwork add -t nat-int-network
-n "192.168.15.0/24"
-e -h on
\end{lstlisting}
This commands creates the NAT network 'nat-int-network' with an IP
address range between 192.168.15.0/24.
After creating the NAT network, virtual machine guests can be
configured as shown in ~\cref{fig:vbox_network_config}.
\begin{figure}[h!]
\caption{VirtualBox network configuration.}
\label{fig:vbox_network_config}
\centering
\includegraphics[scale=0.3]{pics/vbox_network_config.png}
\end{figure}
\section{Monitoring}
\subsection{Experiment overview}
To test nagios \cite{nagios:2013}, I will test the detection of
application and system failure. To test the application failure, I
will write an example application that fails after
2 minutes. I will test the system failure by inducing a kernel panic.
When an application failure is detected I will execute a script on the
cluster node to restart the application.
When a system failure is detected, I will restart the virtual
machine.
\subsection{Nagios experiment}
I cloned the base virtual machine as described earlier and changed its
hostname to 'monitor' (by editing $/etc/hostname$ and $/etc/hosts$ and
rebooting).
On this machine, I installed nagios:
\begin{lstlisting}[language=bash]
sudo apt-get install nagios3
\end{lstlisting}
During this installation a couple of questions were asked:
\begin{itemize}
\item email configuration: since I will not use this in my experiment
I did not provide a configuration
\item the nagios web administration password
\end{itemize}
In order to be able to access the web server, I opened port 80 of the
machine's firewall.
I first enabled the UFW firewall:
\begin{lstlisting}[language=bash]
sudo ufw enable
\end{lstlisting}
Then I opened port 80:
\begin{lstlisting}[language=bash]
sudo ufw allow 80/tcp
\end{lstlisting}
Since the NAT network setup does only allow virtual machines to access
other virtual machines, I started my Windows virtual machine and used
the web browser to access the nagios system. After providing the
username and password, the browser showed the un-configured website as
depicted in \cref{fig:nagios_1}.
\begin{figure}[h!]
\caption{Nagios start page, immediately after installation.}
\label{fig:nagios_1}
\centering
\includegraphics[scale=0.3]{pics/nagios_1.png}
\end{figure}
I cloned another pair of servers: 'node1' and 'node2', that will serve as
cluster nodes (I performed the same steps as on 'monitor' to change
the hostnames of these nodes).\\
To test the detection of application failure, I wrote a program that
fails after 2 minutes. \\
The source code of this program can be found in src/nagios/nagios\_app.cpp.
I install it on 'node1' simply by building the source code:
\begin{lstlisting}[language=bash]
g++ nagios_app.cpp -o nagios_app
\end{lstlisting}
After building the application, I copy it to $/usr/sbin/$.
\footnote{In order to build this C++ code, we need gcc, this is included
in the package build-essential, which I installed earlier}
In order to start/stop and restart the service, I create an init.d
configuration script.
I based this script on the code I found on this website: \\
http://koo.fi/blog/2013/03/09/init-script-for-daemonizing-non-forking-processes/ \\
I included the source code of the nagios\_app init.d config script in
\\ src/nagios/init.d/nagios\_app.
Now we still need to give the script the correct permissions:
\begin{lstlisting}[language=bash]
sudo chmod 755 /etc/init.d/nagios_app
\end{lstlisting}
And we need to include it in the startup list:
\begin{lstlisting}[language=bash]
sudo update-rc.d myscriptname defaults
\end{lstlisting}
We now can start/stop the program very easily:
\begin{lstlisting}[language=bash]
sudo /etc/init.d/nagios_app start
sudo /etc/init.d/nagios_app stop
\end{lstlisting}
To monitor a remote server I use the NRPE plugin for Nagios.
This can be easily installed on the server using this command:
\begin{lstlisting}[language=bash]
sudo apt-get install nagios-nrpe-plugin
\end{lstlisting}
To allow nagios to detect the installation of the plugin, we need to
restart apache:
\begin{lstlisting}[language=bash]
sudo /etc/init.d/apache2 restart