Skip to content

Tutorial

ZHANG Bo edited this page Mar 10, 2017 · 3 revisions

The following tutorial shows how to use hadoop-benchmark on single machine (local laptop) using VirtualBox on OSX and Linux. Options for other platforms are available, by changing the driver used by the docker-machine (cf. below).

All the steps has been performed on a MacBook PRO 3.1 GHz Intel Core i7, 16GB RAM, running OSX 10.12.2. The screencast of all these steps on this system are available from: asciicast.

We will do the following:

  1. Create 4 node cluster
  • one node is for distributed service discovery (running Consul kv-store) with 512MB or RAM
  • one node is the Hadoop controller (running ResourceManager, NameNode) with 2048MB or RAM
  • two node are for Hadoop workers (running NodeManager, DataNode) with 2048MB or RAM each
  1. Start vanilla Hadoop
  2. Run benchmarks
  1. Start customized Hadoop with a feedback control loop self-balancing job parallelism and throughput (cf. Zhang et al.)
  2. Run benchmarks
  1. Compare the results
  2. Destroy the cluster

Please note that we do not run the SWIM benchmarks as they are computationally expensive. Each run will either take a significant amount of time or simply won't run on a cluster created on a single laptop. The demonstration running hadoop-benchmark on the cluster in Grid5000 shows, among what is presented here, how to run a SWIM benchmark, extracts the results and make a basic comparison.

Prerequisites

  • Currently the hadoop-benchmark has been tested on Linux and OSX
  • Install docker, docker-machine and VirtualBox
  • Check versions
    • docker version >= 1.12
      $ docker --version
      Docker version 1.12.6, build 78d1802
    • docker-machine >= 0.8
      $ docker-machine --version
      docker-machine version 0.8.2, build e18a919
    • VirtualBox >= 5.1
      $ VBoxManage --version
      5.1.12r112440
  • Install latest version of hadoop-benchmark for github
    $ git clone https://github.com/Spirals-Team/hadoop-benchmark.git
    Cloning into 'hadoop-benchmark'...
    remote: Counting objects: 1042, done.
    remote: Compressing objects: 100% (28/28), done.
    remote: Total 1042 (delta 13), reused 0 (delta 0), pack-reused 1012
    Receiving objects: 100% (1042/1042), 1.60 MiB | 0 bytes/s, done.
    Resolving deltas: 100% (554/554), done.
  • (optional) R >= 3.3.2
    $ R --version
    R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
    Copyright (C) 2016 The R Foundation for Statistical Computing
    Platform: x86_64-apple-darwin16.1.0 (64-bit)
    
    R is free software and comes with ABSOLUTELY NO WARRANTY.
    You are welcome to redistribute it under the terms of the
    GNU General Public License versions 2 or 3.
    For more information about these matters see
    http://www.gnu.org/licenses/.
  • (optional) R packages:
    • tidyverse
      Rscript -e 'library(tidyverse, quiet=T)'
    • Hmisc
      Rscript -e 'library(Hmisc, quiet=T)'
    if you do not have it you can install using install.packages(c("tidyverse", "Hmisc")) R command

Create a Hadoop Cluster

Frist, create a cluster on top of which we can then provision Hadoop.

Create a new configuration

The hadoop-benchmark uses a simple configuration files to specify its settings. We start with a new configuration created from the configuration template used for vanilla Hadoop in a local cluster.

cd hadoop-benchmark
cp scenarios/vanilla-hadoop/local_cluster mycluster

The configuration contains a number of options:

# the name of the cluster
CLUSTER_NAME_PREFIX='local-hadoop'

# the driver to be used to create the cluster
# for list of available drivers go to https://docs.docker.com/machine/drivers/
DRIVER='virtualbox'

# the extra driver options for controller and compute nodes
# for list of available options `docker-machine create -d virtualbox --help | grep "\-\-virtualbox"`
DRIVER_OPTS='--virtualbox-memory 2048'

# override the memory for consul
DRIVER_OPTS_CONSUL='--virtualbox-memory 512'

# number of compute nodes
NUM_COMPUTE_NODES=1

# docker images to be used
HADOOP_IMAGE='hadoop-benchmark/hadoop'
HADOOP_IMAGE_DIR='scenarios/vanilla-hadoop/images/hadoop'

We will leave all the defaults but change the number of computing nodes to 2:

# number of compute nodes
NUM_COMPUTE_NODES=2

If you have less memory in you system, you can leave the default. If you create too cluster too big, you might run into problems with not having enough memory. Some processes might segfault because malloc could not allocate the required memory slots. Please note that this usually only happens in the local environments and not in real clusters.

The driver is set to virtualbox, but there are other drivers available. The docker-machine supports Microsoft Azure, G5K, Amazon EC2, GCE, VmWare, etc. The complete list is available in the docker-machine documentation. The DRIVER_OPTS specify additional options. In this tutorial we specify the memory for the virtualbox VMs (concretely for the controller and compute nodes). We will use 2GB instead of the 1GB default. There are many other options (they can be listed using docker-machine create -d virtualbox --help | grep virtualbox) to allow further customization of the nodes. We have also shown that we can override the driver options for individual machines by adding _<NODE_NAME_IN_CAPITAL> (e.g. DRIVER_OPTS_CONTROLLER).

Create the cluster on which the Hadoop will be provisioned

In this step we create the cluster. The nodes will be created using the specified DRIVER. In our case, all machines will be created as VirtualBox VMs.

We need to create together 4 nodes:

  • one Hadoop controller node running YARN ResourceManager and HDFS NameNode services
  • two Hadoop compute node (derived form NUM_COMPUTE_NODES) running YARN NodeManager and HDFS DataNode services
  • one node that acts as the cluster guardian. It runs consul which is distributed service discovery used by docker to know about nodes in the cluster

First, we make sure there are no docker machines:

$ docker-machine ls
NAME   ACTIVE   DRIVER   STATE   URL   SWARM   DOCKER   ERRORS

Create the cluster:

CONFIG=mycluster ./cluster.sh create-cluster

Instead of prefixing the each command with CONFIG=mycluster, we can simply export it:

export CONFIG=mycluster
./cluster.sh create-cluster

After the command has run, we can check the docker machines status:

$ docker-machine ls
NAME                      ACTIVE   DRIVER       STATE     URL                         SWARM                              DOCKER    ERRORS
local-hadoop-compute-1    -        virtualbox   Running   tcp://192.168.99.102:2376   local-hadoop-controller            v1.12.6
local-hadoop-compute-2    -        virtualbox   Running   tcp://192.168.99.103:2376   local-hadoop-controller            v1.12.6
local-hadoop-consul       -        virtualbox   Running   tcp://192.168.99.100:2376                                      v1.12.6
local-hadoop-controller   -        virtualbox   Running   tcp://192.168.99.101:2376   local-hadoop-controller (master)   v1.12.6
  • the first two nodes local-hadoop-compute-{1,2} are for the Hadoop workers
  • the local-hadoop-consul runs the distributed service discovery needed by docker
  • the local-hadoop-controller is designated for the resource manager

The local-hadoop- prefix is derived from the CLUSTER_NAME_PREFIX option in the configuration file. The reason is that we can run multiple clusters at the same time and the prefix allows us to differentiate between them.

Graphically, excluding the local-hadoop-consul (since it is really just an internal docker swarm things), we have the following:

docker machines

We can also check this in the VirtualBox UI:

VirtualBox UI running the cluster

If we want to get an IP address we can simply run:

$ docker-machine ip local-hadoop-controller
192.168.99.101

or connect using SSH in case we need to debug something:

$ docker-machine ssh local-hadoop-controller
                        ##         .
                  ## ## ##        ==
               ## ## ## ## ##    ===
           /"""""""""""""""""\___/ ===
      ~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ /  ===- ~~~
           \______ o           __/
             \    \         __/
              \____\_______/
 _                 _   ____     _            _
| |__   ___   ___ | |_|___ \ __| | ___   ___| | _____ _ __
| '_ \ / _ \ / _ \| __| __) / _` |/ _ \ / __| |/ / _ \ '__|
| |_) | (_) | (_) | |_ / __/ (_| | (_) | (__|   <  __/ |
|_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_|
Boot2Docker version 1.12.6, build HEAD : 5ab2289 - Wed Jan 11 03:20:40 UTC 2017
Docker version 1.12.6, build 78d1802
docker@local-hadoop-controller:~$ exit

The big advantage of using docker-machine is that it all this works transparently among different drivers. So the same commands will work regardless if the node is deployed in VirtualBox or on Amazon EC2.

Now the cluster is ready and we start deploying docker containers.

Benchmarking a vanilla Hadoop

The created cluster is ready to be provisioned with any Hadoop distribution. We start by deploying an Apache vanilla Hadoop distribution (version 2.7.1) and running some benchmarks. Then we will deploy a new Hadoop based on the same vanilla distribution but with some self-adaptive capabilities and compare the results.

Provisioning Hadoop

The following command starts Hadoop on the cluster. First, it builds the specified docker image (HADOOP_IMAGE from HADOOP_IMAGE_DIR) and then in starts the corresponding containers. This command will build the corresponding docker images and then it starts them on the cluster nodes.

$ ./cluster.sh start-hadoop

... ... ...
... ... ...

Hadoop should be ready
To connect docker run: 'eval $(docker-machine env --swarm local-hadoop-controller)'
To connect a bash console to the cluster use: 'console' option
To connect to Graphite (WEB console visualizing collectd data), visit http://192.168.99.101
To connect to YARN ResourceManager WEB UI, visit http://192.168.99.101:8088
To connect to HDFS NameNode WEB UI, visit http://192.168.99.101:50070
To connect to YARN NodeManager WEB UI, visit:
http://192.168.99.102:8042 for compute-1
http://192.168.99.103:8042 for compute-2
To connect to HDFS DataNode WEB UI, visit:
http://192.168.99.102:50075 for compute-1
http://192.168.99.103:50075 for compute-2

If you plan to use YARN WEB UI more extensively, consider to add the following records to your /etc/hosts:
192.168.99.101 controller
192.168.99.102 compute-1
192.168.99.103 compute-2

To see what has happened, we can explore the running containers. First, we need to tell to the docker command-line client which docker node it should communicate with. The easiest is to use the shell-init command.

$ ./cluster.sh shell-init
export DOCKER_TLS_VERIFY="1"
export DOCKER_HOST="tcp://192.168.99.101:3376"
export DOCKER_CERT_PATH="/Users/krikava/.docker/machine/machines/local-hadoop-controller"
export DOCKER_MACHINE_NAME="local-hadoop-controller"
# Run this command to configure your shell:
# eval $(docker-machine env --swarm local-hadoop-controller)

For this to make any effect, it has to be run in an eval:

eval $(./cluster.sh shell-init)

Now we can get an overview of what is running in the docker cluster:

$ docker info
Containers: 8
 Running: 8
 Paused: 0
 Stopped: 0
Images: 10
Server Version: swarm/1.2.5
Role: primary
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 3
 local-hadoop-compute-1: 192.168.99.102:2376
  └ ID: PBW3:TL6Z:COEV:NRMG:JQCH:7VES:GVHA:6BG6:NJ3U:AELV:MTLU:SHHV
  └ Status: Healthy
  └ Containers: 2 (2 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 2.053 GiB
  └ Labels: kernelversion=4.4.41-boot2docker, operatingsystem=Boot2Docker 1.12.6 (TCL 7.2); HEAD : 5ab2289 - Wed Jan 11 03:20:40 UTC 2017, provider=virtualbox, storagedriver=aufs, type=compute
  └ UpdatedAt: 2017-01-17T17:17:46Z
  └ ServerVersion: 1.12.6
 local-hadoop-compute-2: 192.168.99.103:2376
  └ ID: WLCB:IH2R:ZQT6:VXWQ:RPHE:IUJL:2HEG:3EIX:Q7XZ:XUYO:5TWN:RVJT
  └ Status: Healthy
  └ Containers: 2 (2 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 2.053 GiB
  └ Labels: kernelversion=4.4.41-boot2docker, operatingsystem=Boot2Docker 1.12.6 (TCL 7.2); HEAD : 5ab2289 - Wed Jan 11 03:20:40 UTC 2017, provider=virtualbox, storagedriver=aufs, type=compute
  └ UpdatedAt: 2017-01-17T17:17:53Z
  └ ServerVersion: 1.12.6
 local-hadoop-controller: 192.168.99.101:2376
  └ ID: SW6Y:AJ2G:LNMN:XYWN:ML4U:YDEX:FZ4L:6A6Q:RTYJ:HNJQ:QVKY:7DGY
  └ Status: Healthy
  └ Containers: 4 (4 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 2.053 GiB
  └ Labels: kernelversion=4.4.41-boot2docker, operatingsystem=Boot2Docker 1.12.6 (TCL 7.2); HEAD : 5ab2289 - Wed Jan 11 03:20:40 UTC 2017, provider=virtualbox, storagedriver=aufs, type=controller
  └ UpdatedAt: 2017-01-17T17:18:01Z
  └ ServerVersion: 1.12.6
Plugins:
 Volume:
 Network:
Swarm:
 NodeID:
 Is Manager: false
 Node Address:
Security Options:
Kernel Version: 4.4.41-boot2docker
Operating System: linux
Architecture: amd64
CPUs: 3
Total Memory: 6.159 GiB
Name: a214f4308980
Docker Root Dir:
Debug Mode (client): false
Debug Mode (server): false
WARNING: No kernel memory limit support

We can see the three nodes running 8 containers. To see which containers we run:

$ docker ps
CONTAINER ID        IMAGE                     COMMAND                  CREATED             STATUS              PORTS                                                                                                                                                                                               NAMES
0f4bfb7e6a33        hadoop-benchmark/hadoop   "/entrypoint.sh compu"   19 minutes ago      Up 19 minutes       2122/tcp, 8020/tcp, 8030-8033/tcp, 8040/tcp, 8088/tcp, 9000/tcp, 19888/tcp, 49707/tcp, 50010/tcp, 50020/tcp, 192.168.99.103:8042->8042/tcp, 50070/tcp, 50090/tcp, 192.168.99.103:50075->50075/tcp   local-hadoop-compute-2/compute-2
ec42e2942a39        hadoop-benchmark/hadoop   "/entrypoint.sh compu"   19 minutes ago      Up 19 minutes       2122/tcp, 8020/tcp, 8030-8033/tcp, 8040/tcp, 8088/tcp, 9000/tcp, 19888/tcp, 49707/tcp, 50010/tcp, 50020/tcp, 192.168.99.102:8042->8042/tcp, 50070/tcp, 50090/tcp, 192.168.99.102:50075->50075/tcp   local-hadoop-compute-1/compute-1
061dbb4e2f0e        hadoop-benchmark/hadoop   "/entrypoint.sh contr"   20 minutes ago      Up 20 minutes       2122/tcp, 8020/tcp, 8030-8033/tcp, 8040/tcp, 8042/tcp, 9000/tcp, 19888/tcp, 49707/tcp, 50010/tcp, 192.168.99.101:8088->8088/tcp, 50020/tcp, 50075/tcp, 50090/tcp, 192.168.99.101:50070->50070/tcp   local-hadoop-controller/controller
3e5345d10b11        hopsoft/graphite-statsd   "/sbin/my_init"          20 minutes ago      Up 20 minutes       192.168.99.101:80->80/tcp, 2004/tcp, 192.168.99.101:2003->2003/tcp, 192.168.99.101:8126->8126/tcp, 2023-2024/tcp, 192.168.99.101:8125->8125/udp                                                     local-hadoop-controller/graphite

This only shows 4 containers as the listing omits the ones that participate in forming the docker swarm.

The following schema presents a graphical overview of what has been deployed:

hadoop cluster

Once the ./cluster.sh start-hadoop command finishes, it prints the information on how to access the cluster:

Hadoop should be ready
To connect docker run: 'eval $(docker-machine env --swarm local-hadoop-controller)'
To connect a bash console to the cluster use: 'console' option
To connect to Graphite (WEB console visualizing collectd data), visit http://192.168.99.101
To connect to YARN ResourceManager WEB UI, visit http://192.168.99.101:8088
To connect to HDFS NameNode WEB UI, visit http://192.168.99.101:50070
To connect to YARN NodeManager WEB UI, visit:
http://192.168.99.102:8042 for compute-1
http://192.168.99.103:8042 for compute-2
To connect to HDFS DataNode WEB UI, visit:
http://192.168.99.102:50075 for compute-1
http://192.168.99.103:50075 for compute-2

If you plan to use YARN WEB UI more extensively, consider to add the following records to your /etc/hosts:
192.168.99.101 controller
192.168.99.102 compute-1
192.168.99.103 compute-2

If you plan to use these WEB UIs, it is recommended to alter your /etc/hosts file temporarily since the UIs use hostnames instead of IP and without the changes in hosts file some of the links might not be reachable. There are number of available user interfaces:

  • YARN Resource Manager UI

    YARN Resource Manager UI

  • Graphite - monitoring console dashboard (WEB console visualizing collectd data)

    Graphite

    • HDFS NameNode UI

      HDFS NameNode UI

  • compute-1 YARN NodeManager UI

    YARN NodeManager UI for compute-1

  • compute-2 YARN NodeManager UI

    YARN NodeManager UI for compute-2

The Hadoop cluster is fully functional and ready to be used. To test it, we can connect to it:

./cluster.sh console

This in turn starts a new container in the cluster with the same base image as the all the other nodes. We can access the other nodes in the cluster using there hostnames (i.e. controller, compute-1, compute-2):

root@hadoop-console:/# ping controller
PING controller (10.0.0.3) 56(84) bytes of data.
64 bytes from controller.hadoop-net (10.0.0.3): icmp_seq=1 ttl=64 time=0.114 ms
64 bytes from controller.hadoop-net (10.0.0.3): icmp_seq=2 ttl=64 time=0.055 ms
^C
--- controller ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.055/0.084/0.114/0.030 ms
root@hadoop-console:/# ping compute-1
PING compute-1 (10.0.0.4) 56(84) bytes of data.
64 bytes from compute-1.hadoop-net (10.0.0.4): icmp_seq=1 ttl=64 time=0.490 ms
64 bytes from compute-1.hadoop-net (10.0.0.4): icmp_seq=2 ttl=64 time=0.420 ms
^C
--- compute-1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.420/0.455/0.490/0.035 ms
root@hadoop-console:/# ping compute-2
PING compute-2 (10.0.0.5) 56(84) bytes of data.
64 bytes from compute-2.hadoop-net (10.0.0.5): icmp_seq=1 ttl=64 time=0.536 ms
64 bytes from compute-2.hadoop-net (10.0.0.5): icmp_seq=2 ttl=64 time=0.389 ms
^C
--- compute-2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.389/0.462/0.536/0.076 ms

and run Hadoop commands:

  • list the compute nodes in the cluster:

    root@hadoop-console:/# yarn node -list
    17/01/17 17:20:59 INFO client.RMProxy: Connecting to ResourceManager at controller/10.0.0.3:8032
    Total Nodes:2
             Node-Id       Node-State Node-Http-Address Number-of-Running-Containers
     compute-2:45454          RUNNING    compute-2:8042                            0
     compute-1:45454          RUNNING    compute-1:8042                            0
  • list the applications in the cluster:

    root@hadoop-console:/# yarn application -list
    17/01/17 17:21:09 INFO client.RMProxy: Connecting to ResourceManager at controller/10.0.0.3:8032
    Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0
                    Application-Id      Application-Name      Application-Type        User       Queue               State         Final-State         Progress                        Tracking-URL
  • get the report from HDFS:

    root@hadoop-console:/# hdfs dfsadmin -report
    Configured Capacity: 38390448128 (35.75 GB)
    Present Capacity: 33287757824 (31.00 GB)
    DFS Remaining: 33287708672 (31.00 GB)
    DFS Used: 49152 (48 KB)
    DFS Used%: 0.00%
    Under replicated blocks: 0
    Blocks with corrupt replicas: 0
    Missing blocks: 0
    Missing blocks (with replication factor 1): 0
    
    -------------------------------------------------
    Live datanodes (2):
    
    
    Name: 10.0.0.4:50010 (compute-1.hadoop-net)
    Hostname: compute-1
    Decommission Status : Normal
    Configured Capacity: 19195224064 (17.88 GB)
    DFS Used: 24576 (24 KB)
    Non DFS Used: 2551345152 (2.38 GB)
    DFS Remaining: 16643854336 (15.50 GB)
    DFS Used%: 0.00%
    DFS Remaining%: 86.71%
    Configured Cache Capacity: 0 (0 B)
    Cache Used: 0 (0 B)
    Cache Remaining: 0 (0 B)
    Cache Used%: 100.00%
    Cache Remaining%: 0.00%
    Xceivers: 1
    Last contact: Tue Jan 17 02:01:48 UTC 2017
    
    
    Name: 10.0.0.5:50010 (compute-2.hadoop-net)
    Hostname: compute-2
    Decommission Status : Normal
    Configured Capacity: 19195224064 (17.88 GB)
    DFS Used: 24576 (24 KB)
    Non DFS Used: 2551345152 (2.38 GB)
    DFS Remaining: 16643854336 (15.50 GB)
    DFS Used%: 0.00%
    DFS Remaining%: 86.71%
    DFS Remaining%: 86.71%
    Configured Cache Capacity: 0 (0 B)
    Cache Used: 0 (0 B)
    Cache Remaining: 0 (0 B)
    Cache Used%: 100.00%
    Cache Remaining%: 0.00%
    Xceivers: 1
    Last contact: Tue Jan 17 02:01:50 UTC 2017

To quit the console, simply exit:

root@hadoop-console:/# exit
exit

The console is useful for checking the status of the cluster and for debugging. You can think about it as logging into the cluster with all the Hadoop commands available. There is also a possibility to execute commands directly at the controller using ./cluster.sh run-controller <command>. This will run the given command in the controller docker container.

Preparing to run benchmarks

Now we are ready to run some benchmarks. We start with a simple one to check the cluster works. A benchmark is simply a MapReduce job (or any other job running on YARN). So far we have collected three main benchmarks in the benchmarks directory:

  • Hadoop MapReduce Examples that contain the canonical Hadoop benchmarks such as wordcount or terasort (the complete list can be obtained from the console by running ./benchmarks/hadoop-mapreduce-examples/run.sh).
  • Intel HiBench benchmarks.
  • SWIM benchmarks (not shown in this tutorial).

A new benchmark can be created by creating a new docker image that will contain all corresponding files and shell script that will run it. For details, please have a look at the documentation.

Run Hadoop canonical benchmark

We start with Hadoop examples to make sure the cluster is operational. The following will run the a map/reduce program that estimates Pi using a quasi-Monte Carlo method. We parameterize the job with 20 map tasks each computing 1000 samples per a map task:

$ ./benchmarks/hadoop-mapreduce-examples/run.sh pi 20 1000
... ... ...
Number of Maps  = 20
Samples per Map = 1000
... ... ...

At the end, the benchmark outputs the results:

... ... ...
Job Finished in 73.413 seconds
Estimated value of Pi is 3.14280000000000000000

Run Intel HiBench

The cluster seems to be operational so we can proceed to run a full benchmark. For this we will use Intel HiBench benchmarks. This includes:

  • wordcount,
  • sort,
  • terasort,
  • and sleep benchmarks

By default these benchmark are very small. This is to allow one to quickly execute them to see that everything works and then to scale up. More information how to use the HiBench in hadoop-benchmark, please check the guide.

The following command will first build the bechmark image and then it will start a new container located at the local-hadoop-controller node that submits the jobs. At the end, it will print out the statistics and upload them to HDFS.

./benchmarks/hibench/run.sh

At the end the benchmark outputs the statistics:

Benchmarks finished

Type         Date       Time     Input_data_size      Duration(s)          Throughput(bytes/s)  Throughput/node
WORDCOUNT    2017-01-17 17:37:35 2746                 37.770               72                   36
SORT         2017-01-17 17:38:56 2582                 38.263               67                   33
TERASORT     2017-01-17 17:40:05 30000                25.937               1156                 578
SLEEP        2017-01-17 17:40:35 0                    25.182               0                    0

The report has been uploaded to HDFS: /hibench-20170114-2200.report
To download, run ./cluster.sh hdfs-download '/hibench-20170114-2200.report'

Optional - run multiple times

Running the benchmark only once will not give us a lot of confidence about the performance. Instead we want to run it multiple times and use the measurement to build a confidence interval. The idea is that we run the benchmark n times, then download the reports from HDFS and run some analysis. All this can be done easily using hadoop-benchmark. This will take about 40 minutes to complete. The screencast is available on asciicast.

  1. Remove the existing report(s)
./cluster.sh hdfs dfs -rm "/hibench-*.report"
  1. Make sure no other jobs are running in the cluster
$ ./cluster.sh run-controller yarn application -list
... ... ...
... ... ...
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0
                Application-Id      Application-Name        Application-Type          User           Queue                   State             Final-State             Progress                         Tracking-URL
  1. Run the benchmark 10 times
for i in $(seq 1 10); do ./benchmarks/hibench/run.sh; done
  1. Download all the files
./cluster.sh hdfs-download "/hibench-*.report"
  1. Move the files to a specific directory
$ mkdir -p results/hibench/vanilla
$ mv hibench* results/hibench/vanilla
$ ls -1 results/hibench/vanilla
hibench-20170114-1923.report
hibench-20170114-1927.report
hibench-20170114-1932.report
hibench-20170114-1936.report
hibench-20170114-1940.report
hibench-20170114-1944.report
hibench-20170114-1948.report
hibench-20170114-1952.report
hibench-20170114-1957.report
hibench-20170114-1901.report  
  1. We have provided a sample R script to visualize the results of HiBench benchmarks.
$ ./benchmarks/hibench/analysis/hibench-report.R results/hibench
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
pdf
  2
$ ls -1 *.pdf
hibench-duration.pdf
hibench-throughput.pdf
Benchmark durations
(hibench-duration.pdf)
Benchmark throughputs
(hibench-throughput.pdf)

Running benchmarks on a custom Hadoop

We have run some benchmarks on a vanilla Hadoop distribution so we have a baseline to compare our self-adaptation to. Please note, that is by no means a rigor way to run benchmarks and evaluate performance of a computer system. Here we simplify the steps for the sake of brevity so the complete tutorial can be completed within 30 minutes.

Essentially, we are going to redo the steps we have done earlier, this time with a different Hadoop distribution. We will use the one we published in:

Zhang et al., Self-Balancing Job Parallelism and Throughput in Hadoop, 16th International Conference on Distributed Applications and Interoperable Systems (DAIS), Jun 2016. PDF

It is based on the Apache vanilla Hadoop version 2.7.1 (the same as above) with a feedback controll that dynamically reconfigures the Hadoop capacity scheduler to better balance job parallelism and throughput.

Stop the currently provisioned Hadoop

Since we have only once cluster, we need to stop the current Hadoop services so we can provision a new one that we want to test. The following command gracefully stop the currently running Hadoop services:

./cluster.sh stop-hadoop

No containers should be running after this point:

$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Now, we essentially repeat the steps we have issued earlier, only with other CONFIG.

Create a new configuration

cp ./scenarios/self-balancing-example/local_cluster mycluster-sa

The files are almost the same except for the Hadoop image configuration:

HADOOP_IMAGE='hadoop-benchmark/self-balancing-example'
HADOOP_IMAGE_DIR='scenarios/self-balancing-example/image'

This custom image is based on the vanilla Hadoop 2.7.1 and additionally it includes our Java-based feedback control loop that dynamically adjusts the runtime settings of the capacity scheduller. To see how to create such an image, please the guide.

Update the number of compute nodes to 2

NUM_COMPUTE_NODES=2

Start new Hadoop

export CONFIG=mycluster-sa
./cluster.sh start-hadoop

Again, at the end, the command should output connection settings:

Hadoop should be ready
To connect docker run: 'eval $(docker-machine env --swarm local-hadoop-controller)'
To connect a bash console to the cluster use: 'console' option
To connect to Graphite (WEB console visualizing collectd data), visit http://192.168.99.101
To connect to YARN ResourceManager WEB UI, visit http://192.168.99.101:8088
To connect to HDFS NameNode WEB UI, visit http://192.168.99.101:50070
To connect to YARN NodeManager WEB UI, visit:
http://192.168.99.102:8042 for compute-1
http://192.168.99.103:8042 for compute-2
To connect to HDFS DataNode WEB UI, visit:
http://192.168.99.102:50075 for compute-1
http://192.168.99.103:50075 for compute-2

If you plan to use YARN WEB UI more extensively, consider to add the following records to your /etc/hosts:
192.168.99.101 controller
192.168.99.102 compute-1
192.168.99.103 compute-2

Since the number of nodes is the same, they should be the very same as before.

Run a smoke test

./benchmarks/hadoop-mapreduce-examples/run.sh pi 20 1000
... ... ...
... ... ...
Job Finished in 79.147 seconds
Estimated value of Pi is 3.14280000000000000000

Run HiBench benchmark

./benchmarks/hibench/run.sh

At the end the benchmark outputs the statistics:

Benchmarks finished

Type         Date       Time     Input_data_size      Duration(s)          Throughput(bytes/s)  Throughput/node
WORDCOUNT    2017-01-17 18:21:56 2028                 37.171               48                   24
SORT         2017-01-17 18:23:08 2543                 37.044               71                   35
TERASORT     2017-01-17 18:24:24 30000                25.952               1155                 577
SLEEP        2017-01-17 18:24:55 0                    26.140               0                    0

The report has been uploaded to HDFS: /hibench-20170117-1824.report
To download, run ./cluster.sh hdfs-download "/hibench-20170117-1824.report"

Optional - run multiple times

Again, we should run the benchmark multiple time so we can get a better confidence about the results. The steps are pretty much the same as in the first case. This will take about 40 minutes to complete. The screencast is available on asciicast.

  1. Remove the existing report(s)
./cluster.sh hdfs dfs -rm "/hibench-*.report"
  1. Make sure no other jobs are running in the cluster
$ ./cluster.sh run-controller yarn application -list
... ... ...
... ... ...
Total jobs:0
                  JobId      State       StartTime      UserName         Queue    Priority   UsedContainers  RsvdContainers  UsedMem   RsvdMem   NeededMem     AM info
  1. Run the benchmark 10 times
for i in $(seq 1 10); do ./benchmarks/hibench/run.sh; done
  1. Download all the files
./cluster.sh hdfs-download "/hibench-*.report"
  1. Move the files to a specific directory
$ mkdir results/hibench/sa
$ mv hibench* results/hibench/sa
$ ls -1 results/hibench/sa
hibench-20170117-2023.report
hibench-20170117-2027.report
hibench-20170117-2030.report
hibench-20170117-2034.report
hibench-20170117-2038.report
hibench-20170117-2042.report
hibench-20170117-2046.report
hibench-20170117-2050.report
hibench-20170117-2054.report
hibench-20170117-2058.report
  1. The provided script now allow us to show the difference.
./benchmarks/hibench/analysis/hibench-report.R results/hibench

It will create a new dataset for each subdirectory of the directory given as argument. It will then compute the bootstrap confidence intervals and visualize them on both the Duration and Throughput.

Benchmark durations
(hibench-duration.pdf)
Benchmark throughputs
(hibench-throughput.pdf)

We can see that with the feedback control loop the benchmarks run slightly faster. However, the difference is very small. The reason for this is that the nodes in the cluster are very small. They have only 2GB or RAM which in YARN translates into possibility of running only two containers. The feedback controller is designed to balance the number of containers between application masters (in MapReduce case that is the process responsible of coordinating the complete MapReduce job) and individual application tasks (in MapReduce case these are the individual map and reduce tasks). For more details about how this particular feedback control loop works and results on a larger scale please refer to paper.

Stop the cluster

First we stop the Hadoop services

./cluster.sh stop-hadoop

And then we can stop the virtual machines

./cluster.sh stop-cluster

If you want to free all the space taken, you can also destroy the cluster which will remove all the created VMs

./cluster.sh destroy-cluster

Troubleshooting

Running a distributed system is never easy. Even though hadoop-benchmark is trying to simply all the steps, one may encounter some problems. The ./cluster.sh commands are meant to be idempotent as much as possible to running something twice should produce the same result. If something does not work as expected it is a good idea to retry the command. If that does not work, one can try to undo the command if possible (stopping hadoop and then starting again, stoping the cluster and starting again). If that does not work either one can always start fresh by running ./cluster.sh destroy-cluster. Finally, please submit an issue to github.

Resources