This repo contains Dockerfiles for Apache Spark for running in standalone mode. Standalone mode is the easiest to set up and will provide almost all the same features as the other cluster managers if you are only running Spark.
Apache Spark Docker image is available directly from docker.
Copy the docker-compose.yml
file and run the following command.
docker-compose up
This should run a spark cluster on your host machine at localhost:7077
. You can connect to it remotely
from any spark shell. A short pyspark example is provided below that will work with the juypter notebook
running at localhost:8888
.
from pyspark import SparkConf, SparkContext
import random
conf = SparkConf().setAppName('test').setMaster('spark://master:7077')
sc = SparkContext(conf=conf)
NUM_SAMPLES = 100000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(xrange(0, NUM_SAMPLES)) \
.filter(inside).count()
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
Be sure that your worker is using the desired amount of cores and memory. These can be set directly
in the docker-compose.yml
file.
SPARK_WORKER_CORES: 4
SPARK_WORKER_MEMORY: 2g
There are 2 ways of getting this image:
- Build this image using
Dockerfile
OR - Pull the image directly from DockerHub.
Copy the Dockerfile
to a folder on your local machine and then invoke the following command.
git clone https://github.com/Ouwen/docker-spark.git && cd docker-spark
docker build -t p7hb/docker-spark .
docker pull p7hb/docker-spark
Spark latest version as on 11th July, 2017 is 2.2.0
. So, :latest
or 2.2.0
both refer to the same image.
docker run -it -p 7077:7077 -p 4040:4040 -p 8080:8080 -p 8081:8081 p7hb/docker-spark
The above step will launch and run the bash shell into the latest image. We preset a couple ports for the following purposes:
7077
is the port bind for spark master process8080
is the port bind for the spark master webui8081
is the port bind for the spark worker webui4040
is the port bind the spark
All the required binaries have been added to the PATH
. Run the following in a running container.
start-master.sh
start-slave.sh spark://0.0.0.0:7077
spark-submit --class org.apache.spark.examples.SparkPi --master spark://0.0.0.0:7077 $SPARK_HOME/examples/jars/spark-examples*.jar 100
.......
.......
Pi is roughly 3.140495114049511
spark-shell --master spark://0.0.0.0:7077
Only available for the duration of the application.
This is the IP Address which needs to be used to look upto for all the exposed ports of our Docker container.
docker-machine ip default
docker ps
docker ps -a
docker stats --all shows a running list of containers.
docker inspect <<Container_Name>> | grep IPAddress
We can open new terminal with new instance of container's shell with the following command.
docker exec -it <<Container_ID>> /bin/bash #by Container ID
OR
docker exec -it <<Container_Name>> /bin/bash #by Container Name
Depending on the version of the Spark Image you want, please run the corresponding command.
Latest image is always the most recent version of Apache Spark available. As of 11th July, 2017 it is v2.2.0.
Dockerfile for Apache Spark v2.2.0
docker pull p7hb/docker-spark
Dockerfile for Apache Spark v2.2.0
docker pull p7hb/docker-spark:2.2.0
Dockerfile for Apache Spark v2.1.1
docker pull p7hb/docker-spark:2.1.1
Dockerfile for Apache Spark v2.1.0
docker pull p7hb/docker-spark:2.1.0
Dockerfile for Apache Spark v2.0.2
docker pull p7hb/docker-spark:2.0.2
Dockerfile for Apache Spark v2.0.1
docker pull p7hb/docker-spark:2.0.1
Dockerfile for Apache Spark v2.0.0
docker pull p7hb/docker-spark:2.0.0
Dockerfile for Apache Spark v1.6.3
docker pull p7hb/docker-spark:1.6.3
Dockerfile for Apache Spark v1.6.2
docker pull p7hb/docker-spark:1.6.2
Other Spark image versions of this repository can be booted by suffixing the image with the Spark version. It can have values of 2.2.0
, 2.1.1
, 2.1.0
, 2.0.2
, 2.0.1
, 2.0.0
, 1.6.3
and 1.6.2
.
docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -h spark --name=spark p7hb/docker-spark:2.2.0
docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -h spark --name=spark p7hb/docker-spark:2.1.1
docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -h spark --name=spark p7hb/docker-spark:2.1.0
docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -h spark --name=spark p7hb/docker-spark:2.0.2
docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -h spark --name=spark p7hb/docker-spark:2.0.1
docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -h spark --name=spark p7hb/docker-spark:2.0.0
docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -h spark --name=spark p7hb/docker-spark:1.6.3
docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -h spark --name=spark p7hb/docker-spark:1.6.2
This image contains the following softwares:
- OpenJDK 64-Bit v1.8.0_131
- Scala v2.12.2
- SBT v0.13.15
- Apache Spark v2.2.0
root@spark:~# java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_111-8u131-b11-2~bpo8+1-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
root@spark:~# scala -version
Scala code runner version 2.12.2 -- Copyright 2002-2017, LAMP/EPFL and Lightbend, Inc.
Running sbt about
will download and setup SBT on the image.
root@spark:~# spark-shell
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1483032227786).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_111)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
If you find any issues or would like to discuss further, please ping me on my Twitter handle @P7h or drop me an email.
Copyright © 2016 Prashanth Babu.
Modified work Copyright © 2018 Ouwen Huang.
Licensed under the Apache License, Version 2.0.