This is Apache Spark with modifications to run security sensitive code inside Intel SGX enclaves. The implementation leverages sgx-lkl, a library OS that allows to run Java-based applications inside SGX enclaves.
This guide shows how to run Sgx-Spark in a few simple steps using Docker. Most parts of the setup and deployment are wrapped within Docker containers. Compliation and deployment should thus be smooth.
-
Clone this Sgx-Spark repository
-
Build the Sgx-Spark base image. The name of the resulting Docker image is
sgxpsark
. This process might take a while (30-60 mins):sgx-spark/dockerfiles$ docker build -t sgxspark .
-
Prepare the disk image that will be required by sgx-lkl. Due to restrictions of Docker, this step can currently not be implemented as part of the above Docker build process. Thus, this step is platform-dependent. The process has been successfully tested on Ubuntu 16.04 and Arch Linux:
sgx-spark/lkl$ make prepare-image
-
Create a Docker network device that will be used for communication by the Docker containers. Note that by creating a user-defined network, Docker will create an embedded DNS server so that workers can find the Spark master by name.
sgx-spark$ docker network create sgxsparknet
From within directory sgx-spark/dockerfiles
, run the Sgx-Spark master node,
the Sgx-Spark worker node, as well as the actual Sgx-Spark job as follows.
-
Run the Sgx-Spark master node:
sgx-spark/dockerfiles$ docker run \ --user user \ --env-file $(pwd)/docker-env \ --net sgxsparknet \ --name sgxspark-docker-master \ -p 7077:7077 \ -p 8082:8082 \ -ti sgxspark /sgx-spark/master.sh
-
Run the Sgx-Spark worker node:
sgx-spark/dockerfiles$ docker run \ --user user \ --memory="4g" \ --shm-size="8g" \ --env-file $(pwd)/docker-env \ --net sgxsparknet \ --privileged \ -v $(pwd)/../lkl:/spark-image:ro \ -ti sgxspark /sgx-spark/worker-and-enclave.sh
-
Run the Sgx-Spark job as follows.
As of writing, the three jobs below are known to be fully supported:
-
WordCount
sgx-spark/dockerfiles$ docker run \ --user user \ --memory="4g" \ --shm-size="8g" \ --env-file $(pwd)/docker-env \ --net sgxsparknet \ --privileged \ -v $(pwd)/../lkl:/spark-image:ro \ -e SPARK_JOB_CLASS=org.apache.spark.examples.MyWordCount \ -e SPARK_JOB_NAME=WordCount \ -e SPARK_JOB_ARG0=README.md \ -e SPARK_JOB_ARG1=output \ -ti sgxspark /sgx-spark/driver-and-enclave.sh
-
KMeans
sgx-spark/dockerfiles$ docker run \ --user user \ --memory="4g" \ --shm-size="8g" \ --env-file $(pwd)/docker-env \ --net sgxsparknet \ --privileged \ -v $(pwd)/../lkl:/spark-image:ro \ -e SPARK_JOB_CLASS=org.apache.spark.examples.mllib.KMeansExample \ -e SPARK_JOB_NAME=KMeans \ -e SPARK_JOB_ARG0=data/mllib/kmeans_data.txt \ -ti sgxspark /sgx-spark/driver-and-enclave.sh
-
LineCount
sgx-spark/dockerfiles$ docker run \ --user user \ --memory="4g" \ --shm-size="8g" \ --env-file $(pwd)/docker-env \ --net sgxsparknet \ --privileged \ -v $(pwd)/../lkl:/spark-image:ro \ -e SPARK_JOB_CLASS=org.apache.spark.examples.LineCount \ -e SPARK_JOB_NAME=LineCount \ -e SPARK_JOB_ARG0=SgxREADME.md \ -ti sgxspark /sgx-spark/driver-and-enclave.sh
-
To run Sgx-Spark natively, proceed as detailed in the following.
Install all required dependencies. For Ubuntu 16.04, these can be installed as follows:
$ sudo apt-get update
$ sudo apt-get install -y --no-install-recommends scala libtool autoconf curl xutils-dev git build-essential maven openjdk-8-jdk ssh bc python autogen wget autotools-dev sudo automake
Hadoop, and thus Spark, depends on Google Protocol Buffers (GPB) in version 2.5.0:
-
Make sure to uninstall any other versions of GPB
-
Install GPB v2.5.0. Instructions for Ubuntu 16.04 are as follows:
$ cd /tmp /tmp$ wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz /tmp$ tar xvf protobuf-2.5.0.tar.gz /tmp$ cd protobuf-2.5.0 /tmp/protobuf-2.5.0$ ./autogen.sh && ./configure && make && sudo make install /tmp/protobuf-2.5.0$ sudo apt-get install -y --no-install-recommends libprotoc-dev
Instructions for Arch Linux are available at https://stackoverflow.com/a/29799354/2273470.
As Sgx-Spark uses sgx-lkl, the
latter must have been downloaded and compiled successfully. As of writing (June 14, 2018),
sgx-lkl
should be compiled using branch cleanup-musl
. Please
follow the documentation of sgx-lkl and ensure that your
installation of sgx-lkl executes simple Java applications successfully.
sgx-spark$ build/mvn -DskipTests package
-
As part of this compilation process, a modified Hadoop library has been compiled. Copy the Hadoop JAR file into the Sgx-Spark jars directory:
sgx-spark$ cp hadoop-2.6.5-src/hadoop-common-project/hadoop-common/target/hadoop-common-2.6.5.jar assembly/target/scala-2.11/jars/
-
Sgx-Spark ships with a native C library (
libringbuff.so
) that enables shared-memory-based communication between two JVMs. Compile as follows:sgx-spark/C$ make install
-
Adjust file
spark-sgx/lkl/Makefile
for your environment:Variable
SGX_LKL
must point to yoursgx-lkl
directory (see Prerequisites). -
Build the Sgx-Spark disk image required for sgx-lkl:
sgx-spark/lkl$ make clean all
Finally, we are ready to run (i) the Sgx-Spark master node,
(ii) the Sgx-Spark worker node, (iii) the worker's enclave, (iv) the Sgx-Spark client,
and (v) the client's enclave. In the following commands, replace: <hostname>
with
the master node's actual hostname; <sgx-lkl>
with the path to your sgx-lkl
installation.
Note: After running each example, make sure to (i) restart all processes, (ii) delete all shared memory files in /dev/shm
.
-
If you run all the nodes locally, you need to add the following line to
variables.sh
:export SPARK_LOCAL_IP=127.0.0.1
-
Run the Master node
sgx-spark$ ./master.sh
-
Run the Worker node
sgx-spark$ ./worker.sh
-
Run the enclave for the Worker node
sgx-spark$ ./worker-enclave.sh
-
Run the enclave for the driver program. This is the process that will output the job results!
sgx-spark$ ./driver-enclave.sh
-
Finally, submit a Spark job. The result will be output in the process we started just before.
-
WordCount
sgx-spark$ ./submitwordcount.sh
-
KMeans
sgx-spark$ ./submitkmeans.sh
-
LineCount
sgx-spark$ ./submitlinecount.sh
-
In order to run the above installation without SGX, start your environment as follows:
-
Start the Master node as above
-
Start the Worker node as above, but change environment variable
SGX_ENABLED=true
toSGX_ENABLED=false
-
Do not start the enclaves
-
Submit the Spark job as above, but change evironment variable
SGX_ENABLED=true
toSGX_ENABLED=false
There are a few important things to keep in mind when developing Sgx-Spark:
-
Whenever you change parts of the code, obviously, you must recompile the Spark code
sgx-spark$ mvn package -DskipTests
There have been (not clearly definable) situations in which the above command did not compile all of the changed files. In this case, issue:
sgx-spark$ mvn clean package -DskipTests
-
After making changes to the Sgx-Spark code and after compiling the Java/Scala code (see above), you always need to rebuild the lkl image that will be used by sgx-lkl:
sgx-spark/lkl$ make clean all
-
If you changed parts of the Hadoop code (in directory
hadoop-2.6.5-src
), you will also need to copy the resulting*jar
file:sgx-spark$ cp hadoop-2.6.5-src/hadoop-common-project/hadoop-common/target/hadoop-common-2.6.5.jar assembly/target/scala-2.11/jars/
-
Lastly, do not forget to remove all related shared memory files in
/dev/shm/
before running your next experiment!
Development with sgx-lkl can be tedious. For development purposes, a special flag allows to run the
enclave-side of Sgx-Spark in a regular JVM rather than on top of sgx-lkl. To make use of this feature,
run the enclave JVMs using scripts worker-enclave-nosgx.sh
and driver-enclave-nosgx.sh
.
Under the hood, these scripts set environment variable DEBUG_IS_ENCLAVE_REAL=false
(defaults to true
) and
provide the JVM with a value for environment variable SGXLKL_SHMEM_FILE
. Note that the value of SGXLKL_SHMEM_FILE
must be the same as the one provided for the corresponding Worker (worker.sh
) and Driver (driver.sh
).