-
Notifications
You must be signed in to change notification settings - Fork 4
Running on EC2
A guide to running eXpress-D on Amazon's Elastic Compute Cloud (EC2).
Setup an EC2 account on the Amazon Web Services (AWS) website.
The express-d/ec2-scripts
directory contains a copy of Spark's EC2 scripts that launches a cluster and sets up Spark and Hadoop HDFS on it. These EC2 scripts are described in detail in Running Spark on EC2. What follows in this wiki page is a summary of key points.
To launch a fresh cluster, do:
$ cd express-d/ec2-scripts
$ ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> -a ami-0b6d2662 launch <cluster-name>
Where ...
-
<keypair>
is the name of your EC2 keypair (i.e., the keypair filename without the.pem
suffix). -
<key-file>
is the private keypair file for the keypair. -
<num-slaves>
is the number of slave instances to launch. -
<cluster-name>
is the name to give to your cluster. This will be shown on the Spark WebUI.
and ...
-
-a ami-0b6d2662
specifies a pre-built machine image that includes Spark, Hadoop, and eXpress-D sources. This image is used to launch each slave instance.
Another optional, yet very useful, option is -t <instance-type>
, which allows the user to specify what type of instances to launch (link to Instance Types and Instance Costs).
The spark-ec2
script will take a while to complete setup. If at any time the setup process is interrupted (e.g. you accidentally closed a Terminal window while the process is running), then you can resume setup by passing the --resume
flag to the command above.
spark-ec2
can also be used to stop, resume, or terminate a cluster (see spark-ec2 --help
).
Once spark-ec2
has finished cluster setup, SSH into the master.
$ ./spark-ec2 -k key -i key.pem login <cluster-name>
To setup eXpress-D on the master, follow the steps on the Setting Up and Running eXpress-D page. The bin/build
scripts will detect the cluster's slave instances and copy the express-d
directory to each one.
After the express-d/config/config.py
file has been configured and eXpress-D and Spark have been packaged, eXpress-D is ready to be run (but read the next section before actually running).
There is a small, simulated dataset in /root/sample-datasets
to play with. But before running eXpress-D, these datasets must be loaded onto HDFS or Amazon S3. To load the targets and alignments files into HDFS, run:
$ /root/bin/hadoop/dfs -put <path/to/targets.pb> /targets.pb
$ /root/bin/hadoop/dfs -put <path/to/hits.pb> /hits.pb
HDFS is sufficient for most use EC2 use cases. If any dataset will accessed multiple times by different clusters running eXpress-D, then it might be more convenient to use Amazon S3 (i.e. storing sporadically accessed data on S3 is cheaper than storing data in an HDFS that is running on a managed EC2 cluster). However, loading into Amazon S3 is tricky and, when we did so for the eXpress-D paper, involved using Spark to load an RDD/dataset from HDFS, and then calling RDD#saveAsTextFile(s3n://...)
on it.
After the datasets have been loaded onto HDFS, make sure that config,py
contains the HDFS path for each file before calling bin/run
.
In config.py
:
...
EXPRESS_RUNTIME_LOCAL_OPTS = [
OptionSet("hits-file-path", ["%s:9000/hits.1M.pb" % SPARK_CLUSTER_URL]),
OptionSet("targets-file-path", [%s:9000/targets.pb" % SPARK_CLUSTER_URL]),
...