Sparkling Water integrates H2O fast scalable machine learning engine with Spark.
- Linux or OS X (Windows support is coming)
- Java 7
- Spark 1.1.0
SPARK_HOME
shell variable should point to your local Spark installation
Use provided gradlew
to build project:
./gradlew build
To avoid running tests, please, use
-x test
option
Build a package which can be submitted to Spark cluster:
./gradlew assemble
Set the configuration of the demo Spark cluster, for example; local-cluster[3,2,1024]
export SPARK_HOME="/path/to/spark/installation"
export MASTER="local-cluster[3,2,1024]"
In this example, the description
local-cluster[3,2,1024]
causes the creation of an embedded cluster consisting of 3 workers.
And run the example:
bin/run-example.sh
This runs the Deep Learning demo. It is finished when you see the line (Ctrl-C to stop the demo cluster):
===> Model predictions: 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...
For more details about the demo, please see the README.md file in the examples directory.
The Sparkling shell provides a regular Spark shell with support to create a H2O cloud and execute H2O algorithms.
First, build a package containing Sparkling water
./gradlew assemble
Configure the location of Spark cluster:
export SPARK_HOME="/path/to/spark/installation"
export MASTER="local-cluster[3,2,1024]"
In this case
local-cluster[3,2,1024]
points to embedded cluster of 3 worker nodes, each with 2 cores and 1G of memory.
And run Sparkling Shell:
bin/sparkling-shell
Sparkling Shell accepts common Spark Shell arguments. For example, to increase memory allocated by each executor it > is possible to pass
spark.executor.memory
parameter:bin/sparkling-shell --conf "spark.executor.memory=4g"
- Run Sparkling shell with an embedded cluster:
export SPARK_HOME="/path/to/spark/installation"
export MASTER="local-cluster[3,2,1024]"
bin/sparkling-shell
-
You can go to http://localhost:4040/ to see the Sparkling shell (i.e., Spark driver) status.
-
Now you can launch H2O inside the Spark cluster:
import org.apache.spark.h2o._
val h2oContext = new H2OContext(sc).start()
import h2oContext._
Note: The H2OContext#start API call figures out number of Spark workers and launch corresponding number of H2O instances inside Spark cluster.
- Import the provided airlines data, parse them via H2O parser:
import java.io.File
val dataFile = "examples/smalldata/allyears2k_headers.csv.gz"
val airlinesData = new DataFrame(new File(dataFile))
- Use the data via RDD API:
import org.apache.spark.examples.h2o._
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)
- Compute the number of rows inside RDD:
airlinesTable.count
or compute the number of rows via H2O API:
airlinesData.numRows()
- Select only flights with destination in SFO with help of Spark SQL:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
import sqlContext._
airlinesTable.registerTempTable("airlinesTable")
// Select only interesting columns and flights with destination in SFO
val query = "SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO'"
val result = sql(query)
- Launch the H2O algorithm on the result of the SQL query:
import hex.deeplearning._
import hex.deeplearning.DeepLearningModel.DeepLearningParameters
val dlParams = new DeepLearningParameters()
dlParams._training_frame = result('Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'CRSArrTime,
'UniqueCarrier, 'FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,
'Distance, 'IsDepDelayed)
dlParams._response_column = 'IsDepDelayed
// Launch computation
val dl = new DeepLearning(dlParams)
val dlModel = dl.trainModel.get
- Use the model for prediction:
val predictionH2OFrame = dlModel.score(result)('predict)
val predictionsFromModel = toRDD[DoubleHolder](predictionH2OFrame).collect.map(_.result.getOrElse(Double.NaN))
The directory docker
contains basic support to run Sparkling Water on top of docker container.
$ cd docker
$ ./build.sh
$ docker run -i -t --rm sparkling-test-base bin/run-example.sh
$ docker run -i -t --rm sparkling-test-base bin/sparkling-shell