Sparkling Water

Sparkling Water integrates H₂O fast scalable machine learning engine with Spark.

Requirements

Linux or OS X (Windows support is coming)
Java 7
Spark 1.1.0
- SPARK_HOME shell variable should point to your local Spark installation

Downloads of binaries

Sparkling Water - Latest version

Building

Use provided gradlew to build project:

./gradlew build

To avoid running tests, please, use -x test option

Running examples

Build a package which can be submitted to Spark cluster:

./gradlew assemble

Set the configuration of the demo Spark cluster, for example; local-cluster[3,2,1024]

export SPARK_HOME="/path/to/spark/installation"
export MASTER="local-cluster[3,2,1024]"

In this example, the description local-cluster[3,2,1024] causes the creation of an embedded cluster consisting of 3 workers.

And run the example:

bin/run-example.sh

This runs the Deep Learning demo. It is finished when you see the line (Ctrl-C to stop the demo cluster):

===> Model predictions: 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...

For more details about the demo, please see the README.md file in the examples directory.

Sparkling shell

The Sparkling shell provides a regular Spark shell with support to create a H₂O cloud and execute H₂O algorithms.

First, build a package containing Sparkling water

./gradlew assemble

Configure the location of Spark cluster:

export SPARK_HOME="/path/to/spark/installation"
export MASTER="local-cluster[3,2,1024]"

In this case local-cluster[3,2,1024] points to embedded cluster of 3 worker nodes, each with 2 cores and 1G of memory.

And run Sparkling Shell:

bin/sparkling-shell

Sparkling Shell accepts common Spark Shell arguments. For example, to increase memory allocated by each executor it > is possible to pass spark.executor.memory parameter: bin/sparkling-shell --conf "spark.executor.memory=4g"

Simple Example

Run Sparkling shell with an embedded cluster:

export SPARK_HOME="/path/to/spark/installation"
export MASTER="local-cluster[3,2,1024]"
bin/sparkling-shell

You can go to http://localhost:4040/ to see the Sparkling shell (i.e., Spark driver) status.
Now you can launch H₂O inside the Spark cluster:

import org.apache.spark.h2o._
val h2oContext = new H2OContext(sc).start()
import h2oContext._

Note: The H2OContext#start API call figures out number of Spark workers and launch corresponding number of H2O instances inside Spark cluster.

Import the provided airlines data, parse them via H₂O parser:

import java.io.File
val dataFile = "examples/smalldata/allyears2k_headers.csv.gz"
val airlinesData = new DataFrame(new File(dataFile))

Use the data via RDD API:

import org.apache.spark.examples.h2o._
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)

Compute the number of rows inside RDD:

airlinesTable.count

or compute the number of rows via H₂O API:

airlinesData.numRows()

Select only flights with destination in SFO with help of Spark SQL:

import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
import sqlContext._ 
airlinesTable.registerTempTable("airlinesTable")

// Select only interesting columns and flights with destination in SFO
val query = "SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO'"
val result = sql(query)

Launch the H₂O algorithm on the result of the SQL query:

import hex.deeplearning._
import hex.deeplearning.DeepLearningModel.DeepLearningParameters

val dlParams = new DeepLearningParameters()
dlParams._training_frame = result('Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'CRSArrTime,
                                  'UniqueCarrier, 'FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,
                                  'Distance, 'IsDepDelayed)
dlParams._response_column = 'IsDepDelayed
// Launch computation
val dl = new DeepLearning(dlParams)
val dlModel = dl.trainModel.get

Use the model for prediction:

val predictionH2OFrame = dlModel.score(result)('predict)
val predictionsFromModel = toRDD[DoubleHolder](predictionH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

Docker Support

The directory docker contains basic support to run Sparkling Water on top of docker container.

Build a container

$ cd docker
$ ./build.sh

Running examples in container

$ docker run -i -t --rm sparkling-test-base bin/run-example.sh

Running Sparkling Shell in container

$ docker run -i -t --rm sparkling-test-base bin/sparkling-shell

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
assembly		assembly
bin		bin
core		core
dist		dist
docker		docker
examples		examples
gradle		gradle
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
make-dist.sh		make-dist.sh
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkling Water

Requirements

Downloads of binaries

Building

Running examples

Sparkling shell

Simple Example

Docker Support

Build a container

Running examples in container

Running Sparkling Shell in container

About

Releases

Packages

Languages

License

mediative/sparkling-water

Folders and files

Latest commit

History

Repository files navigation

Sparkling Water

Requirements

Downloads of binaries

Building

Running examples

Sparkling shell

Simple Example

Docker Support

Build a container

Running examples in container

Running Sparkling Shell in container

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages