A Really Quick Spark Demo

Shows joining a C* table to a CSV file. What could be more fun than that? Lots of things, actually, but stop complaining. This is still kind of sick if you think about it.

Initializing the C* Data

Do these things in cqlsh and live:

SOURCE 'schema.cql'
COPY movies from 'movies.csv'
COPY movie-genres from 'movie-genres.csv'

In Spark Shell

Do actual Spark things now:

// Read the movies as a Pair RDD keyed by movie_id
val movies = sc.cassandraTable("killr_video","movies").as( (i:String,y:Int,t:String) => (java.util.UUID.fromString(i),(t,y)) )

// Get the CSV into an RDD as text
val ratings = sc.textFile("file:///Users/tlberglund/workshops/spark/movie-ratings.csv")

// Parse the CSV, making a Pair RDD keyed by movie ID
val ratingsByID =
ratings.map{ line =>
             val fields = line.split(',')
             (java.util.UUID.fromString(fields(0)),fields(1).toFloat) }

val movies = sc.cassandraTable("killr_video","movies").as( (i:String,y:Int,t:String) => (java.util.UUID.fromString(i),t))

val ratingCount = ratingsByID.mapValues(v => 1).reduceByKey(_+_)
val ratingSums = ratingsByID.reduceByKey(_ + _)
val averageRatings = ratingCount.join(ratingSums).mapValues(r => r._2 / r._1)

val ratedMovies = movies.join(averageRatings)

ratedMovies.collect

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
data.cql		data.cql
genres.csv		genres.csv
movie-genres.csv		movie-genres.csv
movie-ratings.csv		movie-ratings.csv
movies.csv		movies.csv
movies_by_actor.csv		movies_by_actor.csv
schema.cql		schema.cql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Really Quick Spark Demo

Initializing the C* Data

In Spark Shell

About

Releases

Packages

tlberglund/simple-spark-demos

Folders and files

Latest commit

History

Repository files navigation

A Really Quick Spark Demo

Initializing the C* Data

In Spark Shell

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages