Home

Spark Workshop

"Getting started with Spark Dataframes"

Goals

Parallel processing motivation?
Background on Spark's development, relate to what Hadoop and Apache are and how they interoperate. Goal is to provide context to people on what tools they might be interested in. Show Hadoop Ecosystem list (gently, not as a checklist to know all).
Install and configure Spark on a single computer.
Explanation of which syntax we're using (RRD vs parsed SQL vs Hive SQL vs dataframe methods)
Explanation of how to read the documentation - what function scope already is in pydoc, how to use imports to emulate it
Example use cases - NO WORD COUNT!, histogram generation, data cube, machine learning, user-defined aggregates?
Pointers to where to learn about executor & memory tuning - they will later. Good transition to running on a cluster with larger data?
Demo cluster submission. Make the code operatable with just file name changes.

Story

TF-IDF-based cosine clustering? A case where we could do (n 2) operations in a loop or w/ Spark to compare times? Contrived and inefficient... Would be a good opportunity to talk about dimensionality reduction (this drifts away from core of workshop though - algorithms is not what we're doing)

Refs

http://www.infoworld.com/article/3045593/it-jobs/stack-overflow-survey-javascript-reigns-female-developers-mia.html?google_editors_picks=true

"The second biggest gainer is Apache Spark, the in-memory data processing framework commonly used with Hadoop. (Biggest loser: Windows Phone. Who's shocked?) Spark also cracked the top of another list: Technology that garners the highest-paid talent. A good Spark dev can command as much as $125,000 a year, thanks to its wide use in finance. But JavaScript devs didn't do too badly, with the average pay for a full-stack JS developer just south of $100,000 a year."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Spark Workshop

Story

Refs

Clone this wiki locally