Skip to content
Matthew J Collins edited this page Mar 31, 2016 · 9 revisions

Spark Workshop

"Getting started with Spark Dataframes"

Goals

  1. Parallel processing motivation?

  2. Background on Spark's development, relate to what Hadoop and Apache are and how they interoperate. Goal is to provide context to people on what tools they might be interested in. Show Hadoop Ecosystem list (gently, not as a checklist to know all).

  3. Install and configure Spark on a single computer.

  4. Explanation of which syntax we're using (RRD vs parsed SQL vs Hive SQL vs dataframe methods)

  5. Explanation of how to read the documentation - what function scope already is in pydoc, how to use imports to emulate it

  6. Example use cases - NO WORD COUNT!, histogram generation, data cube, machine learning, user-defined aggregates?

  7. Pointers to where to learn about executor & memory tuning - they will later. Good transition to running on a cluster with larger data?

  8. Demo cluster submission. Make the code operatable with just file name changes.

Story

TF-IDF-based cosine clustering? A case where we could do (n 2) operations in a loop or w/ Spark to compare times? Contrived and inefficient... Would be a good opportunity to talk about dimensionality reduction (this drifts away from core of workshop though - algorithms is not what we're doing)

Refs

http://www.infoworld.com/article/3045593/it-jobs/stack-overflow-survey-javascript-reigns-female-developers-mia.html?google_editors_picks=true

"The second biggest gainer is Apache Spark, the in-memory data processing framework commonly used with Hadoop. (Biggest loser: Windows Phone. Who's shocked?) Spark also cracked the top of another list: Technology that garners the highest-paid talent. A good Spark dev can command as much as $125,000 a year, thanks to its wide use in finance. But JavaScript devs didn't do too badly, with the average pay for a full-stack JS developer just south of $100,000 a year."

Clone this wiki locally