Skip to content

Latest commit

 

History

History
28 lines (28 loc) · 3.2 KB

README.md

File metadata and controls

28 lines (28 loc) · 3.2 KB

Tips and Tricks

This repo contains a random collection of Spark code, written mostly in python (using the PySpark API). I have also included code/scripts in Scala and SparkR. Feel free to copy and use as-in. Let me know if you have any questions or feedback regarding any of the code.

Zeppelin Notebook Hub (can be used to view Zeppelin notebooks, in json format): https://www.zeppelinhub.com/viewer/

Spark Tuning & Best Practices Reference: https://github.com/zaratsian/HDP_Tuning_Unofficial
Spark Tuning Tool: https://github.com/zaratsian/Spark/blob/master/spark_tuning_tool.py

Machine Learning Cheatsheets:
    • SKLearn - Choosing the right estimator
    • Keras Cheatsheet
    • SAS - ML Algorithms
    • MS Azure - ML Algorithms
    • Kaggle ML Solutions

References:
    • Apache Spark Quickstart
    • Spark PySpark (Python) API
    • Databricks - Guide
    • Databricks - Developer Resources
    • Spark Tuning Guide
    • Spark Tuning - Garbage Collection
    • Hortonworks - Spark Reference
    • Anaconda Hortonworks Management Packs
    • Apache Spark - Best Practices & Tuning
    • PySpark Cheatsheet