Materials for the Advanced Data Processing course of the Big Data Analytics Master at the Universitat Politècnica de València.
This course gives a 30 hours overview of many concepts, techniques and tools in data processing using Spark, including some key concepts from Apache Beam. We assume you're familiar with Python, but all the exercises can be easily followed in Java and Scala. We've included a Vagrant definition and docker images for both Spark and Beam.
If you find a bug or you want to contribute some comments, please fill an issue in this repository or simply write us. You're free to reuse course materials, please follow details in the license section.
- Brief intro to functional programming
- Spark basics
- PySpark: transformations, actions and basic IO
- Spark SQL
- MLib
- Graphs
- GraphX (Scala)
- GraphFrames (Python)
- Spark cluster deployment
- Single node
- Vagrant box playground
- Clustering
- Docker
- Kubernetes
- Cloud Dataproc - Start Tutorial (in Spanish)
- Apache Beam
- Rationale
- Docker container using Python SDK
- Slides (coming soon)
- Minio
- Apache Airflow: coordinating jobs
- Basic setup
- DAGs
- Cloud Composer
Team work using Aronson's puzzle. We present a set of real case studies to solve and teams have to design and develop them using any technology available in the market today.
In the first phase, the teams will split with the goal of becoming experts into a particular area and dig into the proposed tools and framework specifics. In the second phase, they'll return to their peers to design a system that covers use case requirement. There's a 15 minute presentation per team to share the results.
To be added soon, stay tuned!
- Functional programming (coming soon)
- Why you don't need big data tools
- poors_man_routes.sh - bash superpowers
- Basic data processing using PySpark
- compras_con_mas_de_un_descuento.py
- compras_importe_total_agrupado_por_tx_id.py
- compras_conversion_a_dolares.py
- compras_top_ten_countries.py
- helpers.py - basic parse functions to get started quickly
- Spark SQL
- Spark Streaming
- MLib
- peliculas_0_ml.py - ALS intro
- peliculas_1_ml.py - Predictions
- GraphFrames
- friends.py - Classic graph sample
- ship_routes.py - Shortest paths for ship routes
- Apache Beam
- Apache Airflow
- Standalone Docker Image
- Tutorial for Composer in Cloud Shell [English / Spanish]
- hello_dags.py
- hello_python_operator.py
- hello_simple.py
- spark_ondemand.py
- spark_simple.py
- Deployment
- Single Node
- Vagrant
- Ansible
- Spark on Docker
- Beam on Docker
- Spark on Kubernetes
- Spark on Google Cloud Dataproc
- Tutorial for Dataproc in Cloud Shell English / Spanish]
- PySpark Jupyter Notebook
Final course assignments can be found in this document. They are in Spanish, they will be translated to English at some point.
I'm not publishing the solutions to avoid remaking the exercises every year. There's a test suite using py.test to help you validate the results. If you're really interested on them, please write me to [email protected].
Self-sufficiency is the state of not requiring any aid, support, or interaction, for survival; it is therefore a type of personal or collective autonomy - Wikipedia.
We follow a self-sufficiency principles for students to drive course goals. At the end of the course, students should have enough knowledge and tools to develop small data processing solutions their own.
- Student understands the underlying concepts behind Spark, and is able to write data processing scripts using PySpark, Spark SQL and MLib.
- Student is capable of identify common data processing libraries and frameworks and their applications.
- Student is capable to work in a team designing a system to cover a simple data processing scenario, understanding the basic implications of the choices they made on systems, languages, libraries and platforms.
We recommend the following papers to expand knowledge on Spark and other data processing techniques:
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
- Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
- Spark SQL: Relational Data Processing in Spark
- MLlib: Machine Learning in Apache Spark
- GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
- Tachyon: Memory Throughput I/O for Cluster Computing Frameworks
- The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
- Streaming 101: The world beyond batch - Part two
- Apache Flink™: Stream and Batch Processing in a Single Engine
- MillWheel: Fault-Tolerant Stream Processing at Internet Scale
- Pig Latin: A Not-So-Foreign Language for Data Processing
- Interpreting the Data: Parallel Analysis with Sawzall
- Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams
- Above the Clouds: A Berkeley View of Cloud Computing
- Cloud Programming Simplified: A Berkeley View on Serverless Computing (particularly item 8.2 on MapReduce also applies to Spark)
Some ideas we might add in forthcoming course editions:
- Code samples in python notebooks
Apache Flink and Apache Beam(2017)- Add Tachyon content and exercises
- Add Kafka source to the streaming sample
Introduce samples with Minio / InfiniSpan(2018)Improve deployment scenarios and tools: Mesos, Chef, etc.(2017)- Monitoring using Prometheus and Grafana, provide ready-to-use docker containers
- Profiling of Spark applications (Scala only)
- Translate all content to English and Spanish
Cloud Dataproc(2019)Apache Airflow(2019)- Tensorflow training and model execution at scale
Advanced Data Processing course materials. Copyright (C) 2016, Luis Belloch
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Luis Belloch, course materials for Advanced Data Processing, Spring 2016. Master on Big Data Analytics (http://bigdata.inf.upv.es), Universitat Politècnica de València. Downloaded on [DD Month YYYY].