We will practice Apache Spark with simple activities:
- write Spark programs
- submit Spark programs
- understand relationships between developers and platform providers through tools/supports
Note: there are many tutorials about Apache Spark that you can take a look in the Internet, e.g. CSC You should also try to check again Hadoop tutorial which is related to MapReduce/Spark.
It is important to learn how to setup and operate a Spark deployment (from the platform viewpoint) so three different suggested ways:
- Use Google (or other providers) which provides a feature for creating a Spark deployment for you. In this tutorial we can use "dataproc"
- Use a small setup of Spark in your local machines. We recommend you to have it because it is not easy (and expensive) to get access to a real, production Hadoop/Spark system. With your own system, you can combine Spark and Hadoop in one system for learning both Hadoop and Spark.
If you just want to do programming, you can use existing Spark deployment, e.g., DataBricks, from and write and submit your program.
You can download Spark with Hadoop or if you already have Hadoop, then just install Spark. Follow the instruction here.
Note: you can setup one master and one slave node in the same machine. This could be enough to practice some basic tasks with Spark.
Check if it works:
$sbin/start-master.sh
then
bin/pyspark
or
bin/pyspark --master spark://[SPARK_MASTER_HOST]:7077
Spark has some UI to see the nodes:
- http://MASTER_NODE:8080
Login into the our Google Spark test:
ssh mysimbdp@
Now you are in a private environment and you can see that you have a Spark cluster with the the master is yarn
You then you can test:
$pyspark
Assume that you have data in Hadoop or local file systems. You can write some programs to practice Spark programming. You should try to start with simple ones.
We have some very simple programs for processing NY Taxi. These programs use the NYTaxi dataset which is in our test Hadoop.
Take a look at the PySpark cheatsheet for spark functions.
- How would you manage input for Spark programs?
In our exercises, data for Spark programs are in Hadoop Filesystem. Check Hadoop Filesystem to make sure that the files are available, e.g.
$hdfs dfs -ls hdfs:///user/mybdp
hdfs dfs -ls hdfs:///user/mybdp/nytaxi2019.csv
Note: check again the Hadoop tutorial. When you run programs with your local file systems, you can also use the path with file:///.... to indicate input data.
To speed up the exercises, you can also create small datasets of taxi data (or other data).
Using spark-submit to submit the job. Note that in Google test environment, the master job scheduler is yarn or local[*]. Using yarn and avoid local[] as if we have many people running programs in the same machine (with local) then the server is overloading. Example of counting the trips from the taxi file:
ssh [email protected]
spark-submit --master yarn --deploy-mode cluster simple_taxi_tripcount.py --input_file hdfs:///user/mybdp/nytaxi2019.csv --output_dir hdfs:///user/mybdp/taxiresult01
or
spark-submit --master local[*] simple_taxi_tripcount.py --input_file hdfs:///user/mybdp/nytaxi2019.csv --output_dir hdfs:///user/mybdp/taxiresult01
with hdfs:///user/mybdp/taxiresult01 is the directory where the result will be stored. You can define the output directory.
Then check if the result is OK by checking the output directory:
mybdp@cluster-bdp-m:~/code$ hdfs dfs -ls hdfs:///user/mybdp/taxiresult01
Found 2 items
-rw-r--r-- 2 mybdp hadoop 0 2019-10-25 19:25 hdfs:///user/mybdp/taxiresult01/_SUCCESS
-rw-r--r-- 2 mybdp hadoop 139 2019-10-25 19:25 hdfs:///user/mybdp/taxiresult01/part-00000-b6a9824f-1b23-463d-a8c4-651074b6f9a5-c000.csv
with _SUCCESS we know the job is successful and you can check the content of the output:
$hdfs dfs -cat hdfs:///user/mybdp/taxiresult01/part-00000-b6a9824f-1b23-463d-a8c4-651074b6f9a5-c000.csv
Other simple examples are:
export PYSPARK_PYTHON=python3
mybdp@cluster-bdp-m:~/code$ spark-submit --master yarn broadcast-ex.py --master yarn
- How do we specify complex input data?
- How do we get back the result?
- When do I know my program finished?
See jobs:
yarn top
Kill jobs:
yarn application -kill
- Check /etc/spark/spark-defaults.conf
Check some important performance configuration and understand them, like:
- spark.sql.shuffle.partitions
- spark.sql.files.maxPartitionBytes
- sc.defaultParallelism
- spark.dynamicAllocation.enabled
You can use Spark with Jupyter. For example, if you use a free version of DataBricks or CSC Rahti. However, as learning the platform, we suggest you to use commandlines and also check services of Spark - not just programming Spark programs.