This document assumes you already have python.
To run PySpark, we first need to add it to PYTHONPATH
:
export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH"
Make sure that the version under ${SPARK_HOME}/python/lib/
matches the filename of py4j or you will
encounter ModuleNotFoundError: No module named 'py4j'
while executing import pyspark
.
For example, if the file under ${SPARK_HOME}/python/lib/
is py4j-0.10.9.3-src.zip
, then the
export PYTHONPATH
statement above should be changed to
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH"
Now you can run Jupyter or IPython to test if things work. Go to some other directory, e.g. ~/tmp
.
Download a CSV file that we'll use for testing:
wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
Now let's run ipython
(or jupyter notebook
) and execute:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.appName('test') \
.getOrCreate()
df = spark.read \
.option("header", "true") \
.csv('taxi+_zone_lookup.csv')
df.show()
Test that writing works as well:
df.write.parquet('zones')