This repo holds some examples, to start familiarizing yourself with Spark.
The idea is to quickly create a Spark cluster in your machine, and then run some jobs. In these examples, we're going to use the Sparks' Python API, named PySpark.
.
├── .dockerignore
├── .gitignore
├── .markdownlint.json
├── .pre-commit-config.yaml
├── .python-version
├── .vscode
│ ├── extensions.json
│ └── settings.json
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── apps
│ ├── intro.py
│ └── python_app.py
├── conf
│ ├── spark-env.sh
│ └── workers
├── mypy.ini
├── noxfile.py
├── poetry.lock
└── pyproject.toml
4 directories, 19 files
You'll need the following tools in your machine:
Also, it's recommended to have pyenv installed and working.
Please, keep in mind that a small Spark cluster like this one requires at least 2GB of RAM and 1 CPU core.
The first thing you'll need to do is build the required Docker image:
make build
The built image is tagged as spark-local
.
You can start the spark-local
container with:
make run
In the conf/spark-env.sh
file you'll find some settings to configure the Spark cluster.
This example uses the following settings:
# conf/spark-env.sh
SPARK_EXECUTOR_CORES=1
SPARK_EXECUTOR_MEMORY=512M
SPARK_DRIVER_MEMORY=512M
SPARK_MASTER_HOST=localhost
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=4040
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=1G
SPARK_WORKER_INSTANCES=3
SPARK_WORKER_PORT=9000
SPARK_WORKER_WEBUI_PORT=4041
SPARK_DAEMON_MEMORY=512M
All these settings are self-explanatory, but for example, you can modify the number of worker nodes by changing the SPARK_WORKER_INSTANCES
variable.
You can start the Spark cluster with:
make spark-start
Wait a few seconds, and go to localhost:4040
in your browser, you'll see a UI like this: