Spark examples

This repo holds some examples, to start familiarizing yourself with Spark.

The idea is to quickly create a Spark cluster in your machine, and then run some jobs. In these examples, we're going to use the Sparks' Python API, named PySpark.

Directories structure

.
├── .dockerignore
├── .gitignore
├── .markdownlint.json
├── .pre-commit-config.yaml
├── .python-version
├── .vscode
│   ├── extensions.json
│   └── settings.json
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── apps
│   ├── intro.py
│   └── python_app.py
├── conf
│   ├── spark-env.sh
│   └── workers
├── mypy.ini
├── noxfile.py
├── poetry.lock
└── pyproject.toml

4 directories, 19 files

Pre-requisites

You'll need the following tools in your machine:

Also, it's recommended to have pyenv installed and working.

Please, keep in mind that a small Spark cluster like this one requires at least 2GB of RAM and 1 CPU core.

Set up the environment

1. Build the Docker image

The first thing you'll need to do is build the required Docker image:

make build

The built image is tagged as spark-local.

2. Start the container

You can start the spark-local container with:

make run

3. Configure the Spark cluster

In the conf/spark-env.sh file you'll find some settings to configure the Spark cluster.

This example uses the following settings:

# conf/spark-env.sh

SPARK_EXECUTOR_CORES=1
SPARK_EXECUTOR_MEMORY=512M
SPARK_DRIVER_MEMORY=512M
SPARK_MASTER_HOST=localhost
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=4040
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=1G
SPARK_WORKER_INSTANCES=3
SPARK_WORKER_PORT=9000
SPARK_WORKER_WEBUI_PORT=4041
SPARK_DAEMON_MEMORY=512M

All these settings are self-explanatory, but for example, you can modify the number of worker nodes by changing the SPARK_WORKER_INSTANCES variable.

4. Start the Spark cluster

You can start the Spark cluster with:

make spark-start

Wait a few seconds, and go to localhost:4040 in your browser, you'll see a UI like this:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark examples

Directories structure

Pre-requisites

Set up the environment

1. Build the Docker image

2. Start the container

3. Configure the Spark cluster

4. Start the Spark cluster

Run Spark jobs

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.vscode		.vscode
apps		apps
conf		conf
images		images
.dockerignore		.dockerignore
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mypy.ini		mypy.ini
noxfile.py		noxfile.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

hvignolo87/spark-examples

Folders and files

Latest commit

History

Repository files navigation

Spark examples

Directories structure

Pre-requisites

Set up the environment

1. Build the Docker image

2. Start the container

3. Configure the Spark cluster

4. Start the Spark cluster

Run Spark jobs

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages