PawMark is a platform for big data and AI. It is based on Apache Spark and Kubernetes. The platform is designed to be scalable and easy to use. It provides a set of tools for data processing, machine learning, and data visualization.
Details
-
Start docker-compose
docker-compose up -d
-
Access platform UI
-
Use notebook
- Access http://localhost:8888
- Spark session is automatically created
- Run
spark
in cell to check the spark session
- Run
- Run the following code in the notebook to test the spark session
spark.range(0, 5) \ .write.format("delta").mode("overwrite").saveAsTable("test")
-
Check the history server
- Access http://localhost:18080
- Spark application history / progress can be viewed here
-
Delta tables
- Use
/opt/data/delta-table/
as the root directory for delta tables
- Use
-
Schedule with Airflow
- Access http://localhost:8090
- Use the default username and password to login
- Create a new DAG to schedule the spark job
- Or use the example DAGs in the
./dags
folder
- TODO
- Singapore Resale Flat Prices Analysis
- TODO
Notebook
- Dockerfile
- Includes
- Jupyter Notebook
- Spark
- Google Cloud SDK
- GCS Connector
- Pyspark Startup Script
- Notebook Save Hook Function
Details
Component | Version |
---|---|
Scala | 2.12 |
Java | 17 |
Python | 3.11 |
IPython | 8.16.1 |
Apache Spark | 3.5.0 |
Delta Lake | 3.0.0 |
Airflow | 2.9.1 |
Postgres | 13 |
React | 18.3.1 |
This project is licensed under the terms of the Apache-2.0 license.