Skip to content

Latest commit

 

History

History
50 lines (31 loc) · 1.74 KB

README.md

File metadata and controls

50 lines (31 loc) · 1.74 KB

pyspark-plaso

(c) 2018-2020 Marek Rychly ([email protected]) and Radek Burget ([email protected])

A tool for distributed extraction of timestamps from various files using extractors adapted from the Plaso engine to Apache Spark.

Usage

The PySpark Plaso is running in a Docker container and it is accessible as a Web service via a REST API.

See the project Wiki Pages for details.

Deployment

Use a Prebuilt Docker Image

There is a prebuilt Docker image.

See the webapp-prebuilt.yml docker-compose file.

Build and Deploy

cd ./deployment
# create a Python virtual environment including a required Python packages
./010-make-python-virtualenv.sh
# pack the Python packages into a ZIP file ready to use in PySpark
./020-make-site-packages-zip.sh
# create JAR packages for Java dependencies
./030-make-java-helpers.sh
# run the PySpark Plaso infrastructure as Docker containers by docker-compose
./040-run-docker-webapp.sh

See the project Wiki Pages for details and also the webapp.yml docker-compose file.

Kubernetes

For Kubernetes see its resource files.

Dependencies

The PySpark Plaso is

Acknowledgements

This work was supported by the Ministry of the Interior of the Czech Republic as a part of the project Integrated platform for analysis of digital data from security incidents VI20172020062.