-
Notifications
You must be signed in to change notification settings - Fork 1
Installation Guide
The PySpark Plaso is
- utilizing extractors adapted from the Plaso Project
- running at Docker containers from the TARZAN Docker Infrastructure Project
The Plaso extractors were adopted to an Apache Spark infrastructure and the extraction process is controlled by a Web service via a REST API. The following instructions describe how to build all required components, build the software artefacts for the Web service and run the service in the Spark infrastructure in Docker containers.
cd ./deployment
# create a Python virtual environment including a required Python packages
./010-make-python-virtualenv.sh
# pack the Python packages into a ZIP file ready to use in PySpark
./020-make-site-packages-zip.sh
# create JAR packages for Java dependencies
./030-make-java-helpers.sh
The build process consists of several steps:
-
to create a Python virtual environment including a required Python packages with required dependencies:
All required Python packages, that can be found in
misc/dependencies.txt
, will be installed by the script. -
to pack the Python packages into a ZIP file ready to use in PySpark with required dependencies:
Binary platform-dependent builds (
*.so
files) are not included in the ZIP file, so it can be distributed to nodes of various platforms. To use the binary parts of Python packages, install them by a package manager in each of the target nodes (e.g., byINSTALL_PKGS
environment variable of the Spark Docker image. -
to create JAR packages for Java dependencies which require:
- Java JDK version 8 or above
- Gradle Build Tool
The Java dependencies are implemented by PySpark Plaso Java Helpers which is included as a sub-module in the PySpark Plaso repository.
After the build process, there should be the following files in deployment/build
:
-
site-packages.zip
-- Python-based components -
pyspark-java-helpers.jar
andtimeline-analyzer-spark.jar
-- Java-base components
cd ./deployment
# run the PySpark Plaso infrastructure as Docker containers by docker-compose
./040-run-docker-webapp.sh
# reinitialise the PySpark Plaso Docker container with new build artefacts
./050-redeploy-docker-webapp.sh
To deploy the PySpark Plaso, the script creates and runs an infrastructure of Docker containers as defined in deployment/docker-compose/webapp.yml
. The PySpark Plaso application will be deployed into Docker Spark Application container by copying the build artefacts into its /app/lib
directory and the initialisation Python script deployment/src/webapp_main.py
into the container's /app
directory. See the docker-compose file.
In the case of modification of the build artefacts (a re-build) the Docker Spark Application container can be reinitialised by the corresponding script.