This project, called DLProv, is part of my DSc research in the Program of Systems and Computer Science (PESC), COPPE, at the Federal University of Rio de Janeiro (UFRJ).
DLProv is a service that evolved from DNNProv. Originally rooted in DNNProv, DLProv has expanded its scope and capabilities to accommodate the broader domain of Deep Learning (DL).
DNNProv began as a provenance service designed to support online hyperparameter analysis in DL, integrating retrospective provenance data (r-prov) with typical DNN software data, e.g. hyperparameters, DNN architecture attributes, etc.
A DL life cycle involves several data transformations, such as performing data pre-processing, defining datasets to train and test a deep neural network (DNN), and training and evaluating the DL model. Choosing a final model requires DL model selection, which involves analyzing data from several training configurations (e.g. hyperparameters and DNN architectures). We have understood that tracing training data back to pre-processing operations can provide insights into the model selection step. However, there are challenges in providing an integration of the provenance of these different steps. Therefore, we decided to integrate these steps. DLProv is a prototype for provenance data integration using different capture solutions while maintaining DNNProv capabilities.
DLProv is developed on top of DfAnalyzer provenance services. It uses the columnar DBMS MonetDB to support online provenance data analysis and to generate W3C PROV-compliant documents. In addition, these provenance documents can be analyzed through graph DBMS such as Neo4j.
This repository provides a Docker container for DLProv, which can be found in step Running an Example in a Docker Environment.
The following list of software has to be configured/installed for running a DL model training that collects provenance with DLProv.
- Java
- MonetDB and pymonetdb
- neo4j and neo4j python
- prov, pydot, and provdbconnector
- DfAnalyzer
- dfa-lib-python
Due to Git LFS (Large File Storage) restrictions, this repository includes a file that cannot be tracked by Git. Please follow these steps to download and add the file manually:
-
Download the file from this Google Drive link.
-
After downloading the file, move it to the
dlprov/DfAnalyzer/target
folder in the repository. -
Once the file is in the correct folder, you can continue with the setup or use the repository as intended.
Note: If you are using the provided Docker container, this step is not required, as the necessary files will be automatically handled within the container.
The DLProv has a few predefined hyperparameters (e.g. optimizer, learning rate, number of epochs, number of layers, etc.) and metrics (e.g. loss, accuracy, elapsed time) to be captured. In the case that these hyperparameters and metrics are enough, the user has to set the attribute “hyperparameters” as True, and the library will take care of it. It's important to set a tag to identify the workflow and associate it with the provenance data, e.g. hyperparameters. This method captures provenance data as the deep learning workflow executes and sends them to the provenance database managed by MonetDB. As the data reaches the database, it can be analyzed through the Dataflow Viewer (DfViewer), Query Interface (QI), and Query Dashboard (QP). The data received by the provenance method are defined by the user in the source code of the DNN application, as follows:
df = Dataflow(dataflow_tag, predefined=True)
df.save()
To capture the retrospective provenance, the user should add the following code:
tf1_input = DataSet("iTrainModel", [Element([opt.get_config()['name'], opt.get_config()['learning_rate'], epochs, len(model.layers)])])
t1.add_dataset(tf1_input)
t1.begin()
## Data manipulation
tf1_output = DataSet("oTrainModel", [Element([datetime.now().strftime('%Y-%m-%d %H:%M:%S'), elapsed_time, loss, accuracy, val_loss, val_accuracy, epoch])])
t1.add_dataset(tf1_output)
if(epoch==final_epoch):
t1.end()
else:
t1.save()
In case there is an adaptation of the hyperparameters during training (e.g., an update of the learning rate), that is, the use of methods such as LearningRateScheduler offered by Keras, the hyperparameter’s values are updated, therefore, the adaptation should be registered for further analysis. To capture these data, the user should add code for this specific transformation.
We provide a pre-built Docker container image that includes all necessary dependencies and data from this repository, ensuring a consistent and reproducible environment for running the example.
- Pull the Docker Image
To get started, pull the pre-built Docker image from the container registry:
docker pull dbpina/dlprov
- Run the Container
Once the image is downloaded, run the container with:
docker run -p 7474:7474 -p 7687:7687 -p 22000:22000 -d \
-e NEO4J_dbms_default__listen__address=0.0.0.0 \
-e NEO4J_dbms_connector_http_listen__address=0.0.0.0 \
--name dlprov-container dbpina/dlprov
docker exec -it dlprov-container /bin/bash
- Run the example
Once you are in the container shell, the first step is to initialize the MonetDB database. This initialization is only required the first time before the experiments. (When running the experiments script, it will only stop and start MonetDB as needed. Note that restore-database.sh deletes all existing data, so use it with caution.)
To start the database, run the following commands:
cd /opt/dlprov/DfAnalyzer
./restore-database.sh
After that, you navigate to the folder /opt/dlprov/
, where you will find a script named run_experiment.sh
. This script:
- Starts the database and the server.
- Runs an experiment that trains a DL model on the MNIST dataset (with only a few epochs; you can adjust the epoch count as needed).
- Generates the provenance document.
- Inserts the provenance data into Neo4j for analysis.
To execute the script, use:
cd /opt/dlprov/
./run_experiment.sh
- Submit a query
To submit queries to MonetDB, connect to the database using the following command:
mclient -u monetdb -d dataflow_analyzer
The default password is monetdb
. Once connected, you can submit queries such as:
SELECT * FROM dataflow;
SELECT * FROM dataflow_execution; (This will show the execution identifier.)
To analyze data related to the training process, switch to the schema with:
SET SCHEMA "mnist";
Then, to view available tables, use:
\d
For specific data, you can submit queries like:
SELECT * FROM itrainmodel; to see the hyperparameters.
SELECT * FROM otrainmodel; to view training metrics.
SELECT * FROM otestmodel; to see test metrics.
To interact with Neo4j, open the following address in your browser:
http://localhost:7474
Note: This is why the docker run command includes the -p (publish) flag to make ports available externally.
You may need to enter your credentials to access Neo4j. The default configuration is set with the following:
- Username: neo4j
- Password: neo4jneo4j
In Neo4j, you can submit queries such as:
MATCH (n) RETURN n LIMIT 25;
This query will display the complete graph of an execution, allowing you to analyze the relationships and data flow visually.
- Generate provenance graph for several executions
If you would like to generate a W3C PROV document for multiple executions of the same DL model (for example, after running two training executions), you can do so by running the following script:
./run_df_experiment.sh
This script first restores the Neo4j database, as the current Neo4j version only supports one active database. After the restoration, it generates the provenance document and inserts it into Neo4j, allowing you to analyze the provenance data using the commands previously provided.
That's it - you are all set! Now, you can check the folder /opt/dlprov/generate-prov/output
where you will find the provenance document for your experiment, named something like mnist-<timestamp>
. You can compare it with the example file, mnist-example
, provided in the directory /opt/dlprov/output/
. There are .json
, .provn
, and .png
files for review and analysis.
To visualize the PNG file, follow these steps:
- Use the following command to copy the file from the Docker container to your host system:
docker cp dlprov-container:/opt/dlprov/output/<insert_file_name.png> </host/path/target>
Replace <insert_file_name.png> with the actual name of your PNG file. Replace </host/path/target> with the desired destination path on your host system where you want to save the file.
- After executing the command, navigate to the specified target directory on your host to view the PNG file.
- Average Training Loss Query
This query calculates the average loss for the training activity, providing insights into model performance over training iterations. Other metrics, such as elapsed time, can also be used in place of loss to analyze different aspects of the training process.
MATCH (b:Entity)-[:wasGeneratedBy]->(c:Activity)
RETURN avg(b.`dlprov:loss`)
MATCH (b:Entity)-[:wasGeneratedBy]->(c:Activity)
RETURN avg(toFloat(b.`dlprov:elapsed_time`)) AS avg_elapsed_time
- Shortest Path Queries
These queries find the shortest paths from the resulting test metrics to key components in the workflow:
- (i) to the data used for model input, tracking data lineage,
- (ii) to the activity responsible for generating these metrics, helping trace back to the source of the results.
MATCH p = shortestPath(
(a:Entity {`dlprov:ds_tag`: 'otestmodel'})-[*]-
(b:Entity {`dlprov:ds_tag`: 'oloaddata'})
)
RETURN p
MATCH p = shortestPath(
(a:Entity {`dlprov:ds_tag`: 'otestmodel'})-[:wasGeneratedBy]-
(b:Activity {`dlprov:dt_tag`: 'testmodel'})
)
RETURN p
- Complete Path Query
This query presents the full path from the resulting test metrics to the original input dataset, detailing each step in the data processing pipeline. Information about the dataset source and intermediate transformations is included to support data traceability.
MATCH p = (a:Entity {`dlprov:ds_tag`: 'otestmodel'})-[*]-(b:Entity {`dlprov:ds_tag`: 'iinputdataset'})
RETURN p
This project is a work in progress. If you encounter any issues, errors, or have suggestions for improvements, please feel free to contact us. We appreciate your feedback as we continue to refine and expand this project.