A 14-minute video of a live demo can be found here
This repo details an example of how to integrate run:ai with Tensorboard, using Tensorflow ResNet as an example.
It consists of 4 basic steps:
- create a persistent directory on the NFS
- called 'tensorboard_logs' in our example
- to hold the records from Tensorboard callbacks during Tensorflow training.
- create a first docker image with Tensorboard and jupyter-server-proxy installed
- jupyter-server-proxy is used to access the Tensorboard UI
- create a second docker image with Tensorflow 2.9 and cudatoolkit 11.7 installed
- run python scripts by submitting jobs to the scheduler, using the created second docker image. Use the created first image to launch the Tensorboard UI, and view progress.
The first created docker image used is public and can be found here:
jonathancosme/tensorboard-ui
The source for the first docker image can be found here: /tensorboard-ui.
The second created docker image used is public and can be found here:
jonathancosme/keras-nb
The source for the second docker image can be found here: /keras-nb.
Example notebook and python scripts can be found here: /tensorboard_resnet_demo.
One thing is needed for Tensorboard:
- A logs folder to store objects related to the runs
we can also choose to specify the location of the database, and artifact folder, as well as the host IP, and port.
tensorboard \
--logdir=/abs/path/to/logs_folder \
--host=0.0.0.0 \
--port=6006
2. In order to write records to the Tensorboard folder, create a Tensorboard callback, and pass it to model.fit()
Note:
It is not necessary to start the Tensorboard server in order to write records.
Starting the server is only needed to view the UI.
import tensorflow as tf
tb_dir = "/abs/path/to/logs_folder"
"""
code to build and compile your model
"""
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=tb_dir)
history = model.fit(train_ds,
epochs=5,
callbacks=[tensorboard_callback])
- A persistent directory to keep
- Tensorboard logs folder
- A first docker image with the following installed
- Tensorboard
- jupyterlab*
- jupyter-server-proxy*
- A second docker image with the following installed
- Tensorflow**
- Keras**
*needed in order to access the mlflow UI ** needed in order train Tensorflow models (ResNet in our example)
The first docker image we will use is:
jonathancosme/tensorboard-ui
This is what is in the dockerfile:
in order to access the Tensorboard UI, we need to add this entry to the jupyter_server_config.py file, and replace the existing file in the image
The second docker image we will use is:
jonathancosme/keras-nb
This is what is in the dockerfile:
Create a jupyter interactive job with:
- image jonathancosme/tensorboard-iu
- mounted NFS folder (with 'tensorboard_logs' folder) in default jupyter work directory
A new tab should appear with the Tensorboard UI
Select ‘Reload data’ under settings.
Note: The first time you access the UI, there will be no data available
Our example scrips are located here:
so our CLI command would look like this:
runai submit \
--project testproj \
--gpu 1 \
--job-name-prefix tb-renset-demo \
--image jonathancosme/keras-nb \
--volume /home/jonathan_cosme/jcosme:/home/jovyan/work \
-- python work/projects/tensorboard_resnet_demo/train_resnet50.py