Allegro ClearML is a full system open source ML / DL experiment manager and ML-Ops solution. It is composed of a server, Python SDK and web UI. Allegro ClearML enables data scientists and data engineers to effortlessly track, manage, compare and collaborate on their experiments as well as easily manage their training workloads on remote machines.
Install clearml by executing the following command:
pip install --upgrade clearml
pip install -r requirements.txt
We need to also install Nvidia/APEX and libraries for opencv.
Important, please, check the content of experiments/setup_opencv.sh
before running the script.
sh experiments/setup_apex.sh
sh experiments/setup_opencv.sh
Since 10/2019, we need to register an account in order to download the dataset. To download the dataset, use the following form : http://www.image-net.org/download.php
To configure the path to already existing ImageNet dataset, please specify DATASET_PATH
environment variable
export DATASET_PATH=/path/to/imagenet
# export DATASET_PATH=$PWD/input/imagenet
In ClearML, when you run the experiment code, clearml
stores the experiment in clearml-server.
By default, clearml
works with the demo ClearML Server (https://demoapp.trains.allegro.ai/dashboard),
which is open to anyone (although once a week it is refreshing and deleting all data). You can also set up your own self-hosted ClearML Server.
After the experiment code runs once, you can reproduce the experiment using the
ClearML Web-App (UI), which is part of clearml-server
. You only need to run the code once to store it
in clearml-server
.
This setup is a specific for this code and is not required in general usage of ClearML. We setup an output path as a local storage:
export CLEARML_OUTPUT_PATH=/path/to/output/clearml
# e.g export CLEARML_OUTPUT_PATH=$PWD/output/clearml
This environment variable helps to choose ClearML as experiment tracking system among all others.
The configuration to upload artifact must be done by modifying the clearml
configuration file ~/clearml.conf
generated by clearml-init
. According to the
documentation, the output_uri
argument can be
configured in sdk.development.default_output_uri
to fileserver uri. If server is self-hosted, ClearML
fileserver uri is
http://localhost:8081
.
For more details, see https://allegro.ai/docs/examples/reporting/artifacts/
Please, make sure to adapt training data loader batch size to your GPU type. By default, batch size is 64 per process.
Execute the following command:
export CLEARML_OUTPUT_PATH=/path/to/output/clearml
# e.g export CLEARML_OUTPUT_PATH=$PWD/output/clearml
export PYTHONPATH=$PWD/code:$PYTHONPATH
py_config_runner ./code/scripts/training.py ./configs/train/baseline_resnet50.py
Please, make sure to adapt training data loader batch size to your GPU type. By default, batch size is 64 per process.
export CLEARML_OUTPUT_PATH=/path/to/output/clearml
# e.g export CLEARML_OUTPUT_PATH=$PWD/output/clearml
export PYTHONPATH=$PWD/code:$PYTHONPATH
python -m torch.distributed.launch --nproc 2 --use_env -m py_config_runner ./code/scripts/training.py ./configs/train/baseline_resnet50.py
In ClearML Web-App a new project named "ImageNet Training" will be created, with an experiment named "baseline_resnet50" inside.
In your local environment, the console output includes the URL of the experiment's RESULTS page.
You can now view your experiment in ClearML by clicking the link or copying the URL into your browser. It opens the results in the experiment's details pane, in the ClearML Web-App (UI).
When the experiment code runs, ClearML automatically logs your environment, code, and the outputs. Which means that you don't need to change your code.
All you need is 2 lines of integration at the top of your main script
from clearml import Task
Task.init("ImageNet Training", "baseline_resnet50")
Once it's there, the following will be automatically logged by ClearML:
- Resource Monitoring CPU/GPU utilization, temperature, IO, network, etc
- Development Environment Python environment, Git (repo, branch, commit) including uncommitted changes
- Configuration Including configuration files, command line arguments (ArgParser), and general dictionaries
- Full stdout and stderr automatic logging
- Model snapshots, with optional automatic upload to central storage.
- Artifacts log & store, including shared folders, S3, GS, Azure, and Http/s
- Matplotlib / Seaborn / TensorBoard / TensorBoardX scalars, metrics, histograms, images, audio, video, etc
Additionally, ClearML supports explicit logging by adding calls to the ClearML Python client Logger
class methods in the code. For more information,
see Explicit Reporting in the ClearML documentation.
In the ClearML Web-App (UI), track the experiment and visualize results in the experiment's details pane, which is organized in tabs and provides the following information;
- Source code, uncommitted changes, Python packages and versions, and other information, in the EXECUTION tab
- Hyperparameters in the HYPERPARAMETERS tab
- Input model, Configuration, Output model, and other artifacts in the ARTIFACTS tab
- Experiment Comments and General experiment information in the INFO tab
- Results in the RESULTS tab, including the log, scalar metric plots, plots of any data, and debug samples
In ClearML, reproduce experiments using clearml-agent
for remote execution. Rerun the same experiment,
by making an exact copy of it (a clone), and remotely execute the cloned experiment.
First, install clearml-agent
and then configure it to work with your self-hosted ClearML Server.
Once clearml-agent
is installed and configured, run clearml-agent daemon
.
In ClearML, we call these workers, they pop experiments from a job execution queue and execute them.
Every machine with a clearml-agent daemon, becomes a registered worker in your clearml-server cluster.
Using the ClearML Web-App you can easily send experiments to be remotely executed on one of these machines.
More details can be found on the clearml-agent github
-
Install
clearml-agent
pip install clearml-agent
-
Configure
clearml-agent
by running the setup wizardclearml-agent init
-
Start a ClearML worker. Run a
clearml-agent daemon
listening to a queueFor example, run a
clearml-agent daemon
listening to thedefault
queue and using multiple GPUs.clearml-agent daemon --gpus 0,1 --queue default
-
Locate the experiment. In the ClearML Web-App (UI), Projects page, click on the project card
-
Make a copy of the experiment
- In the experiment table, right-click the experiment
- On the sub-menu, select Clone
- Select the project, type a name for the copy, and type a description, or accept the defaults
- Click the CLONE button
The copy of the experiment is created. Its details pane opens.
-
Send the experiment for remote execution, by enqueuing it in one of the job execution queues
- In the experiment table, right-click the experiment
- On the sub-menu, select Enqueue
- Select the default queue
- Click the ENQUEUE button
The experiment's status changes to Pending.
When the experiment reaches the top of the job execution queue, the clearml-agent deamon
fetches it,
its status changes to Running, and clearml-agent
executes it while logging and monitoring.
You can track the experiment while it is in progress, and anytime afterwards.