Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu tensorboard handlers initialization #520

Open
wyli opened this issue Oct 12, 2023 · 2 comments
Open

multi-gpu tensorboard handlers initialization #520

wyli opened this issue Oct 12, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@wyli
Copy link
Collaborator

wyli commented Oct 12, 2023

"train#trainer#train_handlers": "$@train#handlers[: -2 if dist.get_rank() > 0 else None]",

the multi-gpu override essentially set the trainer handlers to $@train#handlers[:-2] for the worker nodes. but because of the @train#handlers reference, the config parser will still trigger handler constructor calls on all nodes.

for tensorboard handlers this will be an issue, as each constructor call will create a new event log file. as a result the multinode log will have unnecessary event logging files. https://github.com/Project-MONAI/MONAI/blob/e36982b87bf87fb9559fc4d124e132b67f177d23/monai/handlers/tensorboard_handlers.py#L52-L55

@wyli wyli added the bug Something isn't working label Oct 12, 2023
@wyli
Copy link
Collaborator Author

wyli commented Oct 17, 2023

a possible fix is to introduce a flag:

diff --git a/configs/multi_gpu_train.json b/configs/multi_gpu_train.json
index ea41b9f..f323b02 100644
--- a/configs/multi_gpu_train.json
+++ b/configs/multi_gpu_train.json
@@ -1,5 +1,6 @@
 {
     "device": "$torch.device('cuda:' + os.environ['LOCAL_RANK'])",
+    "use_tensorboard": "$dist.get_rank() == 0",
     "network": {
         "_target_": "torch.nn.parallel.DistributedDataParallel",
         "module": "$@network_def.to(@device)",
diff --git a/configs/train.json b/configs/train.json
index 7c866fe..80f15d3 100644
--- a/configs/train.json
+++ b/configs/train.json
@@ -10,6 +10,7 @@
     "output_dir": "$@bundle_root + '/eval'",
     "data_list_file_path": "$@bundle_root + '/msd_task09_spleen_folds.json'",
     "dataset_dir": "/data/Task09_Spleen",
+    "use_tensorboard": true,
     "finetune": false,
     "finetune_model_path": "$@bundle_root + '/models/model.pt'",
     "early_stop": false,
@@ -191,6 +192,7 @@
             },
             {
                 "_target_": "TensorBoardStatsHandler",
+                "_disabled_": "$not @use_tensorboard",
                 "log_dir": "@output_dir",
                 "tag_name": "train_loss",
                 "output_transform": "$monai.handlers.from_engine(['loss'], first=True)"
@@ -279,6 +281,7 @@
             },
             {
                 "_target_": "TensorBoardStatsHandler",
+                "_disabled_": "$not @use_tensorboard",
                 "log_dir": "@output_dir",
                 "iteration_log": false
             },

@yiheng-wang-nv
Copy link
Collaborator

Thanks @wyli . I will take a look at this issue and your suggestion. Or @KumoLiu , if you have time could you please help to address it? Can check with the deepedit bundle first.

cc @Nic-Ma

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants