Enable multi-GPU in object detection #33561

SangbumChoi · 2024-09-18T12:44:03Z

What does this PR do?

This is the concept of the PR that fundamentally fixes the following multi-GPU circumstances error. The reason why I wrote as concept is because torchmetrics does not accept multi-gpu problem + also I want to make some code more clean.

Main keypoint is that all the Trainer class accept the prediction and labels as nested tensor (e.g. batch size : 4, num_gpu : 2 -> length of 8). However in order to calculate proper evaluation metric it should be shape as batch_size. + We should always encourage to use accelerate as default.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

SangbumChoi · 2024-09-18T12:46:47Z

@qubvel Hi Pavel, even though this is draft (which is finished but want to make some code clean) I want you to review about overall idea :) cc. @amyeroberts

amyeroberts · 2024-09-25T10:35:24Z

Thanks for opening this PR @SangbumChoi!

What I would propose is creating a new script e.g. run_object_detection_multigpu.py which contains there changes. The reason for that is that all the other run_xxx scripts follow a standard pattern with the Trainer, and using accelerator to prepare things isn't a common pattern. This way, it highlights the difference in the two methods and enables easier comparison / pattern matching across scripts in the single GPU case.

Otherwise, I think the changes here all looks great :D

SangbumChoi · 2024-09-26T01:17:51Z

@amyeroberts I found that this flatten issue happens all over the detection pipeline (e.g. detr, groundingdino, deformable-detr, etc..). I think it would be better to add additional argument at Trainer!

BTW, as you suggested I will seperate the file to multigpu.py 👍🏼

amyeroberts · 2024-09-27T11:10:31Z

@amyeroberts I found that this flatten issue happens all over the detection pipeline (e.g. detr, groundingdino, deformable-detr, etc..). I think it would be better to add additional argument at Trainer!

Agreed! I think @qubvel was working on something to enable this in trainer

daniel-bogdoll · 2024-10-22T12:58:52Z

Is there any progress on this issue? I would be much interested :)

SangbumChoi · 2024-10-23T01:02:31Z

@daniel-bogdoll Thanks for the interest. I think you can use current script but I will plan to support amy's suggestion.

parakh08 · 2024-10-24T09:23:45Z

[rank0]: File examples/pytorch/object-detection/run_object_detection_multi_gpu.py", line 216, in compute_metrics
[rank0]: logits=torch.tensor(batch_logits).squeeze(1), pred_boxes=torch.tensor(batch_boxes).squeeze(1)
[rank0]: RuntimeError: Could not infer dtype of dict

This happens because the loss is present in batch_logits as a dict.

I am facing this error when trying to evaluate, did you also face any issue when trying to evaluate?

I was able to fix this by modifying the compute metrics as:

batch_size = batch[1].shape[0]
num_devices = len(target_sizes) // batch_size
nested_integer = len(batch) // num_devices
batch_logits, batch_boxes = torch.tensor(batch[1::nested_integer]), torch.tensor(batch[2::nested_integer])
output = ModelOutput(logits=batch_logits.view(1, -1, *(batch_logits.size()[2:])).squeeze(0), pred_boxes=batch_boxes.view(1, -1, *(batch_boxes.size()[2:])).squeeze(0))

SangbumChoi added 2 commits September 18, 2024 06:54

compute_metrics is not accepted through Trainer

12059dc

enable multi-GPU

fc0aec4

SangbumChoi marked this pull request as draft September 18, 2024 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable multi-GPU in object detection #33561

Enable multi-GPU in object detection #33561

SangbumChoi commented Sep 18, 2024 •

edited

Loading

SangbumChoi commented Sep 18, 2024

amyeroberts commented Sep 25, 2024

SangbumChoi commented Sep 26, 2024

amyeroberts commented Sep 27, 2024

daniel-bogdoll commented Oct 22, 2024 •

edited

Loading

SangbumChoi commented Oct 23, 2024 •

edited

Loading

parakh08 commented Oct 24, 2024 •

edited

Loading

Enable multi-GPU in object detection #33561

Are you sure you want to change the base?

Enable multi-GPU in object detection #33561

Conversation

SangbumChoi commented Sep 18, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

SangbumChoi commented Sep 18, 2024

amyeroberts commented Sep 25, 2024

SangbumChoi commented Sep 26, 2024

amyeroberts commented Sep 27, 2024

daniel-bogdoll commented Oct 22, 2024 • edited Loading

SangbumChoi commented Oct 23, 2024 • edited Loading

parakh08 commented Oct 24, 2024 • edited Loading

SangbumChoi commented Sep 18, 2024 •

edited

Loading

daniel-bogdoll commented Oct 22, 2024 •

edited

Loading

SangbumChoi commented Oct 23, 2024 •

edited

Loading

parakh08 commented Oct 24, 2024 •

edited

Loading