-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add autocast for nms, roi_align on CPU #8049
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chunyuan-w , PR looks good, I made a few comments below. LMK what you think!
test/test_ops.py
Outdated
@pytest.mark.parametrize("dtype", (torch.float, torch.bfloat16)) | ||
def test_autocast_cpu(self, iou, dtype): | ||
with torch.cpu.amp.autocast(): | ||
self.test_nms_cpu(iou=iou, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be enough to just call test_nms_ref()
here (perhaps with slight modifications) ?
If we really need a specific test as in test_nms_cpu
then maybe we can just inline it below instead of having test_nms_cpu
as a standalone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified test_nms_ref
to check the case where the dtype
is torch.bfloat16
on cpu. Could you please check if it looks good to you?
test/test_ops.py
Outdated
if dtype == torch.bfloat16: | ||
keep_ref_float = ops.nms(boxes.to(dtype).float(), scores.to(dtype).float(), iou) | ||
keep_dtype = ops.nms(boxes.to(dtype), scores.to(dtype), iou) | ||
torch.testing.assert_close(keep_ref_float, keep_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update and sorry for the noise - shouldn't we instead assert that we get equivalent results when autocast is on and when it's off? e.g. something like
@pytest.mark.parametrize("iou", (0.2, 0.5, 0.8))
@pytest.mark.parametrize("seed", range(10))
def test_autocast_cpu(self, iou, seed):
torch.random.manual_seed(seed)
boxes, scores = self._create_tensors_with_iou(1000, iou)
keep = ops.nms(boxes, scores, iou)
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
keep_autocast = ops.nms(boxes, scores, iou)
torch.testing.assert_close(keep, keep_autocast)
The test for roi_align calls test_forward()
which has a slightly different logic where instead we check our implem against a reference, both with autocast. Unfortunately that same test seems to fail for nms, so maybe we can just use the snippet above. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Here are two concerns:
- Since nms is not in the white list of autocast, simply running the above UT won't be able to expose the issue that this PR is trying to fix, since the input to nms will still be fp32 in the above snippet. Instead, we need to provide an input in BF16 to nms under autocast, so that without this PR, it will throw
RuntimeError: "nms_kernel" not implemented for 'BFloat16'
, while with this PR, autocast will convert the input back to FP32 so that it could run successfully. - A modified version following item 1 will be like this:
@pytest.mark.parametrize("iou", (0.2, 0.5, 0.8))
@pytest.mark.parametrize("seed", range(10))
def test_autocast_cpu(self, iou, seed):
torch.random.manual_seed(seed)
boxes, scores = self._create_tensors_with_iou(1000, iou)
keep = ops.nms(boxes, scores, iou)
with torch.cpu.amp.autocast(dtype=torch.bfloat16):
keep_autocast = ops.nms(boxes.to(torch.bfloat16), scores.to(torch.bfloat16), iou)
torch.testing.assert_close(keep, keep_autocast)
The issue with this version is that, since boxes.to(torch.bfloat16)
and scores.to(torch.bfloat16)
has converted data from high precision to low precision, when directly comparing the result keep_autocast
and keep
, we'll meet:
AssertionError: The values for attribute 'shape' do not match: torch.Size([434]) != torch.Size([436]).
.
I used boxes.to(dtype).float()
in the current implementation here to simulate the process of converting from high precision to low precision and convert it back to float to be the reference result to compare with the autocast result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and for your explanations @chunyuan-w . LGTM, I'll merge when green
@NicolasHug I've updated the branch to merge the latest branch from the main branch. Could you please help approve the CI workflow runs? |
@NicolasHug there's one failure in the CI which is marked as unrelated to this PR. May I know if it's fine to land it? I can update the branch to merge the latest changes from the main branch if needed. |
Reviewed By: vmoens Differential Revision: D50789096 fbshipit-source-id: ca847eded3b90409b0cd25e849034cb5dc83b6b8 Co-authored-by: Nicolas Hug <[email protected]> Co-authored-by: Nicolas Hug <[email protected]>
nms
androi_align
only support FP32. The current code only handled these OPs for CUDA via autocast to convert all the inputs to FP32. This PR adds the autocast support on CPU.Fixes bfloat16 on CPU for the below models (pytorch/pytorch#110841):