Changes on top of upstream to get rid of type errors #248

alugorey · 2024-04-01T18:35:22Z

Fixes the class of failed unit tests on rocm in test_base.py that fail the internal assertion Cannot convert ScalarType Float8_e4m3fn to hipDataType.

Note: We are aware of the outstanding numerical issues and are looking into it internally.

alugorey · 2024-04-03T22:32:56Z

@drisspg :)

drisspg · 2024-04-04T04:47:44Z

Awesome! I will take a look at this tomorrow

vkuzo · 2024-04-04T15:37:20Z

float8_experimental/float8_utils.py

@@ -28,12 +28,25 @@
 IS_AMD = torch.cuda.is_available() and torch.version.hip is not None


+# Helper functions to get individual F8 types based on backend architecture


Would it be possible to put this into configuration instead of setting it dynamically? It can be unexpected for numerics to change based on the environment. It would also be good to support numerical emulation of all of these types regardless of whether the user's machine supports a float8 matmul.

I'm afraid I don't understand your question. These helper functions are simply intended to grab the "right" version of the prebuilt torch F8 types. Could you elaborate on the change you'd like to see?

Sure, it's just making the dtype flavors encoded in configuration instead of environment dependent. Having this in configuration would make it easier to debug numerics without having the target hardware.

# float8 dtypes have a default which can be changed explicitly config = ... config.float8_flavors = 'nuz' do_float8_things(..., config)

versus

# float8 dtypes magically change based on the environment do_float8_things(...)

That said, my comment is not high pri, feel free to land and we can adjust this later if it becomes important.

@vkuzo Sorry, got wrapped up in other work recently and just circled back to this. Okay, I will add an option in float8_experimental/float8_experimental/config.py and instead of checking the backend architecture, the code will check against this user-settable config variable

vkuzo · 2024-04-05T15:55:12Z

test/test_base.py

@@ -350,7 +374,7 @@ def test_scaled_mm_vs_emulated(self, base_dtype):


 class TestNumerics:
-    @pytest.mark.parametrize("float8_dtype", [torch.float8_e4m3fn, torch.float8_e5m2])
+    @pytest.mark.parametrize("float8_dtype", [fp8_e4m3_t(), fp8_e5m2_t()])


I would recommend testing all cases on all hardware types instead. For things not requiring a matmul, it should just work. For things requiring a matmul, we have an emulation mode to at least help approximate it.

vkuzo · 2024-04-05T15:56:38Z

test/test_base.py

@@ -47,7 +51,10 @@ class TestFloat8Tensor(unittest.TestCase):
    def test_preserves_dtype(self) -> None:
        # hp means high precision, lp means low precision
        hp_dtypes = (torch.float32, torch.float16, torch.bfloat16)
-        lp_dtypes = (torch.float8_e4m3fn, torch.float8_e5m2)
+        fp8_dtypes = (
+            FP8Dtypes()


all dtypes would be nice, there should not be anything in Float8Tensor which is hardware dependent

vkuzo · 2024-04-05T15:57:27Z

test/test_base.py

        m = nn.Linear(32, 16, device="cuda", dtype=linear_dtype)
-        m = get_float8_linear(linear_type, m, emulate, False)
+        m = get_float8_linear(linear_type, m, emulate, False, fp8_dtypes)


you can enable emulation here if your hardware doesn't support the dtype under test

alugorey · 2024-04-25T20:57:34Z

@vkuzo Ready for another review. Also, wanted to ask if there was an ETA or roadmap for when this functionality would be pulled into pytorch proper?

vkuzo · 2024-04-25T23:45:41Z

so sorry I am on holiday right now, will take a look late next week when I return, unless @drisspg wants to get to it sooner

drisspg · 2024-04-26T03:07:32Z

Yeah will review tomorrow

drisspg

What was the output of test_everything.sh?

drisspg

What was the output of test_everything.sh?

alugorey · 2024-04-30T16:11:18Z

@drisspg
test_everything.log

Fails in test_compile. However, I was aware of this failure and found that this is unrelated to my changes, rather, an issue with torch.compile on ROCm. This failure was next on my TODO's to address. I will upload a follow up PR.

vkuzo · 2024-05-01T18:12:27Z

float8_experimental/config.py

+# If True, use 'fnuz' float8 types for calculations. If the backend
+# hardware does not support a particular type, the emulated implementation
+# of the dtype will be used. Currently, ROCm only supports the fnuz variants.
+use_fnuz_dtype = True


default to False?

ah, yes, that's an oversight due to being easier for me to test. will change.

vkuzo · 2024-05-01T18:16:25Z

test/test_dtensor.py

@@ -128,7 +128,7 @@ def test_dtensor_fp8_autograd(mesh: DeviceMesh, size=16):
    )

    out = torch.nn.functional.linear(dist_x_fp8, dist_weight_fp8)
-    out = NoopFwToFloat8E5M2Bw.apply(out, False)
+    out = NoopFwToFloat8E5M2Bw.apply(out, False, fp8_e5m2_t())


is the new last arg here expected?

Oversight, fixed in latest commit

drisspg · 2024-06-05T14:32:16Z

just an update here, the base PR should landed yesterday

alugorey · 2024-06-14T18:41:27Z

@vkuzo @drisspg
Revised/rebased drop of AMD support after the amd-support branch was merged to main

vkuzo · 2024-06-14T18:55:56Z

float8_experimental/config.py

@@ -19,3 +19,8 @@
 # implements pre/post-all-gather methods to do fp8 all-gather with FSDP2.
 # Only dynamic scaling is supported for now.
 enable_fsdp_fp8_all_gather = False
+
+# If True, use 'fnuz' float8 types for calculations. If the backend
+# hardware does not support a particular dtype, the emulated implementation


nit: currently the user is responsible for toggling the emulation setting, we don't do that automatically

I thought per previous comments that we wanted to go the emulated route if the backend hardware didn't support the type? in general, the user could force to emulate, but i was under the impression that cases where a dtype was being used that wasn't supported on the underlying hardware, that we wanted to go the emulated route?

yep, that is correct! This is just currently done by the user explicitly, and there is no support do handle that automatically.

oh, you're saying the comment is wrong

vkuzo

looks great! thanks for helping!

facebook-github-bot · 2024-06-16T21:29:30Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

alugorey · 2024-06-17T14:32:58Z

@drisspg fixed lint. But not sure what the ufmt errors are about.

drisspg · 2024-06-17T16:31:45Z

@alugorey I think if you apply this patch it should work: https://gist.github.com/drisspg/2a87d54521a0b2312ac44d070f63350d

alugorey · 2024-06-17T18:33:19Z

@drisspg Looks like you beat me to it. Still failing on 2 files though. is there documentation on what ufmt expects? some of those changes seem purely cosmetic

drisspg · 2024-06-17T19:18:13Z

The reason for this formatting is due to some internal nodes on code styling. TBH
I just run ufmt format . and dont care about the actual format.

I have also have some 'pre-commit' hooks for the repo that should be set and forget. There does seem to be 2 more lint fixes.

There seems to be 1 real test_failure:

./test/test_everything.sh: line 5: rocm-smi: command not found

I found this patch passed test_everything:

https://gist.github.com/drisspg/47a29d6bf3fcca2a2c48d09b74c564aa

facebook-github-bot · 2024-06-20T17:39:02Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-06-20T19:36:24Z

@drisspg merged this pull request in 0bd374d.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 1, 2024

vkuzo reviewed Apr 4, 2024

View reviewed changes

vkuzo reviewed Apr 5, 2024

View reviewed changes

drisspg reviewed Apr 29, 2024

View reviewed changes

vkuzo reviewed May 1, 2024

View reviewed changes

drisspg mentioned this pull request May 23, 2024

Adds utilities for AMD fp8 dtype support, follow up PR to add option to the configs #235

Closed

drisspg force-pushed the amd-support branch 3 times, most recently from e29cc35 to ba9b5dd Compare June 1, 2024 00:47

drisspg force-pushed the amd-support branch 3 times, most recently from 5da5b5c to 4f304cc Compare June 4, 2024 17:24

alugorey force-pushed the fnuz_typing branch 2 times, most recently from 57120fa to 4997c19 Compare June 14, 2024 18:39

alugorey changed the base branch from amd-support to main June 14, 2024 18:40

vkuzo reviewed Jun 14, 2024

View reviewed changes

vkuzo approved these changes Jun 14, 2024

View reviewed changes

alugorey force-pushed the fnuz_typing branch from 4997c19 to 0a24b01 Compare June 14, 2024 19:19

alugorey and others added 3 commits June 18, 2024 18:19

[ROCm] Support for fnuz config

d5dc16e

format

f32a4a4

Ufmt and bash fix

08a1029

alugorey force-pushed the fnuz_typing branch from ed7eba8 to 08a1029 Compare June 18, 2024 18:23

facebook-github-bot closed this in 0bd374d Jun 20, 2024

facebook-github-bot added the Merged label Jun 20, 2024

		@@ -28,12 +28,25 @@
		IS_AMD = torch.cuda.is_available() and torch.version.hip is not None


		# Helper functions to get individual F8 types based on backend architecture

Changes on top of upstream to get rid of type errors #248

Changes on top of upstream to get rid of type errors #248

Uh oh!

Conversation

alugorey commented Apr 1, 2024

Uh oh!

alugorey commented Apr 3, 2024

Uh oh!

drisspg commented Apr 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alugorey commented Apr 25, 2024

Uh oh!

vkuzo commented Apr 25, 2024

Uh oh!

drisspg commented Apr 26, 2024

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

alugorey commented Apr 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drisspg commented Jun 5, 2024

Uh oh!

alugorey commented Jun 14, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vkuzo left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 16, 2024

Uh oh!

alugorey commented Jun 17, 2024

Uh oh!

drisspg commented Jun 17, 2024

Uh oh!

alugorey commented Jun 17, 2024

Uh oh!

drisspg commented Jun 17, 2024

Uh oh!

facebook-github-bot commented Jun 20, 2024

Uh oh!

facebook-github-bot commented Jun 20, 2024

Uh oh!

Uh oh!