add unit tests for FSDP2 + torch.compile(transformer block) #321

weifengpy · 2024-07-17T22:17:03Z

TorchTitan complains about FSDP2 + float8 + torch.compile(transformer block).

there is a mismatch in float8 scale so dynamo guards assersion failed torch._C._dynamo.guards.assert_size_stride(new_inputs[3], (), ())

in 1st iteration, we calculate float8 scale through cast_to_float8_e4m3_dynamic (code). scale is a scalar tensor, eg tensor(4674.8633)
in 2nd iteration, we calulate float8 scale through precompute_float8_dynamic_scale, but scale is NOT a scalar tensor, eg tensor([[4674.8633]]
this PR calls .squeeze to make sure scales are always scalar tensors, and dynamo guards assersion always hold true

added unit test so we can catch the isssue at PR time

TODO: add fp8 + torch.compile to CI in torchtitan

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-07-17T22:18:25Z

float8_experimental/fsdp_utils.py

@@ -301,7 +303,7 @@ def __tensor_flatten__(self):
            ],
            {
                "mm_config": self._mm_config,
-                "is_amax_initialized": is_amax_initialized,


pre-commit run --all-files complains about undefined is_amax_initialized in trunk. fixing it so I can commit without bypassing linter

weifengpy · 2024-07-17T22:19:56Z

test/test_fsdp2/test_fsdp2_common.py

@@ -46,7 +47,10 @@ def check_parity_no_mp(
            ):
                precompute_float8_dynamic_scale_for_fsdp(model)

-        test_cls.assertEqual(losses[0], losses[1])
+        if compile_transformer_block:
+            torch.testing.assert_close(losses[0], losses[1], atol=9.5e-2, rtol=9.5e-2)


borrowed 9.5e-2 from test_compile.py https://github.com/pytorch-labs/float8_experimental/blob/main/test/test_compile.py#L62

This seems kind of high 🤔 I wonder how this value was determined. Can we instead compare the ref as also compiling each transformer block (but without FSDP applied)?

will try to switch the ref model to Float8Linear + torch.compiled

after applying torch.compile to ref_model, we can achieve atol/rtol=1e-4. I can dig more as follow ups if we want to reach higher numeric parity like 1e-5

weifengpy · 2024-07-17T22:20:32Z

float8_experimental/fsdp_utils.py

@@ -64,7 +64,9 @@ def precompute_float8_dynamic_scale_for_fsdp(module: nn.Module) -> None:
        scale_tensor = torch.clamp(scale_tensor, max=torch.finfo(torch.float16).max)
    scales = torch.split(scale_tensor, 1)  # Replicate
    for scale, float8_linear in zip(scales, float8_linears):
-        float8_linear.weight._local_tensor._precomputed_scale = scale._local_tensor
+        float8_linear.weight._local_tensor._precomputed_scale = (
+            scale._local_tensor.squeeze()


make sure tensor is like tensor(4674.8633) instead of tensor([[4674.8633]]

otherwise torch.compile errors out in gurads, torch._C._dynamo.guards.assert_size_stride(new_inputs[3], (), ())

Can we check the traces? I want to make sure there is no CPU sync point introduced from making this tensor a scalar tensor.

do we know the reasoning for why the current behavior is not supported with compile? This might not scale long term as we add other scaling granularities like rowwise or blockwise.

Can we check the traces? I want to make sure there is no CPU sync point introduced from making this tensor a scalar tensor

sure. I checked the trace and it's seems to be purely executed on cpu, no kernels launch, and no cudaStreamSynchronize if that's what you refer as "CPU sync point"

I mean when the scalar is used later downstream.

do we know the reasoning for why the current behavior is not supported with compile? This might not scale long term as we add other scaling granularities like rowwise or blockwise.

TL;DR: this is more like a bug when I implement precompute_float8_dynamic_scale_for_fsdp

for the 1st iteration, self._precomputed_scale is None and thus we calcuclate scale through cast_to_float8_e4m3_dynamic (code) , where scale are in tensor(4674.8633). Dynamo generates a guard assersion on tensor(4674.8633).size() and tensor(4674.8633).stride(), so it expect same input shapes in 2nd iteration

for the 2nd iteration after precompute_float8_dynamic_scale_for_fsdp, we have self._precomputed_scale=tensor([[4674.8633]]) because I only called torch.split(scale_tensor, 1) without .squeeze. Guard assersion find out .size() and .stride() changed and throw out the error

does it make sense to say this is a bug in user code, instead of a misfunction in dynamo ?

I mean when the scalar is used later downstream.

ah, I see. I should be looking for cudaStreamSynchronize, right ?

I think it would be like cudaDeviceSynchronize if I understand correctly (but basically you would see the CPU thread blocked).

I mean when the scalar is used later downstream.

_precomputed_scale will be used inside fsdp_pre_all_gather when calling following code https://github.com/pytorch-labs/float8_experimental/blob/main/float8_experimental/fsdp_utils.py#L167

float8_tensor = Float8Tensor.to_float8( self._tensor, self._precomputed_scale, torch.float8_e4m3fn, mm_config=self._mm_config, )

I annotated the function with record_function("Float8Tensor.to_float8"). Here are the snapshots for cpu thread and cuda stream

in both cases, I do not see cudaStreamSynchronize and cuda stream stays ahead of cpu thread

any worries ?

sounds good! should be fine

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-07-17T22:26:51Z

float8_experimental/float8_dynamic_utils.py

@@ -4,8 +4,6 @@
 # This source code is licensed under the BSD 3-Clause license found in the
 # LICENSE file in the root directory of this source tree.

-from typing import Any, Optional, Tuple


fix linter from the trunk

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot · 2024-07-18T00:28:38Z

@weifengpy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

vkuzo · 2024-07-18T14:30:59Z

test/test_fsdp2/test_fsdp2.py

-                fully_shard(submodule)
+        for layer_id, transformer_block in module.layers.named_children():
+            if compile_transformer_block:
+                transformer_block = torch.compile(transformer_block, dynamic=False)


is compiling the transformer block instead of the entire model related to this issue, or are we just trying to match torchtitan behavior?

optionally, if possible, would be good to compile the whole model here instead as long as that can catch the issues relevant to us and keep the more advanced "how to apply compile" logic localized to torchtitan.

is compiling the transformer block instead of the entire model related to this issue, or are we just trying to match torchtitan behavior?

This is just trying to match torchtitan's behavior. The .squeeze is needed regardless of compiling transformer blocks or compiling whole model.

optionally, if possible, would be good to compile the whole model here instead as long as that can catch the issues relevant to us

I want to check at PR time that float8_experimental are compatiable with torchtitan (thus compiling transformer block)

for float8_experimental, I am with you it's good to also cover compiling full model.
For FSDP2, it should work. For FSDP+TP, I remember there is some problem in to compile full model. Will see if I can follow up

vkuzo

looks great, thanks for fixing this!

bdhirsh

nice catch!

facebook-github-bot · 2024-07-18T22:04:54Z

@weifengpy merged this pull request in 7f0d6bb.

fixed my bug in float8_experimental. now we can torch.compile transfromer blocks with FSDP float8 all-gather pytorch-labs/float8_experimental#321 local test: `CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.compile` profiler traces: I can see compiled region in cpu thread and float8 malmul `sm90_xmma_gemm_e4m3bf16...` in cuda stream <img width="1468" alt="Screenshot 2024-07-18 at 4 22 17 PM" src="https://github.com/user-attachments/assets/0cf58dee-aae1-4582-a3f1-b8aa48b45129">

weifengpy and others added 3 commits July 17, 2024 15:11

add unit test for FSDP2 + torch.compile(transformer block)

b5cad8d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'pytorch-labs:main' into fsdp2

a6b8913

remove debug lines

272e85b

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 17, 2024

weifengpy commented Jul 17, 2024

View reviewed changes

weifengpy requested review from vkuzo, awgu, drisspg and bdhirsh July 17, 2024 22:21

fix linter

097ceed

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy commented Jul 17, 2024

View reviewed changes

weifengpy added 2 commits July 17, 2024 17:01

numeric baseline against compiled model

b6ebf8d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

update README and CI

2eaa51b

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

vkuzo reviewed Jul 18, 2024

View reviewed changes

vkuzo approved these changes Jul 18, 2024

View reviewed changes

bdhirsh reviewed Jul 18, 2024

View reviewed changes

facebook-github-bot closed this in 7f0d6bb Jul 18, 2024

facebook-github-bot added the Merged label Jul 18, 2024

weifengpy mentioned this pull request Jul 18, 2024

add torch.compile + FSDP2 float8 all-gather in CI pytorch/torchtitan#468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add unit tests for FSDP2 + torch.compile(transformer block) #321

add unit tests for FSDP2 + torch.compile(transformer block) #321

weifengpy commented Jul 17, 2024 •

edited

Loading

weifengpy Jul 17, 2024

weifengpy Jul 17, 2024

awgu Jul 17, 2024

weifengpy Jul 17, 2024

weifengpy Jul 18, 2024

weifengpy Jul 17, 2024 •

edited

Loading

awgu Jul 17, 2024

vkuzo Jul 17, 2024

weifengpy Jul 17, 2024

awgu Jul 17, 2024

weifengpy Jul 17, 2024 •

edited

Loading

weifengpy Jul 17, 2024

awgu Jul 17, 2024

weifengpy Jul 17, 2024 •

edited

Loading

awgu Jul 17, 2024

weifengpy Jul 17, 2024

facebook-github-bot commented Jul 18, 2024

vkuzo Jul 18, 2024

weifengpy Jul 18, 2024 •

edited

Loading

vkuzo left a comment

bdhirsh left a comment

facebook-github-bot commented Jul 18, 2024

add unit tests for FSDP2 + torch.compile(transformer block) #321

add unit tests for FSDP2 + torch.compile(transformer block) #321

Conversation

weifengpy commented Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 18, 2024

Choose a reason for hiding this comment

weifengpy Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

vkuzo left a comment

Choose a reason for hiding this comment

bdhirsh left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 18, 2024

weifengpy commented Jul 17, 2024 •

edited

Loading

weifengpy Jul 17, 2024 •

edited

Loading

weifengpy Jul 17, 2024 •

edited

Loading

weifengpy Jul 17, 2024 •

edited

Loading

weifengpy Jul 18, 2024 •

edited

Loading