Add option for recomputing the casted weight during backwards #186

drisspg · 2024-01-13T02:16:48Z

Summary

See: #185
For more detail

Disclaimer

Ughh idk, PT2 doesn't let me control what gets recomputed, I am having trouble interpreting the tea leaves

Currently ignore all the performance numbers below, expect for max memory usage. The min-cut-partitioner is actually undoing the recompute for backwards and saving the casted weight tensor. cc @Chillee
See: pytorch/pytorch#117901

Single GPU Linear numbers:

| name      | shape               | ref_dtype      | compiled   | recompute_weight_cast   |   ref_time_sec |   pt_fp8_time_sec |   pt_fp8_speedup |
|:----------|:--------------------|:---------------|:-----------|:------------------------|---------------:|------------------:|-----------------:|
| attn.wqkv | (16384, 8192, 1280) | torch.bfloat16 | True       | True                    |     0.00211272 |        0.00207579 |          1.01779 |
| attn.wqkv | (16384, 8192, 1280) | torch.bfloat16 | True       | False                   |     0.00211962 |        0.00208095 |          1.01858 |
| attn.w0   | (16384, 1024, 8192) | torch.bfloat16 | True       | True                    |     0.00187907 |        0.00149616 |          1.25593 |
| attn.w0   | (16384, 1024, 8192) | torch.bfloat16 | True       | False                   |     0.00187947 |        0.00149665 |          1.25578 |
| ffn.w13   | (16384, 8192, 7168) | torch.bfloat16 | True       | True                    |     0.0102547  |        0.00680098 |          1.50782 |
| ffn.w13   | (16384, 8192, 7168) | torch.bfloat16 | True       | False                   |     0.0102781  |        0.00680872 |          1.50954 |
| ffn.w2    | (16384, 3584, 8192) | torch.bfloat16 | True       | True                    |     0.00538504 |        0.00370726 |          1.45257 |
| ffn.w2    | (16384, 3584, 8192) | torch.bfloat16 | True       | False                   |     0.00539845 |        0.00370568 |          1.4568  |
| attn.wqkv | (16384, 8192, 1280) | torch.float16  | True       | True                    |     0.0021861  |        0.0020997  |          1.04115 |
| attn.wqkv | (16384, 8192, 1280) | torch.float16  | True       | False                   |     0.00217873 |        0.00210146 |          1.03677 |
| attn.w0   | (16384, 1024, 8192) | torch.float16  | True       | True                    |     0.00188072 |        0.00147959 |          1.27111 |
| attn.w0   | (16384, 1024, 8192) | torch.float16  | True       | False                   |     0.00188136 |        0.00148019 |          1.27103 |
| ffn.w13   | (16384, 8192, 7168) | torch.float16  | True       | True                    |     0.0101473  |        0.00671181 |          1.51186 |
| ffn.w13   | (16384, 8192, 7168) | torch.float16  | True       | False                   |     0.0101678  |        0.00670741 |          1.51591 |
| ffn.w2    | (16384, 3584, 8192) | torch.float16  | True       | True                    |     0.00545398 |        0.00362562 |          1.50429 |
| ffn.w2    | (16384, 3584, 8192) | torch.float16  | True       | False                   |     0.00544952 |        0.00362146 |          1.50478 |

FSDP Memory Usage

Verified on single node 8-gpu FSDP that the memory usage is no longer scaling:

Configuration	Max Memory Used Before this PR	Max Memory Used After this PR
bf16	31.12 GiB	31.12 GiB
dynamic_linear cache casted weight	36.63 GiB	36.06 GiB
dynamic_linear recompute casted weight	N/A	29.86 GiB

FSDP Performance

Using single node 8-gpu FSDP setup/compile

Configuration	Before this PR It/second	After this PR It/second
bf16	2.01 it/s	1.99 it/s
dynamic_linear cache casted weight	2.35 it/s	2.30 it/s
dynamic_linear recompute casted weight	N/A	2.30 it/s
delayed_linear cache casted weight	2.15 it/s	2.09 it/s
delayed_linear recompute casted weight	N/A	2.08 it/s

Single GPU Memory usage

In eager using this test script: https://gist.github.com/drisspg/75a792f97f5b8fa77f32af7f5280bae5

I am seeing a mac_memory used
Recompute = False:
Max Cuda Memory Used: 1.8438 GiB
Recompute = True:
Max Cuda Memory Used: 1.7032 GiB

A difference of ~0.14 gbs, We would should expect a memory saving of (4096**2)*(1byte)*10(layers) * 1024**-3(bytes per GiB) = 0.15625

Also verified by memory-traces in the gist

Questions

This is kinda a meaty PR that depends on a PyTorch PR(pytorch/pytorch#117667) but I am curious if people have strong feelings on the "UX".

I chose not to make the "recompute weight cast" a config setting instead having it as a module attribute. The swap_linear will set this for every linear it swaps, in theory from_float is granular enough to do this on a per linear basis.

Is there any reason why having it has a global config would be better, (even a global config setting that alters the swap_functions behavior?)

drisspg · 2024-01-13T02:18:37Z

ed(f"call_method {self} {name} {args} {kwargs}")
[2024-01-12 18:38:24,732] [8/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/drisspg/miniconda3/envs/nightly/lib/python3.10/site-packages/torch/_dynamo/exc.py", line 193, in unimplemented
[2024-01-12 18:38:24,732] [8/0] torch._dynamo.variables.higher_order_ops: [ERROR]     raise Unsupported(msg)
[2024-01-12 18:38:24,732] [8/0] torch._dynamo.variables.higher_order_ops: [ERROR] torch._dynamo.exc.Unsupported: call_method GetAttrVariable(TensorVariable(), _data) stride [] {}

You love to see it! Why can't I call shape? avoding calling _data is toughhhhh

drisspg · 2024-01-13T02:55:57Z

cc @bdhirsh As far as I can till this is erroring because of these calls to the tensor attributes: https://github.com/pytorch-labs/float8_experimental/pull/186/files#diff-00f68398c8aad5a3e946cccd7211a80841da9403d6c664452a45e04101bea6d6R84-R93

I know that in the past anytime we try to access the subclasses attributes outside of the __torch__dispatch code this errors. I don't have any idea how to work around this since I think we need this autograd function and hence can't use the torch_dispatch.

float8_experimental/float8_ops.py

float8_experimental/float8_dynamic_linear.py

float8_experimental/float8_linear.py

float8_experimental/float8_ops.py

vkuzo · 2024-01-18T18:39:51Z

I chose not to make the "recompute weight cast" a config setting instead having it as a module attribute. The swap_linear will set this for every linear it swaps, in theory from_float is granular enough to do this on a per linear basis.

The above makes sense to me for this particular setting, if we choose to have a setting. It would be nice to not have a setting at all unless we need it. I feel like FSDP is unusable for real workloads without this, so if the recomputation is fast enough why not just have it as the only path?

vkuzo · 2024-01-18T18:40:57Z

Verified on single node 8-gpu FSDP that the memory usage is no longer scaling:

great! Can we also post throughput metrics on 8-gpu FSDP? If there is a slowdown, having a smaller benchmark to capture + debug it would be useful.

drisspg requested a review from vkuzo January 13, 2024 02:16

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 13, 2024

drisspg removed the request for review from vkuzo January 13, 2024 02:16

drisspg changed the title ~~Add option for recomputing the casted weight during backwards~~ [WIP] Add option for recomputing the casted weight during backwards Jan 13, 2024

drisspg marked this pull request as draft January 13, 2024 02:17

drisspg commented Jan 13, 2024

View reviewed changes

float8_experimental/float8_ops.py Outdated Show resolved Hide resolved

bdhirsh mentioned this pull request Jan 17, 2024

dynamo <> autograd.Function assumes that grad_outputs have the same subclass-ness as forward outputs pytorch/pytorch#117662

Open

drisspg force-pushed the enable_recompute_of_grad branch from c7e3ce3 to 795ebbd Compare January 17, 2024 20:13

drisspg added 4 commits January 17, 2024 12:55

existing eager tests pass

32fbe21

gotta figure out a way to call shape that is compile friendly

67dca36

fix multiple calls to save_for_backward

27a6a7f

add to delayed linear

682c2e8

drisspg force-pushed the enable_recompute_of_grad branch from 795ebbd to 682c2e8 Compare January 17, 2024 20:55

drisspg added 3 commits January 17, 2024 14:18

add recompute option

2ad6184

parmetrize compile and base tests

177173a

add test to fsdp

2ffcbe9

drisspg force-pushed the enable_recompute_of_grad branch from c3f5c9a to 2ffcbe9 Compare January 18, 2024 00:30

drisspg commented Jan 18, 2024

View reviewed changes

float8_experimental/float8_dynamic_linear.py Outdated Show resolved Hide resolved

drisspg commented Jan 18, 2024

View reviewed changes

float8_experimental/float8_linear.py Outdated Show resolved Hide resolved

drisspg commented Jan 18, 2024

View reviewed changes

float8_experimental/float8_linear.py Outdated Show resolved Hide resolved

drisspg commented Jan 18, 2024

View reviewed changes

float8_experimental/float8_ops.py Show resolved Hide resolved

update linear bench

48f21f6

drisspg requested review from vkuzo, y-sq, bdhirsh and awgu January 18, 2024 02:10

drisspg marked this pull request as ready for review January 18, 2024 02:29

drisspg changed the title ~~[WIP] Add option for recomputing the casted weight during backwards~~ Add option for recomputing the casted weight during backwards Jan 18, 2024

drisspg added 2 commits January 18, 2024 19:25

performance boooooooooooost

20da1c0

do less in the backwards

d03d16b

drisspg force-pushed the enable_recompute_of_grad branch from e7a6aa3 to d03d16b Compare January 19, 2024 21:00

drisspg mentioned this pull request Jan 20, 2024

Min Cut Partitioner Issue with float8_experimental pytorch/pytorch#117901

Closed

update profile

d2da1ad

drisspg force-pushed the enable_recompute_of_grad branch from 26ce70d to d2da1ad Compare January 20, 2024 00:34

use addmm for bias (hopefully), fix autocasting

941d2f3

drisspg mentioned this pull request Jan 23, 2024

Checkpoint to reduce fp8_weight tensor saved for backwards #193

Open

drisspg mentioned this pull request Jan 31, 2024

Enable fp8_weight recomputation during backwards pass #185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option for recomputing the casted weight during backwards #186

Add option for recomputing the casted weight during backwards #186

Uh oh!

drisspg commented Jan 13, 2024 •

edited

Loading

Uh oh!

drisspg commented Jan 13, 2024 •

edited

Loading

Uh oh!

drisspg commented Jan 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vkuzo commented Jan 18, 2024

Uh oh!

vkuzo commented Jan 18, 2024

Uh oh!

Uh oh!

Add option for recomputing the casted weight during backwards #186

Are you sure you want to change the base?

Add option for recomputing the casted weight during backwards #186

Uh oh!

Conversation

drisspg commented Jan 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Disclaimer

Single GPU Linear numbers:

FSDP Memory Usage

FSDP Performance

Single GPU Memory usage

Questions

Uh oh!

drisspg commented Jan 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drisspg commented Jan 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vkuzo commented Jan 18, 2024

Uh oh!

vkuzo commented Jan 18, 2024

Uh oh!

Uh oh!

drisspg commented Jan 13, 2024 •

edited

Loading

drisspg commented Jan 13, 2024 •

edited

Loading