we should ensure activation checkpointing with Float8Linear behaves optimally

When AC is on for Float8Linear, what I would expect is:
1. the forward gemm is recomputed in the backward (it is not being recomputed now)
2. max(abs(activation)) and max(abs(weight)) are NOT recomputed, it's much better to always reuse them as they are tiny (seems like one of these is being recomputed now)

Let's figure out why this isn't what is happening now and what we should do about it.  Note: reproductions below require https://github.com/pytorch/ao/pull/892

## bfloat16 linear fwd/bwd with activation checkpointing on

repro command

```
python benchmarks/float8/profile_linear_float8.py ~/local/tmp/20240916_act_chk_on --dtype_filter bfloat16 --enable_activation_checkpointing True
```

trace snippet

<img width="1044" alt="Screenshot 2024-09-16 at 2 50 54 PM" src="https://github.com/user-attachments/assets/89ac86d8-7b8b-459a-a766-b24f7cdd1379">

we see 1 gemm in the forward and 3 in the backward, as expected

## Float8Linear fwd/bwd with activation checkpointing on

repro command

```
python benchmarks/float8/profile_linear_float8.py ~/local/tmp/20240916_act_chk_on --dtype_filter float8 --enable_activation_checkpointing True
```

trace snippet

<img width="925" alt="Screenshot 2024-09-16 at 3 05 37 PM" src="https://github.com/user-attachments/assets/5f8717fe-1619-4bdd-acf5-035838fc67c2">

issue 1: there are only two gemms in the backward instead of three
issue 2: there are some extra kernels in the backward which are recomputing max(abs(activation)) and max(abs(weight))


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

we should ensure activation checkpointing with Float8Linear behaves optimally #893

bfloat16 linear fwd/bwd with activation checkpointing on

Float8Linear fwd/bwd with activation checkpointing on

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

we should ensure activation checkpointing with Float8Linear behaves optimally #893

Description

bfloat16 linear fwd/bwd with activation checkpointing on

Float8Linear fwd/bwd with activation checkpointing on

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions