Adding Float8 Linear variants supporting inference-only with lower overhead #283

cyang49 · 2024-06-14T13:39:22Z

The changes include two new Float8 Linear implementations that removes some extra wiring in Float8Linear unnecessary for inference-only use cases to result in lower latency.

Float8SWLinear supports direct fp8 type direct downcast for activation, and Static per-tensor scale for Weight. Our analysis shows that using this results in no loss of accuracy in Llama models.
Float8DASWLinear supports Dynamic per-tensor scale for Activation, and Static per-tensor scale for Weight. This is used when activation tensor requires dynamic scaling. Compared to Float8SWLinear, this has higher overhead introduced by dynamic activation tensor scale calculation. The overhead can be mitigated when used with torch.compile.

cc: @ani300

Co-authored-by: Mauricio Serrano <[email protected]>

vkuzo · 2024-06-14T18:35:21Z

Hi folks, thanks for the PR and we definitely want this functionality. We have some changes coming up to set up the inference path, let me give a preview and @drisspg is planning to publish a more detailed RFC related to float8 + inference UX soon.

we plan to unify all training logic in Float8Linear, Float8DynamicLinear would fold into Float8Linear and each tensor (act/weight/grad) will be configurable on whether to scale dynamically or delayed
we plan to add a separate construct for inference, since the weights are frozen. It will be something like Float8InferenceLinear, although we might also provide a tensor subclass based version. The behavior of the scaling of the input activation would be configurable at construction time.

In the framing above, this PR would be adding additional options to (2). Are you ok to wait a little bit for us to put out the official plan?

vkuzo · 2024-06-14T18:36:11Z

float8_experimental/float8_linear.py

+        super(Float8SWLinear, self).__init__(in_features=in_features, out_features=out_features, bias=bias)
+        self.w_inv_s = None
+        self.dtype = torch.float8_e4m3fn
+        self.use_triton = use_triton


is this used anywhere?

vkuzo · 2024-06-14T18:37:03Z

float8_experimental/float8_linear.py

+        x_f8 = x.to(self.dtype)
+        ishape= list(x_f8.shape)
+
+        if ishape[0] == 0: # special case handling for mixtral


is this not supported by scaled_mm today? cc @drisspg

cyang49 · 2024-07-29T17:34:38Z

no longer needed

Adding Float8 Linear variants

860527e

Co-authored-by: Mauricio Serrano <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 14, 2024

vkuzo reviewed Jun 14, 2024

View reviewed changes

cyang49 closed this Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding Float8 Linear variants supporting inference-only with lower overhead #283

Adding Float8 Linear variants supporting inference-only with lower overhead #283

Uh oh!

cyang49 commented Jun 14, 2024

Uh oh!

vkuzo commented Jun 14, 2024

Uh oh!

vkuzo Jun 14, 2024

Uh oh!

vkuzo Jun 14, 2024

Uh oh!

cyang49 commented Jul 29, 2024

Uh oh!

Uh oh!

Adding Float8 Linear variants supporting inference-only with lower overhead #283

Adding Float8 Linear variants supporting inference-only with lower overhead #283

Uh oh!

Conversation

cyang49 commented Jun 14, 2024

Uh oh!

vkuzo commented Jun 14, 2024

Uh oh!

vkuzo Jun 14, 2024

Choose a reason for hiding this comment

Uh oh!

vkuzo Jun 14, 2024

Choose a reason for hiding this comment

Uh oh!

cyang49 commented Jul 29, 2024

Uh oh!

Uh oh!