[Operator] Add batch_norm #362

2niuhe · 2024-12-13T07:55:40Z

PR Category

Operator

Type of Change

New Feature

Description

Implement batch_norm operator

forward
backward

Issue

Resolves Code Contribution: 【Lv3】【Operator Development】batch_norm_backward #315

Progress

Change is properly reviewed (1 reviewer required, 2 recommended).
Change is responded to an issue.
Change is fully covered by a UT.

StrongSpoon · 2024-12-16T05:56:18Z

src/flag_gems/ops/batch_norm.py

+@triton.heuristics(
+    {
+        "BLOCK_SIZE_BATCH": lambda args: next_power_of_2(args["batch_dim"]),
+        "BLOCK_SIZE_SPATIAL": BLOCK_SIZE_SPATIAL_heuristic,


we are working on changing one-tile algorithm into loop tiling. please refer to max/min/lof_softmax and update the strategy. ;)

Assuming the input to batch_norm has the shape (batch, channel, spatial).

The implementation of batch_norm is already based on loop tiling, where tiling occurs along the spatial dimension while fully loading the batch dimension. This approach differs from the operators mentioned, such as max, min, which only require computations along a specified dimension without involving others. In the case of batch_norm, both the batch and spatial dimensions need to be fully loaded.

My previous consideration was that introducing tiling along the batch dimension would result in a nested loop structure, which may not be meaningful when batch is not large.

You might be suggesting that applying loop tiling on the batch dimension could indeed improve performance when the batch size is large, as it would allow for more contiguous memory access in the spatial dimension. For instance, if the batch size is 16,384 and the spatial size loaded per loop iteration is only 1, this could lead to inefficient memory access patterns. I will try implementing loop tiling on both the batch and spatial dimensions.

benchmark/test_norm_perf.py

2niuhe · 2024-12-16T14:01:17Z

co-author: @zhangboyue https://github.com/2niuhe/FlagGems/tree/dev_batch_norm

StrongSpoon · 2024-12-17T02:37:09Z

src/flag_gems/ops/batch_norm.py

+                (curr_input - mean) * (curr_input - prev_mean),
+                0.0,
+            )
+            var += tl.sum(deltas)


this method cannot fully utilize vectorization/tensorization and leads to sequential computation.

StrongSpoon · 2024-12-17T02:38:24Z

src/flag_gems/ops/batch_norm.py

+                BLOCK_SIZE_SPATIAL, spatial_dim - block_ind * BLOCK_SIZE_SPATIAL
+            )
+            curr_count = spatial_count * batch_dim
+            count += curr_count


we can set reasonable default value of tl.load to avoid keeping counter.

StrongSpoon · 2024-12-17T03:16:15Z

src/flag_gems/ops/batch_norm.py

+        )
+
+        if affine:
+            weight_grad += tl.sum(curr_pre_lin * curr_output_grad)


StrongSpoon · 2024-12-17T03:25:06Z

src/flag_gems/ops/batch_norm.py

+            weight_grad = bias_grad = None
+
+        # Launches 1D grid where each program operates over one feature.
+        grid = lambda _: (feat_dim,)


grid hasn't to be a lambda expression. initialize it as a tuple.

2niuhe added 18 commits December 5, 2024 17:29

add batch_norm ops

f9f7769

Merge branch 'FlagOpen:master' into batch_norm

118a549

add batch_norm ops

adc958a

Merge branch 'batch_norm' of github.com:2niuhe/FlagGems into batch_norm

32326ae

add batch_norm forward

43f5875

small fix

15b2bf0

add batch_norm ops

1e3b6c2

add batch_norm ops

26247ed

Merge branch 'FlagOpen:master' into batch_norm

5ec20bd

Merge branch 'batch_norm' of github.com:2niuhe/FlagGems into batch_norm

57f664c

add unit test

04d0d02

add batch_norm ops

7ea1775

add batch_norm perf

5c096f1

add batch_norm perf

b3d0778

add note

0c67636

add libentry

7d6e793

fix rsqrt ci error

d90c3c8

update unit tests

2735b45

StrongSpoon self-assigned this Dec 16, 2024

StrongSpoon self-requested a review December 16, 2024 05:36

update unit test

c4cdb58

StrongSpoon reviewed Dec 16, 2024

View reviewed changes

2niuhe added 2 commits December 16, 2024 17:32

Merge branch 'FlagOpen:master' into batch_norm

5fddf7f

update perf and tune config

c5c423a

StrongSpoon reviewed Dec 17, 2024

View reviewed changes

src/flag_gems/ops/batch_norm.py

)

if affine:

weight_grad += tl.sum(curr_pre_lin * curr_output_grad)

Copy link

Collaborator

StrongSpoon Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

StrongSpoon reviewed Dec 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Operator] Add batch_norm #362

[Operator] Add batch_norm #362

2niuhe commented Dec 13, 2024

StrongSpoon Dec 16, 2024

2niuhe Dec 16, 2024

StrongSpoon Dec 17, 2024

2niuhe commented Dec 16, 2024

StrongSpoon Dec 17, 2024

StrongSpoon Dec 17, 2024

StrongSpoon Dec 17, 2024

StrongSpoon Dec 17, 2024

[Operator] Add batch_norm #362

Are you sure you want to change the base?

[Operator] Add batch_norm #362

Conversation

2niuhe commented Dec 13, 2024

PR Category

Type of Change

Description

Issue

Progress

StrongSpoon Dec 16, 2024

Choose a reason for hiding this comment

2niuhe Dec 16, 2024

Choose a reason for hiding this comment

StrongSpoon Dec 17, 2024

Choose a reason for hiding this comment

2niuhe commented Dec 16, 2024

StrongSpoon Dec 17, 2024

Choose a reason for hiding this comment

StrongSpoon Dec 17, 2024

Choose a reason for hiding this comment

StrongSpoon Dec 17, 2024

Choose a reason for hiding this comment

StrongSpoon Dec 17, 2024

Choose a reason for hiding this comment