[PERF] [FLUX] Flux.1 Dev Transformer perf tracker #19751

PhaneeshB · 2025-01-21T17:33:43Z

What happened?

Flux1 Dev Transformer MLIR and Wts (Real Model)
Artefacts - (tracy, *dispatch.mlir, *benchmark.mlir)

TOP 6 Dispatches (01/21):

Attention: dispatch_535, dispatch_37
MatVec_like: dispatch_19, dispatch_526
matmul_transpose_b: dispatch_528, dispatch_538

Compile:

../iree-build-trace/tools/iree-compile \
  black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16.mlir \
  -o black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16-trace.vmfb \
  --iree-hal-executable-debug-level=3 \
  --iree-hal-dump-executable-files-to=dump_real \
  --iree-hal-target-device=hip \
  --iree-hip-target=gfx942 \
  --iree-opt-const-eval=false \
  --iree-opt-strip-assertions=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-dispatch-creation-enable-fuse-horizontal-contractions=true \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-outer-dim-concat=true \
  --iree-vm-target-truncate-unsupported-floats \
  --iree-llvmgpu-enable-prefetch=true \
  --iree-opt-data-tiling=false \
  --iree-codegen-gpu-native-math-precision=true \
  --iree-codegen-llvmgpu-use-vector-distribution \
  --iree-hip-waves-per-eu=2 \
  --iree-execution-model=async-external \
  "--iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline,iree-preprocessing-pad-to-intrinsics)"

Run (with all 1s input):

../iree-build-trace/tools/iree-run-module \
  --device=hip \
  --module=black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16-trace.vmfb \
  --parameters=model=black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16.irpa \
  --function=forward_bs1 \
  --input=1x4096x64xbf16=1 \
  --input=1x4096x3xf32=1 \
  --input=1x512x4096xbf16=1 \
  --input=1x512x3xf32=1 \
  --input=1xbf16=1 \
  --input=1x768xbf16=1 \
  --input=1xbf16=1

Benchmark (01/21):

Compile:

  ../iree-build/tools/iree-compile \
  flux1-dev-data/black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16.mlir \
  -o black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16-benchmark.vmfb \
  --iree-hal-target-device=hip \
  --iree-hip-target=gfx942 \
  --iree-opt-const-eval=false \
  --iree-opt-strip-assertions=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-dispatch-creation-enable-fuse-horizontal-contractions=true \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-outer-dim-concat=true \
  --iree-vm-target-truncate-unsupported-floats \
  --iree-llvmgpu-enable-prefetch=true \
  --iree-opt-data-tiling=false \
  --iree-codegen-gpu-native-math-precision=true \
  --iree-codegen-llvmgpu-use-vector-distribution \
  --iree-hip-waves-per-eu=2 \
  --iree-execution-model=async-external \
  "--iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline,iree-preprocessing-pad-to-intrinsics)"

Run:

../iree-build/tools/iree-benchmark-module \
  --device=hip \
  --module=black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16-benchmark.vmfb \
  --parameters=model=/home/phaneesh/NOD/flux1-dev-data/black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16.irpa \
  --function=forward_bs1 \
  --input=1x4096x64xbf16=1 \
  --input=1x4096x3xf32=1 \
  --input=1x512x4096xbf16=1 \
  --input=1x512x3xf32=1 \
  --input=1xbf16=1 \
  --input=1x768xbf16=1 \
  --input=1xbf16=1 \
  --benchmark_repetitions=10

Steps to reproduce your issue

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

No response

sogartar · 2025-01-23T14:46:01Z

Having all inputs set to 1s may introduce deviation from a real world run. We know that 0 causes faster execution and we don't know how these inputs influence intermediate values.

  --input=1x4096x64xbf16=1 \
  --input=1x4096x3xf32=1 \
  --input=1x512x4096xbf16=1 \
  --input=1x512x3xf32=1 \
  --input=1xbf16=1 \
  --input=1x768xbf16=1 \
  --input=1xbf16=1 \

The model is equipped with a function to draw a plausible sample from the distribution. These should be fed as an input arguments.

sogartar · 2025-01-28T16:29:10Z

The benchmark real time with real inputs is about 10 ms higher.

IanWood1 · 2025-01-28T17:04:12Z

Have you tried --iree-opt-generalize-matmul? I used it a bit ago and it seemed to improve the time

sogartar · 2025-01-28T22:55:39Z

@IanWood1 your suggestion about the --iree-opt-generalize-matmul flag gave fruit and with some other modifications regarding the RoPE kernel dropped the mean real time to 451 ms.

sogartar · 2025-01-30T17:14:23Z

Aside from the large problem regarding the custom IREE RoPE kernel that is being addressed at #19822.
There are several elementwise dispatches that take 10 us to 120 us that may be possible to be fused.

This is before attention in the double block MMDiT.

The single block looks a bit better in that respect. But there is one slow memcpy dispatch.

PhaneeshB added the bug 🐞 Something isn't working label Jan 21, 2025

ScottTodd added codegen Shared code generation infrastructure and dialects performance ⚡ Performance/optimization related work across the compiler and runtime codegen/rocm ROCm code generation compiler backend (HIP/HSA) and removed bug 🐞 Something isn't working labels Jan 21, 2025

sogartar mentioned this issue Jan 27, 2025

Flux Dev transformer RoPE IREE custom kernel bad performance #19822

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] [FLUX] Flux.1 Dev Transformer perf tracker #19751

[PERF] [FLUX] Flux.1 Dev Transformer perf tracker #19751

PhaneeshB commented Jan 21, 2025 •

edited

Loading

sogartar commented Jan 23, 2025

sogartar commented Jan 28, 2025

IanWood1 commented Jan 28, 2025

sogartar commented Jan 28, 2025

sogartar commented Jan 30, 2025

[PERF] [FLUX] Flux.1 Dev Transformer perf tracker #19751

[PERF] [FLUX] Flux.1 Dev Transformer perf tracker #19751

Comments

PhaneeshB commented Jan 21, 2025 • edited Loading

What happened?

TOP 6 Dispatches (01/21):

Benchmark (01/21):

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

sogartar commented Jan 23, 2025

sogartar commented Jan 28, 2025

IanWood1 commented Jan 28, 2025

sogartar commented Jan 28, 2025

sogartar commented Jan 30, 2025

PhaneeshB commented Jan 21, 2025 •

edited

Loading