Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] [FLUX] Flux.1 Dev Transformer perf tracker #19751

Open
PhaneeshB opened this issue Jan 21, 2025 · 5 comments
Open

[PERF] [FLUX] Flux.1 Dev Transformer perf tracker #19751

PhaneeshB opened this issue Jan 21, 2025 · 5 comments
Labels
codegen/rocm ROCm code generation compiler backend (HIP/HSA) codegen Shared code generation infrastructure and dialects performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@PhaneeshB
Copy link
Contributor

PhaneeshB commented Jan 21, 2025

What happened?

Flux1 Dev Transformer MLIR and Wts (Real Model)
Artefacts - (tracy, *dispatch.mlir, *benchmark.mlir)

TOP 6 Dispatches (01/21):

Image

Attention: dispatch_535, dispatch_37
MatVec_like: dispatch_19, dispatch_526
matmul_transpose_b: dispatch_528, dispatch_538

Compile:

../iree-build-trace/tools/iree-compile \
  black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16.mlir \
  -o black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16-trace.vmfb \
  --iree-hal-executable-debug-level=3 \
  --iree-hal-dump-executable-files-to=dump_real \
  --iree-hal-target-device=hip \
  --iree-hip-target=gfx942 \
  --iree-opt-const-eval=false \
  --iree-opt-strip-assertions=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-dispatch-creation-enable-fuse-horizontal-contractions=true \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-outer-dim-concat=true \
  --iree-vm-target-truncate-unsupported-floats \
  --iree-llvmgpu-enable-prefetch=true \
  --iree-opt-data-tiling=false \
  --iree-codegen-gpu-native-math-precision=true \
  --iree-codegen-llvmgpu-use-vector-distribution \
  --iree-hip-waves-per-eu=2 \
  --iree-execution-model=async-external \
  "--iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline,iree-preprocessing-pad-to-intrinsics)"

Run (with all 1s input):

../iree-build-trace/tools/iree-run-module \
  --device=hip \
  --module=black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16-trace.vmfb \
  --parameters=model=black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16.irpa \
  --function=forward_bs1 \
  --input=1x4096x64xbf16=1 \
  --input=1x4096x3xf32=1 \
  --input=1x512x4096xbf16=1 \
  --input=1x512x3xf32=1 \
  --input=1xbf16=1 \
  --input=1x768xbf16=1 \
  --input=1xbf16=1

Benchmark (01/21):

Image

Compile:

  ../iree-build/tools/iree-compile \
  flux1-dev-data/black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16.mlir \
  -o black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16-benchmark.vmfb \
  --iree-hal-target-device=hip \
  --iree-hip-target=gfx942 \
  --iree-opt-const-eval=false \
  --iree-opt-strip-assertions=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-dispatch-creation-enable-fuse-horizontal-contractions=true \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-outer-dim-concat=true \
  --iree-vm-target-truncate-unsupported-floats \
  --iree-llvmgpu-enable-prefetch=true \
  --iree-opt-data-tiling=false \
  --iree-codegen-gpu-native-math-precision=true \
  --iree-codegen-llvmgpu-use-vector-distribution \
  --iree-hip-waves-per-eu=2 \
  --iree-execution-model=async-external \
  "--iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline,iree-preprocessing-pad-to-intrinsics)"

Run:

../iree-build/tools/iree-benchmark-module \
  --device=hip \
  --module=black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16-benchmark.vmfb \
  --parameters=model=/home/phaneesh/NOD/flux1-dev-data/black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16.irpa \
  --function=forward_bs1 \
  --input=1x4096x64xbf16=1 \
  --input=1x4096x3xf32=1 \
  --input=1x512x4096xbf16=1 \
  --input=1x512x3xf32=1 \
  --input=1xbf16=1 \
  --input=1x768xbf16=1 \
  --input=1xbf16=1 \
  --benchmark_repetitions=10

Steps to reproduce your issue

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

No response

@PhaneeshB PhaneeshB added the bug 🐞 Something isn't working label Jan 21, 2025
@ScottTodd ScottTodd added codegen Shared code generation infrastructure and dialects performance ⚡ Performance/optimization related work across the compiler and runtime codegen/rocm ROCm code generation compiler backend (HIP/HSA) and removed bug 🐞 Something isn't working labels Jan 21, 2025
@sogartar
Copy link
Contributor

Having all inputs set to 1s may introduce deviation from a real world run. We know that 0 causes faster execution and we don't know how these inputs influence intermediate values.

  --input=1x4096x64xbf16=1 \
  --input=1x4096x3xf32=1 \
  --input=1x512x4096xbf16=1 \
  --input=1x512x3xf32=1 \
  --input=1xbf16=1 \
  --input=1x768xbf16=1 \
  --input=1xbf16=1 \

The model is equipped with a function to draw a plausible sample from the distribution. These should be fed as an input arguments.

@sogartar
Copy link
Contributor

The benchmark real time with real inputs is about 10 ms higher.

@IanWood1
Copy link
Contributor

Have you tried --iree-opt-generalize-matmul? I used it a bit ago and it seemed to improve the time

@sogartar
Copy link
Contributor

@IanWood1 your suggestion about the --iree-opt-generalize-matmul flag gave fruit and with some other modifications regarding the RoPE kernel dropped the mean real time to 451 ms.

@sogartar
Copy link
Contributor

Aside from the large problem regarding the custom IREE RoPE kernel that is being addressed at #19822.
There are several elementwise dispatches that take 10 us to 120 us that may be possible to be fused.

This is before attention in the double block MMDiT.
Image

The single block looks a bit better in that respect. But there is one slow memcpy dispatch.
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen/rocm ROCm code generation compiler backend (HIP/HSA) codegen Shared code generation infrastructure and dialects performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
None yet
Development

No branches or pull requests

4 participants