ggml : implement op fusion, starting with REGLU/GEGLU/SWIGLU #14158

CISC · 2025-06-12T23:24:57Z

~~Implement REGLU/GEGLU/SWIGLU ops to avoid unnecessary tensor duplications and a little more efficient execution by combining ops in one.~~

Implement op fusion, starting with REGLU/GEGLU/SWIGLU for PoC.

~~Only CPU and CUDA right now, help needed to complete other backends!~~

ggerganov

I missed that these ops change the shape of the input tensor.

I think it would be better to introduce:

enum ggml_glu_op {
    GGML_GLU_OP_REGLU,
    GGML_GLU_OP_GEGLU,
    GGML_GLU_OP_SWIGLU,
};

// similar to ggml_unary()
GGML_API struct ggml_tensor * ggml_glu(
        struct ggml_context * ctx,
         struct ggml_tensor * a,
           enum ggml_glu_op   op);

// these simply call ggml_glu()
GGML_API struct ggml_tensor * ggml_reglu(
        struct ggml_context * ctx,
        struct ggml_tensor  * a);

GGML_API struct ggml_tensor * ggml_geglu(
        struct ggml_context * ctx,
        struct ggml_tensor  * a);

GGML_API struct ggml_tensor * ggml_swiglu(
        struct ggml_context * ctx,
        struct ggml_tensor  * a);

ggml/include/ggml.h

ggerganov

Hope we don't forget to implement these in the rest of the backends.

Adding @JohannesGaessler for review of the CUDA changes.

ggerganov · 2025-06-13T08:31:31Z

Only CPU and CUDA right now, help needed to complete other backends!

Yes, let's add the rest of the backends first before merging. At least Metal and Vulkan.

ggml/src/ggml-cuda/unary.cu

JohannesGaessler · 2025-06-13T09:14:48Z

More generally, I've been thinking that it would be useful to have something like a backend-specific graph optimization step in ggml. That way you could do things like fuse tensors only if the fused tensor is supported by the backend and only if using it makes sense given the tensor shapes.

CISC · 2025-06-13T10:29:17Z

Only CPU and CUDA right now, help needed to complete other backends!

Yes, let's add the rest of the backends first before merging. At least Metal and Vulkan.

Any suggestions on who could help with that?

ggml-ci

ngxson · 2025-06-13T13:14:56Z

ggml/include/ggml.h

+            struct ggml_context * ctx,
+            struct ggml_tensor  * a);
+
+    GGML_API struct ggml_tensor * ggml_swiglu(


just want to note that I have been observing one variants of swiglu. it's used by ultravox, which sigmoid the second half of the vector instead of the first half

Oh, interesting, worth adding a parameter for, or best just handling in conversion?
https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_3-70b/blob/main/ultravox_model.py#L701-L704

I think it would be nice to have a param since the GGUFs are already on the internet. Haven't thought about permuting the FFN up tensor before, nice suggestion

Added swapped variants.

@ggerganov I didn't dare update metal code, so needs to be implemented there too. :)

JohannesGaessler · 2025-06-13T13:29:00Z

@0cc4m @jeffbolznv are either of you interested in a Vulkan implementation?

0cc4m · 2025-06-13T13:42:21Z

I can look into it tomorrow.

ggml-ci

ggml/src/ggml-cuda/unary.cu

CISC · 2025-06-17T19:36:46Z

The change became much bigger than I initially anticipated and I'm having some second thoughts about this. There was a previous discussion about fusing ops: see #5413 (comment). I am wondering if we should directly implement the approach that @slaren proposed there.

That sounds like a very good idea, should be possible to refactor this PR without too many changes to backend, I'll have a look!

jeffbolznv · 2025-06-17T19:52:55Z

Is there a plan for how to handle changes in memory allocation due to fusion? That design feels a bit incomplete. Maybe we should discuss more before you start implementing.

jeffbolznv · 2025-06-17T20:03:01Z

A very rough suggestion: no new "fused" op. Have a "use count" in the tensor that is incremented during build, so that the backend knows a tensor is only used by one op. Add a function for the backend to report that a tensor doesn't need to be written to memory while the graph is being built.

CISC · 2025-06-17T20:19:25Z

Is there a plan for how to handle changes in memory allocation due to fusion? That design feels a bit incomplete. Maybe we should discuss more before you start implementing.

Yeah, I can see there are quite a few implementation details that need to be laid out clearly here.

First though I think it would make refactoring easier if we merged the other PR into this one, any objections? @ggerganov @0cc4m @qnixsynapse

0cc4m · 2025-06-17T20:24:26Z

Is there a plan for how to handle changes in memory allocation due to fusion? That design feels a bit incomplete. Maybe we should discuss more before you start implementing.

Yeah, I can see there are quite a few implementation details that need to be laid out clearly here.

First though I think it would make refactoring easier if we merged the other PR into this one, any objections? @ggerganov @0cc4m @qnixsynapse

No, please do.

The change became much bigger than I initially anticipated and I'm having some second thoughts about this. There was a previous discussion about fusing ops: see #5413 (comment). I am wondering if we should directly implement the approach that @slaren proposed there.

Why not just have a device-specific optimization run that checks for fusable ops only if the backend supports them? Or would we not be able to cache that "improved" graph for each repetition of the graph run? Then it might be too expensive.

ngxson · 2025-06-17T20:33:20Z

I'm just wondering if fusing ops in this particular case (swiglu/geglu/etc) can actually improve the performance in a significant way.

The main issue with the non-fused version prior to this PR was that we need ggml_cont, and worse, we are doing cont on tensors shape n_ff

So just out of curiosity, can we do a test without ggml_cont on CUDA @JohannesGaessler ? IIUC at least on CUDA, unary ops and ggml_mul already support non-cont tensors, so it should work out of the box.

CISC · 2025-06-17T20:40:14Z

I'm just wondering if fusing ops in this particular case (swiglu/geglu/etc) can actually improve the performance in a significant way.

You can see in the other PR that even though it's not as much we are still gaining 1-3% depending on backend and model, it's not much, but it is something, and it can be a lot more if we go all out and include mul-mat in the fuse.

CISC · 2025-06-17T20:46:00Z

A very rough suggestion: no new "fused" op. Have a "use count" in the tensor that is incremented during build, so that the backend knows a tensor is only used by one op.

~~We still need to store the op chain, how else than a GGML_OP_FUSED?~~

Edit: Hmmm, so, extending @0cc4m suggestion, we could perhaps somehow present the longest possible chain (within reason) of ops to the backend and it would compile that down to the shortest possible combination of fused+single ops and run it.

Edit2: @jeffbolznv I think I understand what you meant now, we don't even have to prepare that chain, we can simply keep queuing ops and do the fused op once it gets the longest possible one.

jeffbolznv · 2025-06-17T21:10:36Z

We still need to store the op chain, how else than a GGML_OP_FUSED?

The backend can just look ahead. Ideally it would be nice to have a use list like in LLVM, but I don't think we need that complexity for now. The backend can just look at the next few nodes.

CISC · 2025-06-17T21:27:41Z

BTW, we need a way to disable specific ops from getting fused, f.ex. when running imatrix:

llama.cpp/tools/imatrix/imatrix.cpp

Lines 83 to 84 in 745aa53

    
           if (t->op == GGML_OP_MUL_MAT_ID) return true; // collect all indirect matrix multiplications 
        
           if (t->op != GGML_OP_MUL_MAT) return false;

* implement GLU for split up/gate * add tests for ggml_glu_split * Vulkan: Implement glu_split logic and shader support * add split to logging [no ci] * SYCL: refactor element_size ops and add split up and gate support to gated kernels * SYCL: switch GEGLU to use tanh approximation --------- Co-authored-by: 0cc4m <[email protected]> Co-authored-by: Akarshan <[email protected]>

CISC · 2025-06-19T16:52:58Z

I wonder if it would make sense for the backends to announce what fusion they can do so that it can be taken into account before the graph is split, or if just performing the fusion at the backend is sufficient/cost-effective enough?

ggerganov · 2025-06-19T16:57:15Z

I think the "fused op" solves all issues with the graph splits and result dependencies. @jeffbolznv What is the reason to prefer not using a dedicated fused operator?

jeffbolznv · 2025-06-19T18:04:59Z

Imagine we start by fusing A+B, and change the llama frontend to generate a fused A+B op. Then all the backends implement that. So far so good. Now somebody wants to fuse A+B+C. Do they change the frontend to generate a fused A+B+C op? Then they need to change all backends to fuse that or partially replay it, to avoid a perf regression everywhere. As we add more and more fusion, I think this gets unmanageable because nobody is really comfortable changing and testing all the backends. And it doesn't do anything to address avoiding memory allocation for intermediate tensors.

ggerganov · 2025-06-19T18:23:01Z

I think if we have fused(A+B) and we want to fuse A+B+C, then in the frontend we declare fused(fused(A+B)+C). This way if the backend has fused(A+B), but does not have fused(fused(A+B)+C), it can still invoke the partial fused op. Haven't thought this through though, so it's possible that I am missing something.

I think we are mainly targeting fusing inplace operations such as additions and multiplications. These don't require additional memory. It's not ideal in that sense if we decide to do more complex fusing in the future, but apart from requiring extra memory, there aren't technical blockers. While the other approach of scanning the graph is technically very difficult to implement correctly because there are many cases to handle, especially with multi-backend support.

So I'm still leaning towards the "fused op" approach. Though would be giving this some further thought.

jeffbolznv · 2025-06-19T20:10:29Z

What happens if one backend wants to fuse ABC (eg mat mul, scale, bias) but another backend wants to fuse BCD (eg scale, bias, activation)? I fear the fused ops will in the long run not match how backends think about these operations (more like an optimizing compiler), and then they all kind of have to unwind them and refuse.

Also, if you envision a future where there are lots of ggml applications, it would be a shame if each app has to explicitly opt in to new fusion optimizations by changing their code to adopt the new fused ops. We'll miss out on what could have been free performance if the optimizations happened automatically.

ngxson · 2025-06-19T21:04:21Z

I'm +1 for the idea of scanning cgraph as @jeffbolznv suggested, but mostly because I think it will provide a better DX overall.

A developer already using ggml in their product will be quite confused if we add a dedicated fused op once in a while. Downstream apps may not be able to take advantage of this, as they may not even notice that the op is added.

The best case scenario could be to have no change to the public API, while still allow fusing ops internally. I think it can be the same idea as some ops automatically using inplace if possible.

JohannesGaessler · 2025-06-19T21:49:57Z

As I think I've said before in this thread, my preferred solution for fused operations would be a backend-specific graph optimization step. This would be applicable to more than just fused operations. For example, data conversions differ between backends and could be moved out of the specific ggml ops into the compute graph. You could then for example re-use the converted data for branching matrix multiplications or move data conversions between backends to minimize the amount of data that needs to be transmitted between them.

jeffbolznv · 2025-06-19T23:52:21Z

As I think I've said before in this thread, my preferred solution for fused operations would be a backend-specific graph optimization step.

The fused ops are a lot more palatable if they're generated in a backend optimization pass. I think its still worthwhile for the ggml common code to generate a bit more "connectivity" (use list or use count) for the backend to be able to do this more easily.

ggerganov · 2025-06-20T06:32:29Z

The auto-fusing approach sounds lucrative, but I feel it involves a lot of extra complexity and careful design (where is this optimization pass implemented, when is it executed, what extra information to add to the graph meta data, what extra info the backends need to announce, how to enable/disable fusing at certain places, etc.). Note that if we implement the manual fuse ops, this does not prevent to implement the auto-fuse approach later on. For the manual approach we already have everything needed from the ggml common and backend perspective - it's straightforward implementation of a new operator. The code for the fused kernel implementations in the backends can obviously be reused for the auto-fuse approach if we ever figure out how to implement it, so the bulk of the code would not be wasted.

Note that even if it is more convenient for a developer to not think about fusing, this would effectively shift the complexity at runtime - every graph processing has to apply the same auto-fuse logic over and over again. A compiler compiles a program once and users use it millions of times. But here we are effectively "compiling the program" all the time. Probably a separate "analyze graph and recommend optimizations" function could be more appropriate instead - graphs don't change at runtime, so it makes sense to do the optimization once.

And another argument is that even with clever compilers, we still end up writing handcrafted SIMD and assembly kernels to get the best performance from the hardware. So IMO a manual fuse mechanism is going to be necessary either way.

CISC · 2025-06-20T07:18:08Z

Note that if we implement the manual fuse ops, this does not prevent to implement the auto-fuse approach later on.

Ok, good point, let's go manual for now.

It's probably a good idea to open up a roadmap discussion on the auto-fuse topic though, there is a lot to be discussed. :)

slaren · 2025-06-20T10:20:21Z

The manual fused ops have the disadvantage that it would require the backend running them to support all the sub-operations in the fused op. For example, if we create a fused op for mul_mat + activation, it would immediately prevent all backends that do not implement the activation function from running this op. This could break completely backends that only support matrix multiplication, such as the BLAS backend.

CISC · 2025-06-20T11:06:34Z

The manual fused ops have the disadvantage that it would require the backend running them to support all the sub-operations in the fused op. For example, if we create a fused op for mul_mat + activation, it would immediately prevent all backends that do not implement the activation function from running this op. This could break completely backends that only support matrix multiplication, such as the BLAS backend.

Can't the backend just check all the sub-ops when checking that it supports the fused op, in which case it will fall back to CPU?

slaren · 2025-06-20T11:19:45Z

Can't the backend just check all the sub-ops when checking that it supports the fused op, in which case it will fall back to CPU?

Yes, that's the assumption, if the backend does not check that then it would crash when it tries to evaluate the graph. That's still not a desirable behavior.

ggerganov · 2025-06-20T14:15:54Z

The manual fused ops have the disadvantage that it would require the backend running them to support all the sub-operations in the fused op.

Hm, that's a drawback indeed. I don't think it's a dealbreaker though - fusing could be toggled on/off with user input if a specific backend is preferred over fusing.

slaren · 2025-06-20T14:31:08Z

Asking users to pass a flag, or adding code to try detect this situation, would be bad for UX and maintainability. I think it is as close to a dealbreaker as it gets.

@jeffbolznv suggestion of implementing this directly in the backends looks good to me. We would need to add the necessary information to the tensors (the number of references), possibly during ggml_build_forward_expand, and we could have a few functions in ggml that perform the necessary checks to determine if a sequence of nodes can be fused, that would be used by the backends.

Note that ggml-alloc also needs the information about the number of references to each tensor. Currently it is implemented in ggml-alloc as a pre-processing step, but it is something that comes up fairly frequently when looking at the graphs, and would be good to move this information to the tensors so that it only has to be computed once.

ggerganov · 2025-06-20T16:41:28Z

I see. If the reference count is most of what is needed, I agree it would be the better approach. It would allow to fuse much more combinations than the obvious x*y+z or act(x)*y+z, that are more difficult to express manually, so definitely an important advantage.

jeffbolznv · 2025-06-20T16:45:01Z

I can try to make a more concrete proposal when I'm back next week.

CISC added 6 commits June 12, 2025 17:39

implement unary REGLU/GEGLU/SWIGLU cpu ops

c717198

relax constraints

92943e7

duplicate shape of source

319c6cb

fix ggml_vec_geglu_f16

6fe7e07

special case gated ops

7e075be

implement unary REGLU/GEGLU/SWIGLU cuda ops

1acd121

CISC added the help wanted Extra attention is needed label Jun 12, 2025

CISC requested a review from ggerganov June 12, 2025 23:25

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 12, 2025

ggerganov reviewed Jun 13, 2025

View reviewed changes

ggml/include/ggml.h Outdated Show resolved Hide resolved

CISC added 2 commits June 13, 2025 09:00

tighten constraints again

5c58196

refactor into GGML_GLU_OP

f4be71e

CISC changed the title ~~ggml : implement unary REGLU/GEGLU/SWIGLU ops~~ ggml : implement REGLU/GEGLU/SWIGLU ops Jun 13, 2025

CISC requested a review from ggerganov June 13, 2025 08:23

ggerganov approved these changes Jun 13, 2025

View reviewed changes

ggerganov requested a review from JohannesGaessler June 13, 2025 08:30

JohannesGaessler reviewed Jun 13, 2025

View reviewed changes

ggml/src/ggml-cuda/unary.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/unary.cu Show resolved Hide resolved

ggml/src/ggml-cuda/unary.cu Outdated Show resolved Hide resolved

metal : add glu kernels

564861d

ggml-ci

github-actions bot added the Apple Metal https://en.wikipedia.org/wiki/Metal_(API) label Jun 13, 2025

ngxson reviewed Jun 13, 2025

View reviewed changes

CISC added 2 commits June 13, 2025 16:10

add CUDA_GLU_BLOCK_SIZE [no ci]

4b7d4dd

more constraints and use 64bit ints

d1d3f4f

ggml-ci

JohannesGaessler reviewed Jun 13, 2025

View reviewed changes

ggml/src/ggml-cuda/unary.cu Outdated Show resolved Hide resolved

CISC marked this pull request as draft June 18, 2025 14:12

CISC changed the title ~~ggml : implement REGLU/GEGLU/SWIGLU ops~~ ggml : implement op fusion, starting with REGLU/GEGLU/SWIGLU Jun 18, 2025

ggml : implement op fusion, starting with REGLU/GEGLU/SWIGLU #14158

Are you sure you want to change the base?

ggml : implement op fusion, starting with REGLU/GEGLU/SWIGLU #14158

Uh oh!

Conversation

CISC commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Jun 13, 2025

Uh oh!

CISC commented Jun 13, 2025

Uh oh!

ngxson Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Jun 13, 2025

Uh oh!

0cc4m commented Jun 13, 2025

Uh oh!

Uh oh!

CISC commented Jun 17, 2025

Uh oh!

jeffbolznv commented Jun 17, 2025

Uh oh!

jeffbolznv commented Jun 17, 2025

Uh oh!

CISC commented Jun 17, 2025

Uh oh!

0cc4m commented Jun 17, 2025

Uh oh!

ngxson commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Jun 17, 2025

Uh oh!

CISC commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Jun 17, 2025

Uh oh!

CISC commented Jun 17, 2025

Uh oh!

CISC commented Jun 19, 2025

Uh oh!

ggerganov commented Jun 19, 2025

Uh oh!

jeffbolznv commented Jun 19, 2025

Uh oh!

ggerganov commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Jun 19, 2025

Uh oh!

ngxson commented Jun 19, 2025

Uh oh!

JohannesGaessler commented Jun 19, 2025

Uh oh!

jeffbolznv commented Jun 19, 2025

Uh oh!

CISC commented Jun 12, 2025 •

edited

Loading

ngxson commented Jun 17, 2025 •

edited

Loading

CISC commented Jun 17, 2025 •

edited

Loading

ggerganov commented Jun 19, 2025 •

edited

Loading

ggerganov commented Jun 20, 2025 •

edited

Loading

slaren commented Jun 20, 2025 •

edited

Loading