Add alloca support #1105

bjorn3 · 2019-07-27T19:42:53Z

This is necessary to implement the unsized_locals rust feature.

cc https://github.com/bjorn3/rustc_codegen_cranelift/issues/15

The text was updated successfully, but these errors were encountered:

jyn514 · 2019-10-06T03:41:21Z

I also need this to implement jyn514/saltwater#63.

…dealliance#1105) * clif-util: Make the `-t` flag work with the `wasm` subcommand The presence of the flag was checked in the code, so it was essentially already supported, but it was not properly configured when constructing the CLI arguments parser. * clif-util: also enable the `-c` flag for the `wasm` subcommand Similar to the parent commit, this functionality is already supported, just a mix up of the CLI parser construction made it not show up.

jyn514 · 2020-07-24T14:30:33Z

How hard would this be to implement? I'm willing to take a shot at it.

tschneidereit · 2020-07-26T13:15:02Z

@bnjbvr, @cfallin, @julian-seward1, can you comment on this?

bnjbvr · 2020-07-27T13:00:32Z

I'm assuming the question arises in the context of the new backend.

From looking at LLVM's docs, it seems that alloca always takes a static (= known at compile time) amount of stack space. If that's true, it should be somewhat easy to implement (add amount to SP, adjust the "nominal SP" offset, make sure to deallocate in the return paths).

If one can pass a dynamic input value that's the amount to allocate, it is likely to be much trickier, because we need to be able to track precisely the running SP value within the function's body: that's what the nominal SP offset does in a static manner. It should be implementable, but it might require using a register for this purpose.

bjorn3 · 2020-07-27T13:12:46Z

From looking at LLVM's docs, it seems that alloca always takes a static (= known at compile time) amount of stack space.

No, it also allows a dynamic input. It is just that a static input is equivalent to using stack slots in Cranelift.

cfallin · 2020-07-27T18:57:25Z

It's definitely possible to implement this with the new backends. It interacts with the way we address stack slots and spill slots; at least on aarch64, we address function arguments with fp, which stays at the top of the stack frame (invariant to any allocas), but we address stack/spill slots with offsets from sp, because positive offsets are cheaper on aarch64. We track "nominal SP" as an offset from real SP (statically during codegen), so we can continue to access this storage while we've temporarily pushed args to set up for a call (EDIT: or, with alloca support, after we've decremented real SP to allocate storage).

The most straightforward approach would probably be to (i) detect when an alloca (or just a dynamic alloca) is present; then if so, (ii) allocate a separate scratch register in the prologue and copy nominal-SP to that; then (iii) access all stack and spill slots relative to that register. We lose a register in that case but I think that's unavoidable unless we revert to negative offsets from FP (which has a higher cost -- a few percent degradation at least, because it forces add instructions to synthesize addresses when offset more than -0x80, IIRC).

Happy to point out the bits that would need to change in more detail if you would like!

peterhuene · 2020-07-27T20:20:05Z

Related to this, at least for the x86-64 ABIs, I would like to see Cranelift stop using RBP as a "traditional" frame pointer as both DWARF and Windows unwind encode enough information to properly describe frame layout without having to establish a frame pointer for frames of static size. This would free RBP to be used as a GPR for functions that do not have dynamic stack allocations or as the "nominal-SP" register for functions that have dynamic stack allocations.

In fact, on Windows x64, a "frame pointer" is supposed to be exactly what you describe the "nominal-SP" register as: a register pointing at the base (or somewhere inside) of the "static" part of the local frame and used to reference args/locals (and CSRs for unwind) by positive offset. For that ABI, a frame pointer is therefore generally only established for frames calling alloca.

Right now the x64 prologue/epilogue instructions relating to the establishment of a traditional frame pointer are simply wasted instructions on Windows.

bjorn3 · 2020-07-27T20:27:29Z

Related to this, at least for the x86-64 ABIs, I would like to see Cranelift stop using RBP as a "traditional" frame pointer as both DWARF and Windows unwind information encode enough information to properly describe frame layout without having to establish a frame pointer for frames of static size.

This should be an option in my opinion. Using DWARF unwinding for perf profiles as opposed to frame pointers results in much bigger perf.data files and slower perf report, as it requires capturing a big chunk of the stack and then performing the unwinding offline. Online unwinding using DWARF tables is simply too slow.

peterhuene · 2020-07-27T20:32:56Z

This should be an option in my opinion.

Definitely, but I think omitting a traditional frame pointer should be default for these ABIs, at least for optimized compilations. An option to opt-in when they are legitimately needed (like in the case of a tool relying on them for fast stack walks) makes sense to me.

cfallin · 2020-07-27T21:22:42Z

@peterhuene that's a good point -- could you create a separate issue for that? I definitely agree that -fomit-frame-pointer optimizations are something we should look into at some point.

peterhuene · 2020-07-27T21:29:23Z

I opened #1149 a while back specific to Windows. Should we create a more general "omit frame pointers when permitted" issue?

cfallin · 2020-07-27T22:15:56Z

Sure, I think it makes sense to track with a separate issue; it's a distinct thing that we'd want to do on any platform when we're allowed to (by ABIs and by debug requirements).

peterhuene · 2020-07-27T23:15:53Z

I've opened #2073.

bjorn3 · 2023-05-25T10:18:40Z

As per rust-lang/compiler-team#630 unsized locals will be removed from the compiler. No need to implement them in cg_clif any longer. It would still be necessary for implementing a C compiler based on Cranelift though.

jyn514 · 2023-05-25T10:38:59Z

I no longer maintain saltwater-cc and don't have time to work on this.

jameysharp · 2023-05-25T17:25:44Z

Okay, I guess let's close this issue. If somebody wants this feature in the future, feel free to re-open this issue then.

bryal · 2024-02-08T10:14:17Z

I want to implement something similar to Swift's approach to unboxed polymorphism using Value Witness Tables.[1] alloca is needed to be able to put generic intermediate values on the stack, or we'll have to make a heap allocation for the out-parameter of every other function call. Of course there are optimizations that can alleviate the issue even without alloca, but it would be much more performant and convenient to just have alloca in the first place.

[1] Implementing Swift Generics @ 2017 LLVM Developer's Meeting

jameysharp · 2024-02-08T21:16:04Z

That use-case makes sense to me, @bryal.

I gather your interest is related to https://git.sr.ht/~jojo/kapreolo/commit/93672f5, right? I like your current workaround of allocating a fixed-size stack slot and falling back to a heap allocation if you need more space, but we can definitely discuss how alloca could work in Cranelift.

The suggestions that folks made several years ago have some associated costs, both in runtime when accessing stack slots, and in maintenance time. We'll just have to consider those costs carefully in this discussion.

bryal · 2024-02-09T09:21:16Z

Exactly, @jameysharp, that's the one.

I'm not intimately familiar with any concrete ISAs. Before Cranelift, my only experience with code at this level was using LLVM. I've had to consider calling conventions, but not much more than that. That is to say, I'm not sure I have much to contribute to the discussion of how alloca should work here.

That being said, if we manage to come up with a clear plan, I'd be happy to help out with the manual labour.

jameysharp · 2024-02-12T22:00:23Z

I talked with several of the other people working on Cranelift (@cfallin, @fitzgen, @elliottt, and @lpereira) about this today and there is quite a bit to say about it, which I will try to organize here. If I misrepresent any of their positions I hope they will speak up.

First off, we would welcome a PR demonstrating how this could work! But at least among the people I talked with, working on alloca support is unlikely to be a priority for the moment. We think it's surprisingly complicated to support in conjunction with Cranelift's other goals, and the complexity is difficult for us to justify committing to without a more substantial use case. We have suggestions for things you could try instead though.

There are several reasons why a fixed-size stack frame is much easier to deal with. One is that Windows requires stack-probing in the function prologue, and while I assume there's a way to make that work with alloca, the specifics are a research question that we'd need somebody to answer.

A larger reason is that accessing stack slots for spilled registers needs to be as cheap as possible. Currently there are two registers we can add fixed offsets to in order to find any stack slot in the current frame: specifically, each target has a frame pointer and a stack pointer. On x86-64 we could choose to use the frame pointer to access everything, which would allow alloca to move the stack pointer without doing any harm. On aarch64, however, there's a cost to using negative offsets, and for frames larger than something like 128 bytes, accessing stack slots would need an extra instruction. However, the ARM ABI doesn't require the frame layout we use now ("The location of the frame record within a stack frame is not specified") so a step toward making this work could be to change our frame layout on aarch64 so that all stack slots are at positive offsets from the frame pointer. We're not certain of other consequences of that change, though.

You already have a workaround where you fall back to malloc for large allocations, which is a good approach. In a comment you noted that you're "not sure we can [free] this manually. The temporary may be passed as an arg in a tail call." I'd note that alloca wouldn't work in that case either as the allocated space would be part of the caller's frame that's overwritten by the tail call. Anywhere that you can use alloca, you can also safely use malloc/free, at some performance cost.

To avoid the performance cost of heap allocations, one suggestion that we came up with is that you could allocate a separate stack that is under your compiler's control, rather than trying to share it with the stack used for calling conventions and register allocation. This kind of "shadow stack" is a common solution when data lifetimes are tied to function call scopes. Allocation and deallocation are constant-time in the common and amortized cases, just like alloca, but you don't need magic code-generator support. I think this is your best bet.

I'm going to go ahead and close this issue again to reflect that this is not currently planned, but we still welcome further discussion.

bryal · 2024-02-13T10:18:02Z

Thank you @jameysharp for your work and thank you all for your input! Indeed I had not yet stopped to consider how my stack temporaries would play with tail calls. As I intend to employ optimized tail calls extensively in my generated code, you're of course right that this approach will not work as I had planned. I assume Swift does not (or, in 2017, did not) guarantee TCO in the same way, for the alloca to work for them.

I'll see about using a shadow stack instead -- thanks for the suggestion! Now I need to think a bit about indirectly stored temporaries, tail recursion, and memory leaks.

For my part, there's no need to implement this anymore.

alexcrichton transferred this issue from bytecodealliance/cranelift Feb 28, 2020

alexcrichton added cranelift:goal:rustc Focus area: Support for compiling Rust! cranelift Issues related to the Cranelift code generator labels Feb 28, 2020

peterhuene mentioned this issue Jul 27, 2020

Cranelift: implement omitting frame pointers (where permissible). #2073

Open

cfallin added the enhancement label May 4, 2022

jameysharp closed this as not planned Won't fix, can't repro, duplicate, stale May 25, 2023

jameysharp reopened this Feb 8, 2024

jameysharp closed this as not planned Won't fix, can't repro, duplicate, stale Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alloca support #1105

Add alloca support #1105

bjorn3 commented Jul 27, 2019

jyn514 commented Oct 6, 2019

jyn514 commented Jul 24, 2020

tschneidereit commented Jul 26, 2020

bnjbvr commented Jul 27, 2020

bjorn3 commented Jul 27, 2020

cfallin commented Jul 27, 2020 •

edited

Loading

peterhuene commented Jul 27, 2020 •

edited

Loading

bjorn3 commented Jul 27, 2020

peterhuene commented Jul 27, 2020

cfallin commented Jul 27, 2020

peterhuene commented Jul 27, 2020

cfallin commented Jul 27, 2020

peterhuene commented Jul 27, 2020

bjorn3 commented May 25, 2023

jyn514 commented May 25, 2023 •

edited

Loading

jameysharp commented May 25, 2023

bryal commented Feb 8, 2024

jameysharp commented Feb 8, 2024

bryal commented Feb 9, 2024 •

edited

Loading

jameysharp commented Feb 12, 2024

bryal commented Feb 13, 2024

Add alloca support #1105

Add alloca support #1105

Comments

bjorn3 commented Jul 27, 2019

jyn514 commented Oct 6, 2019

jyn514 commented Jul 24, 2020

tschneidereit commented Jul 26, 2020

bnjbvr commented Jul 27, 2020

bjorn3 commented Jul 27, 2020

cfallin commented Jul 27, 2020 • edited Loading

peterhuene commented Jul 27, 2020 • edited Loading

bjorn3 commented Jul 27, 2020

peterhuene commented Jul 27, 2020

cfallin commented Jul 27, 2020

peterhuene commented Jul 27, 2020

cfallin commented Jul 27, 2020

peterhuene commented Jul 27, 2020

bjorn3 commented May 25, 2023

jyn514 commented May 25, 2023 • edited Loading

jameysharp commented May 25, 2023

bryal commented Feb 8, 2024

jameysharp commented Feb 8, 2024

bryal commented Feb 9, 2024 • edited Loading

jameysharp commented Feb 12, 2024

bryal commented Feb 13, 2024

cfallin commented Jul 27, 2020 •

edited

Loading

peterhuene commented Jul 27, 2020 •

edited

Loading

jyn514 commented May 25, 2023 •

edited

Loading

bryal commented Feb 9, 2024 •

edited

Loading