Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alloca support #1105

Closed
bjorn3 opened this issue Jul 27, 2019 · 21 comments
Closed

Add alloca support #1105

bjorn3 opened this issue Jul 27, 2019 · 21 comments
Labels
cranelift:goal:rustc Focus area: Support for compiling Rust! cranelift Issues related to the Cranelift code generator enhancement

Comments

@bjorn3
Copy link
Contributor

bjorn3 commented Jul 27, 2019

This is necessary to implement the unsized_locals rust feature.

cc https://github.com/bjorn3/rustc_codegen_cranelift/issues/15

@jyn514
Copy link
Contributor

jyn514 commented Oct 6, 2019

I also need this to implement jyn514/saltwater#63.

@alexcrichton alexcrichton transferred this issue from bytecodealliance/cranelift Feb 28, 2020
@alexcrichton alexcrichton added cranelift:goal:rustc Focus area: Support for compiling Rust! cranelift Issues related to the Cranelift code generator labels Feb 28, 2020
arkpar pushed a commit to paritytech/wasmtime that referenced this issue Mar 4, 2020
…dealliance#1105)

* clif-util: Make the `-t` flag work with the `wasm` subcommand

The presence of the flag was checked in the code, so it was essentially already
supported, but it was not properly configured when constructing
the CLI arguments parser.

* clif-util: also enable the `-c` flag for the `wasm` subcommand

Similar to the parent commit, this functionality is already supported, just a
mix up of the CLI parser construction made it not show up.
@jyn514
Copy link
Contributor

jyn514 commented Jul 24, 2020

How hard would this be to implement? I'm willing to take a shot at it.

@tschneidereit
Copy link
Member

@bnjbvr, @cfallin, @julian-seward1, can you comment on this?

@bnjbvr
Copy link
Member

bnjbvr commented Jul 27, 2020

I'm assuming the question arises in the context of the new backend.

From looking at LLVM's docs, it seems that alloca always takes a static (= known at compile time) amount of stack space. If that's true, it should be somewhat easy to implement (add amount to SP, adjust the "nominal SP" offset, make sure to deallocate in the return paths).

If one can pass a dynamic input value that's the amount to allocate, it is likely to be much trickier, because we need to be able to track precisely the running SP value within the function's body: that's what the nominal SP offset does in a static manner. It should be implementable, but it might require using a register for this purpose.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Jul 27, 2020

From looking at LLVM's docs, it seems that alloca always takes a static (= known at compile time) amount of stack space.

No, it also allows a dynamic input. It is just that a static input is equivalent to using stack slots in Cranelift.

@cfallin
Copy link
Member

cfallin commented Jul 27, 2020

It's definitely possible to implement this with the new backends. It interacts with the way we address stack slots and spill slots; at least on aarch64, we address function arguments with fp, which stays at the top of the stack frame (invariant to any allocas), but we address stack/spill slots with offsets from sp, because positive offsets are cheaper on aarch64. We track "nominal SP" as an offset from real SP (statically during codegen), so we can continue to access this storage while we've temporarily pushed args to set up for a call (EDIT: or, with alloca support, after we've decremented real SP to allocate storage).

The most straightforward approach would probably be to (i) detect when an alloca (or just a dynamic alloca) is present; then if so, (ii) allocate a separate scratch register in the prologue and copy nominal-SP to that; then (iii) access all stack and spill slots relative to that register. We lose a register in that case but I think that's unavoidable unless we revert to negative offsets from FP (which has a higher cost -- a few percent degradation at least, because it forces add instructions to synthesize addresses when offset more than -0x80, IIRC).

Happy to point out the bits that would need to change in more detail if you would like!

@peterhuene
Copy link
Member

peterhuene commented Jul 27, 2020

Related to this, at least for the x86-64 ABIs, I would like to see Cranelift stop using RBP as a "traditional" frame pointer as both DWARF and Windows unwind encode enough information to properly describe frame layout without having to establish a frame pointer for frames of static size. This would free RBP to be used as a GPR for functions that do not have dynamic stack allocations or as the "nominal-SP" register for functions that have dynamic stack allocations.

In fact, on Windows x64, a "frame pointer" is supposed to be exactly what you describe the "nominal-SP" register as: a register pointing at the base (or somewhere inside) of the "static" part of the local frame and used to reference args/locals (and CSRs for unwind) by positive offset. For that ABI, a frame pointer is therefore generally only established for frames calling alloca.

Right now the x64 prologue/epilogue instructions relating to the establishment of a traditional frame pointer are simply wasted instructions on Windows.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Jul 27, 2020

Related to this, at least for the x86-64 ABIs, I would like to see Cranelift stop using RBP as a "traditional" frame pointer as both DWARF and Windows unwind information encode enough information to properly describe frame layout without having to establish a frame pointer for frames of static size.

This should be an option in my opinion. Using DWARF unwinding for perf profiles as opposed to frame pointers results in much bigger perf.data files and slower perf report, as it requires capturing a big chunk of the stack and then performing the unwinding offline. Online unwinding using DWARF tables is simply too slow.

@peterhuene
Copy link
Member

This should be an option in my opinion.

Definitely, but I think omitting a traditional frame pointer should be default for these ABIs, at least for optimized compilations. An option to opt-in when they are legitimately needed (like in the case of a tool relying on them for fast stack walks) makes sense to me.

@cfallin
Copy link
Member

cfallin commented Jul 27, 2020

@peterhuene that's a good point -- could you create a separate issue for that? I definitely agree that -fomit-frame-pointer optimizations are something we should look into at some point.

@peterhuene
Copy link
Member

I opened #1149 a while back specific to Windows. Should we create a more general "omit frame pointers when permitted" issue?

@cfallin
Copy link
Member

cfallin commented Jul 27, 2020

Sure, I think it makes sense to track with a separate issue; it's a distinct thing that we'd want to do on any platform when we're allowed to (by ABIs and by debug requirements).

@peterhuene
Copy link
Member

I've opened #2073.

@bjorn3
Copy link
Contributor Author

bjorn3 commented May 25, 2023

As per rust-lang/compiler-team#630 unsized locals will be removed from the compiler. No need to implement them in cg_clif any longer. It would still be necessary for implementing a C compiler based on Cranelift though.

@jyn514
Copy link
Contributor

jyn514 commented May 25, 2023

I no longer maintain saltwater-cc and don't have time to work on this.

@jameysharp
Copy link
Contributor

Okay, I guess let's close this issue. If somebody wants this feature in the future, feel free to re-open this issue then.

@jameysharp jameysharp closed this as not planned Won't fix, can't repro, duplicate, stale May 25, 2023
@bryal
Copy link

bryal commented Feb 8, 2024

I want to implement something similar to Swift's approach to unboxed polymorphism using Value Witness Tables.[1] alloca is needed to be able to put generic intermediate values on the stack, or we'll have to make a heap allocation for the out-parameter of every other function call. Of course there are optimizations that can alleviate the issue even without alloca, but it would be much more performant and convenient to just have alloca in the first place.

[1] Implementing Swift Generics @ 2017 LLVM Developer's Meeting

@jameysharp
Copy link
Contributor

That use-case makes sense to me, @bryal.

I gather your interest is related to https://git.sr.ht/~jojo/kapreolo/commit/93672f5, right? I like your current workaround of allocating a fixed-size stack slot and falling back to a heap allocation if you need more space, but we can definitely discuss how alloca could work in Cranelift.

The suggestions that folks made several years ago have some associated costs, both in runtime when accessing stack slots, and in maintenance time. We'll just have to consider those costs carefully in this discussion.

@jameysharp jameysharp reopened this Feb 8, 2024
@bryal
Copy link

bryal commented Feb 9, 2024

Exactly, @jameysharp, that's the one.

I'm not intimately familiar with any concrete ISAs. Before Cranelift, my only experience with code at this level was using LLVM. I've had to consider calling conventions, but not much more than that. That is to say, I'm not sure I have much to contribute to the discussion of how alloca should work here.

That being said, if we manage to come up with a clear plan, I'd be happy to help out with the manual labour.

@jameysharp
Copy link
Contributor

I talked with several of the other people working on Cranelift (@cfallin, @fitzgen, @elliottt, and @lpereira) about this today and there is quite a bit to say about it, which I will try to organize here. If I misrepresent any of their positions I hope they will speak up.

First off, we would welcome a PR demonstrating how this could work! But at least among the people I talked with, working on alloca support is unlikely to be a priority for the moment. We think it's surprisingly complicated to support in conjunction with Cranelift's other goals, and the complexity is difficult for us to justify committing to without a more substantial use case. We have suggestions for things you could try instead though.

There are several reasons why a fixed-size stack frame is much easier to deal with. One is that Windows requires stack-probing in the function prologue, and while I assume there's a way to make that work with alloca, the specifics are a research question that we'd need somebody to answer.

A larger reason is that accessing stack slots for spilled registers needs to be as cheap as possible. Currently there are two registers we can add fixed offsets to in order to find any stack slot in the current frame: specifically, each target has a frame pointer and a stack pointer. On x86-64 we could choose to use the frame pointer to access everything, which would allow alloca to move the stack pointer without doing any harm. On aarch64, however, there's a cost to using negative offsets, and for frames larger than something like 128 bytes, accessing stack slots would need an extra instruction. However, the ARM ABI doesn't require the frame layout we use now ("The location of the frame record within a stack frame is not specified") so a step toward making this work could be to change our frame layout on aarch64 so that all stack slots are at positive offsets from the frame pointer. We're not certain of other consequences of that change, though.

You already have a workaround where you fall back to malloc for large allocations, which is a good approach. In a comment you noted that you're "not sure we can [free] this manually. The temporary may be passed as an arg in a tail call." I'd note that alloca wouldn't work in that case either as the allocated space would be part of the caller's frame that's overwritten by the tail call. Anywhere that you can use alloca, you can also safely use malloc/free, at some performance cost.

To avoid the performance cost of heap allocations, one suggestion that we came up with is that you could allocate a separate stack that is under your compiler's control, rather than trying to share it with the stack used for calling conventions and register allocation. This kind of "shadow stack" is a common solution when data lifetimes are tied to function call scopes. Allocation and deallocation are constant-time in the common and amortized cases, just like alloca, but you don't need magic code-generator support. I think this is your best bet.

I'm going to go ahead and close this issue again to reflect that this is not currently planned, but we still welcome further discussion.

@jameysharp jameysharp closed this as not planned Won't fix, can't repro, duplicate, stale Feb 12, 2024
@bryal
Copy link

bryal commented Feb 13, 2024

Thank you @jameysharp for your work and thank you all for your input! Indeed I had not yet stopped to consider how my stack temporaries would play with tail calls. As I intend to employ optimized tail calls extensively in my generated code, you're of course right that this approach will not work as I had planned. I assume Swift does not (or, in 2017, did not) guarantee TCO in the same way, for the alloca to work for them.

I'll see about using a shadow stack instead -- thanks for the suggestion! Now I need to think a bit about indirectly stored temporaries, tail recursion, and memory leaks.

For my part, there's no need to implement this anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:goal:rustc Focus area: Support for compiling Rust! cranelift Issues related to the Cranelift code generator enhancement
Projects
None yet
Development

No branches or pull requests

9 participants