Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement some form of escape analysis to allocate values on the stack #776

Open
yorickpeterse opened this issue Oct 31, 2024 · 27 comments
Open
Assignees
Labels
compiler Changes related to the compiler feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance
Milestone

Comments

@yorickpeterse
Copy link
Collaborator

yorickpeterse commented Oct 31, 2024

Description

Types defined using the class keyword are heap allocated. While #750 talks about giving users the option to explicitly opt-in to stack allocating data, I feel we also need a mechanism to determine if we can do this automatically. That is, we need some form of escape analysis and stack allocate values if we determine this is safe to do so.

An area of the standard library that can really benefit from this is the std.iter module. Iterators are used heavily and incur a heap allocation. When used in tight loops, it's not uncommon for most time to be dominated by the memory management necessary for these iterators.

The basic idea is simple: if an allocated value V doesn't escape the scope S it's allocated into (meaning lifetime(V) <= lifetime(S)), we can allocate it onto the stack instead.

The implementation however is a bit more tricky. For example, we can only stack allocate values if they are allocated directly in their scope and not through a method call. This means that good inlining is crucial for escape analysis to have a meaningful impact. In addition, the implementation of escape analysis itself may prove a bit of a challenge.

Note that it's perfectly fine for borrows of stack allocated data to outlive the stack value, as dropping the stack value would still incur a runtime borrow check and produce a panic in such cases. In other words, such cases don't inhibit the ability to stack allocate.

Related issues

Related work

@yorickpeterse yorickpeterse added feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance compiler Changes related to the compiler labels Oct 31, 2024
@yorickpeterse
Copy link
Collaborator Author

A related and older idea is to bump allocate small objects using a custom allocator (#542). The idea here was that by using our own allocator we can avoid the cost of going through malloc & friends. While this may not perform better compared to using e.g. jemalloc, glibc's allocator in particular is really not great when handling large amounts of allocations rapidly. This means it might be worth using for small objects (e.g. up to 64 bytes) that we can't stack allocate.

@yorickpeterse
Copy link
Collaborator Author

As for what constitutes as an escape: a conservative approach would be to just consider any move (except for a drop) an escape. This does mean that moving stack value S₁ into stack value S₂ results in S₁ being heap allocated, but it might be good enough as a start.

@yorickpeterse
Copy link
Collaborator Author

The way we generate MIR poses a challenge, such that we can't just look at move instructions. That is, code like this:

let a = User(name: 'Alice')

Results in something like this:

R₁ = allocate User
R₁.name = 'Alice'
R₂ = move R₁

This means that if we look at just move instructions, it might not be accurate/sufficient enough to determine if we can stack allocate User.

@yorickpeterse
Copy link
Collaborator Author

Assuming the moves are all in the same scopes, it doesn't really matter how many moves there are, since we can copy the data if needed. Whether copying actually happens depends largely on how LLVM decides to compile/optimize the code.

Things do get tricky if we borrow in between such moves, because the move would invalidate the borrow.

@yorickpeterse
Copy link
Collaborator Author

Given the following instructions:

R₁ = allocate User(...)
R₂ = move R₁
R₃ = move R₂
...

R₁ should be stack allocated, but the decision to do that depends on R₂, R₃ and others not escaping the scope R₁ is defined in. So given a move chain of A -> B -> C -> ..., if all of those happen in the same scope or a nested scope, we can stack allocate A.

For this to work, for every register R that's assigned the result of an allocate, we need to track the registers it's moved to (recursively), and record the scopes those moves take place in.

@yorickpeterse
Copy link
Collaborator Author

For #677 I have a WIP parser that uses the commonly used strftime()/strptime() format. This parser allocates many Option[Int] instances, where the Int is a byte.

Stack allocating these values would likely greatly improve performance. However, I'm not sure it's enough to just rely on escape analysis. That is, for that to work the allocations need to be inlined, and given the complexity of the parser that's not a guarantee.

An alternative might be to not rely on escape analysis at all, but instead look at the type definition. That is, if the type size is e.g. less than or equal to 64 bytes, and all its sub values are trivial to copy, we just always stack allocate it and copy it upon borrowing. This does mean we need some sort of mut T where T is on the stack but the borrow is an actual pointer, such that mutable methods can mutate the data in-place.

@yorickpeterse
Copy link
Collaborator Author

I've spent some time building a bump allocator for small objects based on Immix, as found on the bump-allocator branch. While a naive bump allocator can improve performance in allocation heavy applications, the moment you throw in reuse/recycling of memory the improvements are reduced greatly. This is because reusing memory requires additional bookkeeping, which in turn imposes a cost on both the allocation and deallocation path.

For the work in progress DateTime.parse implementation the bump allocator reduces the time by about 30%, but for https://github.com/jinyus/related_post_gen there's little to no difference. A key issue there seems to be that it simply requires a lot of objects and thus blocks, and thus most time is spent allocating those blocks. Increasing the block size helps a little bit, but it's still outperformed by using jemalloc (310 ms vs 325-330 ms for my bump allocator).

I'm now wondering if we perhaps could take a simpler approach. That is, instead of relying on escape analysis we instead promote types to value types based on their definition rather than their usage. For example, types such as Option[Int] could easily be turned into value types and passed around using copies, removing the need for runtime memory allocation entirely.

The downside is that "value type promotion" only works for simple types such as Option[Int] but not for more complex types such as Array[Int], while bump allocation could help with more complex types. Stack allocation would also be helpful for this, but is likely to be incorrect (as in, not promoting values to the stack) more often than desired.

@yorickpeterse
Copy link
Collaborator Author

Another issue with value type promotion: if we just blindly promote a type to a value type (assuming it stores other value types), then we might break circular types (i.e. a linked list).

@yorickpeterse
Copy link
Collaborator Author

For the bump allocation scheme, I think we can make the following changes:

  1. When finding the next hole to allocate into, simply just find the next line and that's it, instead of trying to find a hole consisting of multiple lines. I think the total cost might be roughly the same, but at least the hole finding logic will be simpler.
  2. Finding reusable blocks whenever we exhaust the last block is too expensive, as we can end up with the pattern "find no reusable blocks -> allocate new block -> exhaust block -> find no reusable blocks -> repeat". A better approach may be to only do this every N block allocations
  3. The mechanism for finding reusable lines using atomics is way too expensive, with a worst-case of 254 atomic loads. This needs to go, but I'm not sure yet how.

@yorickpeterse
Copy link
Collaborator Author

One option is to flip things around: values go on the stack by default, and a hypothetical box expr expression puts it on the heap (e.g. box [10, 20, 30] or box SomeThing.new(...)). class instances still have headers and allow for dynamic dispatch, and are still subject to single ownership. Casting class values to a trait isn't possible and requires explicit boxing first.

To make that safe, we'd have to check the borrow count upon moving a stack value such that we don't invalidate any. In theory we can elide such checks if we can statically determine no borrows are present at the time of the move, but this won't be able to remove all such checks. Moving box T values around incurs no borrow check. Stack values are specialized over their size not the type, such that 15 different types of the same size being passed to a generic type results in only one copy of the type.

Implementation wise this is much simpler compared to escape analysis and value type promotion, and we drastically reduce the pressure on the system allocator. This in turn is likely to improve performance significantly.

The downsides are:

  • More runtime borrow checks when moving stack data around
  • I'm not sure we can strictly guarantee our borrow checks always occur before a move, because LLVM optimizations might reorder things. We might have to insert some sort of compiler fence to prevent that from happening
  • A more complicated type system, as users now have to deal with the differences between User, box User, etc
  • Recursive types now explicitly require boxing, e.g. a linked list Node must be defined as class Node[T] { let @value: T, let @next: Option[box Node[T]] }

@yorickpeterse
Copy link
Collaborator Author

The above idea won't work well for resizable types such as arrays: if you have an array of N stack values and you create a borrow to a value, then resizing the array will invalidate that borrow. To make that safe as well, we'd have to check the borrow count for every value, which will be terrible performance wise.

@yorickpeterse
Copy link
Collaborator Author

yorickpeterse commented Nov 6, 2024

Enums could be promoted to stack based value types if the following conditions are met:

  • The values they wrap (if any) are also stack types
  • The type is not a recursive type

The reason this would be safe is that enums themselves can't be modified in-place, i.e. you can't change the constructor tag or the constructor's arguments. You can modify those in-place technically, but for value types that would result in a copy being modified.

Essentially the idea here is that in addition to introducing an inline keyword to flag a type as an immutable value type, we'd infer this for enum types, a bit like a hypothetical class enum inline? Foo { ... } definition where inline? roughly translates to "inline whenever possible, and fall back to a heap value otherwise".

For types such as Option[Int] and Option[Float] this would bring big improvements, as we can avoid the need for heap allocating the data.

A challenge here is specialization: if we specialize for such specific types we can remove the need for object headers, as generics can use static dispatch. This however may result in more copies of methods that operate on different types with the same memory layout. We'd also need a form of auto boxing when such values are cast to a trait, though we could also just straight up disallow casting enum values to traits (this isn't really useful anyway I think).

If we keep the headers this won't be necessary, but then e.g. Option[Int] is 32 bytes (16 bytes for the header, 8 for the tag, and 8 for the value) instead of 16 bytes (tag plus value). This might not be that big of a deal given enums are likely to live shortly.

For Array this requires a change to Array.get_unchecked and Array.get_unchecked_mut: these values read from a pointer and then use ref/mut to increment the borrow count. For stack values this would result in two copies: one as a result of the dereference, and one as a result of the ref/mut expression (which would copy the target). Perhaps we'd need to extend ref/mut to support operating on a pointer and in that case just not do anything if the pointer points to a stack value.

@yorickpeterse
Copy link
Collaborator Author

I started work on supporting stack/inline types in #778. For my WIP implementation of DateTime.parse (#677), the use of such inline types improves performance by 45-50%, as it no longer needs to allocate Option[Int] values all over the place.

@yorickpeterse
Copy link
Collaborator Author

#778 is merged, so the bulk of the work is done. I do think it's still worth implementing stack allocations on top of this. This way if a heap allocated type is used but we can trivially determine it should go on the stack, we can do just that automatically.

@yorickpeterse
Copy link
Collaborator Author

Escape analysis may be tricky to implement in an effective way: when encountering a dynamic dispatch call site we'd have to conservatively assume any arguments passed (including the receiver) escape the call, as we have no way of knowing which method is actually called at runtime. Since we use dynamic dispatch for most generics, this means that escape analysis might end up not being that useful.

@yorickpeterse
Copy link
Collaborator Author

Another challenge: iterators are often composed together, combined with using closures. These types may be constructed in function A while used in some caller B. The result is that escape analysis might not be able to determine whether the data escapes or not and thus has to conservatively assume that it does escape. The result is that this combination likely results in many cases where we can't stack allocate the data.

@yorickpeterse
Copy link
Collaborator Author

A different approach is to do the following: instead of escape analysis, we introduce the syntax/semantics type uni T and fn uni T. A type defined using uni exposes its borrows as uni ref T/uni mut T, the same goes for a closure. Unlike regular uni borrows, these borrows allow non-sendable types to be passed/returned, with the exception of uni borrows themselves. Or more precisely, they only enforce the "you can't assign/pass them around" restrictions.

This would be safe because it's never possible to construct a regular mut T or ref T to a type uni T, and uni ref T and uni mut T values can't stick around (e.g. in a local variable).

Because of these restrictions we can allow field assignments, which in turn makes type uni types useful for e.g. iterators (which often assign fields new values), and fn uni closures useful for closures that you just want to call and not borrow multiple times (the latter being quite rare to begin with).

Note that type uni doesn't mean an instance of such a type is automatically uni, because it can still contain/produce borrows to data outside of it. Instead it merely means that you can't borrow it the usual way. To make it uni such that you can e.g. send it to another process, you'd have to combine it with recover as usual.

In case of std.iter.Stream, it means its definition would be as follows:

type pub uni Stream[T] {
  let @func: fn uni -> Option[T]

  fn pub static new(func: fn uni -> Option[T]) -> Stream[T] {
    Stream(func)
  }
}

When calling methods on a type uni, the receiver is passed as a pointer for fn and fn mut methods, and by value for fn move.

This approach does have some downsides:

  • There's some conflict between type inline and type uni in terms of when to use them. In some cases it might be obvious (e.g. you want to "borrow" the data), but in other cases it might not be obvious which one of the two you should use
  • It requires the developer to actually think about whether the data is going to be borrowed the usual way or not, i.e. it complicates the development experience
  • It complicates the type checker, which is already rather complicated as-is

The first downside is the biggest one, as I really don't like it when there are multiple competing ways of doing (more or less) the same thing, because it results in a mess. For example, Python famously claims "there should be only one way to do things", only to end up with urllib/urllib2/urllib3 and similar situations.

On the other hand, the benefit of this approach is that the resulting behavior is consistent/predictable: a type uni type always goes on the stack, regardless of how it's passed around. This can result in more predictable performance when used where necessary.

@yorickpeterse
Copy link
Collaborator Author

type uni types would come with a potentially annoying restriction: methods defined on such a type would have self exposed as uni ref T or uni mut T. Values of such types can't be captured by closures, as doing so would violate the aliasing rules/restrictions.

We also run into an issue similar to the one described in #787: trait default methods may borrow self, but that wouldn't be sound for type uni types.

I think that in order to solve both, we essentially have to treat self as uni ref T/uni mut T as proposed above (= can return anything, just can't store it). That wouldn't be enough to account for expressions such as self.something() where something() returns a borrow of self, so we'd need something extra for that.

@yorickpeterse
Copy link
Collaborator Author

Skimming through some recent papers related to escape analysis (link), it seems that the usual stack allocation percentages hovers around 10% on average, with some cases of 20-30%. IIRC this matches findings from papers published in the late 90s/early 2000s. If this holds true across languages (or more specifically, Inko), then I'm not sure a 10% reduction in heap allocations is worth the compile-time complexity.

@yorickpeterse
Copy link
Collaborator Author

yorickpeterse commented Jan 22, 2025

I think that for closures we can just make fn uni the default behavior. I honestly can't think of a scenario where you'd want multiple mutable borrows of a closure (something that isn't even possible in Rust), so producing uni mut fn ... when borrowing a closure should be sufficient.

What we should do is change the capturing logic such that one closure capturing another closure captures it by value, as this new setup wouldn't allow capturing by borrow. This really should be the default anyway as capturing closures by borrow is almost always going to result in them being dropped prematurely.

One challenge is that given a uni mut fn ... we can't tell if it's a borrow of a regular closure, or a borrow of an uni fn closure. In case of the latter we'd have to enforce the sendability rules, but for the former that isn't necessary because we only want to restrict borrowing the closure itself.

@yorickpeterse
Copy link
Collaborator Author

My current thinking is to introduce static ref T and static mut T borrows. These can't be passed around similar to uni ref T and uni mut T borrows, but they don't impose the sendability restrictions. Similar to the unique borrows, there's no actual syntax for such borrows such that you can never specify such a type directly.

@yorickpeterse
Copy link
Collaborator Author

There's in fact an area in the standard library where we mutably borrow a closure: std.array.stable_sort is a recursive method and borrows a closure used for the comparison. We'd have to change this to return the closure when it's done such that it can be reused, or find a different approach (e.g. an iterative merge sort).

yorickpeterse added a commit that referenced this issue Jan 22, 2025
For more information, refer to
#776.

Changelog: changed
yorickpeterse added a commit that referenced this issue Jan 23, 2025
For more information, refer to
#776.

Changelog: changed
yorickpeterse added a commit that referenced this issue Jan 23, 2025
For more information, refer to
#776.

Changelog: changed
@yorickpeterse
Copy link
Collaborator Author

Per #786 (comment) we'll look into stack allocating closures as part of 0.19.0.

@yorickpeterse
Copy link
Collaborator Author

yorickpeterse commented Jan 24, 2025

I quickly hacked together a method that checks how many allocations are dropped in the same scope, taking into account any moves of the original register containing the allocation. For the standard library there are 5407 heap allocation sites, but only 150 (2.77%) of those are dropped in the same method. If I increase the inline weight from 100 to 200 this only goes up to 2.86%. If I increase the threshold to 500, we're at 7.46%.

If I force inlining of methods that accept any closures as arguments, the percentage is only 4.17% (423 out of 10156 allocations). If I combine this with only counting closure allocations, the percentage goes up to 15%. Without that forceful inlining the percentage is only 5.91%.

In other words, we'd have to inline much more aggressively for this to have any impact. Even then I doubt it's worth it, as stack allocating only 15% of the closures might not be worth it.

@yorickpeterse
Copy link
Collaborator Author

OpenFlow seems to benefit a bit more from this approach: if we only count closure allocations, then without the inlining hack mentioned above then about 16% of the closure allocations can go on the stack. Enabling the closure inlining hack only bumps that up to 19%.

For clogs the results are very similar.

@yorickpeterse
Copy link
Collaborator Author

Another way of approaching this is the following: the allocation/drop check hack mentioned is very basic and likely a lot less accurate compared to escape analysis. If such a basic hack can reduce allocations by (in the best case) around 20%, then perhaps escape analysis can do a much better job compared to the usual 8-10% mentioned by the various papers linked in the issue description. For example, if we can stack allocate say 30% of all allocations that's still pretty good, provided the escape analysis implementation isn't overly complex.

yorickpeterse added a commit that referenced this issue Jan 28, 2025
A unique type is a type that is allocated on the stack and subject to
move semantics (i.e. it's not copied), but severely restricts borrows.
This in turn makes it safe to assign fields new values (unlike with
inline types), as the assignments are always correctly observed (because
they always act on the same reference). Unique types are useful when
creating mutable state machines and streams such as files, sockets,
and iterators.

Similar to borrows of `uni T` _values_, one can use borrows of unique
types as the receiver of a method. For example, if a unique type is
stored in a field it's exposed as a `ref T` or `mut T`, and one can
still call methods on such a borrow. What isn't possible is assigning
such borrows, passing them around as arguments or returning them.

Refer to #776 for more
information.

Changelog: changed
@yorickpeterse
Copy link
Collaborator Author

Working on #798 I realized the following: since we have single ownership and move semantics, determining if values escape should be easier than thought: given some value V, if V is unconditionally dropped at every return then we know for a fact it doesn't escape the call. If V is merely moved or dropped conditionally, we have to assume that it might escape. Using this information we can then annotate arguments as "escape" or "doesn't escape" and use that for inter-procedural escape analysis.

For dynamic dispatch we could produce this information for all potential targets, then "union" the results together: if given an argument A and for one or more potential targets A might escape, we treat it as escaping for all targets. Only if it doesn't escape for any of the targets would we consider it as non-escaping.

With that said, we'd likely still be looking at maybe 10-20% fewer allocations in the best case. That's not bad, but I'm not sure it's worth it either.

@yorickpeterse yorickpeterse modified the milestones: 0.18.0, 0.19.0 Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler Changes related to the compiler feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance
Projects
None yet
Development

No branches or pull requests

1 participant