Life of an async fn
Live captioning by White Coat Captioning
TYLER: Well, in case you haven't been paying any attention, async/await is now table in Rust. [Applause] This is a huge milestone for the language and I think systems development in general, so very exciting.
Again, my name is Tyler, or tmandry. I work on Fuchsia, which is an experimental operating system at Google. We use Rust for a number of things in the operating system. I work on the toolchain team supporting Rust, which means that sometimes I get to spend time fiddling around in the compiler and I'm also a Rust release team member and compiler team contributor.
So I just want to briefly give kind of a primer on futures before I dive into some of the async stuff. Florian covered some of this, so just to recap, a future is any computation that progresses asynchronously, so that could be like a request to some external database server, it could be like reading and writing on disk, or doing some expensive data processing on another thread, something like that. All of these things can be represented with a future. It eventually produces a result which is called its output and we represent futures in Rust with the future trait, so you can see there is an associated type called output which is the output of your computation, and then there's this method called poll which has kind of the scary signature but the job of the poll method is to make progress on the future, as much progress as possible. Poll returns an enum called poll and the result of the poll is either pending which means that I made as much progress as I can and now I'm waiting for some external event, or ready, which means that I'm completely done and here is the result of my computation.
An important thing to remember is that in Rust we have a lazy futures model, so futures don't do anything until they are told and it's the job of the executor to do that.
Okay, so I want to talk about async functions in this talk and I will start with a really simple example of an async function, so here is an async function named handle request. We are going to take some arguments. The first thing we are going to do is to issue a request to an external database server, to get a row from our database corresponding to the ID we were handed. We are going to await the result of that, so basically suspend execution of this function until we get the row back. Once we get it back we are going to encode it into JSON and then finally we are going to write the encoded JSON out onto a TCP stream. Now, if you are coding this synchronously your function would block your thread at two potential points.
One is basically when you are awaiting the database server and the other is when you are writing out to a TCP stream because TCP sockets many times have a buffer that can fill up so you will block your entire thread and at minimum you will pay a contact switch or your thread could be locked for a really long time. The way this connects to future, as was mentioned earlier, is we re-write the signature in the compiler of your async function. It could just be a normal function that returns impl future and that means that we are returning some type, we are not going to tell you what type it is, that happens to implement the future crate. In this case our async function did not have a recurrent type so the output is just the empty [inaudible] and then inside of the function body, we just wrap it with an async block.
So let's talk about what the compiler actually does when you compile this async function. To do that I want to walk through how you would implement this by hand in the pre-async/await world. So the way you would do that is you want to create a data structure that represents the state of your request, and so we are going to do that, we are going to call it request handler and start with what we call the argument star function into that struct and then we are going to implement future on a request handler, again our async function doesn't return anything so the output associated type is just tuple and then we are going to implement the poll method, so the way you implement this by hand is by writing a state machine which is what we are going to do. We are going to add a state field to our struct which keeps track of what state our request is in and then, inside the poll method, every time I get polled we are going to match on that state and basically behave accordingly. So the very first state that we can be in is we've never been polled before, this corresponds to the very top of the async function; the next one is that we are getting the row so awaiting the result of the database; the third is that we are writing to our TCP stream. Notice I didn't create a state for encoding the JSON because that's a synchronous operation. There's no await. Then finally we are completely done and ready to return our result.
Let's go back. In the case where we've never been polled before we are going to just basically execute the very first line in our function, so we are going to call get row and it's going to return a future which we are going to store in a field and then we will update to getting row. Get row returns a future. We are going to store that in a field called row fut and we have to initialise all the fields, need to be able to say: I don't have it yet, none, and then when we do have it, initialise it, some, with the value of the future. Then once we are polled we are polled when we are in the getting row state. We are awaiting, so we need to propagate this to pull down the future that we are awaiting, so we will do that. We know there's a future inside of row_fut. If that returns pending that means we can't make any more progress so we simply put that back up the stack. If we get poll ready, that means the row is ready and we can make progress so the very first thing we can do is deallocate the row_fut, we don't need that anymore and then we are going to do the next thing in the function which is to encode our row into JSON. Then we do the next thing which is to create the write all. So we call write all. Again, we are going to store the value in our state machine and then we are going to update our state to be writing.
All right, so the next state is writing. This sounds really simple, basically the same thing that we saw before, where we are now await the write_fut so we will propagate the poll down to the write_fut. Again if it's pending, we will return poll pending. If it's ready there's no output so we immediately update our state to be ready and that's it. The ready state is really simple. You simply [inaudible]. So there's one last thing that we need to do which is add a loop round the inside of our body and that means that every time poll gets called, we make as much progress as we possibly can until we either hit a poll pending on subfuture or we are completely done, so basically we will constantly loop until we hit one of these three return statements in the body.
You notice there's quite a bit more code on the right side than the left. It's not as readable. That's one major benefit of async/await, of course. And in fact we actually cheated and actually made the code a little bit more readable than it would have to be in real life because we are actually borrowing one of our struct members with [inaudible] encoded inside of the value of the write so when we call write all we are handing it a reference to basically put it to buffer and you can't actually write this in normal Rust. You would have to introduce a layer of indirection with a smart pointer or something and it ends up being really not fun. This is an issue that comes up a lot in async. Thankfully, with async/await all that is taken care of for you and it's more efficient. Just says: I know that you are [inaudible] yourself and thanks to some extractions we can guarantee that.
Another problem with our implementation that you might notice is we are actually wasting some space here. Any time we are in the middle of getting the row we don't have any need for the write_fut, encoder bytes, and then in the next stage when we are writing to the TCP stream we don't need the row_fut anymore so it would be really nice if we could not allocate all of that space and just re-use the bytes. If we look at the way that it's laid out in memory right now, we've got offset zero of our structs and you have all of the fields, so this is the way it is now. What we would really like to be able to do is basically again re-use those same bytes, row_futs in the next stage when we are awaiting the write_fut.
So as recently as a few months ago it didn't do this and this problem compounds really quickly. If we assume for simplicity's sake that the size of our future is basically dominated by the size of all the subfutures, then we are roughly twice as big as we need, not re-using though bytes. And that's a problem but it actually gets worse because row_fut and write_fut, those are written as async functions and those each have their own problems where they are awaiting different subfutures and so really those are twice as big as they need to be. So really the future we are writing isn't twice as big as it needs to be, it's really four times as big as it needs to be and then if that gets awaited by some other future maybe that's eight times a big as it needs to be. This really did happen in practice. So in Fuchsia we would have futures that took up half a megabyte on the stack, which is really big and we had a lot of futures that actually blew the stack. You can [inaudible] in Rust. It's not undefined behaviour, but you can.
What the compiler wants to do is it wants to ask the question: okay, which of these fields can I overlap? The way it does that is it looks at your programme - I will zoom in here so you can read some things - so this is a graphical representation of what's called mirror or the mid-level IR inside the Rust compiler. We represent the entire program as this control flow graph, we are inside of every block. These are called basic blocks, so on the left side in the middle you will see basically block number 14 and inside of that block there's a bunch of statements. These are very simple statements where you can only do things like: I need to allocate storage on the stack for a local variable, or I need to access one member of this variable, reference something or assign something. There's no control flow inside of the basic blocks. All the control flow happens between the basic blocks along these edges here. So at the end of block 14 we drop something and drop is an operation which can panic, so there are two options. Either we return and go down the happy path, which is normally what you think about, down to basic block 23, or we panic and we unwind and go into basic block 7 to do all the clean-up. So what we are going to do is crawl this entire info graph and work out which variables are storage live at the same time. Storage live is how we represent that we actually need storage for a variable. We are not actually going to do that because we don't have three hours for this talk so we are just going to squint at the source and see if we can figure anything out. You can do this at home. That works pretty well.
So the first question we are going to ask is: can I overlap row_fut with encoded? The rule of thumb here is, if the last time you ever use a future is - sorry, if the last time you ever use field A happens before the first time you ever use field B then you can definitely overlap. So in this case, you will notice the last time we used row_fut is we set it equal to none and then the following line is the first time we ever use encoded and you can kind of reason about this by looking at the source and saying: okay, definitely these are never needed at the same time. So yes, we can overlap these in memory. When I say overlap, I mean just re-use the bytes, right?
Okay, so what about row_fut with write_fut? Again, in this case row_fut, the last time we ever use it is setting it equal to none and then later on we use write_fut, so we can definitely overlap these two. Then the last two fields that we care about are write_fut and encoded. So in this case we assign to encoded and then we assign to write_fut and we are actually handing out a reference to the encoded bytes, which means that we definitely need these. Write_fut contains a reference to encoded so we can't overlap these in memory. So this allows us to overlap these bytes.
What this ends up looking like is not so much a struct but more like an enum where you have one variant per every state your future could be in and inside of each enum variant you have all the fields that you need for that particular state, so in general in this table we have all of the enum variants so we could be in unpolled state, suspend 0, 1, 2 etc and then all of our local fields on the left here. The only difference between a regular Rust enum and this is that you can have one field and multiple variants, so in this example we need local a starting in the suspend zero variant but also in the suspend 1 variant so you can have one local that persists across multiple await points. That comes up quite a lot.
So I want to talk about - this is all very cool, but what are the performance applications? So let's look at some patterns that you might not realise have subtle performance issues. So in this example we have a function called do-stuff, we take a context which is just a smart pointer to some context somewhere and what we are going to do is we are going to log some development fill with some contexts. There is a subtle performance issue with this code. Can anyone spot it? Nod if you think you see it. Not seeing any nodding, okay. So what can happen is, at the end of every scope is when we drop our local variables, right? So the compiler is going to insert an implicit drop at the very end of our function but we don't actually need context until the end of the function; we just log in at the very beginning. So we are actually hanging on to this context, all the way through awaiting that future and then finishing the task, which is kind of inefficient. If we have the only reference to context then we will be heaving that memory around and we don't need to be. What we are going to do is, one way to mitigate this is to move that context into basically an interscope in the function and the effect of this is that we are actually moving the drop to the end of that interscope before the await. This is quite a bit better. Not hanging on to context or awaiting some other future. That future could take a year to complete. We have no idea. But it's still not perfect. The reason is: remember, futures do nothing until they are polled and in the case of an async function that includes all of the lines at the top of the function, so none of these lines are run until the first time the poll is called, which means we are still hanging on to contexts or in that unresumed state. So the "perfect" way of dealing with this is actually to desugar our async function into a regular function which returns impl future like we saw earlier - compiler does this for you but you can do it yourself and at the very top of that function we will do our log and then we will write an async block which drops this into an async context where we can await anything we want and actually evaluate your future return. Because we don't reference context inside of that async block at all, we never have to save it inside of our state machine, so we aren't allocating any bytes to save the smart pointer and also never hanging on to that reference inside of the future. So once due step is called, we immediately log. Then we return a future that can make progress later.
Another pattern to look out for is when you are awaiting expressions that have really big, complicated temporaries. So in this example I have a struct Big which has an array in it that is just kilobyte and link, a pretty big type. We are going to implement drop for Big and do something silly like print a message when it is called. Then I have a function through from usize to usize. It doesn't really matter what it does. Inside of my async function bar, we are going to construct a Big simply to get the length of the array out and then call foo with that link. So this is semantically equivalent to calling foo with 1024. Let's say we don't want to repeat ourselves and we forgot the thing in Rust so we are going to construct it on the stack and get the link up. Normally the compiler could optimise this away and it would be fine. But what happens when we try to print the size of the state machine that is returned? We are going to get something really big back. Our state machine is over a kilobyte in size. That's because it is embedding a copy of Big inside of the state machine. The reason for this is that Rust has well-defined semantics for when the structs are run and any time you create a temporary inside of a statement, the destructor for that temporary is called at the very end of the statement. Basically, you have a temporary expression inside of a larger statement, fast forward to the next semicolon you see and that's when the destructor will be run for that temporary, which in this case is not good because that after our await, right?
So this is something to look out for and it's pretty easy to mitigate. So in this case we could simply get the link inside of one statement, there's a semicolon in between that statement and the next await and that would be fine. We can also write it like this. Actually, constructing the entire foo future inside of one statement and then awaiting on the next line. It doesn't really matter. In either case, when we print the size of our state machine it's going to be much, much smaller. We are not embedding a copy of Big. You might say, okay, that seems like a problem but if the temporary doesn't have a constructor, in other words it doesn't implement drop, we should be able to optimise that away. We don't need to hang on to this copy of Big in between because we never need to call the constructor on it and you would be right except unfortunately we don't do that optimisation today. So that's kind of a future thing that we would want to do in the future. So for now just always look out for big complicated temporaries. This only comes up occasionally in practice but it's something to be aware of.
So should I use async/await? Yes, definitely use it. It's now stable but it's a "Minimum Viable Product", and that means we are taking advantage of the Rust release process where we release a new compiler every six weeks to iterate on the language compiler and implementation. All of your code - all of the state machines that get produced should be as good or better than hand-rolled in most cases with a few notable exceptions, and that's going to improve.
I got to talk to some of the people writing the C++ coroutines proposal for 2020 which is a really interesting talk and that's some of their version, C++'s version of async/await so there are a couple of differences between theirs and ours. One major difference between Rust and the current proposal for C++ is that any time we allocate anything on the heap in Rust, that allocation is always made explicit so if you await another future, it always gets rolled up, all of the state for that future gets rolled up into your state machine, your data structure, and we don't allocate it on the heap unless you create a box and put the future in there. The C++ proposal, there's actually an implicit heap allocation every time you await a future and in many cases the compiler can optimise that away, but there are no guarantees. There's a possibility to flip over a new - an optimisation and have an optimisation fail and now you have a new heap allocation in your programme, in the C++ programme.
Rust has fewer language-level extension points so one consequence of releasing a new standard only once every three years is that you can shove everything in at once. This works for C++, but in the case of Rust we kind of have a minimal service area. The benefit of that is that you have to do less work, keeping code performant, and the rest of the stuff you don't need. We have less legacy. Async/await is not just for web servers. There are a lot of things you could use it for and I think there are really a lot of interesting applications out there. One is high-performance compute. Another is actually filesystems, so any time you are writing to or reading from disk, that's a whole lot slower than your CPUs. I think there are some interesting applications there. Device drivers might be an interesting thing. I'm working on operating systems so that's what I'm thinking about. Build systems: any time you are writing a build system, you are basically starting a compile step and then that can take a very long time so there might be some interesting applications there for expressing build systems and bug rules and UI is another way, so you can imagine showing a pop-up user and then awaiting their response. Things like that. It's really just a paradigm in a way of writing code that allows you to express what you want in a really high level way.
If any of this interested you in terms of the compiler or just getting involved in the Rust project in general I highly encourage you to contribute. There are lots of ways you can contribute. Probably the simplest way is, if you hit an issue, whether it's a compiler bug or confusing error message, it's a great idea to file an issue for that. There are people who do nothing except look for confusing messages in the compiler and fix them, and they are awesome. So definitely report things. If you feel like picking up an issue to tackle there's a lot of easy or mentor tags in the Rust repo, so you can look for that, and take advantage of the community. I mean, one of the things I love about Rust is that it has a really strong community and people that are always happy to support you. So there's a few different places that you can go to kind of post, ask questions, there's me, you can find me on Twitter, GitHub, Gmail, and there are also working groups that you can join if you want more of an ongoing contribution. Here is a shortlist of some working groups in the compiler team but there's a lot of other working groups and other teams. You can find a list online. And that's everything. So do we have time for questions? [Applause]
You mentioned that Fuchsia's use of the generators prior to those optimisations being implemented was about, what, half a megabyte on the stack or more? After the optimisations, can you share how much it shrank by? TYLER: Yes, that's a good question. I do not remember that particular one. I think we shrunk it by over 80%. Some of them - so some of the futures that we had shrunk by - didn't shrink at all, but some shrunk by 80 or 90% in size so it just depended how deep the stack of futures that were being awaited was. The desugaring of the loop showed that it always - there's a loop and a match, so that even if you go from one basic block to the next it would jump to the top of the loop and rematch again. TYLER: Yes. Is that how it actually compiles or does it optimise it so that if you don't suspend the execution, it just falls through to the next basic block without jumping up to the top and then matching it there? Jumping back down. TYLER: So, right, yes, so it does actually have a loop in there, so any time you have an await, that is actually where the loop is introduced. So any time you await some subfuture, we basically poll that future in a loop, so what I'm doing is semantically equivalent but I'm putting the loop inside of the poll function. Around the body of our entire poll function, essentially the same thing, I'm just not embedding a loop at every single await point because then it wouldn't hit on the slide. So async/await has been stabilising for four years, right, or something like that? TYLER: It has been a while. Now it's stable, so are you seeing something that might show us that that was a mistake? We should have waited a little bit more? No pun intended. TYLER: Ooh. Uh oh. So did we make any mistakes? I think time will tell. I'm really happy that we picked the time, even though a lot of people understand build were upset by this. Yes, I don't think that there's any red flags or things missing. Some of the performance things that are covered, like the fact that we hang on to a temporary until the end of a statement, even if there's an await, that was a change that was introduced a couple of months before stabilisation. So I think that makes sense for the consistency of the language, that's how the language works outside of an async context, but it's possible we could have defined the language differently and just silently changed the semantics so that you don't hold on to temporary as across an await. Time will tell. Hi, thank you for the talk, it was really interesting. TYLER: Thank you. I'm curious about the state machine. Can you ask the compiler to omit not Rust stores but the mirror for what the state machine generates? TYLER: Yes, you can. It's a nightly only flag, so you have to download the nightly compiler and it's not super user friendly. What am I doing? I am going to show you. So yes, this diagram that I showed in the slides is actually generated by that. It actually emits a dot graph file and you can render it. So there's a couple of different modes but there's z emit mirror and then z - I can't remember the flag, but message me and I will send you the flag. There's a really nice documentation called the rustc guide so people working on the compiler. You can go and find it there, even if you just want to see your state machine. Thank you. Give a big hand to Tyler. [Applause]