|
| 1 | +### Compiled code interop |
| 2 | + |
| 3 | +The interpreter is expected to run in combination with aot compiled code so we will need an efficient mechanism for entering/exiting interpreter. Even in an interp only mode, we would still need to do these transitions for pinvokes and reverse pinvokes. These transitions also pose a few challenges since on iOS we won't be able to dynamically generate thunks, while on WASM the call signature must be embedded into the emitted Wasm code, meaning we can't reuse a generic thunk for calling methods with different signatures. |
| 4 | + |
| 5 | +The interpreter operates on a separate stack space that it maintains. Every local variable, including argument registers reside on this space. When an interpreter method starts executing, it expects the arguments to be present one after the other at the location pointed to by the stack pointer. This means that for a call exiting the interpreter, with a certain signature, we need to call a thunk that receives at least the location of the parameters and the target address and that moves each argument from the interpreter stack to the corresponding reg/native stack location followed by a native call to the target address. Once the call returns it should move the return value from the regs/stack to the interpreter stack. If we need to pass to native/compiled code a code pointer that can be used to entry the interpreter, we need to create a thunk that moves all arguments from the native regs/stack to a separate memory location. This thunk should also have embedded a pointer that identifies the interpreter method that we need to execute. The thunk should over pass the memory location where the arguments have been copied together with the interpreter method to execute, so the interpreter entry code can set up a new interpreter frame and begin executing the method code. |
| 6 | + |
| 7 | +### Interop with compiled code on native architecture (ex. arm64) |
| 8 | + |
| 9 | +##### Interpreter Exit |
| 10 | + |
| 11 | +Let's assume the interpreter needs to either do a pinvoke call or a call to a method that is present in a AOT compiled image. At the callsite, in interpreter code, we will know exactly the signature of the call as well as native code pointer that needs to be called. We could have two different invocation paths: either a per signature specialized path or a generic path. |
| 12 | + |
| 13 | + 1. Specialized path |
| 14 | + |
| 15 | + A specialized path means that we will have a thunk specialized per signature that can be used to call any method with that signature. The basic set of signatures that are needed by an application are easy to compute. We will require a separate thunk for every pinvoke signature as well as for every signature of a method that is aot-compiled, because each aot-compiled method can end up being called from the interpreter. While these thunks could be emitted directly in assembly code, it makes more sense to compile them as IL wrappers and emit them together with the rest of the managed code when aot compiling an assembly. Each one of these wrappers will receive as arguments the native pointer of the method to call and the address on the interpreter stack where the arguments reside. The wrapper will be able to compute the base address of every single argument and load the argument value. It will then proceed with executing the native call, followed by writing back the result of the call to the interpreter stack in place of the arguments. |
| 16 | + |
| 17 | + 1. Generic Path |
| 18 | + |
| 19 | + The purpose of the generic path is to be able to implement interp exit calls with a signature that is not known at compile time. This is probably not a common scenario, but I assume there are cases where one could generate IL code at runtime that does a PInvoke call through a pointer with a dynamically generated signature that is not known at compile time. In this situation we need a unique thunk, or at least a limited number of thunks that are able to handle any type of signature. |
| 20 | + |
| 21 | + A possible solution for the generic path would be to use a lower level interpreter (let’s call it transition interpreter) that handles the calling convention and it is based on hand written assembly thunks. In the case of interpreter exit, the entry thunk for the transition interpreter would receive as arguments the compiled code that it needs to call, the pointer to the interp stack where the arguments reside and and additional pointer to the opcodes stream that it needs to execute. |
| 22 | + |
| 23 | + Let's say we would need to call a method with a signature like `native_method (int, int, struct {int, int})`. Let's assume that first 2 arguments would need to be stored in reg R0, R1, half of the struct in R2 and the other half in the first slot on the stack. The code that the transition wrapper would need to run would be something like: |
| 24 | + |
| 25 | + ``` |
| 26 | + SAVE_REG_CONTEXT // if we need to unwind and resume back to interpreter |
| 27 | + MOV_ISTACK_TO_R0 (off 0) // move arg0 that is on interp stack at off 0 |
| 28 | + MOV_ISTACK_TO_R1 (off 8) |
| 29 | + MOV_ISTACK_TO_R2 (off 16) |
| 30 | + MOV_ISTACK_TO_STACK (off 20, off 0) // move from interp stack to native param area on the real stack |
| 31 | + CALL // dispatch to compiled code |
| 32 | + MOV_R0_TO_ISTACK (off 0) // assuming return is in R0 and we would store the return value back on the interpreter stack at offset 0), |
| 33 | + RET |
| 34 | + ``` |
| 35 | +
|
| 36 | + All these opcodes would have fixed implementation in assembly and their implementation, as well as dispatching between opcodes could use scratch registers so it doesn't clobber the registers that are initialized to be passed as arguments to the native call. We would only need to implement a handful of such thunks and the execution speed would be very fast. We could implement transition code for any signature at runtime, by generating a list of such opcodes that are later interpreted when doing the call. This would also allow doing other low level operations that might be necessary, like saving the register context so EH can resume in interpreter. |
| 37 | +
|
| 38 | + Another approach to this problem would be to have a single assembly thunk that could receive the number of general registers, the number of stack slots and the number of floating point registers that the thunk needs to pass. When compiling an interp method that will have to do an interop call, where the fast wrapper is not found compiled in the image, then the offset allocator for vars will need to ensure that the arguments for the method will be in the correct order. I think the only change in the order would come from floating arguments, since they would need to be moved to the end of the stack. While this could be a relatively simple approach on paper, I'm not completely sold on the idea because it would move some of the complexity of the cconv over to the var offset allocator, which will already be fairly complex. Also, if it turns out that this simple approach is not mapping perfectly to the native calling conventions that we need to support on a certain architecture, then additional logic will have to be inserted making the solution ugly and not feasible. |
| 39 | +
|
| 40 | +##### Interpreter Entry |
| 41 | +
|
| 42 | +In scenarios where we need to pass a function pointer to a pinvoke from the interpreter (an `UnmanagedCallersOnly` method) or in cases where compiled code needs to invoke a method that wasn't AOT compiled, the runtime will have to dynamically generate a thunk that can enter execution into the interpreter. Given this function pointer needs to be callable from compiled code with the same signature as the original method, the thunk will have to embed at least the interpreter method pointer that identifies the method to be interpreted and an additional function pointer that will be called to do the actual argument translation and call into the interpreter. The runtime could maintain pairs of pages for the thunks, one for code and one for data, that can be remapped as many times as necessary whenever generation of a new thunk is required. Similarly with the interpreter exit case, we could have two approaches for the argument translation, a specialized and a generic one. |
| 43 | +
|
| 44 | + 1. Specialized path |
| 45 | +
|
| 46 | + We will generate an IL wrapper for every signature that needs to be handled. The wrapper will receive the arguments of the call in the native registers/stack according to the native call convention plus the interpreter method pointer in a special register that the wrapper would need to be able to access via special IL. The wrapper should then be able to obtain the current interpreter stack pointer of the current thread from a TLS variable and then proceed to write every single argument to this location. Since it will be aot compiled, the call convention internals are easily handled by design, leaving the call convention details for the jit compiler to handle. The wrapper will then dispatch to an interpreter entry method, written in C++, that needs just to set up a new interpreter frame and then begin execution. As the method finishes execution, the return value will be at the top of the interpreter stack. The compiled wrapper will then load this value from the interpreter stack and return it normally. |
| 47 | + |
| 48 | + 1. Generic path |
| 49 | +
|
| 50 | + The generic path is very important for the interpreter entry scenario because this entry transition is required for methods that are interpreted, which by definition are methods that are not really known/handled at AOT compile time. This means that it would be common for us to not know the signature for the method entry, meaning we can't aot compile the necessary wrapper in advance. We could follow a similar approach to the interpreter exit generic path. |
| 51 | + |
| 52 | + The thunk embedding the interpreter method pointer would call instead into the transition interpreter thunks, passing, as before, the interpreter method pointer in a special register. The starting thunk will first store the pointer for the transition interpreter opcodes into a scratch register (which would be obtained from the interpreter method data), then it will obtain the interpreter stack pointer and start executing each instruction, moving values from the native regs/stack to the interpreter stack, according to the generated opcodes for the signature of the method. Once the arguments are moved it will call into C++ where actual method execution can start with the values on the interpreter stack. |
| 53 | +
|
| 54 | +
|
| 55 | +### Interop with compiled code on Wasm |
| 56 | +
|
| 57 | +In addition to other Wasm limitations, the design might as well assume the impossibility to dynamically generate thunks of code, given this constraint will be present on WASI. The Wasm architecture brings two main differences. One is that it is impossible to have a generic thunk, because all calls in Wasm have an explicit signature embedded in code so we can't reuse the same thunk as an entry/exit point to methods with different signatures. The second difference is that Wasm doesn't have valuetypes, which simplifies the signatures that we need to support. While on a native architecture, we would need to account for a valuetype size, we could have it passed partly in regs and part on stack etc, on Wasm every valuetype is passed just as a simple int32 offset into Wasm memory. This greatly reduces the number of signatures that we need to support. |
| 58 | +
|
| 59 | +##### Interpreter exit |
| 60 | +
|
| 61 | +In order to support the pinvoke or compiled code call paths, we could use the same approach as with the native architecture. The only difference is that we won't be able to have a generic path, but for this transition it should rarely be problematic since we have a clear picture of the code that we would need to invoke into and signatures are typically reused. When the application is aot compiled, we will include a compiled wrapper for every signature of a compiled method as well as for every pinvoke signature. The wrapper will receive the target pointer to call and the address of the interpreter stack where the arguments are present. On mono these wrappers are written in C with dynamically generated code during app compilation time. I think it makes more sense to include them as compiled IL wrappers which should allow for code reuse with the native architecture approach and also for the invocation path to be as fast as possible. |
| 62 | +
|
| 63 | +##### Interpreter entry |
| 64 | +
|
| 65 | +Given we can't dynamically generate thunks that can be invoked, in order to generate an interpreter entry point we could reuse the functionality of fat pointers that is already used with native aot. Fat pointers can point to additional data, if a bit is set, rather than being actual function pointers. Instead of calling the pointer directly, the calling code will check for the most significant bit. If it is set, it will instead dereference the pointer and obtain the real function pointer together with the additional argument that is passed to the call. For the purpose of entering the interpreter, we would generate a fat pointer that has the target destination as a compiled IL wrapper for the signature in question together with the interpreter method pointer that is passed. The wrapper will obtain the pointer to the interpreter stack, will move all arguments there and call into the C++ interpreter path, passing the method and the address on the stack where the arguments have been written. When aot compiling an assembly, we would need to consider every single call as a potential entry to the interpreter and, if we deem it possible, we would generate an interpreter entry wrapper for the call signature in question. |
| 66 | +
|
| 67 | +Native code has no knowledge of fat pointers, so we will need to explicitly generate small thunks for every single `UnmanagedCallersOnly` method. On mono, during app compilation, a build task scans the assembly for all `UnmanagedCallersOnly` methods and it dynamically generates a separate C method for each one of them. This method will have its own data that is later initialized with the function pointer for the cconv translation wrapper together with the interpreter method pointer argument. Rather than have this logic in special build tasks that dynamically generate C code, it might make more sense to simply generate a special direct call wrapper for `UnmanagedCallersOnly` methods, that are not being aot compiled to the Wasm image (this might represent a scenario that is just good to have, for interp-only, but not really mandatory since we can choose to always aot compile these methods). |
| 68 | +
|
| 69 | +Given on Wasm we might encounter situations where we don't have a necessary wrapper for a certain signature we would be in need for a fallback approach. While on browser we could rely on dynamic generation of Wasm code, on wasi we have no alternative. For the rare cases where users would run into such scenarios, I think a simple approach would be for the runtime to report the missing signature when crashing and instruct the user to specify these signatures into a separate file, that can then be consumed by the Wasm application build, so the additional wrappers are compiled. |
0 commit comments