Skip to content

Explore Wasmtime as an alternative WebAssembly runtime #458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ia0 opened this issue May 7, 2024 · 11 comments
Open

Explore Wasmtime as an alternative WebAssembly runtime #458

ia0 opened this issue May 7, 2024 · 11 comments
Labels
for:usability Improves users (and maintainers) life needs:design Needs design to make progress

Comments

@ia0
Copy link
Member

ia0 commented May 7, 2024

Now that Wasmtime has no-std support, it becomes a possible alternative for the platform WASM runtime. This task should track the feasibility of using Wasmtime, since many roadblocks are expected (page size, memory and binary footprint, supported target architectures, releasing control flow, etc).

In particular, we should try to use Pulley.

Action items:

@ia0 ia0 added needs:design Needs design to make progress for:usability Improves users (and maintainers) life labels May 7, 2024
@ia0
Copy link
Member Author

ia0 commented Feb 18, 2025

There was recent developments on bytecodealliance/wasmtime#7311. I tried to use Pulley on Nordic on the wasm-bench crate (see #753). It seems the generated Pulley bytecode is 34 times larger than the Wasm bytecode (it's an ELF file). Besides, it seems Wasmtime needs to copy it to RAM, which is another issue.

@tschneidereit
Copy link

Thank you for starting a conversation about this in the BA's Zulip, @ia0! <3 As @alexcrichton said over there, we'll gladly help out with making Wasmtime a viable option however we can.

Alex already filed two issues[1, 2], which should address the issues with Pulley bytecode size, and having to have the bytecode in RAM.

Besides that, Alex also mentioned being able to reduce the size of the runtime itself, by removing the dependency on Serde. We know that there are other ways to shrink the binary size, but perhaps the biggest one might come from disabling a feature: SIMD support incurs a substantial size increase, because of how many opcodes need to be handled. Disabling that should shrink the interpreter meaningfully.

@ia0
Copy link
Member Author

ia0 commented Feb 20, 2025

Thank you and Alex for the quick answers and follow-up!

Let me describe how WebAssembly is used in Wasefire and answer Alex's questions:

I realize that this may be a bit of a stretch, but if you're able to describe what your embedding does (or even better have a fork/project that can be built)

  • Wasefire provides 2 APIs (the board API and the applet API) and a "scheduler" sitting between both.
  • The board API is a hardware abstraction in Rust (board implementations and the scheduler are written in Rust). This API simplifies support for new embedded devices by just having to implement this API.
  • The applet API is a system abstraction in WebAssembly (applets are Wasm modules and the scheduler is a Wasm runtime). This API provides applet portability across different embedded devices, think Embedded WASI except it's custom for now (API using Component Model or Embedded WASI #63).
  • The scheduler is meant to run multiple applets concurrently, although currently only one applet can be installed (and executed) at a time. However, applets can be installed or updated dynamically (through a custom USB protocol).
  • After considering Wasmtime, Wasmer, Wasmi, and Wasm3, I decided to write my own in-place interpreter1. It differentiates itself from the rest with small binary size, small memory footprint, slow interpretation, returns control flow for host functions, and default function linking.
  • There's also the option to link one native applet to the scheduler (bypassing WebAssembly). This is the another extreme in the design space (applet performance, applet sandboxing). Ideally Wasmtime would provide yet another point in the design space (applet performance/sandboxing, scheduler flash and RAM footprint and limited applet binary portability).
  • I'm doing Wasm runtime experiments in crates/wasm-bench. This benchmark uses the minimal CoreMark from Wasm3, and it's really just to get orders of magnitude (or even answer feasibility questions).

We know that there are other ways to shrink the binary size, but perhaps the biggest one might come from disabling a feature

I always use default-features = false and enable only what I use, so I'm already expecting to use the minimum set of Wasmtime features.

and/or describe what the wasm is doing (or even better share a sample wasm) that'd be awesome.

Applets use less than the WebAssembly MVP. The current interpreter doesn't even support SIMD. It also has optional floats (disabled by default). If you want to check some actual wasm modules, you can run cargo xtask applet rust NAME where NAME is the name of the applet and is just a crate at path examples/rust/NAME/Cargo.toml. This will produce target/wasefire/applet.wasm (and applet.wasm.orig in the same directory before wasm-strip and wasm-opt). The biggest example so far is opensk. Note that you can also use cargo xtask --release applet rust opensk (to remove debug printing support) or cargo xtask --release applet rust opensk --opt-level=z to optimize for size.

I'm currently on vacations (with the kid thus little time), but as soon as I'm back I'll try to see if I can add Wasmtime support behind a cargo feature. The main difficulty will be the fact that the scheduler currently assumes the runtime to return control flow on host function calls. I guess I'll be able to use the async API of Wasmtime for this purpose (without async runtime, just calling poll myself). Another will be the fact that the current interpreter accepts a way to always link imported functions, but that's only to support linking new applets on old platforms as long as the imported function is allowed to return an error (there's a common format for all functions) at runtime. That's probably not going to be a blocker.

I'll post updates on this issue.

Footnotes

  1. I later discovered Ben Titzer's paper A fast in-place interpreter for WebAssembly which ideas are currently being implemented in the dev/fast-interp branch.

@alexcrichton
Copy link

A bit delayed, but thank you for writing that up! It'll take some time to fully digest this but I hope to poke at this in the future.

In the meantime bytecodealliance/wasmtime#10285 triggered another thought/recommendation, you'll want to be sure to set Config::generate_address_map to false if you aren't already. That should ~halve the size of the *.cwasm and means that you'll lose the ability to get wasm bytecode offsets in backtraces, which I suspect is probably suitable for your use case. (although if it's not there's some assorted ideas on bytecodealliance/wasmtime#3547 for making this section smaller)

Also, to confirm but I suspect you're already doing this, if you strip the binary before compiling it (e.g. remove the name section) it'll make the *.cwasm a bit smaller by removing that from the original binary. (or we could also plumb a Config option to retaining that in the *.cwasm if you'd prefer to not strip)

@ia0
Copy link
Member Author

ia0 commented May 2, 2025

Sorry for the very long delay. I finally got time to use the new version in #819. You can see the diff for the tuning I had to do (rather simple). In terms of performance it's essentially 20x faster than what I currently use, uses 2x the memory, and 2.5x the flash. So it seems usable, I'll try to integrate it in the final product.

You can also see that compared to Wasmi, it has comparable performance, uses 50% more memory (but that's probably just my tuning, I could probably just ask for 32k wasm stack instead of 64k), but uses 2x less flash.

Also important, the changes on the cwasm were significative, the module is now 15k.

@alexcrichton
Copy link

Oh that's awesome, thanks for the update!

FWIW the max_wasm_stack option doesn't actually proactively allocate stack it's instead just a limiter which prevents going above that threshold, so changing the setting probably won't lead to less memory consumption. That being said 2x memory may mean there's yet still to improve within Wasmtime, so if you're able to identify some of the larger allocations we can try to work on shrinking them or making more of them optional.

@ia0
Copy link
Member Author

ia0 commented May 4, 2025

FWIW the max_wasm_stack option doesn't actually proactively allocate stack

For Pulley this is the case (and I guess it makes sense). I reduced it to 16k and seen the memory reduction.

Regarding memory usage, there are only 3 allocations of more than 1kB:

  • 16k for the stack (or whatever max_wasm_stack is set to)
  • 1056 for the VM itself
  • 64k for the wasm memory

This seems very reasonable to me. The only remaining limitation is the binary size, but I'll just optimize wasmtime for size (since that's the biggest part, 70kB to 80kB) and pulley-interpreter for perf (since it gives 40% perf improvement for double the footprint from 25k to 50k).

@tschneidereit
Copy link

64k for the wasm memory

That's something the custom page sizes proposal can help with, which is already supported by Wasmtime. It sounds like that should work well for your use case, potentially with page sizes of 4kb, or even less.

@alexcrichton
Copy link

Oh right yes I forgot about that stack, sorry! That should definitely be ok to decrease as you see fit.

I'm not sure I reproduced exactly right but locally I was seeing a 680k binary for the wasm-bench folder compiled to a thumb target, and with bytecodealliance/wasmtime#10727 I was able to get that number down to 580k, so if you don't need simd that should help? That should also shrink the size of the VM allocation too by removing the (probably unused) vector registers from the VM state, leaving just float/integer ops.

@ia0
Copy link
Member Author

ia0 commented May 5, 2025

That's something the custom page sizes proposal can help with

Good point. That's definitely going to be useful when we'll support multi-applets. For now this is not a blocker.

I'm not sure I reproduced exactly right but locally I was seeing a 680k binary

Weird, the repo should be somewhat hermetic. Running the following command at commit d85f65661519e5159d3d79d240e12ae1dc70b60:

cargo-size --profile=release-size -Zbuild-std=core,alloc -Zbuild-std-features=panic_immediate_abort,optimize_for_size --target=thumbv7em-none-eabi --features=target-nordic,runtime-wasmtime

should give:

   text	   data	    bss	    dec	    hex	filename
 276708	     24	   1220	 277952	  43dc0	wasm-bench

so if you don't need simd that should help

Indeed I don't need SIMD. The current interpreter I'm using don't even support them (and floats are behind a feature flag). I'll follow the PR.

@alexcrichton
Copy link

alexcrichton commented May 5, 2025

Aha I was having some various issues which I have now resolved. I couldn't find cargo-size over the weekend so I was just looking at the output ELF size. Now I've found it though! Additionally I was using --release vs --profile=release-size. In release mode I'm seeing a 10% reduction in text size removing simd in the interpreter (410016 => 366260, 43756 bytes removed), and in release-size I'm seeing a 5% reduction in removing simd (273944 => 259424, 14520 bytes removed). This was compiling with/without CARGO_TARGET_THUMBV7EM_NONE_EABI_RUSTFLAGS=--cfg=pulley_disable_interp_simd on bd85f65, the SHA you linked above I couldn't find in the repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for:usability Improves users (and maintainers) life needs:design Needs design to make progress
Projects
None yet
Development

No branches or pull requests

3 participants