Skip to content

Trade interpreter footprint for performance pre-computing a side table #804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 55 commits into from
Apr 30, 2025

Conversation

ia0
Copy link
Member

@ia0 ia0 commented Mar 20, 2025

This is part of #46 and was developed in the dev/fast-interp branch.

zhouwfang and others added 30 commits August 28, 2024 13:40
google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
This is the first step towards adding a value stack in `Thread`. The
next step is to merge `locals` and `labels_values` in `Frame`. After
that, we can create a value stack in `Thread` by combining the values
from the `Frame`'s.

google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
Continuation of google#598 towards adding a value stack in `Thread`.

google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
Reverted google#605 and keep values separate from locals for now.

google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
The first step towards building the side table.

The next step is to build the side table in `push_label()`,
`pop_label()`, etc.

google#46

Co-authored-by: Zhou Fang <[email protected]>
Make validation return side tables, and store them in module.

google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
There was an earlier review in
zhouwfang#2.

google#46

---------

Co-authored-by: Julien Cretin <[email protected]>
Co-authored-by: Zhou Fang <[email protected]>
Jumping from "if without else" should be 1 step less than jumping from
"if with else". For "if without else", we don't want to jump over the
"end" because `exit_label()` need to be
[called](https://github.com/zhouwfang/wasefire/blob/de6c7e889865736cb1f11877d5d4e1e9f50729a8/crates/interpreter/src/exec.rs#L788)
at the "end". But for "if with else", we want to jump over the "else"
because otherwise we would be forced to jump to the "end" at "else"
(normally, we would reach "else" after executing the true branch of
"if", and we should not execute the "else" branch).

This is equivalent to
https://github.com/google/wasefire/blob/main/crates/interpreter/src/parser.rs#L566-L569,
which will be deprecated with the side table applied in `exec.rs`.

google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
When "if" has an "else", we should skip the entry for "else" in the side
table. It was addressed in google#690 during execution and should be moved to
validation.

google#46

Co-authored-by: Zhou Fang <[email protected]>
Apply `delta_ip` and `delta_stp` from side table in `exec.rs`.

google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
On my Linux Docker container:

1. 

CoreMark results of this PR: 37.991516 (in 18.695s), 36.7287 (in
19.258s)

CoreMark results of google#690:
37.101162 (in 19.133s), 37.39716 (in 18.996s)

Look very similar. Not sure about `nordic`.

2. 

CoreMark results of `main`: 28.791477 (in 18.09s), 28.814291 (in
17.734s)

So there does seem some minor performance improvement with `delta_ip`
and `delta_stp`.

google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
It improves the performance by removing the costly
`last_frame_values_cnt()`.

On linux, `CoreMark result: 39.795715 (in 17.853s)`. cf. CoreMark result
was about 37 in 19s in google#705.

google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
- Apply `val_cnt` and `pop_cnt`.
- Remove `Label` by introducing `Frame::labels_cnt`.

There is not much performance improvement on Linux due to this PR, which
replaces
```
let values_cnt: usize = frame.labels[i ..].iter().map(|label| label.values_cnt).sum();
```
in `pop_label()` by `pop_cnt + val_cnt`.

google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
Based on experiments, these are the smallest masks to pass CI. In other
words, `SideTableEntry` only requires `u35` at this point. I'll add the
fields from `func_type()` and `func()` in `module.rs` to the side table
in google#711, and `SideTableEntry` as `u64` might not be sufficient.

google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
I'd like to get some feedback about the design: the "section table" is
used to improve the performance of [these
accessors](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L176-L209)
in `module.rs`, which are used in execution. For these three accessors
that return `Parser`, we can precompute the `ParseRange` during
validation, and create a parser based on the range during execution.

I'm not sure about [the other
accessors](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L119-L174)
that return types or even `Option<ExportDesc>`. IIUC, we want to store
the section table in flash as the side table. Does that mean we would
have to design some encoding rules for these return types?

**Update 1**

We will only optimize
[`func()`](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L188)
and
[`func_type()`](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L119)
in `module.rs`, because the other accessors should be fast.

**Update 2**

- Rename the previous `SideTableEntry` as `BranchTableEntry`.
- Create per-function `SideTable` to include branch table and other
function metadata.

google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
CoreMark result: 37.404152 (in 18.937s)

Slight performance improvement on Linux compared to [the previous
state](google#46 (comment)).

google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
ia0 and others added 8 commits March 7, 2025 11:10
The main changes are:
- Do not create a branch table entry for unreachable code. This saves
space and permits a uniform treatment of branches.
- Use the correct stack length for the Loop branch target (we're taking
the branch before being in the label, so the current stack needs to be
added).
- Take the Else source and target branches at the correct time (i.e.
when the stack is the expected one such that the stack length is
correctly computed).
This way we can make sure which part of it is unsupported, namely the
large table.
google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
google#46

---------

Co-authored-by: Zhou Fang <[email protected]>
Co-authored-by: Julien Cretin <[email protected]>
@ia0 ia0 added crate:interpreter Modifies the interpreter crate:cli Modifies the users CLI for:performance Improves firmware performance labels Mar 20, 2025
@ia0 ia0 requested a review from ia0-review March 20, 2025 11:32
Copy link

google-cla bot commented Mar 20, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@ia0
Copy link
Member Author

ia0 commented Mar 20, 2025

cc @zhouwfang I'll try to review this when I get time. I'll directly modify the PR as I review it.

@zhouwfang
Copy link
Contributor

Gentle ping :)

@ia0
Copy link
Member Author

ia0 commented Apr 29, 2025

We currently have:

run.sh CoreMark Time RAM Platform Applet
nordic base size 0.04453232 449.875s 5416 88756 7771
nordic base perf 0.092720516 216.018s 5400 138952 7771
nordic PR size 0.050766833 394.835s 3128 92144 9459
nordic PR perf 0.15536636 129.018s 3096 138736 9459

Logging this before doing the "cursor" change, to see if the unsafe is worth it or not.

@ia0 ia0 requested a review from ia0-review April 30, 2025 10:32
@ia0
Copy link
Member Author

ia0 commented Apr 30, 2025

Final measurements:

run.sh CoreMark Time RAM Platform Applet
nordic base size 0.04453232 449.875s 5416 88756 7771
nordic base perf 0.092720516 216.018s 5400 138952 7771
nordic PR size 0.043123893 464.771s 3256 92444 9459
nordic PR perf 0.1507 132.99s 3288 145004 9459

@ia0
Copy link
Member Author

ia0 commented Apr 30, 2025

HWCI nordic passed

@ia0 ia0 merged commit d632760 into google:main Apr 30, 2025
22 checks passed
@ia0 ia0 deleted the side-table branch April 30, 2025 16:12
@zhouwfang
Copy link
Contributor

Thanks!

@ia0
Copy link
Member Author

ia0 commented Apr 30, 2025

Thanks!

Thank you! That was a big piece of work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crate:cli Modifies the users CLI crate:interpreter Modifies the interpreter for:performance Improves firmware performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants