Performance profiling of Spicy parsers

This captures some notes for gathering performance statistics for Spicy parsers. The following assumes you are using Spicy and plugin bundled with Zeek.

Creating Flame Graphs.

The two main challenges when creating flame graphs is (1) compiling everything with frame pointer to get consistent stack traces, and (2) cleaning up the collected performance samples to account for Spicy's stack switching.

Frame pointers.

Build Zeek (and hence Spicy's runtime library) with frame pointers:
```
# CFLAGS=-fno-omit-frame-pointer  CXXFLAGS=-fno-omit-frame-pointer ./configure --generator=Ninja --ccache --disable-zeekctl --disable-python
```
If you also want to use a non-standard compiler (say, clang), add CC=/path/to/clang CXX=/path/to/clang++.
We also need malloc and friends with frame pointers. One way to do that is using a custom jemalloc. It's easy to build with frame pointers:
```
# CFLAGS=-fno-omit-frame-pointer ./configure && make && make install
```
Then inject it into processes through LD_PRELOAD, like globally like this:
```
# export LD_PRELOAD=/usr/local/lib/libjemalloc.so.2
```

Generating graphs

With frame pointers in place, we can now collect perf data for Zeek:

# perf record -F 99 --call-graph fp -- zeek -Cr trace.pcap my-analyzer.hlto

To create a flame graph out of that, we need to clean up the stack traces first. Get the flame graph tools and run:

# perf script | awk '/^$/ { skip=0; } !skip { print; } /hilti::rt::detail::Callback::Callback/ { skip=1; }' | stackcollapse-perf.pl /dev/stdin >perf.folded
# flamegraph.pl perf.folded >~/tmp/perf.svg

You should now see the Spicy parsers as their stacks in the graphs, placed right on top of the main Zeek process.

Retaining generated C++ code

Normally, when using a profiler/debugger with *.hlto files, they won't have access to the generated C++ code, meaning they can't show source code information for a lot of what's going on. To retain the source code when building the analyzer, use spicyz with -T:

# TMPDIR=/dir/to/place/code spicyz -T -o my-analyzer.hlto my-analyzer.spicy my-analyzer.evt

With a recent Spicy plugin from git, you can also set that flag when building your analyzer through CMake:

# `cmake -DSPICYZ_FLAGS=-T [...]`

Using Spicy's built-in profiler.

Since version 1.8 (currently only in git main), Spicy also provides a built-in profiling mode (-Z) that adds measurement instrumentation to your analyzer:

# spicyz -Z -o my-analyzer.hlto my-analyzer.spicy my-analyzer.evt

Then run Zeek with Spicy::enable_profiling to get a timing dump at the end:

[TODO: Update example output, it looks a bit different now.]

# zeek -r -Cr trace.pcap my-analyzer.hlto Spicy::enable_profiling=T
#
# Profiling results
#
# name                                                  count       time      avg-%    total-%
hlt/fiber/save-stack                                        2       8938       0.09       0.18
hlt/func/Foo::X::__on_x                                     1    4446076      87.58      87.58
hlt/func/Foo::X::__parse_Foo_X_stage2                       1    4565265      89.93      89.93
hlt/func/Foo::X::__parse_stage1                             1    4649733      91.59      91.59
hlt/func/Foo::X::parse3                                     1    4690787      92.40      92.40
hlt/func/Foo::__register_Foo_X                              1       5705       0.11       0.11
hlt/func/Foo::fibo                                          1    4401818      86.71      86.71
hlt/func/Foo::y                                            88     141955       0.03       2.80
hlt/total                                                   1    5076714     100.00     100.00
spicy/unit/Foo::X                                           1    4608137      90.77      90.77
spicy/unit/Foo::X::x                                        1    4506502      88.77      88.77

Usually the most interesting parts are the lines starting with spicy/unit (showing execution times for parsing units and their fields), and spicy/event (showing the time it takes to create (not process) a Zeek event). The latter isn't shown in the example.

The time values represent a high-precision internal clock with no direct meaning, so the percentages are most useful. The measure corresponding blocks of code executed. count counts how often the blocks were executed. For recursive functions/units, only the top-level instance has their time measured (but all instances are counted).

Provide feedback

Saved searches