Performance profiling of Spicy parsers

This captures some notes for gathering performance statistics for Spicy parsers. The following assumes we're using Spicy and plugin bundled with Zeek.

Creating Flame Graphs.

The two main challenges when creating flame graphs is (1) compiling everything with frame pointer to get consistent stack traces, and (2) cleaning up the collected performance samples to account for Spicy's stack switching.

Frame pointers.

Build Zeek (and hence Spicy's runtime librart) with frame pointers:
```
# CFLAGS=-fno-omit-frame-pointer  CXXFLAGS=-fno-omit-frame-pointer ./configure --generator=Ninja --ccache --disable-zeekctl --disable-python
```
If you also want to use a non-standard compiler (say, clang), add CC=/path/to/clang CXX=/path/to/clang++.
We also need malloc and friends with frame pointers. One way to do that is using jemalloc. It's easy to compile yourself:
```
# CFLAGS=-fno-omit-frame-pointer ./configure && make && make install
```
(It might even compile with frame pointers by default, not sure.)

Then inject it into processes like this:
```
# LD_PRELOAD=/usr/local/lib/libjemalloc.so.2
```

Generating graphs

With frame pointers in place, we can now collect perf data for Zeek:

# perf record -F 99 --call-graph fp -- zeek -Cr trace.pcap my-analyzer.hlto

To create a flame graph out of that, we need to clean up the stack traces first. Get the flame graph tools and run:

# perf script | awk '/^$/ { skip=0; } !skip { print; } /hilti::rt::detail::Callback::Callback/ { skip=1; }' | stackcollapse-perf.pl /dev/stdin >perf.folded
# flamegraph.pl perf.folded >~/tmp/perf.svg

You should now see the Spicy parsers as their stacks in the graphs, placed right on top of the main Zeek process.

Retaining generated C++ code

Normally, when using a profiler/debugger with *.hlto files, they won't have access to the generated C++ code, meaning they can't show source code information for a lot of what's going on. To retain the source code when building the analyzer, use spicyz with -T:

# TMPDIR=/dir/to/place/code spicyz -T -o my-analyzer.hlto my-analyzer.spicy my-analyzer.evt

Once https://github.com/zeek/spicy-plugin/pull/205 will have landed, you can pass that -T to CMake when building your analyzer through the standard build setup.

Using Spicy's built-in profiler.

# spicyz -Z -o my-analyzer.hlto my-analyzer.spicy my-analyzer.evt

Then run Zeek with Spicy::enable_profiling to get a timing dump at the end:

# zeek -r -Cr trace.pcap my-analyzer.hlto Spicy::enable_profiling=T
#
# Profiling results
#
# name                                                  count       time      avg-%    total-%
hlt/fiber/save-stack                                        2       8938       0.09       0.18
hlt/func/Foo::X::__on_x                                     1    4446076      87.58      87.58
hlt/func/Foo::X::__parse_Foo_X_stage2                       1    4565265      89.93      89.93
hlt/func/Foo::X::__parse_stage1                             1    4649733      91.59      91.59
hlt/func/Foo::X::parse3                                     1    4690787      92.40      92.40
hlt/func/Foo::__register_Foo_X                              1       5705       0.11       0.11
hlt/func/Foo::fibo                                          1    4401818      86.71      86.71
hlt/func/Foo::y                                            88     141955       0.03       2.80
hlt/total                                                   1    5076714     100.00     100.00
spicy/unit/Foo::X                                           1    4608137      90.77      90.77
spicy/unit/Foo::X::x                                        1    4506502      88.77      88.77

Usually the most interesting parts are the lines starting with spicy/unit (showing execution times for parsing units and their fields), and spicy/event (showing the time it takes to create (not process) a Zeek event). The latter isn't shown in the example.

The time values represent a high-precision internal clock with no direct meaning, so the percentages are most useful. The measure corresponding blocks of code executed. count counts how often the blocks were executed. For recursive functions/units, only the top-level instance is counted/measured.

Provide feedback

Saved searches