Sharing a math inner-loop (On^2) benchmark, loop unrolling and autovectorization #561

ewmailing · 2023-02-22T23:32:12Z

ewmailing
Feb 22, 2023

Hi, I just wanted to share the results of another little Pallene benchmark I did that is very positive.

For reference I was implementing the algorithm/equation found here, which computes something called a "Money Flow Index" at a specific point on a stock:
https://tulipindicators.org/mfi

In my version, I am calculating the MFI for all candle bars in a stock chart history, which means I put this inside a for-loop.

The slowest part of the formula is the summation. Essentially, this is summation of values in an array within a small sliding window (default window size is 14, but it is a variable parameter.)
So this is basically a smaller nested for-loop inside the main for-loop, so O(N^2).

So in a quick benchmark test I whipped up using os.clock():

Lua (via --emit-lua): 8.0
Pallene (default -O2): 0.80

Because in C-land, this seems like a terrific candidate for loop-unrolling and vectorization, I tried activating those flags. -O3 wasn't sufficient to change anything by itself, so my flags were:

export CFLAGS="-O3 -funroll-loops -ftree-vectorize"

Pallene (-O3 -funroll-loops -ftree-vectorize): 0.75

So, about another 6% decrease.

I separated out the flags and found that the vectorizer did nothing, and it was the -funroll-loops that had the impact.

Also, FYI, these verbose flags show all the missed things the optimizer couldn't handle. Lot's of Pallene/Lua C functions are highlighted as "clobbering memory". Probably all expected.

export CFLAGS="-O3 -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -fopt-info-vec-missed"

Anyway, I thought this was overall very positive for Pallene, and I thought you would enjoy seeing it.
But this does make me wonder if Pallene in the future could detect cases like this where it might be a good candidate for SIMD/vectorization, and generate better code that could help the autovectorizer do its thing.

Also, on a semi-related tangent, my program has to load lots of stock history datasets which have become a bottleneck. I started playing with serialized binary formats, such as Google Protobufs. While my performance is better, I know that it must use the standard Lua C API to push each element into Lua. It got me wondering if Pallene could offer a more efficient C API to:

batch push (or pull) arrays (of homogeneous types) into Lua/Pallene.
Possibly push (or pull) types directly into a Pallene record type instead of a Lua table.

These are not nearly my most pressing concerns, but I thought I would stir the pot of ideas before moving on.

Thanks

hugomg · 2023-02-24T14:48:09Z

hugomg
Feb 24, 2023
Maintainer

Very cool! I agree with you, some kind of way to access homogenous C buffers from Pallene would be nice. LuaJIT does this as part of their FFI, maybe we should also put it in the (to be done) FFI for Pallene?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharing a math inner-loop (On^2) benchmark, loop unrolling and autovectorization #561

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Sharing a math inner-loop (On^2) benchmark, loop unrolling and autovectorization #561

ewmailing Feb 22, 2023

Replies: 1 comment

hugomg Feb 24, 2023 Maintainer

ewmailing
Feb 22, 2023

hugomg
Feb 24, 2023
Maintainer