Replies: 1 comment
-
Very cool! I agree with you, some kind of way to access homogenous C buffers from Pallene would be nice. LuaJIT does this as part of their FFI, maybe we should also put it in the (to be done) FFI for Pallene? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I just wanted to share the results of another little Pallene benchmark I did that is very positive.
For reference I was implementing the algorithm/equation found here, which computes something called a "Money Flow Index" at a specific point on a stock:
https://tulipindicators.org/mfi
In my version, I am calculating the MFI for all candle bars in a stock chart history, which means I put this inside a for-loop.
The slowest part of the formula is the summation. Essentially, this is summation of values in an array within a small sliding window (default window size is 14, but it is a variable parameter.)
So this is basically a smaller nested for-loop inside the main for-loop, so O(N^2).
So in a quick benchmark test I whipped up using os.clock():
Because in C-land, this seems like a terrific candidate for loop-unrolling and vectorization, I tried activating those flags. -O3 wasn't sufficient to change anything by itself, so my flags were:
So, about another 6% decrease.
I separated out the flags and found that the vectorizer did nothing, and it was the -funroll-loops that had the impact.
Also, FYI, these verbose flags show all the missed things the optimizer couldn't handle. Lot's of Pallene/Lua C functions are highlighted as "clobbering memory". Probably all expected.
Anyway, I thought this was overall very positive for Pallene, and I thought you would enjoy seeing it.
But this does make me wonder if Pallene in the future could detect cases like this where it might be a good candidate for SIMD/vectorization, and generate better code that could help the autovectorizer do its thing.
Also, on a semi-related tangent, my program has to load lots of stock history datasets which have become a bottleneck. I started playing with serialized binary formats, such as Google Protobufs. While my performance is better, I know that it must use the standard Lua C API to push each element into Lua. It got me wondering if Pallene could offer a more efficient C API to:
These are not nearly my most pressing concerns, but I thought I would stir the pot of ideas before moving on.
Thanks
Beta Was this translation helpful? Give feedback.
All reactions