-
-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX Surface.fill() setup, AVX BLEND_ADD #2382
AVX Surface.fill() setup, AVX BLEND_ADD #2382
Conversation
b2bd30d
to
5d7f47c
Compare
Couple preliminary things. I was interested to see if I could find any examples of people actually using Surface.fill with a blend flag online. I did, I found this: https://github.com/Rabbid76/PyGameExamplesAndAnswers/blob/master/documentation/pygame/pygame_blending_and_transaprency.md#change-the-color-of-a-surface-area-mask AVX2 only works on x86, our SSE2 code also runs on ARM due to SSE2Neon.h. So SSE2 is more broadly important to us, since it will help x86 computers as well as ARM macs and other ARM devices. I like that you're taking inspiration from my macro strategy. I see that you're using the SSE and AVX2 registers together, my blit macros use masked stores on the non-aligned edges so everything can be done with AVX2 registers and instructions. Is that something you'd like to do here? |
If what you mean is that i should do the SSE2 version first i guess it's fine to see this PR as a setup for avx and then the blend add SSE2 version and setup will come next instead of expanding AVX2, so after these two PRs we could either implement both the sse and avx versions of the same flags in a single PR or separately.
yep =).
Yeah I've already replicated your work to compare performance and see the benefits. Tbh i thought about the "only avx" stategy myself without knowing your implementation. In practice i didn't see much of a difference, basically the same. I've also switched unrolled loops with a for loop. The main benefit there is that we would just need a single code for filling instead of two which is good. I didn't push that yet but might be wise for simplicity. |
You’re talking about two codes, in my macro you only need 1, and you only need AVX for the AVX blitter. This is not about performance, this is about code simplicity. And this is the approach I would prefer. Unrolled vs normal for loop— I don’t care too much. Unrolled has a larger code size, so if there’s no measurable benefit I’d do a normal for loop. |
This comment was marked as outdated.
This comment was marked as outdated.
I saw that you moved to my favored strategy and then moved away from it. I think there are speedups to be had in your implementation of my suggested strategy. Another consideration is code size: your current implementation copy-pastes the "add" code into the final binary 14 times by my count, because of the loop unrolling and the macro. Code size could be a bigger impact if it was a bigger routine (like a blit), rather than just a handful of instructions, but it's something to keep in mind. |
This comment was marked as outdated.
This comment was marked as outdated.
da66157
to
46a4483
Compare
looks like this needs a merge with main to get past the old CircleCI failure. |
…ead of vector calculations.
1e75d63
to
de7b49b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, LGTM 👍 (passes all my visual tests, and I also see the expected speedup)
SIMD with the add blend is so nice and straightforward! 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
On to an SSE2 implementation, and roll out to other flags?
I'd like to be squashed down a bit before merge, please. |
Our current implementation of
Surface.fill()
when using blend flags only implements the single-pixel strategy. This is a massive opportunity to speed things up.This PR tries to start the changes with
BLEND_ADD
.Results:
Test Program: