-
-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface.fill()
performance improvements
#2390
Comments
Another update. I've conducted further testing, and found that we could also implement a similar approach to SDL's SSE implementation of fillrect but for AVX2. I've also made a graph comparing the SDL's implementation (which in my case uses SSE), straight memcpy, an AVX2 implementation similar to the one found in #2382 and an AVX2 implementation similar to the first one i listed but for AVX2. The results show that a naive AVX2 implementation could be even more reliable than memcpy at low sizes(and shown lows that are better than memcopy's), but the new AVX2 algorithm seems promising for bigget sizes, being even faster on average than SSE. I still think that a hybrid approach would work best, using memcpy up until a certain point (this i A zoom on the critical interval of 0-1000 which shows how memcpy surpasses every approach. To be coorect memcpy seems a bit weird around 1000 but using it would save tons of lines of code and setup. |
Is the surface size in these charts the total area of pixels, or the size of a dimension in a square surface? I’m surprised AVX2 is not that much better than SSE2. I suppose it’s memory constrained, there isn’t much to gain with the higher width? That being said, it’s still an improvement. This is something you could implement SDL side and benefit all SDL projects (and without any maintenance burden on pygame-ce) |
Just like #2388, it's uclear what the best strategy would be, both in terms of actual implementation and where to implement it (SDL or here). It's even unclear how much each strategy would scale better on different hardware, so with all of these unknowns I'm closing this issue. |
While investigating for #2388, I've concluded that the main issue with inconsistent or sub-optimal performance is caused by SDL's custom implementantion for memcpy, but also for memset, as it's used inside the
SDL_FilLRect
function. In this Issue i'm focused on the 32-4 byte surface case. Long story short there looks to be a 6X performance improvement still on the table.And I discovered a similar issue with fill, having very inconsistent performance across surface sizes. At first it looked like just a inconsistency issue, but i later discovered that the SDL_memset4 function was causing massive performance deficits.
This is SDL's implementation of the memset function used to fill a solid rect in the 4-byte case:
This is a graph made with the program below. It shows how the performance levels aren't on a smooth curve, but rather on fairly separate levels that cause this inconsistency in performance.
Text results:
Program:
I've already tried to find a solution with substituting SDL's function with a custom function made for just the 4-byte case, though something similar could be made for all other cases.
This is the new function (it's similar to SDL_FillRect but only works on 4 byte surfaces):
Results:
Results:
The text was updated successfully, but these errors were encountered: