Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface.fill() performance improvements #2390

Closed
itzpr3d4t0r opened this issue Aug 10, 2023 · 4 comments
Closed

Surface.fill() performance improvements #2390

itzpr3d4t0r opened this issue Aug 10, 2023 · 4 comments
Labels
enhancement Performance Related to the speed or resource usage of the project Surface pygame.Surface

Comments

@itzpr3d4t0r
Copy link
Member

itzpr3d4t0r commented Aug 10, 2023

While investigating for #2388, I've concluded that the main issue with inconsistent or sub-optimal performance is caused by SDL's custom implementantion for memcpy, but also for memset, as it's used inside the SDL_FilLRect function. In this Issue i'm focused on the 32-4 byte surface case. Long story short there looks to be a 6X performance improvement still on the table.

And I discovered a similar issue with fill, having very inconsistent performance across surface sizes. At first it looked like just a inconsistency issue, but i later discovered that the SDL_memset4 function was causing massive performance deficits.

This is SDL's implementation of the memset function used to fill a solid rect in the 4-byte case:

SDL_FORCE_INLINE void SDL_memset4(void *dst, int val, size_t len)
{
#if defined(__GNUC__) && defined(i386)
    int u0, u1, u2;
    __asm__ __volatile__ (
        "cld \n\t"
        "rep ; stosl \n\t"
        : "=&D" (u0), "=&a" (u1), "=&c" (u2)
        : "0" (dst), "1" (val), "2" (SDL_static_cast(Uint32, len))
        : "memory"
    );
#else
    size_t _n = (len + 3) / 4;
    Uint32 *_p = SDL_static_cast(Uint32 *, dst);
    Uint32 _val = (val);
    if (len == 0)
        return;
    switch (len % 4)
    {
        case 0: do {    *_p++ = _val;
        case 3:         *_p++ = _val;
        case 2:         *_p++ = _val;
        case 1:         *_p++ = _val;
        } while ( --_n );
    }
#endif
}

This is a graph made with the program below. It shows how the performance levels aren't on a smooth curve, but rather on fairly separate levels that cause this inconsistency in performance.

main sdl memses

Text results:

Total time: 2.479284100008954
Mean: 0.004948670858301305
Median: 0.0048034999999799766
Stdev: 0.0032880930364022334

Program:

import pygame
from matplotlib import pyplot as plt
from timeit import timeit
from statistics import mean, stdev, median

pygame.init()

max_size = 500
color = (233, 12, 34)
total_time = 0
data = []

for s in range(max_size + 1):
    base = pygame.Surface((s, s))
    G = globals()
    im_data = timeit("base.fill(color)", globals=G, number=100)
    total_time += im_data
    data.append(im_data)
    plt.scatter(s, im_data)

print(f"Total time: {total_time}"
      f"\nMean: {mean(data)}"
      f"\nMedian: {median(data)}"
        f"\nStdev: {stdev(data)}")

plt.title("Surface.fill() test")
plt.xlabel("Surface size (px)")
plt.ylabel("Time (s)")
plt.show()

I've already tried to find a solution with substituting SDL's function with a custom function made for just the 4-byte case, though something similar could be made for all other cases.

This is the new function (it's similar to SDL_FillRect but only works on 4 byte surfaces):

int
pgFillRect(SDL_Surface *dst, SDL_Rect *rect, Uint32 color)
{
    SDL_Rect clipped;
    Uint8 *pixels;

    if (!dst) {
        return SDL_SetError("Passed NULL destination surface");
    }

    /* If 'rect' == NULL, then fill the whole surface */
    if (rect) {
        /* Perform clipping */
        if (!SDL_IntersectRect(rect, &dst->clip_rect, &clipped)) {
            return 0;
        }
        rect = &clipped;
    }
    else {
        rect = &dst->clip_rect;
        /* Don't attempt to fill if the surface's clip_rect is empty */
        if (SDL_RectEmpty(rect)) {
            return 0;
        }
    }

    /* Perform software fill */
    if (!dst->pixels) {
        return SDL_SetError("SDL_FillRect(): You must lock the surface");
    }

    int h = rect->h;
    int w = rect->w;
    int bpp = dst->format->BytesPerPixel;
    pixels = (Uint8 *)dst->pixels + rect->y * dst->pitch + rect->x * bpp;
    Uint32 *colorarray = (Uint32 *)malloc(sizeof(Uint32) * w);

    for (int i = 0; i < w; i++) {
        colorarray[i] = color;
    }

    while (h--) {
        memcpy(pixels, colorarray, w * bpp);
        pixels += dst->pitch;
    }

    free(colorarray);

    return 0;
}

Results:

results

Results:

Total time: 0.4106500999259879
Mean: 0.0008196608780957842
Median: 0.0005631000021821819
Stdev: 0.0007310078819052501
@itzpr3d4t0r itzpr3d4t0r added Performance Related to the speed or resource usage of the project enhancement labels Aug 10, 2023
@itzpr3d4t0r
Copy link
Member Author

itzpr3d4t0r commented Aug 10, 2023

An update on this. Unfortunately, this suggestion works up until a certain point which i suspect is the processor's cache size. So The approach seems to be extremely valid for small/medium/medium-large surfaces while SDL's approach seems better in the long run. My processor is using the SSE path of SDL's memset4, so it's a bit faster already.
issue_at_higt_dims

We either incroporate a hybrid approach with memcpy for low sizes and fallback to the old one for bigger sizes or we keep it as is.

@itzpr3d4t0r
Copy link
Member Author

itzpr3d4t0r commented Aug 11, 2023

Another update.

I've conducted further testing, and found that we could also implement a similar approach to SDL's SSE implementation of fillrect but for AVX2. I've also made a graph comparing the SDL's implementation (which in my case uses SSE), straight memcpy, an AVX2 implementation similar to the one found in #2382 and an AVX2 implementation similar to the first one i listed but for AVX2.

The results show that a naive AVX2 implementation could be even more reliable than memcpy at low sizes(and shown lows that are better than memcopy's), but the new AVX2 algorithm seems promising for bigget sizes, being even faster on average than SSE. I still think that a hybrid approach would work best, using memcpy up until a certain point (this i
s a critical point that would need to be researched more) and then dispatch to the available instruction set (like SSE or AVX) after that.

all_together

A zoom on the critical interval of 0-1000 which shows how memcpy surpasses every approach. To be coorect memcpy seems a bit weird around 1000 but using it would save tons of lines of code and setup.

image

@Starbuck5
Copy link
Member

Is the surface size in these charts the total area of pixels, or the size of a dimension in a square surface?

I’m surprised AVX2 is not that much better than SSE2. I suppose it’s memory constrained, there isn’t much to gain with the higher width?

That being said, it’s still an improvement. This is something you could implement SDL side and benefit all SDL projects (and without any maintenance burden on pygame-ce)

@MyreMylar MyreMylar added the Surface pygame.Surface label Oct 7, 2023
@itzpr3d4t0r
Copy link
Member Author

Just like #2388, it's uclear what the best strategy would be, both in terms of actual implementation and where to implement it (SDL or here). It's even unclear how much each strategy would scale better on different hardware, so with all of these unknowns I'm closing this issue.

@itzpr3d4t0r itzpr3d4t0r closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Performance Related to the speed or resource usage of the project Surface pygame.Surface
Projects
None yet
Development

No branches or pull requests

3 participants