`Surface.fill()` performance improvements #2390

itzpr3d4t0r · 2023-08-10T10:36:52Z

While investigating for #2388, I've concluded that the main issue with inconsistent or sub-optimal performance is caused by SDL's custom implementantion for memcpy, but also for memset, as it's used inside the SDL_FilLRect function. In this Issue i'm focused on the 32-4 byte surface case. Long story short there looks to be a 6X performance improvement still on the table.

And I discovered a similar issue with fill, having very inconsistent performance across surface sizes. At first it looked like just a inconsistency issue, but i later discovered that the SDL_memset4 function was causing massive performance deficits.

This is SDL's implementation of the memset function used to fill a solid rect in the 4-byte case:

SDL_FORCE_INLINE void SDL_memset4(void *dst, int val, size_t len)
{
#if defined(__GNUC__) && defined(i386)
    int u0, u1, u2;
    __asm__ __volatile__ (
        "cld \n\t"
        "rep ; stosl \n\t"
        : "=&D" (u0), "=&a" (u1), "=&c" (u2)
        : "0" (dst), "1" (val), "2" (SDL_static_cast(Uint32, len))
        : "memory"
    );
#else
    size_t _n = (len + 3) / 4;
    Uint32 *_p = SDL_static_cast(Uint32 *, dst);
    Uint32 _val = (val);
    if (len == 0)
        return;
    switch (len % 4)
    {
        case 0: do {    *_p++ = _val;
        case 3:         *_p++ = _val;
        case 2:         *_p++ = _val;
        case 1:         *_p++ = _val;
        } while ( --_n );
    }
#endif
}

This is a graph made with the program below. It shows how the performance levels aren't on a smooth curve, but rather on fairly separate levels that cause this inconsistency in performance.

Text results:

Total time: 2.479284100008954
Mean: 0.004948670858301305
Median: 0.0048034999999799766
Stdev: 0.0032880930364022334

Program:

import pygame
from matplotlib import pyplot as plt
from timeit import timeit
from statistics import mean, stdev, median

pygame.init()

max_size = 500
color = (233, 12, 34)
total_time = 0
data = []

for s in range(max_size + 1):
    base = pygame.Surface((s, s))
    G = globals()
    im_data = timeit("base.fill(color)", globals=G, number=100)
    total_time += im_data
    data.append(im_data)
    plt.scatter(s, im_data)

print(f"Total time: {total_time}"
      f"\nMean: {mean(data)}"
      f"\nMedian: {median(data)}"
        f"\nStdev: {stdev(data)}")

plt.title("Surface.fill() test")
plt.xlabel("Surface size (px)")
plt.ylabel("Time (s)")
plt.show()

I've already tried to find a solution with substituting SDL's function with a custom function made for just the 4-byte case, though something similar could be made for all other cases.

This is the new function (it's similar to SDL_FillRect but only works on 4 byte surfaces):

int
pgFillRect(SDL_Surface *dst, SDL_Rect *rect, Uint32 color)
{
    SDL_Rect clipped;
    Uint8 *pixels;

    if (!dst) {
        return SDL_SetError("Passed NULL destination surface");
    }

    /* If 'rect' == NULL, then fill the whole surface */
    if (rect) {
        /* Perform clipping */
        if (!SDL_IntersectRect(rect, &dst->clip_rect, &clipped)) {
            return 0;
        }
        rect = &clipped;
    }
    else {
        rect = &dst->clip_rect;
        /* Don't attempt to fill if the surface's clip_rect is empty */
        if (SDL_RectEmpty(rect)) {
            return 0;
        }
    }

    /* Perform software fill */
    if (!dst->pixels) {
        return SDL_SetError("SDL_FillRect(): You must lock the surface");
    }

    int h = rect->h;
    int w = rect->w;
    int bpp = dst->format->BytesPerPixel;
    pixels = (Uint8 *)dst->pixels + rect->y * dst->pitch + rect->x * bpp;
    Uint32 *colorarray = (Uint32 *)malloc(sizeof(Uint32) * w);

    for (int i = 0; i < w; i++) {
        colorarray[i] = color;
    }

    while (h--) {
        memcpy(pixels, colorarray, w * bpp);
        pixels += dst->pitch;
    }

    free(colorarray);

    return 0;
}

Results:

Total time: 0.4106500999259879
Mean: 0.0008196608780957842
Median: 0.0005631000021821819
Stdev: 0.0007310078819052501

The text was updated successfully, but these errors were encountered:

itzpr3d4t0r · 2023-08-10T14:20:16Z

An update on this. Unfortunately, this suggestion works up until a certain point which i suspect is the processor's cache size. So The approach seems to be extremely valid for small/medium/medium-large surfaces while SDL's approach seems better in the long run. My processor is using the SSE path of SDL's memset4, so it's a bit faster already.

We either incroporate a hybrid approach with memcpy for low sizes and fallback to the old one for bigger sizes or we keep it as is.

itzpr3d4t0r · 2023-08-11T12:29:34Z

Another update.

I've conducted further testing, and found that we could also implement a similar approach to SDL's SSE implementation of fillrect but for AVX2. I've also made a graph comparing the SDL's implementation (which in my case uses SSE), straight memcpy, an AVX2 implementation similar to the one found in #2382 and an AVX2 implementation similar to the first one i listed but for AVX2.

The results show that a naive AVX2 implementation could be even more reliable than memcpy at low sizes(and shown lows that are better than memcopy's), but the new AVX2 algorithm seems promising for bigget sizes, being even faster on average than SSE. I still think that a hybrid approach would work best, using memcpy up until a certain point (this i
s a critical point that would need to be researched more) and then dispatch to the available instruction set (like SSE or AVX) after that.

A zoom on the critical interval of 0-1000 which shows how memcpy surpasses every approach. To be coorect memcpy seems a bit weird around 1000 but using it would save tons of lines of code and setup.

Starbuck5 · 2023-08-13T09:23:30Z

Is the surface size in these charts the total area of pixels, or the size of a dimension in a square surface?

I’m surprised AVX2 is not that much better than SSE2. I suppose it’s memory constrained, there isn’t much to gain with the higher width?

That being said, it’s still an improvement. This is something you could implement SDL side and benefit all SDL projects (and without any maintenance burden on pygame-ce)

itzpr3d4t0r · 2024-03-31T17:34:12Z

Just like #2388, it's uclear what the best strategy would be, both in terms of actual implementation and where to implement it (SDL or here). It's even unclear how much each strategy would scale better on different hardware, so with all of these unknowns I'm closing this issue.

itzpr3d4t0r added Performance Related to the speed or resource usage of the project enhancement labels Aug 10, 2023

MyreMylar added the Surface pygame.Surface label Oct 7, 2023

itzpr3d4t0r mentioned this issue Jan 5, 2024

SIMD surface.fill #2657

Closed

itzpr3d4t0r closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Surface.fill()` performance improvements #2390

`Surface.fill()` performance improvements #2390

itzpr3d4t0r commented Aug 10, 2023 •

edited

Loading

itzpr3d4t0r commented Aug 10, 2023 •

edited

Loading

itzpr3d4t0r commented Aug 11, 2023 •

edited

Loading

Starbuck5 commented Aug 13, 2023

itzpr3d4t0r commented Mar 31, 2024

Surface.fill() performance improvements #2390

Surface.fill() performance improvements #2390

Comments

itzpr3d4t0r commented Aug 10, 2023 • edited Loading

itzpr3d4t0r commented Aug 10, 2023 • edited Loading

itzpr3d4t0r commented Aug 11, 2023 • edited Loading

Starbuck5 commented Aug 13, 2023

itzpr3d4t0r commented Mar 31, 2024

`Surface.fill()` performance improvements #2390

`Surface.fill()` performance improvements #2390

itzpr3d4t0r commented Aug 10, 2023 •

edited

Loading

itzpr3d4t0r commented Aug 10, 2023 •

edited

Loading

itzpr3d4t0r commented Aug 11, 2023 •

edited

Loading