Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copying off-sceen images to screen is not always optimal in Xorg fb code #9

Open
hglm opened this issue May 22, 2013 · 10 comments
Open

Comments

@hglm
Copy link
Contributor

hglm commented May 22, 2013

Modern applications often write into an off-screen pixmap or image that is regularly transferred to the screen window. Depending on the method used, it appears that the code paths used by the Xorg fb driver for this purpose may not always be optimal. This is especialy true if the ShmPutImage method is used, and to lesser extent for normal (piped) PutImage. The newer Xorg shared memory pixmap extension is not affected and is already optimal since it uses CopyArea.

Looking a the Xorg ShmPutImage implementation, it may use either CopyArea (which is optimized in the sunxifb driver) or PutImage. Running x11perf -shmput500 triggers the CopyArea path so it does not show the weakness. However, it appears that the PutImage path is regularly used by applications in practice.

Using a slightly modified version of a program I found on the web that repeatedly writes into an image and transfers it to the screen, I benchmarked an optimized version of PutImage that uses the NEON optimized blt functions that are already present in the sunxifb driver:

512x512 image, 1920x1080x32bpp@60Hz, results in Mpix/s

                           XPutImage XShmPutImage Pixmap SHM
32bpp, non optimized       14.57     27.84        40.03
32bpp, optimized PutImage  17.77     40.22        40.76

200x200 image

                           XPutImage XShmPutImage Pixmap SHM
32bpp, non-optimized       13.46     25.59        33.23
32bpp, optimized PutImage  15.36     34.04        34.34

500x500 image, 1280x720x32bpp@60Hz

                           XPutImage XShmPutImage Pixmap SHM
32bpp, non-optimized       14.80     26.79        35.14
32bpp, optimized PutImage  19.72     47.17        47.30

The 35.14 value is anomalous and probably a typo.

It is clear that the PutImage path is not optimal. This maybe even more true for 16bpp mode. I will make available a patch that implements the optimized PutImage.

@hglm
Copy link
Contributor Author

hglm commented May 22, 2013

At the moment overlapped_blt_neon is seperated into the CPU back-end and not easily accessible if the G2D back-end is selected. For off-screen to screen blitting, is there in any point in using overlapped_blt_neon? I guess it is optimized for screen-to-screen blits. Is pixman (using NEON) good enough for this purpose?

BTW. I have run the X Test Suite on my patches on a bare server with encouraging results, although the optimized PutImage implementation is showing some errors at 16bpp related to ROPS.

@ssvb
Copy link
Owner

ssvb commented May 24, 2013

Yes, overlapped_blt_neon is mostly optimized for reading from uncached sources. In addition, because it performs reading using aligned 32 byte chunks, it can access memory before and after the source buffer. For normal malloc allocated pixmap buffers, this may result in the reads before and after the allocated block, causing valgrind to complain. There are similar optimizations in glibs, which are handled by valgrind suppressions. But in our case it is just not worth the efforts.

The NEON code from pixman is generally good as long as you are only reading cached buffers. Pixman does not support operations with overlapped source/destination though.

As you have noticed, we also need to have a fallback from G2D to CPU backend for unsupported operations (not having both source and destination inside the framebuffer, or dragging windows to the right). G2D support is still not really complete.

@hglm
Copy link
Contributor Author

hglm commented May 29, 2013

As you have noticed, we also need to have a fallback from G2D to CPU backend for unsupported operations (not having both source and destination inside the framebuffer, or dragging windows to the right). G2D support is still not really complete.

Indeed, a fallback from G2D to the CPU backend when dragging windows to the right is needed for better performance in this case. When G2D is enabled, a "reversed" (right to left with src_y == dst_y) blit triggers the fbBlt code, which is very slow, as can be seen from the following benchmarks using benchx (1920x1080x32bpp):

G2D enabled (triggers fbBlt):

ScreenCopyRightwards (5 x 5): 55635.62 ops/sec (5.31 MB/s), CPU 89% + 9% = 99%
ScreenCopyRightwards (7 x 7): 49420.80 ops/sec (9.24 MB/s), CPU 91% + 6% = 98%
ScreenCopyRightwards (10 x 10): 31897.66 ops/sec (12.17 MB/s), CPU 95% + 4% = 99%
ScreenCopyRightwards (15 x 15): 17798.71 ops/sec (15.28 MB/s), CPU 97% + 1% = 99%
ScreenCopyRightwards (22 x 22): 8392.44 ops/sec (15.50 MB/s), CPU 91% + 5% = 97%
ScreenCopyRightwards (33 x 33): 4764.56 ops/sec (19.79 MB/s), CPU 83% + 10% = 94%
ScreenCopyRightwards (49 x 49): 2550.80 ops/sec (23.36 MB/s), CPU 95% + 2% = 98%
ScreenCopyRightwards (73 x 73): 1186.38 ops/sec (24.12 MB/s), CPU 96% + 0% = 97%
ScreenCopyRightwards (109 x 109): 551.54 ops/sec (25.00 MB/s), CPU 99% + 0% = 99%
ScreenCopyRightwards (163 x 163): 248.29 ops/sec (25.16 MB/s), CPU 99% + 0% = 99%
ScreenCopyRightwards (244 x 244): 109.56 ops/sec (24.88 MB/s), CPU 97% + 0% = 98%
ScreenCopyRightwards (366 x 366): 49.38 ops/sec (25.23 MB/s), CPU 99% + 0% = 99%
ScreenCopyRightwards (549 x 549): 22.01 ops/sec (25.31 MB/s), CPU 99% + 0% = 99%

G2D disabled (triggers overlapped_blt_neon):

ScreenCopyRightwards (5 x 5): 76151.64 ops/sec (7.26 MB/s), CPU 83% + 14% = 97%
ScreenCopyRightwards (7 x 7): 89401.41 ops/sec (16.71 MB/s), CPU 85% + 12% = 98%
ScreenCopyRightwards (10 x 10): 76142.92 ops/sec (29.05 MB/s), CPU 87% + 10% = 98%
ScreenCopyRightwards (15 x 15): 62945.30 ops/sec (54.03 MB/s), CPU 86% + 11% = 97%
ScreenCopyRightwards (22 x 22): 50601.21 ops/sec (93.43 MB/s), CPU 91% + 7% = 99%
ScreenCopyRightwards (33 x 33): 30193.92 ops/sec (125.43 MB/s), CPU 90% + 7% = 98%
ScreenCopyRightwards (49 x 49): 18217.87 ops/sec (166.86 MB/s), CPU 90% + 7% = 97%
ScreenCopyRightwards (73 x 73): 9546.63 ops/sec (194.07 MB/s), CPU 95% + 3% = 98%
ScreenCopyRightwards (109 x 109): 5012.64 ops/sec (227.18 MB/s), CPU 98% + 0% = 99%
ScreenCopyRightwards (163 x 163): 2416.72 ops/sec (244.94 MB/s), CPU 99% + 0% = 100%
ScreenCopyRightwards (244 x 244): 1130.12 ops/sec (256.66 MB/s), CPU 99% + 0% = 99%
ScreenCopyRightwards (366 x 366): 516.80 ops/sec (264.09 MB/s), CPU 98% + 0% = 99%
ScreenCopyRightwards (549 x 549): 233.85 ops/sec (268.87 MB/s), CPU 97% + 1% = 99%

Fixing this requires overlapped_blt_neon to be available in some way when G2D is enabled.

@hglm
Copy link
Contributor Author

hglm commented Jun 3, 2013

Returning to the title subject (PutImage):

I'm a bit undecided about this one. The speedup is not so major, especially considering that everyone should be using ShmPutImage anyway. And the performance issue itself entirely belongs to Xorg 'fb' module (needs no knowledge about the target hardware, does not have to handle overlapping, no read back from framebuffer, etc.). It would be nice to eventually fix it in upstream Xorg, and benefit from upstream patch review and larger scale testing by more users.

I agree that it belongs in upstream, and the speed-up is not so major. When I was seeing significant speed-ups in ShmPutImage initially, I thought it be a nice improvement, but this speed-up hasn't been confirmed in further testing. It might still be worth investigating whether any real-world performance gains could be made for ShmPutImage (depending on whether it calls the PutImage code path instead of the CopyArea path).

@hglm
Copy link
Contributor Author

hglm commented Jun 5, 2013

As it turns out, the slow code path in Xorg also happens for ShmPutImage. It is triggered when the width of the image to be copied to the screen is equal to the full image width as stored in the SHM image, and not a subimage. This may be a fairly common case in applications. The slow-down is pretty significant for these cases.

A comparison follows of running benchx with the ShmPutImageFullWidth and AlignedShmPutImageFullWidth tests at 1920x1080x32bpp with standard sunxifb and a version of sunxifb that hooks PutImage and redirects calls to pixman if possible. The speed-up is even greater at lower resolutions and color depth (due to the higher bandwidth available). This really is a stupid and unneccessary flaw in Xorg fb/fbdev driver that is trivially fixed.

ShmPutImageFullWidth (5 x 5): Speed up 9%
ShmPutImageFullWidth (7 x 7): Slow down 5%
ShmPutImageFullWidth (22 x 22): Speed up 8%
ShmPutImageFullWidth (49 x 49): Speed up 19%
ShmPutImageFullWidth (73 x 73): Speed up 55%
ShmPutImageFullWidth (109 x 109): Speed up 50%
ShmPutImageFullWidth (163 x 163): Speed up 37%
ShmPutImageFullWidth (244 x 244): Speed up 111%
ShmPutImageFullWidth (366 x 366): Speed up 77%
ShmPutImageFullWidth (549 x 549): Speed up 92%
AlignedShmPutImageFullWidth (5 x 5): Slow down 14%
AlignedShmPutImageFullWidth (7 x 7): Slow down 6%
AlignedShmPutImageFullWidth (15 x 15): Speed up 10%
AlignedShmPutImageFullWidth (22 x 22): Speed up 9%
AlignedShmPutImageFullWidth (33 x 33): Speed up 21%
AlignedShmPutImageFullWidth (49 x 49): Speed up 28%
AlignedShmPutImageFullWidth (73 x 73): Speed up 30%
AlignedShmPutImageFullWidth (109 x 109): Speed up 47%
AlignedShmPutImageFullWidth (163 x 163): Speed up 38%
AlignedShmPutImageFullWidth (244 x 244): Speed up 63%
AlignedShmPutImageFullWidth (366 x 366): Speed up 84%
AlignedShmPutImageFullWidth (549 x 549): Speed up 89%

Full output:

Unoptimized:

Screen size 1920 x 1080, depth 24 (32 bpp), area size 600 x 600
ShmPutImageFullWidth (5 x 5): 116790.35 ops/sec (11.14 MB/s), CPU 70% + 23% = 93%
ShmPutImageFullWidth (7 x 7): 147224.38 ops/sec (27.52 MB/s), CPU 73% + 22% = 95%
ShmPutImageFullWidth (10 x 10): 131139.05 ops/sec (50.03 MB/s), CPU 72% + 21% = 93%
ShmPutImageFullWidth (15 x 15): 112567.60 ops/sec (96.62 MB/s), CPU 81% + 16% = 97%
ShmPutImageFullWidth (22 x 22): 88794.38 ops/sec (163.94 MB/s), CPU 82% + 14% = 97%
ShmPutImageFullWidth (33 x 33): 59980.08 ops/sec (249.17 MB/s), CPU 86% + 11% = 97%
ShmPutImageFullWidth (49 x 49): 37509.17 ops/sec (343.55 MB/s), CPU 90% + 7% = 98%
ShmPutImageFullWidth (73 x 73): 16941.51 ops/sec (344.40 MB/s), CPU 80% + 12% = 93%
ShmPutImageFullWidth (109 x 109): 9193.11 ops/sec (416.65 MB/s), CPU 94% + 4% = 99%
ShmPutImageFullWidth (163 x 163): 4622.99 ops/sec (468.55 MB/s), CPU 98% + 1% = 100%
ShmPutImageFullWidth (244 x 244): 1376.27 ops/sec (312.57 MB/s), CPU 96% + 2% = 99%
ShmPutImageFullWidth (366 x 366): 452.01 ops/sec (230.98 MB/s), CPU 98% + 1% = 100%
ShmPutImageFullWidth (549 x 549): 179.05 ops/sec (205.86 MB/s), CPU 99% + 0% = 99%
Screen size 1920 x 1080, depth 24 (32 bpp), area size 600 x 600
AlignedShmPutImageFullWidth (5 x 5): 128156.43 ops/sec (12.22 MB/s), CPU 69% + 24% = 93%
AlignedShmPutImageFullWidth (7 x 7): 146385.49 ops/sec (27.36 MB/s), CPU 74% + 22% = 97%
AlignedShmPutImageFullWidth (10 x 10): 132538.94 ops/sec (50.56 MB/s), CPU 77% + 19% = 96%
AlignedShmPutImageFullWidth (15 x 15): 110090.07 ops/sec (94.49 MB/s), CPU 85% + 11% = 97%
AlignedShmPutImageFullWidth (22 x 22): 89368.89 ops/sec (165.00 MB/s), CPU 86% + 11% = 97%
AlignedShmPutImageFullWidth (33 x 33): 58461.60 ops/sec (242.86 MB/s), CPU 85% + 11% = 97%
AlignedShmPutImageFullWidth (49 x 49): 36987.81 ops/sec (338.77 MB/s), CPU 94% + 4% = 98%
AlignedShmPutImageFullWidth (73 x 73): 20396.15 ops/sec (414.62 MB/s), CPU 95% + 4% = 100%
AlignedShmPutImageFullWidth (109 x 109): 9442.66 ops/sec (427.96 MB/s), CPU 98% + 1% = 99%
AlignedShmPutImageFullWidth (163 x 163): 4607.43 ops/sec (466.98 MB/s), CPU 98% + 1% = 100%
AlignedShmPutImageFullWidth (244 x 244): 1626.23 ops/sec (369.34 MB/s), CPU 98% + 0% = 99%
AlignedShmPutImageFullWidth (366 x 366): 457.02 ops/sec (233.54 MB/s), CPU 99% + 0% = 99%
AlignedShmPutImageFullWidth (549 x 549): 179.23 ops/sec (206.07 MB/s), CPU 99% + 0% = 100%

Optimized:

Screen size 1920 x 1080, depth 24 (32 bpp), area size 600 x 600
ShmPutImageFullWidth (5 x 5): 127875.45 ops/sec (12.20 MB/s), CPU 75% + 20% = 95%
ShmPutImageFullWidth (7 x 7): 139339.98 ops/sec (26.05 MB/s), CPU 75% + 21% = 96%
ShmPutImageFullWidth (10 x 10): 130371.98 ops/sec (49.73 MB/s), CPU 74% + 20% = 94%
ShmPutImageFullWidth (15 x 15): 115001.83 ops/sec (98.71 MB/s), CPU 80% + 15% = 95%
ShmPutImageFullWidth (22 x 22): 96109.40 ops/sec (177.45 MB/s), CPU 83% + 13% = 96%
ShmPutImageFullWidth (33 x 33): 60831.40 ops/sec (252.71 MB/s), CPU 78% + 15% = 93%
ShmPutImageFullWidth (49 x 49): 44682.81 ops/sec (409.25 MB/s), CPU 84% + 11% = 96%
ShmPutImageFullWidth (73 x 73): 26238.86 ops/sec (533.40 MB/s), CPU 93% + 4% = 98%
ShmPutImageFullWidth (109 x 109): 13744.38 ops/sec (622.93 MB/s), CPU 98% + 1% = 100%
ShmPutImageFullWidth (163 x 163): 6320.21 ops/sec (640.57 MB/s), CPU 98% + 1% = 99%
ShmPutImageFullWidth (244 x 244): 2909.55 ops/sec (660.79 MB/s), CPU 98% + 0% = 99%
ShmPutImageFullWidth (366 x 366): 800.46 ops/sec (409.04 MB/s), CPU 94% + 3% = 98%
ShmPutImageFullWidth (549 x 549): 343.55 ops/sec (395.00 MB/s), CPU 99% + 0% = 100%
Screen size 1920 x 1080, depth 24 (32 bpp), area size 600 x 600
AlignedShmPutImageFullWidth (5 x 5): 110843.24 ops/sec (10.57 MB/s), CPU 73% + 22% = 95%
AlignedShmPutImageFullWidth (7 x 7): 138109.82 ops/sec (25.82 MB/s), CPU 74% + 20% = 95%
AlignedShmPutImageFullWidth (10 x 10): 131585.14 ops/sec (50.20 MB/s), CPU 77% + 19% = 96%
AlignedShmPutImageFullWidth (15 x 15): 120740.70 ops/sec (103.63 MB/s), CPU 76% + 20% = 97%
AlignedShmPutImageFullWidth (22 x 22): 97351.17 ops/sec (179.74 MB/s), CPU 81% + 15% = 97%
AlignedShmPutImageFullWidth (33 x 33): 70962.85 ops/sec (294.79 MB/s), CPU 86% + 12% = 98%
AlignedShmPutImageFullWidth (49 x 49): 47360.65 ops/sec (433.78 MB/s), CPU 91% + 6% = 98%
AlignedShmPutImageFullWidth (73 x 73): 26443.97 ops/sec (537.57 MB/s), CPU 96% + 3% = 99%
AlignedShmPutImageFullWidth (109 x 109): 13870.96 ops/sec (628.67 MB/s), CPU 95% + 3% = 99%
AlignedShmPutImageFullWidth (163 x 163): 6369.61 ops/sec (645.58 MB/s), CPU 98% + 1% = 100%
AlignedShmPutImageFullWidth (244 x 244): 2645.23 ops/sec (600.76 MB/s), CPU 98% + 1% = 100%
AlignedShmPutImageFullWidth (366 x 366): 840.93 ops/sec (429.71 MB/s), CPU 98% + 0% = 99%
AlignedShmPutImageFullWidth (549 x 549): 338.47 ops/sec (389.16 MB/s), CPU 97% + 1% = 99%

@ssvb
Copy link
Owner

ssvb commented Jun 6, 2013

As it turns out, the slow code path in Xorg also happens for ShmPutImage. It is triggered
when the width of the image to be copied to the screen is equal to the full image width
as stored in the SHM image, and not a subimage. This may be a fairly common case
in applications. The slow-down is pretty significant for these cases.

Thanks! That's a really good catch. Could you please make a patch with an updated commit message so that I could just apply it? Finding an example of some popular application triggering the slow path would be also useful.

Also eventually we probably want to get this performance problem fixed in Xorg. The Xorg itself has 6 month release cycle. And then the Linux distributions are lagging behind and not always providing the latest Xorg version, which causes additional delay. This makes it worth having a workaround in DDX to deploy the solution to the end users sooner, but Xorg developers also need to be notified (maybe via https://bugs.freedesktop.org/ bugzilla). Maybe you could also try to make a patch for Xorg yourself. That's your find, and it would be nice if you get properly credited for it :)

@hglm
Copy link
Contributor Author

hglm commented Jun 6, 2013

OK, will make a patch with an updated commit message, and I'll investigate how common the operation is.

As for Xorg, I can imagine some resistance to patching the "stable" fb layer, and as you note it would take a long time for the change to trickle down to releases that are actually used. I may report it to the bugzilla.

Slightly off-topic, the patch helps the RPi even more due to more expensive CPU cycles, running rpifb at 16bpp gives the following speed-up:

ShmPutImageFullWidth (10 x 10): Speed up 7%
ShmPutImageFullWidth (15 x 15): Speed up 9%
ShmPutImageFullWidth (22 x 22): Speed up 15%
ShmPutImageFullWidth (33 x 33): Speed up 16%
ShmPutImageFullWidth (49 x 49): Speed up 34%
ShmPutImageFullWidth (73 x 73): Speed up 49%
ShmPutImageFullWidth (109 x 109): Speed up 70%
ShmPutImageFullWidth (163 x 163): Speed up 106%
ShmPutImageFullWidth (244 x 244): Speed up 135%
ShmPutImageFullWidth (366 x 366): Speed up 132%
ShmPutImageFullWidth (549 x 549): Speed up 141%
AlignedShmPutImageFullWidth (15 x 15): Speed up 5%
AlignedShmPutImageFullWidth (22 x 22): Slow down 11%, CPU usage decrease from 95% to 75%
AlignedShmPutImageFullWidth (33 x 33): Speed up 21%
AlignedShmPutImageFullWidth (49 x 49): Speed up 37%
AlignedShmPutImageFullWidth (73 x 73): Speed up 55%
AlignedShmPutImageFullWidth (109 x 109): Speed up 87%
AlignedShmPutImageFullWidth (163 x 163): Speed up 136%
AlignedShmPutImageFullWidth (244 x 244): Speed up 101%
AlignedShmPutImageFullWidth (366 x 366): Speed up 168%
AlignedShmPutImageFullWidth (549 x 549): Speed up 198%

@hglm
Copy link
Contributor Author

hglm commented Jun 7, 2013

I've rebased my various-optimization-stable branch of xf86-video-sunxifb at https://github.com/hglm/xf86-video-sunxifb. It now contains a new commit implementing the PutImage optimization.

I've investigated a little how often the sub-optimal PutImage code path is used by applications. It seems to be fairly common, the LXDE/openbox GUI environment uses it for things like status bars and window decorations, and it is used for display of images by the Midori browser. Regular GTK user interface elements (e.g. gktperf) don't seem to trigger it however.

@ssvb
Copy link
Owner

ssvb commented Jun 12, 2013

Thanks, pushed with the following tweaks:

--- a/src/sunxi_x_g2d.c
+++ b/src/sunxi_x_g2d.c
@@ -215,7 +215,7 @@ xCopyArea(DrawablePtr pSrcDrawable,
  * The following function is adapted from xserver/fb/fbPutImage.c.
  */

-void xPutImage(DrawablePtr pDrawable,
+static void xPutImage(DrawablePtr pDrawable,
            GCPtr pGC,
            int depth,
            int x, int y, int w, int h, int leftPad, int format, char *pImage)
@@ -286,7 +286,7 @@ void xPutImage(DrawablePtr pDrawable,
                  y1 + dstYoff, w,
                  h);
         }
-        // otherwise fall back to fb */
+        /* otherwise fall back to fb */
         if (!done)
             fbBlt(src + (y1 - y) * srcStride,
                   srcStride,

Profiling Firefox ("gfx.xrender.enabled" is set to "false" in about:config)
with Fishtank demo:
http://ie.microsoft.com/testdrive/Performance/FishIETank/Default.html

Before:

 18.60% firefox  libpixman-1.so.0.30.0 [.] pixman_composite_over_8888_8888_asm_neon
 13.72%       X  libfb.so              [.] fbBlt
 12.51% firefox  libpixman-1.so.0.30.0 [.] pixman_scaled_bilinear_scanline_8888_8888_OVER_asm_neon
  7.07% firefox  libpixman-1.so.0.30.0 [.] pixman_composite_src_8888_8888_asm_neon
  5.68% firefox  libpixman-1.so.0.30.0 [.] pixman_composite_src_n_8888_asm_neon
  2.06% firefox  libpixman-1.so.0.30.0 [.] pixman_scaled_bilinear_scanline_8888_8_8888_OVER_asm_neon

After:

 19.85% firefox  libpixman-1.so.0.30.0 [.] pixman_composite_over_8888_8888_asm_neon
 16.18% firefox  libpixman-1.so.0.30.0 [.] pixman_scaled_bilinear_scanline_8888_8888_OVER_asm_neon
  7.42% firefox  libpixman-1.so.0.30.0 [.] pixman_composite_src_8888_8888_asm_neon
  5.91% firefox  libpixman-1.so.0.30.0 [.] pixman_composite_src_n_8888_asm_neon
  5.13%       X  libpixman-1.so.0.30.0 [.] pixman_composite_src_8888_8888_asm_neon
  2.29% firefox  libpixman-1.so.0.30.0 [.] pixman_scaled_bilinear_scanline_8888_8_8888_OVER_asm_neon

13.72% fbBlt -> 5.13% pixman_composite_src_8888_8888_asm_neon is a nice improvement.

@ssvb
Copy link
Owner

ssvb commented Jun 12, 2013

Though this issue is still open until it it gets resolved in Xorg :-) Having duct-tape band aid code is not very good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants