-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copying off-sceen images to screen is not always optimal in Xorg fb code #9
Comments
At the moment overlapped_blt_neon is seperated into the CPU back-end and not easily accessible if the G2D back-end is selected. For off-screen to screen blitting, is there in any point in using overlapped_blt_neon? I guess it is optimized for screen-to-screen blits. Is pixman (using NEON) good enough for this purpose? BTW. I have run the X Test Suite on my patches on a bare server with encouraging results, although the optimized PutImage implementation is showing some errors at 16bpp related to ROPS. |
Yes, overlapped_blt_neon is mostly optimized for reading from uncached sources. In addition, because it performs reading using aligned 32 byte chunks, it can access memory before and after the source buffer. For normal malloc allocated pixmap buffers, this may result in the reads before and after the allocated block, causing valgrind to complain. There are similar optimizations in glibs, which are handled by valgrind suppressions. But in our case it is just not worth the efforts. The NEON code from pixman is generally good as long as you are only reading cached buffers. Pixman does not support operations with overlapped source/destination though. As you have noticed, we also need to have a fallback from G2D to CPU backend for unsupported operations (not having both source and destination inside the framebuffer, or dragging windows to the right). G2D support is still not really complete. |
Indeed, a fallback from G2D to the CPU backend when dragging windows to the right is needed for better performance in this case. When G2D is enabled, a "reversed" (right to left with src_y == dst_y) blit triggers the fbBlt code, which is very slow, as can be seen from the following benchmarks using benchx (1920x1080x32bpp): G2D enabled (triggers fbBlt):
G2D disabled (triggers overlapped_blt_neon):
Fixing this requires overlapped_blt_neon to be available in some way when G2D is enabled. |
Returning to the title subject (PutImage):
I agree that it belongs in upstream, and the speed-up is not so major. When I was seeing significant speed-ups in ShmPutImage initially, I thought it be a nice improvement, but this speed-up hasn't been confirmed in further testing. It might still be worth investigating whether any real-world performance gains could be made for ShmPutImage (depending on whether it calls the PutImage code path instead of the CopyArea path). |
As it turns out, the slow code path in Xorg also happens for ShmPutImage. It is triggered when the width of the image to be copied to the screen is equal to the full image width as stored in the SHM image, and not a subimage. This may be a fairly common case in applications. The slow-down is pretty significant for these cases. A comparison follows of running benchx with the ShmPutImageFullWidth and AlignedShmPutImageFullWidth tests at 1920x1080x32bpp with standard sunxifb and a version of sunxifb that hooks PutImage and redirects calls to pixman if possible. The speed-up is even greater at lower resolutions and color depth (due to the higher bandwidth available). This really is a stupid and unneccessary flaw in Xorg fb/fbdev driver that is trivially fixed.
Full output: Unoptimized:
Optimized:
|
Thanks! That's a really good catch. Could you please make a patch with an updated commit message so that I could just apply it? Finding an example of some popular application triggering the slow path would be also useful. Also eventually we probably want to get this performance problem fixed in Xorg. The Xorg itself has 6 month release cycle. And then the Linux distributions are lagging behind and not always providing the latest Xorg version, which causes additional delay. This makes it worth having a workaround in DDX to deploy the solution to the end users sooner, but Xorg developers also need to be notified (maybe via https://bugs.freedesktop.org/ bugzilla). Maybe you could also try to make a patch for Xorg yourself. That's your find, and it would be nice if you get properly credited for it :) |
OK, will make a patch with an updated commit message, and I'll investigate how common the operation is. As for Xorg, I can imagine some resistance to patching the "stable" fb layer, and as you note it would take a long time for the change to trickle down to releases that are actually used. I may report it to the bugzilla. Slightly off-topic, the patch helps the RPi even more due to more expensive CPU cycles, running rpifb at 16bpp gives the following speed-up:
|
I've rebased my various-optimization-stable branch of xf86-video-sunxifb at https://github.com/hglm/xf86-video-sunxifb. It now contains a new commit implementing the PutImage optimization. I've investigated a little how often the sub-optimal PutImage code path is used by applications. It seems to be fairly common, the LXDE/openbox GUI environment uses it for things like status bars and window decorations, and it is used for display of images by the Midori browser. Regular GTK user interface elements (e.g. gktperf) don't seem to trigger it however. |
Thanks, pushed with the following tweaks:
Profiling Firefox ("gfx.xrender.enabled" is set to "false" in about:config)
13.72% fbBlt -> 5.13% pixman_composite_src_8888_8888_asm_neon is a nice improvement. |
Though this issue is still open until it it gets resolved in Xorg :-) Having duct-tape band aid code is not very good. |
Modern applications often write into an off-screen pixmap or image that is regularly transferred to the screen window. Depending on the method used, it appears that the code paths used by the Xorg fb driver for this purpose may not always be optimal. This is especialy true if the ShmPutImage method is used, and to lesser extent for normal (piped) PutImage. The newer Xorg shared memory pixmap extension is not affected and is already optimal since it uses CopyArea.
Looking a the Xorg ShmPutImage implementation, it may use either CopyArea (which is optimized in the sunxifb driver) or PutImage. Running x11perf -shmput500 triggers the CopyArea path so it does not show the weakness. However, it appears that the PutImage path is regularly used by applications in practice.
Using a slightly modified version of a program I found on the web that repeatedly writes into an image and transfers it to the screen, I benchmarked an optimized version of PutImage that uses the NEON optimized blt functions that are already present in the sunxifb driver:
512x512 image, 1920x1080x32bpp@60Hz, results in Mpix/s
200x200 image
500x500 image, 1280x720x32bpp@60Hz
The 35.14 value is anomalous and probably a typo.
It is clear that the PutImage path is not optimal. This maybe even more true for 16bpp mode. I will make available a patch that implements the optimized PutImage.
The text was updated successfully, but these errors were encountered: