-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FillRect could be accelerated #8
Comments
One challenge here is that the performance of G2D is fillrate limited. It is currently clocked at 1/2 of the memory clock speed, which means 240MHz on cubieboard (which runs memory at 480MHz). The performance limit is one pixel per cycle (or 240 millions pixels per second total). This means that G2D can only utilize ~960MB/s of memory bandwidth for 32bpp and just 480MB/s for 16bpp. And for comparison, the CPU can easily use ~1400MB/s of memory bandwidth for fill operation. Primitive fill operation which does not involve a lot of computations and does not do many memory accesses per pixel is not the best workload for G2D. CPU is much faster than G2D for fills if we only consider wall clock time. You can also try https://github.com/ssvb/xf86-video-sunxifb/blob/master/test/sunxi_g2d_bench.c test program to get some numbers. Another challenge is that G2D works only with a physically contiguous memory. The framebuffer is physically contiguous, but the offscreen pixmaps are allocated in normal cached memory (this makes a lot of sense when they are primarily accessed by just the CPU). So right now G2D can only potentially do fills, which are directly going to windows on screen (or to the root window). The practical use for it is probably not so significant. Maybe moving windows on top of a solid background and rendering the exposed parts of this solid background? To sum it up: there are some drawbacks, it's not a clear win. But if you have some patches and benchmark results demonstrating practical usefulness of G2D fill, then they are very much welcome. That said, this issue still can/should be revisited when we have a better G2D support in the kernel. |
I can see the drawbacks now. Fills may indeed be more commonly go to off-screen pixmaps. Still, at 1920x1080x32bpp, the fillrate difference between G2D and CPU using sunxi_g2d_bench is minimal:
So at 1920x1080x32bpp, there might be a benefit to using G2D fills. Especially in the sense of relieving the CPU and allowing background processes to continue while the fill takes place. I may try to experiment with this. |
Still, at 1920x1080x32bpp, the fillrate difference between G2D and CPU using sunxi_g2d_bench is minimal There is actually one more interesting thing. The memory performance is not gradually decreasing, but gets abruptly changed at certain points. If your monitor can handle it, you can set 50Hz refresh rate to save some bandwidth (add "disp.screen0_output_mode=1920x1080p50" to the kernel command line). And then the next thing is to increase memory clock frequency a bit in u-boot. Going just from 408 to 432 for memory clock frequency improves CPU fill speed from ~552 MB/s to ~819 MB/s. The improvement for memory copy is not so dramatic, but also noticeable. I suspect that it might be something like getting an extra cycle of penalty somewhere in the memory subsystem. So a minor change in screen refresh rate or memory clock frequency may change overall desktop responsiveness really a lot. |
I noticed that too, I did some benchmarks with tinymembench that I posted on the wiki (Optimizing system performance) that show a similar drop-off at 1920x1080x32bpp with memory clock increasing from 360 to 408 MHz not helping much (while it helped a lot in lower resolution modes). I guess I should try running at 432 MHz. |
I sent a post in the mailing list with mostly the same information some time ago: https://groups.google.com/d/msg/linux-sunxi/0pGua9gzZTQ/VZN3jHo5Ss4J :) |
I experimented with accelerated FillRect. 16-bit G2D is indeed much slower than software fill, the pixel fill-rate of G2D is keeping it down. However at 32-bit color, on a loaded system, there could really be some benefit. When there are background processes fully loading the CPU, the very small CPU utilization of G2D fill makes both the fill operation faster and the background process gets more CPU time (roughly double in case of a load of 1). At higher loads, the benefit should increase. On an unloaded system, running x11perf shows:
With kernel compile in background:
Timing a single-threaded CPU benchmark with x11perf -rect500 running concurrently:
|
16-bit fill operations can be partially emulated using 32-bit fills, however we might need to additionally separately process 1-pixel wide leftmost and 1-picel wide rightmost columns in unaligned cases. Which makes up to 3 ioctls instead of 1, and introduces the hassle adding extra heuristics to decide when this optimization is beneficial or not. |
I'm still not totally convinced with x11perf alone and would like to see a more realistic use case, justifying the optimization of fills with G2D. The current code is more like a placeholder. We need a better kernel driver to make G2D really useful for more advanced things. |
Also it would be a good idea to have a more real time conversation on #linux-sunxi irc :) |
Interesting, I didn't realize this was possible. I might try it.
Yeah, x11perf is extreme and not typical of normal usage. However, on a loaded system, G2D FillRect should be beneficial in whatever way you look at it, reducing all kinds of kernel and CPU cache related penalties that come with running two CPU burning processes at the time. The actual amount of FillRects calls by applications may not be that high, at least not as influential as blitting for scrolling or dragging windows, but it is an optimization. I'll try to think of a way to construct a realistic use case, but there are not many good X benchmarks. Maybe it is possible to use the current CPU load as an extra heuristic for chosing G2D or CPU FillRect. A bit far-fetched, but possible.
I thought G2D can only blit and fill - what more advanced things are possible? The kernel driver is simple but it seems the kernel handles the sleep on IRQ fairly well. |
I have released a patch for testing/evaluation. It is available on https//github.com/hglm/patches
|
I thought G2D can only blit and fill - what more advanced things are possible? Basically scaling, rotation, conversion between formats is supported. Alpha blending is a bit of a challenge, because there is unfortunately no direct premultiplied alpha support for doing it in one pass. You can check the documentation (Mixer Processor section): http://free-electrons.com/~maxime/pub/datasheet/A10%20User%20manual%20V1.20%2020120409.pdf |
I have released a patch for testing/evaluation. It is available on https//github.com/hglm/patches Thanks. It would be best to actually fork xf86-video-sunxifb repository, create a git branch for your changes, and split the big patch into smaller logically independent parts. Some good practices are described here: https://www.kernel.org/pub/software/scm/git/docs/user-manual.html#patch-series Also I'm trying to organize the code in such a way, that it works on any hardware (using acceleration based on what features are available). For example, Allwinner A13 does not have G2D, so only NEON works there. Moreover, the same driver works on Samsung Exynos based ODROID-X board (with the support for Mali DRI2 GLES acceleration, but no layers or hardware cursor). And also nothing prevents it from running on x86 systems, where it works exactly in the same way as xf86-video-fbdev. |
Thanks for the suggestion, I'll check out creating a git branch. My code is not fully tested yet, and I've elminated a few bugs along the way. Splitting the patch into logical parts should make it easier to manage. I also noticed that the driver is in principle device-independent, which I sort-of skipped over in my patch. It should be possible to make the G2D functions optional/generalized and compile the core driver without needing sunxi_disp. |
Some offtopic about Exynos:
I believe in kernel 3.8 that became available for X2/U2 we can use the mainlined s5p-tv driver layering system. Even if the driver itself cant provide the layering, it uses the also mainlined videobuf2 v4l2 system that can report atleast something needed for direct rendering to the framebuffers. For example, @mdrjr from hardkernel inserted needed ioctls for UMP stuff to the vb2 framework. |
@rzk unfortunately there does not seem to be any public documentation about the layering system hardware in Exynos :( It's probably more interesting to add some basic blit/fill acceleration for Raspberry Pi using DMA to fix "X11 struggles to get to 10fps just moving an unscaled, opaque, window" problem described in http://fooishbar.org/tell-me-about/wayland-on-raspberry-pi/ :) |
I've put up some patches at https//github.com/hglm/patches/sunxifb The X driver FillRect patches are left out for now mainly because they are the only feature that requires changes to the device-independent structures so that sunxi_x_g2d.c would remain device independent. I've seperated patches so that the only apply to either sunxi_disp or sunxi_x_g2d. For example, one patch extends the low level fill primitives in sunxi_disp.c, and another adds the double speed 16bpp blit. There's also a PutImage patch for sunxi_x_g2d.c that should be device independent. |
BTW, appears that this kind of acceleration could be actually useful for color key filling done in libvdpau-sunxi by @jemk |
It appears the kernel G2D driver offers Fill Rectangle acceleration. This is currently not used by sunxifb. Implementing this, if it can work, should improve performance in X, especially with respect to lowering CPU utilization when large areas are filled.
I realize that software fill is probably not much slower, but it's the CPU utilization where gains can be made.
The text was updated successfully, but these errors were encountered: