- 04 Apr, 2016 1 commit
-
-
Siarhei Siamashka authored
On a PINE64 board (ARM Cortex-A53), this provides ~180 MB/s speed for the framebuffer readback. For comparison, the normal memcpy operation in cached buffers runs at around ~1200 MB/s. Such read back speed is actually not very fast and is borderline usable. With a 1920x1080 32bpp screen resolution, this results in something like ~20 FPS scrolling. Benchmark vs. shadow framebuffer (1920x1080 32bpp): == Shadow framebuffer in xf86-video-fbdev == $ wget http://mirror.its.dal.ca/gutenberg/3/2/0/3/32032/32032.txt $ time DISPLAY=:0 xterm +j -maximized -e cat 32032.txt real 0m43.909s user 0m0.820s sys 0m0.300s $ DISPLAY=:0 x11perf -scroll500 -copywinwin500 -copypixwin500 -copywinpix500 15000 trep @ 1.8460 msec ( 542.0/sec): Scroll 500x500 pixels 12000 trep @ 2.2629 msec ( 442.0/sec): Copy 500x500 from window to window 12000 trep @ 2.2096 msec ( 453.0/sec): Copy 500x500 from pixmap to window 14000 trep @ 1.9740 msec ( 507.0/sec): Copy 500x500 from window to pixmap == Direct framebuffer readback in xf86-video-fbturbo == $ wget http://mirror.its.dal.ca/gutenberg/3/2/0/3/32032/32032.txt $ time DISPLAY=:0 xterm +j -maximized -e cat 32032.txt real 2m5.741s user 0m0.390s sys 0m0.190s $ DISPLAY=:0 x11perf -scroll500 -copywinwin500 -copypixwin500 -copywinpix500 4500 trep @ 5.9201 msec ( 169.0/sec): Scroll 500x500 pixels 6000 trep @ 5.9211 msec ( 169.0/sec): Copy 500x500 from window to window 18000 trep @ 1.5341 msec ( 652.0/sec): Copy 500x500 from pixmap to window 4000 trep @ 6.4657 msec ( 155.0/sec): Copy 500x500 from window to pixmap == The direct framebuffer access without the shadow framebuffer layer makes scrolling and moving windows slower. But copying from pixmaps to windows becomes faster. In the real world, copying from offscreen pixmaps to windows is much more important, because it is one of the performance bottlenecks for almost every X11 application. While reading back from the framebuffer is only used for a few very specialized tasks (scrolling/moving windows and making screenshots). On 32-bit ARM systems, the uncached framebuffer readback used to perform better. Even the Cortex-A53 running in 32-bit mode can do framebuffer readback at more than 300 MB/s: https://github.com/ssvb/tinymembench/wiki/PINE64-(Allwinner-A64) Scrolling/moving windows still can be accelerated by the kernel (via DMA, a dedicated 2D accelerator or some other method) and hooked into xf86-video-fbturbo. Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
-
- 17 Oct, 2013 1 commit
-
-
Siarhei Siamashka authored
Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
-
- 16 Oct, 2013 1 commit
-
-
Siarhei Siamashka authored
Marvell PJ4 core used in CuBox very poorly handles VFP uncached reads from the framebuffer. Using WMMX or ARM LDM reads is much faster, with LDM instructions having a minor advantage. This improves framebuffer read performance from ~50MB/s to ~100MB/s. WMMX runtime detection and PJ4 core identification is also added as part of this fix. Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
-
- 05 Jun, 2013 1 commit
-
-
Siarhei Siamashka authored
Using VFP, we can load up to 128 bytes with a single VLDM instruction. But before this patch, only NEON implementation was available. Just because it showed better results on Allwinner A10 compared to VFP. And this DDX driver used to primarily target just sunxi hardware. But looks like it makes sense to also target other devices (at least ODROID-X, which has the same Mali400 GPU and can use the same DRI2 integration for EGL and GLESv2 support). And on the other ARM devices, VFP aligned reads generally work better than NEON. The benchmark results are listed below: 1280x720, 32bpp, testing "x11perf -scroll500" == Exynos 5250, Cortex-A15, Non-cacheable streaming enhancement disabled == NEON : 10000 trep @ 3.7101 msec ( 270.0/sec): Scroll 500x500 pixels VFP : 10000 trep @ 2.6678 msec ( 375.0/sec): Scroll 500x500 pixels == Exynos 5250, Cortex-A15, Non-cacheable streaming enhancement enabled == NEON : 15000 trep @ 2.2568 msec ( 443.0/sec): Scroll 500x500 pixels VFP : 15000 trep @ 2.3016 msec ( 434.0/sec): Scroll 500x500 pixels == Exynos 4412, Cortex-A9 == NEON : 10000 trep @ 4.5125 msec ( 222.0/sec): Scroll 500x500 pixels VFP : 10000 trep @ 2.7015 msec ( 370.0/sec): Scroll 500x500 pixels == TI DM3730, Cortex-A8 == NEON : 15000 trep @ 2.2303 msec ( 448.0/sec): Scroll 500x500 pixels VFP : 10000 trep @ 3.0670 msec ( 326.0/sec): Scroll 500x500 pixels == Allwinner A10, Cortex-A8 == NEON : 10000 trep @ 2.5559 msec ( 391.0/sec): Scroll 500x500 pixels VFP : 10000 trep @ 3.0580 msec ( 327.0/sec): Scroll 500x500 pixels == Raspberry Pi, BCM2708, ARM1176 == VFP : 3000 trep @ 8.7699 msec ( 114.0/sec): Scroll 500x500 pixels The benchmark numbers in this particular test setup roughly represent memory copy bandwidth measured in MB/s (when doing overlapped blits inside of a writecombine mapped framebuffer). ----------------------------------------------------------------------- Note: the use of VFP two-pass overlapped copy instead of ShadowFB is still not enabled by default when running on Raspberry Pi because the performance results are not so great. Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
-
- 30 Mar, 2013 1 commit
-
-
Siarhei Siamashka authored
Should be useful for better performance when moving windows and scrolling on the devices without a dedicated 2D hardware accelerator (Allwinner A13). Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
-