Commits · a5e698f117ee19c1da8c0d7ccb79e0d997dff6eb · adam.huang / Xf86 Video Fbturbo

04 Apr, 2016 1 commit

Framebuffer readback assembly code for AArch64 · a5e698f1

Siarhei Siamashka authored Apr 04, 2016

On a PINE64 board (ARM Cortex-A53), this provides ~180 MB/s
speed for the framebuffer readback. For comparison, the normal
memcpy operation in cached buffers runs at around ~1200 MB/s.

Such read back speed is actually not very fast and is borderline
usable. With a 1920x1080 32bpp screen resolution, this results in
something like ~20 FPS scrolling.

Benchmark vs. shadow framebuffer (1920x1080 32bpp):

  == Shadow framebuffer in xf86-video-fbdev ==

     $ wget http://mirror.its.dal.ca/gutenberg/3/2/0/3/32032/32032.txt
     $ time DISPLAY=:0 xterm +j -maximized -e cat 32032.txt

     real 0m43.909s
     user 0m0.820s
     sys  0m0.300s

     $ DISPLAY=:0 x11perf -scroll500 -copywinwin500 -copypixwin500 -copywinpix500

     15000 trep @   1.8460 msec (   542.0/sec): Scroll 500x500 pixels
     12000 trep @   2.2629 msec (   442.0/sec): Copy 500x500 from window to window
     12000 trep @   2.2096 msec (   453.0/sec): Copy 500x500 from pixmap to window
     14000 trep @   1.9740 msec (   507.0/sec): Copy 500x500 from window to pixmap

  == Direct framebuffer readback in xf86-video-fbturbo ==

     $ wget http://mirror.its.dal.ca/gutenberg/3/2/0/3/32032/32032.txt
     $ time DISPLAY=:0 xterm +j -maximized -e cat 32032.txt

     real 2m5.741s
     user 0m0.390s
     sys  0m0.190s

     $ DISPLAY=:0 x11perf -scroll500 -copywinwin500 -copypixwin500 -copywinpix500

      4500 trep @   5.9201 msec (   169.0/sec): Scroll 500x500 pixels
      6000 trep @   5.9211 msec (   169.0/sec): Copy 500x500 from window to window
     18000 trep @   1.5341 msec (   652.0/sec): Copy 500x500 from pixmap to window
      4000 trep @   6.4657 msec (   155.0/sec): Copy 500x500 from window to pixmap

  ==

The direct framebuffer access without the shadow framebuffer layer
makes scrolling and moving windows slower. But copying from pixmaps
to windows becomes faster. In the real world, copying from offscreen
pixmaps to windows is much more important, because it is one of the
performance bottlenecks for almost every X11 application. While
reading back from the framebuffer is only used for a few very
specialized tasks (scrolling/moving windows and making screenshots).

On 32-bit ARM systems, the uncached framebuffer readback used to
perform better. Even the Cortex-A53 running in 32-bit mode can
do framebuffer readback at more than 300 MB/s:
    https://github.com/ssvb/tinymembench/wiki/PINE64-(Allwinner-A64)



Scrolling/moving windows still can be accelerated by the kernel
(via DMA, a dedicated 2D accelerator or some other method) and
hooked into xf86-video-fbturbo.
Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

a5e698f1

17 Oct, 2013 1 commit
- Fix the 'forgotten else' regression to use NEON on Cortex-A8 again · 8ad03c9d
  Siarhei Siamashka authored Oct 17, 2013
```
Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
```
  8ad03c9d
16 Oct, 2013 1 commit

Use ARM LDM instead of VFP for uncached reads on Marvell PJ4 · e9f978f3

Siarhei Siamashka authored Oct 17, 2013



Marvell PJ4 core used in CuBox very poorly handles VFP uncached
reads from the framebuffer. Using WMMX or ARM LDM reads is much
faster, with LDM instructions having a minor advantage. This
improves framebuffer read performance from ~50MB/s to ~100MB/s.

WMMX runtime detection and PJ4 core identification is also added
as part of this fix.
Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

e9f978f3

05 Jun, 2013 1 commit

CPU: Added ARM VFP two-pass overlapped blit implementation · b93dab5c

Siarhei Siamashka authored Jun 05, 2013



Using VFP, we can load up to 128 bytes with a single VLDM instruction.
But before this patch, only NEON implementation was available. Just
because it showed better results on Allwinner A10 compared to VFP.
And this DDX driver used to primarily target just sunxi hardware.

But looks like it makes sense to also target other devices (at least
ODROID-X, which has the same Mali400 GPU and can use the same DRI2
integration for EGL and GLESv2 support). And on the other ARM devices,
VFP aligned reads generally work better than NEON. The benchmark
results are listed below:

            1280x720, 32bpp, testing "x11perf -scroll500"

== Exynos 5250, Cortex-A15, Non-cacheable streaming enhancement disabled ==

NEON : 10000 trep @   3.7101 msec (   270.0/sec): Scroll 500x500 pixels
VFP  : 10000 trep @   2.6678 msec (   375.0/sec): Scroll 500x500 pixels

== Exynos 5250, Cortex-A15, Non-cacheable streaming enhancement enabled ==

NEON : 15000 trep @   2.2568 msec (   443.0/sec): Scroll 500x500 pixels
VFP  : 15000 trep @   2.3016 msec (   434.0/sec): Scroll 500x500 pixels

== Exynos 4412, Cortex-A9 ==

NEON : 10000 trep @   4.5125 msec (   222.0/sec): Scroll 500x500 pixels
VFP  : 10000 trep @   2.7015 msec (   370.0/sec): Scroll 500x500 pixels

== TI DM3730, Cortex-A8 ==

NEON : 15000 trep @   2.2303 msec (   448.0/sec): Scroll 500x500 pixels
VFP  : 10000 trep @   3.0670 msec (   326.0/sec): Scroll 500x500 pixels

== Allwinner A10, Cortex-A8 ==

NEON : 10000 trep @   2.5559 msec (   391.0/sec): Scroll 500x500 pixels
VFP  : 10000 trep @   3.0580 msec (   327.0/sec): Scroll 500x500 pixels

== Raspberry Pi, BCM2708, ARM1176 ==

VFP  :  3000 trep @   8.7699 msec (   114.0/sec): Scroll 500x500 pixels

The benchmark numbers in this particular test setup roughly represent
memory copy bandwidth measured in MB/s (when doing overlapped blits
inside of a writecombine mapped framebuffer).

-----------------------------------------------------------------------

Note: the use of VFP two-pass overlapped copy instead of ShadowFB is
      still not enabled by default when running on Raspberry Pi
      because the performance results are not so great.
Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

b93dab5c

30 Mar, 2013 1 commit

CPU: Added ARM NEON optimized CopyWindow/CopyArea implementation · 24d05b1d

Siarhei Siamashka authored Mar 30, 2013



Should be useful for better performance when moving windows
and scrolling on the devices without a dedicated 2D hardware
accelerator (Allwinner A13).
Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

24d05b1d