1. 04 Apr, 2016 1 commit
    • Siarhei Siamashka's avatar
      Framebuffer readback assembly code for AArch64 · a5e698f1
      Siarhei Siamashka authored
      On a PINE64 board (ARM Cortex-A53), this provides ~180 MB/s
      speed for the framebuffer readback. For comparison, the normal
      memcpy operation in cached buffers runs at around ~1200 MB/s.
      
      Such read back speed is actually not very fast and is borderline
      usable. With a 1920x1080 32bpp screen resolution, this results in
      something like ~20 FPS scrolling.
      
      Benchmark vs. shadow framebuffer (1920x1080 32bpp):
      
        == Shadow framebuffer in xf86-video-fbdev ==
      
           $ wget http://mirror.its.dal.ca/gutenberg/3/2/0/3/32032/32032.txt
           $ time DISPLAY=:0 xterm +j -maximized -e cat 32032.txt
      
           real 0m43.909s
           user 0m0.820s
           sys  0m0.300s
      
           $ DISPLAY=:0 x11perf -scroll500 -copywinwin500 -copypixwin500 -copywinpix500
      
           15000 trep @   1.8460 msec (   542.0/sec): Scroll 500x500 pixels
           12000 trep @   2.2629 msec (   442.0/sec): Copy 500x500 from window to window
           12000 trep @   2.2096 msec (   453.0/sec): Copy 500x500 from pixmap to window
           14000 trep @   1.9740 msec (   507.0/sec): Copy 500x500 from window to pixmap
      
        == Direct framebuffer readback in xf86-video-fbturbo ==
      
           $ wget http://mirror.its.dal.ca/gutenberg/3/2/0/3/32032/32032.txt
           $ time DISPLAY=:0 xterm +j -maximized -e cat 32032.txt
      
           real 2m5.741s
           user 0m0.390s
           sys  0m0.190s
      
           $ DISPLAY=:0 x11perf -scroll500 -copywinwin500 -copypixwin500 -copywinpix500
      
            4500 trep @   5.9201 msec (   169.0/sec): Scroll 500x500 pixels
            6000 trep @   5.9211 msec (   169.0/sec): Copy 500x500 from window to window
           18000 trep @   1.5341 msec (   652.0/sec): Copy 500x500 from pixmap to window
            4000 trep @   6.4657 msec (   155.0/sec): Copy 500x500 from window to pixmap
      
        ==
      
      The direct framebuffer access without the shadow framebuffer layer
      makes scrolling and moving windows slower. But copying from pixmaps
      to windows becomes faster. In the real world, copying from offscreen
      pixmaps to windows is much more important, because it is one of the
      performance bottlenecks for almost every X11 application. While
      reading back from the framebuffer is only used for a few very
      specialized tasks (scrolling/moving windows and making screenshots).
      
      On 32-bit ARM systems, the uncached framebuffer readback used to
      perform better. Even the Cortex-A53 running in 32-bit mode can
      do framebuffer readback at more than 300 MB/s:
          https://github.com/ssvb/tinymembench/wiki/PINE64-(Allwinner-A64)
      
      
      
      Scrolling/moving windows still can be accelerated by the kernel
      (via DMA, a dedicated 2D accelerator or some other method) and
      hooked into xf86-video-fbturbo.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      a5e698f1
  2. 17 Oct, 2013 1 commit
  3. 16 Oct, 2013 1 commit
  4. 05 Jun, 2013 1 commit
    • Siarhei Siamashka's avatar
      CPU: Added ARM VFP two-pass overlapped blit implementation · b93dab5c
      Siarhei Siamashka authored
      
      
      Using VFP, we can load up to 128 bytes with a single VLDM instruction.
      But before this patch, only NEON implementation was available. Just
      because it showed better results on Allwinner A10 compared to VFP.
      And this DDX driver used to primarily target just sunxi hardware.
      
      But looks like it makes sense to also target other devices (at least
      ODROID-X, which has the same Mali400 GPU and can use the same DRI2
      integration for EGL and GLESv2 support). And on the other ARM devices,
      VFP aligned reads generally work better than NEON. The benchmark
      results are listed below:
      
                  1280x720, 32bpp, testing "x11perf -scroll500"
      
      == Exynos 5250, Cortex-A15, Non-cacheable streaming enhancement disabled ==
      
      NEON : 10000 trep @   3.7101 msec (   270.0/sec): Scroll 500x500 pixels
      VFP  : 10000 trep @   2.6678 msec (   375.0/sec): Scroll 500x500 pixels
      
      == Exynos 5250, Cortex-A15, Non-cacheable streaming enhancement enabled ==
      
      NEON : 15000 trep @   2.2568 msec (   443.0/sec): Scroll 500x500 pixels
      VFP  : 15000 trep @   2.3016 msec (   434.0/sec): Scroll 500x500 pixels
      
      == Exynos 4412, Cortex-A9 ==
      
      NEON : 10000 trep @   4.5125 msec (   222.0/sec): Scroll 500x500 pixels
      VFP  : 10000 trep @   2.7015 msec (   370.0/sec): Scroll 500x500 pixels
      
      == TI DM3730, Cortex-A8 ==
      
      NEON : 15000 trep @   2.2303 msec (   448.0/sec): Scroll 500x500 pixels
      VFP  : 10000 trep @   3.0670 msec (   326.0/sec): Scroll 500x500 pixels
      
      == Allwinner A10, Cortex-A8 ==
      
      NEON : 10000 trep @   2.5559 msec (   391.0/sec): Scroll 500x500 pixels
      VFP  : 10000 trep @   3.0580 msec (   327.0/sec): Scroll 500x500 pixels
      
      == Raspberry Pi, BCM2708, ARM1176 ==
      
      VFP  :  3000 trep @   8.7699 msec (   114.0/sec): Scroll 500x500 pixels
      
      The benchmark numbers in this particular test setup roughly represent
      memory copy bandwidth measured in MB/s (when doing overlapped blits
      inside of a writecombine mapped framebuffer).
      
      -----------------------------------------------------------------------
      
      Note: the use of VFP two-pass overlapped copy instead of ShadowFB is
            still not enabled by default when running on Raspberry Pi
            because the performance results are not so great.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      b93dab5c
  5. 30 Mar, 2013 1 commit