1. 12 Jan, 2014 1 commit
  2. 09 Dec, 2013 1 commit
    • Siarhei Siamashka's avatar
      mali: detect and workaround mismatch between back and front buffers · eed17d55
      Siarhei Siamashka authored
      After window creation or resize, the mali blob on the client side
      requests two dri2 buffers (for back and front) from the ddx. The
      problem is that the 'swap' and 'get_buffer' operations are executed
      out of order relative to each other and we may have different
      possible patterns of dri2 communication:
      
      1. swap swap swap swap get_buffer swap get_buffer swap swap ...
      2. swap swap swap get_buffer swap swap get_buffer swap swap ...
      
      A major annoyance is that both mali blob on the client side and
      the ddx driver in xserver need have the same idea about which one
      of there two buffers goes to front and which goes to back. Older
      commit https://github.com/ssvb/xf86-video-fbturbo/commit/30b4ca27d1c4
      
      
      tried to address this problem in a mostly empirical way and managed
      to solve it at least for the synthetic test gles-rgb-cycle-demo and
      for most of the real programs (such as Qt5 applications, etc.)
      
      However appears that this heuristics is not 100% reliable in all
      cases. The Extreme Tux Racer game run in glshim manages to trigger
      the back and front buffers mismatch. Which manifests itself as
      erratic penguin movement.
      
      This patch adds a special check, which now randomly samples certain
      bytes from the dri2 buffers to see which one of them has been
      modified by the client application between buffer swaps. If we see
      that the rendering actually happens to the front buffer instead of
      the back buffer, then we just change the roles of these buffers.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      eed17d55
  3. 15 Nov, 2013 1 commit
  4. 26 Oct, 2013 1 commit
  5. 19 Oct, 2013 2 commits
  6. 17 Oct, 2013 1 commit
  7. 16 Oct, 2013 1 commit
  8. 08 Oct, 2013 1 commit
    • Siarhei Siamashka's avatar
      RPi: implement threshold for deciding between CPU and DMA blits · 102957f9
      Siarhei Siamashka authored
      
      
      Benchmarking with x11perf, modified to support wider range of sizes
      for the scroll operation. Tests have been run at the stock 700MHz CPU
      clock frequency and with 1280x720 32bpp desktop.
      
      $ DISPLAY=:0 ./x11perf -scroll5 -scroll10 -scroll15 -scroll20 \
                             -scroll30 -scroll50 -scroll100
      
      == CPU ==
      
      1000000 trep @   0.0289 msec ( 34600.0/sec): Scroll 5x5 pixels
      1000000 trep @   0.0387 msec ( 25800.0/sec): Scroll 10x10 pixels
      1000000 trep @   0.0459 msec ( 21800.0/sec): Scroll 15x15 pixels
       450000 trep @   0.0576 msec ( 17300.0/sec): Scroll 20x20 pixels
       350000 trep @   0.0817 msec ( 12200.0/sec): Scroll 30x30 pixels
       200000 trep @   0.1564 msec (  6390.0/sec): Scroll 50x50 pixels
       100000 trep @   0.4446 msec (  2250.0/sec): Scroll 100x100 pixels
      
      == fb_copyarea (DMA) acceleration ==
      
      1000000 trep @   0.0307 msec ( 32500.0/sec): Scroll 5x5 pixels
      1000000 trep @   0.0353 msec ( 28300.0/sec): Scroll 10x10 pixels
      1000000 trep @   0.0397 msec ( 25200.0/sec): Scroll 15x15 pixels
      1000000 trep @   0.0464 msec ( 21600.0/sec): Scroll 20x20 pixels
       400000 trep @   0.0645 msec ( 15500.0/sec): Scroll 30x30 pixels
       250000 trep @   0.1177 msec (  8500.0/sec): Scroll 50x50 pixels
       100000 trep @   0.2783 msec (  3590.0/sec): Scroll 100x100 pixels
      
      This shows that the ioctls overhead and the DMA setup cost is not so
      significant for the Raspberry Pi. DMA already becomes a bit faster
      than CPU at 10x10 size of the blit operation.
      
      Even though there is no significant difference between CPU and DMA
      for extremely small sizes of operations (the other overhead is clearly
      dominating), setting a threshold is not going to harm:
      
      == mixed CPU / fb_copyarea (DMA) with 90 pixels threshold ==
      
      1000000 trep @   0.0291 msec ( 34300.0/sec): Scroll 5x5 pixels
      1000000 trep @   0.0345 msec ( 29000.0/sec): Scroll 10x10 pixels
      1000000 trep @   0.0395 msec ( 25300.0/sec): Scroll 15x15 pixels
      1000000 trep @   0.0466 msec ( 21400.0/sec): Scroll 20x20 pixels
       400000 trep @   0.0650 msec ( 15400.0/sec): Scroll 30x30 pixels
       250000 trep @   0.1181 msec (  8470.0/sec): Scroll 50x50 pixels
       100000 trep @   0.2784 msec (  3590.0/sec): Scroll 100x100 pixels
      
      If some other ARM devices also implement Raspberry Pi compatible
      accelerated fb_copyarea ioctl, then the threshold selection may
      be reconsidered.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      102957f9
  9. 07 Oct, 2013 1 commit
  10. 03 Oct, 2013 1 commit
  11. 22 Sep, 2013 1 commit
  12. 09 Sep, 2013 2 commits
  13. 07 Sep, 2013 2 commits
  14. 06 Sep, 2013 1 commit
    • Siarhei Siamashka's avatar
      sunxi: Only enable scaler for the layer when it is really necessary · 64a0d642
      Siarhei Siamashka authored
      
      
      Now the scaler is enabled for the sunxi disp layer only when we want
      to use it for YUV format with XV. Whenever the layer is configured
      for RGB format or deactivated, the scaler gets disabled.
      
      This should make the driver more friendly to the other potential
      scaled layer users. The total number of available scalers is only
      2 for Allwinner A10 and only 1 for Allwinner A13.
      
      The potential drawback is that now we may get an error when trying
      to enable the scaler (if somebody else has used up all the available
      scalers) instead of always having it reserved and ready for use.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      64a0d642
  15. 13 Aug, 2013 1 commit
  16. 04 Aug, 2013 2 commits
  17. 03 Aug, 2013 1 commit
  18. 31 Jul, 2013 3 commits
  19. 30 Jul, 2013 1 commit
  20. 29 Jul, 2013 2 commits
  21. 26 Jul, 2013 2 commits
    • Siarhei Siamashka's avatar
      DRI2: CPU copy fallback path does not drop half of the frames anymore · 0fd7d5de
      Siarhei Siamashka authored
      The recent commit 9e0a8731
      
       (its part
      that suppressed buffers reuse in the Xorg DRI2 framework) introduced
      a regression. Half of the frames stoppped reaching the screen on
      the CPU copy fallback path because the Mali blob now ended up
      rendering them to the "wrong" buffer.
      
      It just confirms that we need to completely move from the standard
      DRI2 framework in the Xorg server to our own buffers bookkeeping
      logic. This patch fixes the regression by introducing a single UMP
      buffer per window, which is shared between back and front DRI2
      buffers. We can do this because double buffering does not make much
      sense on the fallback path at the moment (we can't set scanout from
      this buffer and anyway have to copy this data elsewhere immediately
      after we get it from Mali).
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      0fd7d5de
    • Siarhei Siamashka's avatar
      DRI2: only pay attention to back buffers requests · 7994a0f3
      Siarhei Siamashka authored
      
      
      Bail out earlier for the uninteresting types of DRI2 buffer
      requests (by just returning a dummy null UMP buffer). Makes
      the code a bit more simple on the common path.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      7994a0f3
  22. 24 Jul, 2013 2 commits
    • Siarhei Siamashka's avatar
      DRI2: Refine the workaround for Mali r3p0 window resizing issue · d59ae8a7
      Siarhei Siamashka authored
      
      
      Using the secure id 1 (framebuffer) to trick the Mali blob into
      requesting DRI2 buffers again was not a very good idea. The problem
      is that the blob still writes something there and corrupts the
      framebuffer. So instead we try to assign secure id 2 to a dummy
      4KiB UMP buffer allocated in memory and use it for the same purpose.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      d59ae8a7
    • Siarhei Siamashka's avatar
      DRI2: Workaround window resize bug in Mali r3p0 blob · 9e0a8731
      Siarhei Siamashka authored
      The Mali blob is doing something like this:
      
       1. Request BackLeft DRI2 buffer (buffer A) and render to it
       2. Swap buffers
       3. Request BackLeft DRI2 buffer (buffer B)
       4. Check window size, and if it has changed - go back to step 1.
       5. Render to the current back buffer (either buffer A or B)
       6. Swap buffers
       7. Go back to step 4
      
      A very serious show stopper problem is that the Mali blob ignores
      DRI2-InvalidateBuffers events and just uses GetGeometry polling
      to check whether the window size has changed. Unfortunately this
      is racy and we may end up with a size mismatch between buffer A
      and buffer B. This is particularly easy to trigger when the window
      size changes exactly between steps 1 and 3.
      
      See test/gles-yellow-blue-flip.c program which demonstrates this.
      Qt5 applications also trigger this bug.
      
      We workaround the issue by explicitly tracking the requests for
      BackLeft buffers and checking whether the sizes of these buffers
      match at step 1 and step 3. However the real challenge here is
      notifying the client application that these buffers are no good,
      so that it can request them again. As DRI2-InvalidateBuffers
      events are ignored, we are in a pretty difficult situation.
      Fortunately I remembered a weird behaviour observed earlier:
      
          https://groups.google.com/forum/#!msg/linux-sunxi/qnxpVaqp1Ys/aVTq09DVih0J
      
      
      
      Actually if we return UMP secure ID value 1 for the second DRI2
      buffer request, the blob responds to this by spitting out the
      following error message:
      
          [EGL-X11] [2274] DETECTED ONLY ONE FRAMEBUFFER - FORCING A RESIZE
          [EGL-X11] [2274] DRI2 UMP ID 0x3 retrieved
          [EGL-X11] [2274] DRI2 WINDOW UMP SECURE ID CHANGED (0x3 -> 0x3)
      
      And then it proceeds by re-trying to request a pair of DRI2 buffers.
      But that's exactly the behaviour we want! As a down side, some ugly
      flashing can be seen on screen at the time when this workaround kicks
      in, but then everything normalizes. And unfortunately, the race
      condition is still not totally eliminated because the blob is
      apparently getting DRI2 buffer sizes from the separate GetGeometry
      requests instead of using the information provided by DRI2GetBuffers.
      But now the problem is at least very hard to trigger.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      9e0a8731
  23. 20 Jul, 2013 1 commit
    • Harm Hanemaaijer's avatar
      Fix XV border artifacts when using gstreamer 1.0 · 0a3dbfba
      Harm Hanemaaijer authored
      Since version 1.0, gstreamer (when using xvimagesink) often
      allocates a larger XV image for the video with padding on all
      four sides and then calls XvPutImage() to render a part of this
      image. With the current XV implementation this results in
      artifacts on the borders of the image, with a green bar at the
      bottom.
      
      I am observing this when playing a 1280x720 video on a 1920x1080
      screen at 32bpp, the size of the video window doesn't matter.
      
      This problem seems to be an exaggeration of the one described in
      https://bugzilla.gnome.org/show_bug.cgi?id=685305
      
      .
      
      The solution appears to be to use the source area dimensions as
      requested in the XvPutImage() call, as opposed to the dimensions
      of the originally allocated image, and to honour the offsets
      (src_x, src_y) when setting the source region on the display
      controller. With this relatively simple change, the problem seems
      to go away, and gstreamer 1.0 (which is faster than gstreamer 0.10
      due to a zero-copy strategy) provides an acceptable solution for
      video playback.
      Signed-off-by: default avatarHarm Hanemaaijer <fgenfb@yahoo.com>
      0a3dbfba
  24. 19 Jul, 2013 1 commit
    • Siarhei Siamashka's avatar
      Don't initialize XV if we can't reserve a scalable sunxi-disp layer · febafa2b
      Siarhei Siamashka authored
      
      
      In the case if an attempt to reserve a scalable sunxi-disp layer
      failed, don't initialize XV at all. Otherwise any attempt to use
      XV overlay is not going to work correctly and just results in
      the following dmesg spam:
      
      [  728.280000] [DISP] not supported yuv channel format:18 in img_sw_para_to_reg
      
      This may happen on Allwinner A13 if scaler mode is enabled in
      .fex file (A13 only has one DEFE scaler). Allwinner A10 also
      can have similar troubles in dual-head configuration if scaler
      mode is enabled for one or both screens (A10 has two DEFE scalers).
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      febafa2b
  25. 18 Jul, 2013 1 commit
  26. 17 Jul, 2013 1 commit
  27. 16 Jul, 2013 1 commit
  28. 11 Jul, 2013 1 commit
  29. 12 Jun, 2013 3 commits
    • Harm Hanemaaijer's avatar
      Add CPU optimization for PutImage · 06f5aec6
      Harm Hanemaaijer authored
      Benchmark tests reveal that xorg's fb layer PutImage implementation
      does not follow on optimal code path for requests without special
      raster operations, which is due to the use of a slower general blit
      function instead of the pixman library. This affects Xlib PutImage
      requests and some ShmPutImage requests. In the case of ShmPutImage,
      xorg directs ShmPutImage requests to PutImage only if the width of
      the part of the image to be copied is equal to the full width of
      the image, resulting in relatively poor performance. If the width
      of the part of the image that is copied is smaller than the full
      image, then xorg uses CopyArea which results in the use of the
      already optimal pixman blit functions. The sub-optimal path is
      commonly triggered by applications such as window managers and web
      browsers.
      
      To fix this unnecessary performance flaw, PutImage is replaced with
      a version that uses pixman for the common case of GXcopy and all
      plane masks sets. This change is device-independent and only uses
      pixman CPU blit functions that is already present in the xorg server.
      
      Using the low-level benchmark program benchx
      (https://github.com/hglm/benchx.git
      
      ), the following speed-ups were
      measured (1920x1080x32bpp) on an Allwinner A10 device:
      
      ShmPutImageFullWidth (5 x 5): Speed up 9%
      ShmPutImageFullWidth (7 x 7): Slow down 5%
      ShmPutImageFullWidth (22 x 22): Speed up 8%
      ShmPutImageFullWidth (49 x 49): Speed up 19%
      ShmPutImageFullWidth (73 x 73): Speed up 55%
      ShmPutImageFullWidth (109 x 109): Speed up 50%
      ShmPutImageFullWidth (163 x 163): Speed up 37%
      ShmPutImageFullWidth (244 x 244): Speed up 111%
      ShmPutImageFullWidth (366 x 366): Speed up 77%
      ShmPutImageFullWidth (549 x 549): Speed up 92%
      AlignedShmPutImageFullWidth (5 x 5): Slow down 14%
      AlignedShmPutImageFullWidth (7 x 7): Slow down 6%
      AlignedShmPutImageFullWidth (15 x 15): Speed up 10%
      AlignedShmPutImageFullWidth (22 x 22): Speed up 9%
      AlignedShmPutImageFullWidth (33 x 33): Speed up 21%
      AlignedShmPutImageFullWidth (49 x 49): Speed up 28%
      AlignedShmPutImageFullWidth (73 x 73): Speed up 30%
      AlignedShmPutImageFullWidth (109 x 109): Speed up 47%
      AlignedShmPutImageFullWidth (163 x 163): Speed up 38%
      AlignedShmPutImageFullWidth (244 x 244): Speed up 63%
      AlignedShmPutImageFullWidth (366 x 366): Speed up 84%
      AlignedShmPutImageFullWidth (549 x 549): Speed up 89%
      
      At 16bpp the speed-up is even greater:
      
      ShmPutImageFullWidth (5 x 5): Slow down 8%
      ShmPutImageFullWidth (7 x 7): Slow down 8%
      ShmPutImageFullWidth (10 x 10): Slow down 6%
      ShmPutImageFullWidth (22 x 22): Speed up 9%
      ShmPutImageFullWidth (33 x 33): Speed up 20%
      ShmPutImageFullWidth (49 x 49): Speed up 27%
      ShmPutImageFullWidth (73 x 73): Speed up 69%
      ShmPutImageFullWidth (109 x 109): Speed up 74%
      ShmPutImageFullWidth (163 x 163): Speed up 100%
      ShmPutImageFullWidth (244 x 244): Speed up 111%
      ShmPutImageFullWidth (366 x 366): Speed up 133%
      ShmPutImageFullWidth (549 x 549): Speed up 123%
      AlignedShmPutImageFullWidth (5 x 5): Speed up 6%
      AlignedShmPutImageFullWidth (7 x 7): Slow down 9%
      AlignedShmPutImageFullWidth (10 x 10): Slow down 10%
      AlignedShmPutImageFullWidth (33 x 33): Speed up 17%
      AlignedShmPutImageFullWidth (49 x 49): Speed up 34%
      AlignedShmPutImageFullWidth (73 x 73): Speed up 49%
      AlignedShmPutImageFullWidth (109 x 109): Speed up 53%
      AlignedShmPutImageFullWidth (163 x 163): Speed up 69%
      AlignedShmPutImageFullWidth (244 x 244): Speed up 82%
      AlignedShmPutImageFullWidth (366 x 366): Speed up 116%
      AlignedShmPutImageFullWidth (549 x 549): Speed up 110%
      Signed-off-by: default avatarHarm Hanemaaijer <fgenfb@yahoo.com>
      06f5aec6
    • Siarhei Siamashka's avatar
      CPU: use VFP overlapped blit on VFP-capable hardware by default · 3ad74420
      Siarhei Siamashka authored
      
      
      This should be useful for Raspberry Pi. When reading uncached source buffers,
      the VFP optimized overlapped two-pass blit is roughly 2-3 times slower than
      memcpy in cached memory. Which makes it reasonably competitive compared to
      ShadowFB (considering that ShadowFB allocates an extra buffer, does extra
      memory copies which take time and thrash L2 cache, etc.). It even provides
      a slight performance advantage in a more or less realistic use case
      (scrolling in xterm), which needs reads from the framebuffer:
      
      ==== Before (xf86-video-fbdev with ShadowFB) ====
      
      $ time DISPLAY=:0 xterm +j -maximized -e cat longtext.txt
      
      real    1m50.245s
      user    0m1.750s
      sys     0m0.800s
      
      ==== After (xf86-video-sunxifb without ShadowFB) ====
      
      $ time DISPLAY=:0 xterm +j -maximized -e cat longtext.txt
      
      real    1m27.709s
      user    0m1.690s
      sys     0m0.920s
      
      We get decent results even when reading from the framebuffer. However
      in many typical workloads (excluding scrolling and dragging windows)
      the framebuffer is primarily used as write-only. In write-only use
      cases ShadowFB is just pure overhead. So getting rid of it is a
      very good idea as this improves overall graphics performance.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      3ad74420
    • Siarhei Siamashka's avatar
      Fix segfault on exit (introduced by the new backing store code) · 3676a495
      Siarhei Siamashka authored
      
      
      A small typo in a function argument and C compiler happily accepting
      void pointers instead of something else is a dangerous combo. Need to
      be more careful.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      3676a495