1. 09 Sep, 2013 2 commits
  2. 07 Sep, 2013 3 commits
  3. 06 Sep, 2013 1 commit
    • Siarhei Siamashka's avatar
      sunxi: Only enable scaler for the layer when it is really necessary · 64a0d642
      Siarhei Siamashka authored
      
      
      Now the scaler is enabled for the sunxi disp layer only when we want
      to use it for YUV format with XV. Whenever the layer is configured
      for RGB format or deactivated, the scaler gets disabled.
      
      This should make the driver more friendly to the other potential
      scaled layer users. The total number of available scalers is only
      2 for Allwinner A10 and only 1 for Allwinner A13.
      
      The potential drawback is that now we may get an error when trying
      to enable the scaler (if somebody else has used up all the available
      scalers) instead of always having it reserved and ready for use.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      64a0d642
  4. 13 Aug, 2013 1 commit
  5. 04 Aug, 2013 3 commits
  6. 03 Aug, 2013 1 commit
  7. 31 Jul, 2013 3 commits
  8. 30 Jul, 2013 1 commit
  9. 29 Jul, 2013 2 commits
  10. 28 Jul, 2013 1 commit
  11. 26 Jul, 2013 2 commits
    • Siarhei Siamashka's avatar
      DRI2: CPU copy fallback path does not drop half of the frames anymore · 0fd7d5de
      Siarhei Siamashka authored
      The recent commit 9e0a8731
      
       (its part
      that suppressed buffers reuse in the Xorg DRI2 framework) introduced
      a regression. Half of the frames stoppped reaching the screen on
      the CPU copy fallback path because the Mali blob now ended up
      rendering them to the "wrong" buffer.
      
      It just confirms that we need to completely move from the standard
      DRI2 framework in the Xorg server to our own buffers bookkeeping
      logic. This patch fixes the regression by introducing a single UMP
      buffer per window, which is shared between back and front DRI2
      buffers. We can do this because double buffering does not make much
      sense on the fallback path at the moment (we can't set scanout from
      this buffer and anyway have to copy this data elsewhere immediately
      after we get it from Mali).
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      0fd7d5de
    • Siarhei Siamashka's avatar
      DRI2: only pay attention to back buffers requests · 7994a0f3
      Siarhei Siamashka authored
      
      
      Bail out earlier for the uninteresting types of DRI2 buffer
      requests (by just returning a dummy null UMP buffer). Makes
      the code a bit more simple on the common path.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      7994a0f3
  12. 24 Jul, 2013 3 commits
    • Siarhei Siamashka's avatar
      test: new gles-rgb-cycle-demo for testing the correctness of DRI2 · 1f89628c
      Siarhei Siamashka authored
      
      
      The test program cycles through 3 colors (red, green, blue), so
      it is easier to see if we get the color change pattern wrong.
      Also the X11 window title is updated to indicate the current
      color information. If we have any problems with window
      decorations handling, they are likely to be exposed.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      1f89628c
    • Siarhei Siamashka's avatar
      DRI2: Refine the workaround for Mali r3p0 window resizing issue · d59ae8a7
      Siarhei Siamashka authored
      
      
      Using the secure id 1 (framebuffer) to trick the Mali blob into
      requesting DRI2 buffers again was not a very good idea. The problem
      is that the blob still writes something there and corrupts the
      framebuffer. So instead we try to assign secure id 2 to a dummy
      4KiB UMP buffer allocated in memory and use it for the same purpose.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      d59ae8a7
    • Siarhei Siamashka's avatar
      DRI2: Workaround window resize bug in Mali r3p0 blob · 9e0a8731
      Siarhei Siamashka authored
      The Mali blob is doing something like this:
      
       1. Request BackLeft DRI2 buffer (buffer A) and render to it
       2. Swap buffers
       3. Request BackLeft DRI2 buffer (buffer B)
       4. Check window size, and if it has changed - go back to step 1.
       5. Render to the current back buffer (either buffer A or B)
       6. Swap buffers
       7. Go back to step 4
      
      A very serious show stopper problem is that the Mali blob ignores
      DRI2-InvalidateBuffers events and just uses GetGeometry polling
      to check whether the window size has changed. Unfortunately this
      is racy and we may end up with a size mismatch between buffer A
      and buffer B. This is particularly easy to trigger when the window
      size changes exactly between steps 1 and 3.
      
      See test/gles-yellow-blue-flip.c program which demonstrates this.
      Qt5 applications also trigger this bug.
      
      We workaround the issue by explicitly tracking the requests for
      BackLeft buffers and checking whether the sizes of these buffers
      match at step 1 and step 3. However the real challenge here is
      notifying the client application that these buffers are no good,
      so that it can request them again. As DRI2-InvalidateBuffers
      events are ignored, we are in a pretty difficult situation.
      Fortunately I remembered a weird behaviour observed earlier:
      
          https://groups.google.com/forum/#!msg/linux-sunxi/qnxpVaqp1Ys/aVTq09DVih0J
      
      
      
      Actually if we return UMP secure ID value 1 for the second DRI2
      buffer request, the blob responds to this by spitting out the
      following error message:
      
          [EGL-X11] [2274] DETECTED ONLY ONE FRAMEBUFFER - FORCING A RESIZE
          [EGL-X11] [2274] DRI2 UMP ID 0x3 retrieved
          [EGL-X11] [2274] DRI2 WINDOW UMP SECURE ID CHANGED (0x3 -> 0x3)
      
      And then it proceeds by re-trying to request a pair of DRI2 buffers.
      But that's exactly the behaviour we want! As a down side, some ugly
      flashing can be seen on screen at the time when this workaround kicks
      in, but then everything normalizes. And unfortunately, the race
      condition is still not totally eliminated because the blob is
      apparently getting DRI2 buffer sizes from the separate GetGeometry
      requests instead of using the information provided by DRI2GetBuffers.
      But now the problem is at least very hard to trigger.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      9e0a8731
  13. 20 Jul, 2013 1 commit
    • Harm Hanemaaijer's avatar
      Fix XV border artifacts when using gstreamer 1.0 · 0a3dbfba
      Harm Hanemaaijer authored
      Since version 1.0, gstreamer (when using xvimagesink) often
      allocates a larger XV image for the video with padding on all
      four sides and then calls XvPutImage() to render a part of this
      image. With the current XV implementation this results in
      artifacts on the borders of the image, with a green bar at the
      bottom.
      
      I am observing this when playing a 1280x720 video on a 1920x1080
      screen at 32bpp, the size of the video window doesn't matter.
      
      This problem seems to be an exaggeration of the one described in
      https://bugzilla.gnome.org/show_bug.cgi?id=685305
      
      .
      
      The solution appears to be to use the source area dimensions as
      requested in the XvPutImage() call, as opposed to the dimensions
      of the originally allocated image, and to honour the offsets
      (src_x, src_y) when setting the source region on the display
      controller. With this relatively simple change, the problem seems
      to go away, and gstreamer 1.0 (which is faster than gstreamer 0.10
      due to a zero-copy strategy) provides an acceptable solution for
      video playback.
      Signed-off-by: default avatarHarm Hanemaaijer <fgenfb@yahoo.com>
      0a3dbfba
  14. 19 Jul, 2013 1 commit
    • Siarhei Siamashka's avatar
      Don't initialize XV if we can't reserve a scalable sunxi-disp layer · febafa2b
      Siarhei Siamashka authored
      
      
      In the case if an attempt to reserve a scalable sunxi-disp layer
      failed, don't initialize XV at all. Otherwise any attempt to use
      XV overlay is not going to work correctly and just results in
      the following dmesg spam:
      
      [  728.280000] [DISP] not supported yuv channel format:18 in img_sw_para_to_reg
      
      This may happen on Allwinner A13 if scaler mode is enabled in
      .fex file (A13 only has one DEFE scaler). Allwinner A10 also
      can have similar troubles in dual-head configuration if scaler
      mode is enabled for one or both screens (A10 has two DEFE scalers).
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      febafa2b
  15. 18 Jul, 2013 2 commits
  16. 17 Jul, 2013 2 commits
  17. 16 Jul, 2013 1 commit
  18. 11 Jul, 2013 1 commit
  19. 12 Jun, 2013 4 commits
    • Harm Hanemaaijer's avatar
      Add CPU optimization for PutImage · 06f5aec6
      Harm Hanemaaijer authored
      Benchmark tests reveal that xorg's fb layer PutImage implementation
      does not follow on optimal code path for requests without special
      raster operations, which is due to the use of a slower general blit
      function instead of the pixman library. This affects Xlib PutImage
      requests and some ShmPutImage requests. In the case of ShmPutImage,
      xorg directs ShmPutImage requests to PutImage only if the width of
      the part of the image to be copied is equal to the full width of
      the image, resulting in relatively poor performance. If the width
      of the part of the image that is copied is smaller than the full
      image, then xorg uses CopyArea which results in the use of the
      already optimal pixman blit functions. The sub-optimal path is
      commonly triggered by applications such as window managers and web
      browsers.
      
      To fix this unnecessary performance flaw, PutImage is replaced with
      a version that uses pixman for the common case of GXcopy and all
      plane masks sets. This change is device-independent and only uses
      pixman CPU blit functions that is already present in the xorg server.
      
      Using the low-level benchmark program benchx
      (https://github.com/hglm/benchx.git
      
      ), the following speed-ups were
      measured (1920x1080x32bpp) on an Allwinner A10 device:
      
      ShmPutImageFullWidth (5 x 5): Speed up 9%
      ShmPutImageFullWidth (7 x 7): Slow down 5%
      ShmPutImageFullWidth (22 x 22): Speed up 8%
      ShmPutImageFullWidth (49 x 49): Speed up 19%
      ShmPutImageFullWidth (73 x 73): Speed up 55%
      ShmPutImageFullWidth (109 x 109): Speed up 50%
      ShmPutImageFullWidth (163 x 163): Speed up 37%
      ShmPutImageFullWidth (244 x 244): Speed up 111%
      ShmPutImageFullWidth (366 x 366): Speed up 77%
      ShmPutImageFullWidth (549 x 549): Speed up 92%
      AlignedShmPutImageFullWidth (5 x 5): Slow down 14%
      AlignedShmPutImageFullWidth (7 x 7): Slow down 6%
      AlignedShmPutImageFullWidth (15 x 15): Speed up 10%
      AlignedShmPutImageFullWidth (22 x 22): Speed up 9%
      AlignedShmPutImageFullWidth (33 x 33): Speed up 21%
      AlignedShmPutImageFullWidth (49 x 49): Speed up 28%
      AlignedShmPutImageFullWidth (73 x 73): Speed up 30%
      AlignedShmPutImageFullWidth (109 x 109): Speed up 47%
      AlignedShmPutImageFullWidth (163 x 163): Speed up 38%
      AlignedShmPutImageFullWidth (244 x 244): Speed up 63%
      AlignedShmPutImageFullWidth (366 x 366): Speed up 84%
      AlignedShmPutImageFullWidth (549 x 549): Speed up 89%
      
      At 16bpp the speed-up is even greater:
      
      ShmPutImageFullWidth (5 x 5): Slow down 8%
      ShmPutImageFullWidth (7 x 7): Slow down 8%
      ShmPutImageFullWidth (10 x 10): Slow down 6%
      ShmPutImageFullWidth (22 x 22): Speed up 9%
      ShmPutImageFullWidth (33 x 33): Speed up 20%
      ShmPutImageFullWidth (49 x 49): Speed up 27%
      ShmPutImageFullWidth (73 x 73): Speed up 69%
      ShmPutImageFullWidth (109 x 109): Speed up 74%
      ShmPutImageFullWidth (163 x 163): Speed up 100%
      ShmPutImageFullWidth (244 x 244): Speed up 111%
      ShmPutImageFullWidth (366 x 366): Speed up 133%
      ShmPutImageFullWidth (549 x 549): Speed up 123%
      AlignedShmPutImageFullWidth (5 x 5): Speed up 6%
      AlignedShmPutImageFullWidth (7 x 7): Slow down 9%
      AlignedShmPutImageFullWidth (10 x 10): Slow down 10%
      AlignedShmPutImageFullWidth (33 x 33): Speed up 17%
      AlignedShmPutImageFullWidth (49 x 49): Speed up 34%
      AlignedShmPutImageFullWidth (73 x 73): Speed up 49%
      AlignedShmPutImageFullWidth (109 x 109): Speed up 53%
      AlignedShmPutImageFullWidth (163 x 163): Speed up 69%
      AlignedShmPutImageFullWidth (244 x 244): Speed up 82%
      AlignedShmPutImageFullWidth (366 x 366): Speed up 116%
      AlignedShmPutImageFullWidth (549 x 549): Speed up 110%
      Signed-off-by: default avatarHarm Hanemaaijer <fgenfb@yahoo.com>
      06f5aec6
    • Siarhei Siamashka's avatar
      CPU: use VFP overlapped blit on VFP-capable hardware by default · 3ad74420
      Siarhei Siamashka authored
      
      
      This should be useful for Raspberry Pi. When reading uncached source buffers,
      the VFP optimized overlapped two-pass blit is roughly 2-3 times slower than
      memcpy in cached memory. Which makes it reasonably competitive compared to
      ShadowFB (considering that ShadowFB allocates an extra buffer, does extra
      memory copies which take time and thrash L2 cache, etc.). It even provides
      a slight performance advantage in a more or less realistic use case
      (scrolling in xterm), which needs reads from the framebuffer:
      
      ==== Before (xf86-video-fbdev with ShadowFB) ====
      
      $ time DISPLAY=:0 xterm +j -maximized -e cat longtext.txt
      
      real    1m50.245s
      user    0m1.750s
      sys     0m0.800s
      
      ==== After (xf86-video-sunxifb without ShadowFB) ====
      
      $ time DISPLAY=:0 xterm +j -maximized -e cat longtext.txt
      
      real    1m27.709s
      user    0m1.690s
      sys     0m0.920s
      
      We get decent results even when reading from the framebuffer. However
      in many typical workloads (excluding scrolling and dragging windows)
      the framebuffer is primarily used as write-only. In write-only use
      cases ShadowFB is just pure overhead. So getting rid of it is a
      very good idea as this improves overall graphics performance.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      3ad74420
    • Siarhei Siamashka's avatar
      Fix segfault on exit (introduced by the new backing store code) · 3676a495
      Siarhei Siamashka authored
      
      
      A small typo in a function argument and C compiler happily accepting
      void pointers instead of something else is a dangerous combo. Need to
      be more careful.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      3676a495
    • Siarhei Siamashka's avatar
      Backing store heuristics for improving windows dragging performance · f5501ff1
      Siarhei Siamashka authored
      
      
      This patch implements a heuristics, which enables backing store for some
      windows. When backing store is enabled for a window, the window gets a
      backing pixmap (via automatic redirection provided by composite extension).
      It acts a bit similar to ShadowFB, but for individual windows.
      
      The advantage of backing store is that we can avoid "expose event -> redraw"
      animated trail in the exposed area when dragging another window on top of it.
      Dragging windows becomes much smoother and faster.
      
      But the disadvantage of backing store is the same as for ShadowFB. That's a
      loss of precious RAM, extra buffer copy when somebody tries to update window
      content, potentially skip of some frames on fast animation (they just do
      not reach screen). Also hardware accelerated scrolling does not currently
      work for the windows with backing store enabled.
      
      We try to make the best use of backing store by enabling backing store for
      all the windows that are direct children of root, except the one which has
      keyboard focus (either directly or via one of its children). In practice this
      heuristics seems to provide nearly perfect results:
       1) dragging windows is fast and smooth.
       2) the top level window with the keyboard focus (typically the application
          that a user is working with) is G2D accelerated and does not suffer from
          any intermediate buffer copy overhead.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      f5501ff1
  20. 10 Jun, 2013 1 commit
  21. 07 Jun, 2013 2 commits
  22. 05 Jun, 2013 1 commit
    • Siarhei Siamashka's avatar
      CPU: Added ARM VFP two-pass overlapped blit implementation · b93dab5c
      Siarhei Siamashka authored
      
      
      Using VFP, we can load up to 128 bytes with a single VLDM instruction.
      But before this patch, only NEON implementation was available. Just
      because it showed better results on Allwinner A10 compared to VFP.
      And this DDX driver used to primarily target just sunxi hardware.
      
      But looks like it makes sense to also target other devices (at least
      ODROID-X, which has the same Mali400 GPU and can use the same DRI2
      integration for EGL and GLESv2 support). And on the other ARM devices,
      VFP aligned reads generally work better than NEON. The benchmark
      results are listed below:
      
                  1280x720, 32bpp, testing "x11perf -scroll500"
      
      == Exynos 5250, Cortex-A15, Non-cacheable streaming enhancement disabled ==
      
      NEON : 10000 trep @   3.7101 msec (   270.0/sec): Scroll 500x500 pixels
      VFP  : 10000 trep @   2.6678 msec (   375.0/sec): Scroll 500x500 pixels
      
      == Exynos 5250, Cortex-A15, Non-cacheable streaming enhancement enabled ==
      
      NEON : 15000 trep @   2.2568 msec (   443.0/sec): Scroll 500x500 pixels
      VFP  : 15000 trep @   2.3016 msec (   434.0/sec): Scroll 500x500 pixels
      
      == Exynos 4412, Cortex-A9 ==
      
      NEON : 10000 trep @   4.5125 msec (   222.0/sec): Scroll 500x500 pixels
      VFP  : 10000 trep @   2.7015 msec (   370.0/sec): Scroll 500x500 pixels
      
      == TI DM3730, Cortex-A8 ==
      
      NEON : 15000 trep @   2.2303 msec (   448.0/sec): Scroll 500x500 pixels
      VFP  : 10000 trep @   3.0670 msec (   326.0/sec): Scroll 500x500 pixels
      
      == Allwinner A10, Cortex-A8 ==
      
      NEON : 10000 trep @   2.5559 msec (   391.0/sec): Scroll 500x500 pixels
      VFP  : 10000 trep @   3.0580 msec (   327.0/sec): Scroll 500x500 pixels
      
      == Raspberry Pi, BCM2708, ARM1176 ==
      
      VFP  :  3000 trep @   8.7699 msec (   114.0/sec): Scroll 500x500 pixels
      
      The benchmark numbers in this particular test setup roughly represent
      memory copy bandwidth measured in MB/s (when doing overlapped blits
      inside of a writecombine mapped framebuffer).
      
      -----------------------------------------------------------------------
      
      Note: the use of VFP two-pass overlapped copy instead of ShadowFB is
            still not enabled by default when running on Raspberry Pi
            because the performance results are not so great.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      b93dab5c
  23. 03 Jun, 2013 1 commit