1. 26 Jul, 2013 1 commit
  2. 24 Jul, 2013 2 commits
    • Siarhei Siamashka's avatar
      DRI2: Refine the workaround for Mali r3p0 window resizing issue · d59ae8a7
      Siarhei Siamashka authored
      
      
      Using the secure id 1 (framebuffer) to trick the Mali blob into
      requesting DRI2 buffers again was not a very good idea. The problem
      is that the blob still writes something there and corrupts the
      framebuffer. So instead we try to assign secure id 2 to a dummy
      4KiB UMP buffer allocated in memory and use it for the same purpose.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      d59ae8a7
    • Siarhei Siamashka's avatar
      DRI2: Workaround window resize bug in Mali r3p0 blob · 9e0a8731
      Siarhei Siamashka authored
      The Mali blob is doing something like this:
      
       1. Request BackLeft DRI2 buffer (buffer A) and render to it
       2. Swap buffers
       3. Request BackLeft DRI2 buffer (buffer B)
       4. Check window size, and if it has changed - go back to step 1.
       5. Render to the current back buffer (either buffer A or B)
       6. Swap buffers
       7. Go back to step 4
      
      A very serious show stopper problem is that the Mali blob ignores
      DRI2-InvalidateBuffers events and just uses GetGeometry polling
      to check whether the window size has changed. Unfortunately this
      is racy and we may end up with a size mismatch between buffer A
      and buffer B. This is particularly easy to trigger when the window
      size changes exactly between steps 1 and 3.
      
      See test/gles-yellow-blue-flip.c program which demonstrates this.
      Qt5 applications also trigger this bug.
      
      We workaround the issue by explicitly tracking the requests for
      BackLeft buffers and checking whether the sizes of these buffers
      match at step 1 and step 3. However the real challenge here is
      notifying the client application that these buffers are no good,
      so that it can request them again. As DRI2-InvalidateBuffers
      events are ignored, we are in a pretty difficult situation.
      Fortunately I remembered a weird behaviour observed earlier:
      
          https://groups.google.com/forum/#!msg/linux-sunxi/qnxpVaqp1Ys/aVTq09DVih0J
      
      
      
      Actually if we return UMP secure ID value 1 for the second DRI2
      buffer request, the blob responds to this by spitting out the
      following error message:
      
          [EGL-X11] [2274] DETECTED ONLY ONE FRAMEBUFFER - FORCING A RESIZE
          [EGL-X11] [2274] DRI2 UMP ID 0x3 retrieved
          [EGL-X11] [2274] DRI2 WINDOW UMP SECURE ID CHANGED (0x3 -> 0x3)
      
      And then it proceeds by re-trying to request a pair of DRI2 buffers.
      But that's exactly the behaviour we want! As a down side, some ugly
      flashing can be seen on screen at the time when this workaround kicks
      in, but then everything normalizes. And unfortunately, the race
      condition is still not totally eliminated because the blob is
      apparently getting DRI2 buffer sizes from the separate GetGeometry
      requests instead of using the information provided by DRI2GetBuffers.
      But now the problem is at least very hard to trigger.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      9e0a8731
  3. 20 Jul, 2013 1 commit
    • Harm Hanemaaijer's avatar
      Fix XV border artifacts when using gstreamer 1.0 · 0a3dbfba
      Harm Hanemaaijer authored
      Since version 1.0, gstreamer (when using xvimagesink) often
      allocates a larger XV image for the video with padding on all
      four sides and then calls XvPutImage() to render a part of this
      image. With the current XV implementation this results in
      artifacts on the borders of the image, with a green bar at the
      bottom.
      
      I am observing this when playing a 1280x720 video on a 1920x1080
      screen at 32bpp, the size of the video window doesn't matter.
      
      This problem seems to be an exaggeration of the one described in
      https://bugzilla.gnome.org/show_bug.cgi?id=685305
      
      .
      
      The solution appears to be to use the source area dimensions as
      requested in the XvPutImage() call, as opposed to the dimensions
      of the originally allocated image, and to honour the offsets
      (src_x, src_y) when setting the source region on the display
      controller. With this relatively simple change, the problem seems
      to go away, and gstreamer 1.0 (which is faster than gstreamer 0.10
      due to a zero-copy strategy) provides an acceptable solution for
      video playback.
      Signed-off-by: default avatarHarm Hanemaaijer <fgenfb@yahoo.com>
      0a3dbfba
  4. 19 Jul, 2013 1 commit
    • Siarhei Siamashka's avatar
      Don't initialize XV if we can't reserve a scalable sunxi-disp layer · febafa2b
      Siarhei Siamashka authored
      
      
      In the case if an attempt to reserve a scalable sunxi-disp layer
      failed, don't initialize XV at all. Otherwise any attempt to use
      XV overlay is not going to work correctly and just results in
      the following dmesg spam:
      
      [  728.280000] [DISP] not supported yuv channel format:18 in img_sw_para_to_reg
      
      This may happen on Allwinner A13 if scaler mode is enabled in
      .fex file (A13 only has one DEFE scaler). Allwinner A10 also
      can have similar troubles in dual-head configuration if scaler
      mode is enabled for one or both screens (A10 has two DEFE scalers).
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      febafa2b
  5. 18 Jul, 2013 1 commit
  6. 17 Jul, 2013 1 commit
  7. 16 Jul, 2013 1 commit
  8. 11 Jul, 2013 1 commit
  9. 12 Jun, 2013 4 commits
    • Harm Hanemaaijer's avatar
      Add CPU optimization for PutImage · 06f5aec6
      Harm Hanemaaijer authored
      Benchmark tests reveal that xorg's fb layer PutImage implementation
      does not follow on optimal code path for requests without special
      raster operations, which is due to the use of a slower general blit
      function instead of the pixman library. This affects Xlib PutImage
      requests and some ShmPutImage requests. In the case of ShmPutImage,
      xorg directs ShmPutImage requests to PutImage only if the width of
      the part of the image to be copied is equal to the full width of
      the image, resulting in relatively poor performance. If the width
      of the part of the image that is copied is smaller than the full
      image, then xorg uses CopyArea which results in the use of the
      already optimal pixman blit functions. The sub-optimal path is
      commonly triggered by applications such as window managers and web
      browsers.
      
      To fix this unnecessary performance flaw, PutImage is replaced with
      a version that uses pixman for the common case of GXcopy and all
      plane masks sets. This change is device-independent and only uses
      pixman CPU blit functions that is already present in the xorg server.
      
      Using the low-level benchmark program benchx
      (https://github.com/hglm/benchx.git
      
      ), the following speed-ups were
      measured (1920x1080x32bpp) on an Allwinner A10 device:
      
      ShmPutImageFullWidth (5 x 5): Speed up 9%
      ShmPutImageFullWidth (7 x 7): Slow down 5%
      ShmPutImageFullWidth (22 x 22): Speed up 8%
      ShmPutImageFullWidth (49 x 49): Speed up 19%
      ShmPutImageFullWidth (73 x 73): Speed up 55%
      ShmPutImageFullWidth (109 x 109): Speed up 50%
      ShmPutImageFullWidth (163 x 163): Speed up 37%
      ShmPutImageFullWidth (244 x 244): Speed up 111%
      ShmPutImageFullWidth (366 x 366): Speed up 77%
      ShmPutImageFullWidth (549 x 549): Speed up 92%
      AlignedShmPutImageFullWidth (5 x 5): Slow down 14%
      AlignedShmPutImageFullWidth (7 x 7): Slow down 6%
      AlignedShmPutImageFullWidth (15 x 15): Speed up 10%
      AlignedShmPutImageFullWidth (22 x 22): Speed up 9%
      AlignedShmPutImageFullWidth (33 x 33): Speed up 21%
      AlignedShmPutImageFullWidth (49 x 49): Speed up 28%
      AlignedShmPutImageFullWidth (73 x 73): Speed up 30%
      AlignedShmPutImageFullWidth (109 x 109): Speed up 47%
      AlignedShmPutImageFullWidth (163 x 163): Speed up 38%
      AlignedShmPutImageFullWidth (244 x 244): Speed up 63%
      AlignedShmPutImageFullWidth (366 x 366): Speed up 84%
      AlignedShmPutImageFullWidth (549 x 549): Speed up 89%
      
      At 16bpp the speed-up is even greater:
      
      ShmPutImageFullWidth (5 x 5): Slow down 8%
      ShmPutImageFullWidth (7 x 7): Slow down 8%
      ShmPutImageFullWidth (10 x 10): Slow down 6%
      ShmPutImageFullWidth (22 x 22): Speed up 9%
      ShmPutImageFullWidth (33 x 33): Speed up 20%
      ShmPutImageFullWidth (49 x 49): Speed up 27%
      ShmPutImageFullWidth (73 x 73): Speed up 69%
      ShmPutImageFullWidth (109 x 109): Speed up 74%
      ShmPutImageFullWidth (163 x 163): Speed up 100%
      ShmPutImageFullWidth (244 x 244): Speed up 111%
      ShmPutImageFullWidth (366 x 366): Speed up 133%
      ShmPutImageFullWidth (549 x 549): Speed up 123%
      AlignedShmPutImageFullWidth (5 x 5): Speed up 6%
      AlignedShmPutImageFullWidth (7 x 7): Slow down 9%
      AlignedShmPutImageFullWidth (10 x 10): Slow down 10%
      AlignedShmPutImageFullWidth (33 x 33): Speed up 17%
      AlignedShmPutImageFullWidth (49 x 49): Speed up 34%
      AlignedShmPutImageFullWidth (73 x 73): Speed up 49%
      AlignedShmPutImageFullWidth (109 x 109): Speed up 53%
      AlignedShmPutImageFullWidth (163 x 163): Speed up 69%
      AlignedShmPutImageFullWidth (244 x 244): Speed up 82%
      AlignedShmPutImageFullWidth (366 x 366): Speed up 116%
      AlignedShmPutImageFullWidth (549 x 549): Speed up 110%
      Signed-off-by: default avatarHarm Hanemaaijer <fgenfb@yahoo.com>
      06f5aec6
    • Siarhei Siamashka's avatar
      CPU: use VFP overlapped blit on VFP-capable hardware by default · 3ad74420
      Siarhei Siamashka authored
      
      
      This should be useful for Raspberry Pi. When reading uncached source buffers,
      the VFP optimized overlapped two-pass blit is roughly 2-3 times slower than
      memcpy in cached memory. Which makes it reasonably competitive compared to
      ShadowFB (considering that ShadowFB allocates an extra buffer, does extra
      memory copies which take time and thrash L2 cache, etc.). It even provides
      a slight performance advantage in a more or less realistic use case
      (scrolling in xterm), which needs reads from the framebuffer:
      
      ==== Before (xf86-video-fbdev with ShadowFB) ====
      
      $ time DISPLAY=:0 xterm +j -maximized -e cat longtext.txt
      
      real    1m50.245s
      user    0m1.750s
      sys     0m0.800s
      
      ==== After (xf86-video-sunxifb without ShadowFB) ====
      
      $ time DISPLAY=:0 xterm +j -maximized -e cat longtext.txt
      
      real    1m27.709s
      user    0m1.690s
      sys     0m0.920s
      
      We get decent results even when reading from the framebuffer. However
      in many typical workloads (excluding scrolling and dragging windows)
      the framebuffer is primarily used as write-only. In write-only use
      cases ShadowFB is just pure overhead. So getting rid of it is a
      very good idea as this improves overall graphics performance.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      3ad74420
    • Siarhei Siamashka's avatar
      Fix segfault on exit (introduced by the new backing store code) · 3676a495
      Siarhei Siamashka authored
      
      
      A small typo in a function argument and C compiler happily accepting
      void pointers instead of something else is a dangerous combo. Need to
      be more careful.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      3676a495
    • Siarhei Siamashka's avatar
      Backing store heuristics for improving windows dragging performance · f5501ff1
      Siarhei Siamashka authored
      
      
      This patch implements a heuristics, which enables backing store for some
      windows. When backing store is enabled for a window, the window gets a
      backing pixmap (via automatic redirection provided by composite extension).
      It acts a bit similar to ShadowFB, but for individual windows.
      
      The advantage of backing store is that we can avoid "expose event -> redraw"
      animated trail in the exposed area when dragging another window on top of it.
      Dragging windows becomes much smoother and faster.
      
      But the disadvantage of backing store is the same as for ShadowFB. That's a
      loss of precious RAM, extra buffer copy when somebody tries to update window
      content, potentially skip of some frames on fast animation (they just do
      not reach screen). Also hardware accelerated scrolling does not currently
      work for the windows with backing store enabled.
      
      We try to make the best use of backing store by enabling backing store for
      all the windows that are direct children of root, except the one which has
      keyboard focus (either directly or via one of its children). In practice this
      heuristics seems to provide nearly perfect results:
       1) dragging windows is fast and smooth.
       2) the top level window with the keyboard focus (typically the application
          that a user is working with) is G2D accelerated and does not suffer from
          any intermediate buffer copy overhead.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      f5501ff1
  10. 10 Jun, 2013 1 commit
  11. 07 Jun, 2013 2 commits
  12. 05 Jun, 2013 1 commit
    • Siarhei Siamashka's avatar
      CPU: Added ARM VFP two-pass overlapped blit implementation · b93dab5c
      Siarhei Siamashka authored
      
      
      Using VFP, we can load up to 128 bytes with a single VLDM instruction.
      But before this patch, only NEON implementation was available. Just
      because it showed better results on Allwinner A10 compared to VFP.
      And this DDX driver used to primarily target just sunxi hardware.
      
      But looks like it makes sense to also target other devices (at least
      ODROID-X, which has the same Mali400 GPU and can use the same DRI2
      integration for EGL and GLESv2 support). And on the other ARM devices,
      VFP aligned reads generally work better than NEON. The benchmark
      results are listed below:
      
                  1280x720, 32bpp, testing "x11perf -scroll500"
      
      == Exynos 5250, Cortex-A15, Non-cacheable streaming enhancement disabled ==
      
      NEON : 10000 trep @   3.7101 msec (   270.0/sec): Scroll 500x500 pixels
      VFP  : 10000 trep @   2.6678 msec (   375.0/sec): Scroll 500x500 pixels
      
      == Exynos 5250, Cortex-A15, Non-cacheable streaming enhancement enabled ==
      
      NEON : 15000 trep @   2.2568 msec (   443.0/sec): Scroll 500x500 pixels
      VFP  : 15000 trep @   2.3016 msec (   434.0/sec): Scroll 500x500 pixels
      
      == Exynos 4412, Cortex-A9 ==
      
      NEON : 10000 trep @   4.5125 msec (   222.0/sec): Scroll 500x500 pixels
      VFP  : 10000 trep @   2.7015 msec (   370.0/sec): Scroll 500x500 pixels
      
      == TI DM3730, Cortex-A8 ==
      
      NEON : 15000 trep @   2.2303 msec (   448.0/sec): Scroll 500x500 pixels
      VFP  : 10000 trep @   3.0670 msec (   326.0/sec): Scroll 500x500 pixels
      
      == Allwinner A10, Cortex-A8 ==
      
      NEON : 10000 trep @   2.5559 msec (   391.0/sec): Scroll 500x500 pixels
      VFP  : 10000 trep @   3.0580 msec (   327.0/sec): Scroll 500x500 pixels
      
      == Raspberry Pi, BCM2708, ARM1176 ==
      
      VFP  :  3000 trep @   8.7699 msec (   114.0/sec): Scroll 500x500 pixels
      
      The benchmark numbers in this particular test setup roughly represent
      memory copy bandwidth measured in MB/s (when doing overlapped blits
      inside of a writecombine mapped framebuffer).
      
      -----------------------------------------------------------------------
      
      Note: the use of VFP two-pass overlapped copy instead of ShadowFB is
            still not enabled by default when running on Raspberry Pi
            because the performance results are not so great.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      b93dab5c
  13. 03 Jun, 2013 1 commit
  14. 02 Jun, 2013 2 commits
    • Harm Hanemaaijer's avatar
      G2D: Implement "double speed" 16bpp blits · 98f1b119
      Harm Hanemaaijer authored
      
      
      When source and destination coordinates allow it, a 16bpp screen-to-
      screen blit is divided into up to three segments: two optional one
      pixel wide edges and an  aligned middle segment that is copied in
      32-bit mode.
      
      This patch adds the low-level function sunxi_g2d_blit_r5g6b5_in_three
      and adds logic to the general blit function to use it for 16bpp to
      16bpp blits if the source and destination coordinates allow it. This
      patch automatically enables the use of this optimization in the
      sunxi G2D X driver. The area threshold for using G2D for
      16bpp-to-16bpp blits was introduced in a previous patch.
      
      Benchmarks:
      
      1920x1080x16bpp@60Hz, ShadowFB disabled:
      
      x11perf -scroll100
      Before:
       350000 trep @   0.0881 msec ( 11400.0/sec): Scroll 100x100 pixels
      After:
       350000 trep @   0.0819 msec ( 12200.0/sec): Scroll 100x100 pixels
      
      x11perf -scroll500
      Before:
        20000 trep @   1.3547 msec (   738.0/sec): Scroll 500x500 pixels
      After:
        35000 trep @   0.8005 msec (  1250.0/sec): Scroll 500x500 pixels
      Signed-off-by: default avatarHarm Hanemaaijer <fgenfb@yahoo.com>
      98f1b119
    • Harm Hanemaaijer's avatar
      G2D: Implement an area threshold for using G2D blits. · b3c2fd2c
      Harm Hanemaaijer authored
      
      
      Due to the overhead of G2D for small screen-to-screen blits, CPU blits
      are faster for small areas. This patch introduces are threshold below
      which CPU blits are triggered. It is currently set to 1000 for 32bpp
      and 2500 for 16bpp based on test results.
      
      Some benchmarks:
      
      1920x1080x16bppx60Hz, ShadowFB disabled:
      
      x11perf -scroll10
      
      Before:
      1500000 trep @   0.0239 msec ( 41800.0/sec): Scroll 10x10 pixels
      After:
      2500000 trep @   0.0110 msec ( 90900.0/sec): Scroll 10x10 pixels
      
      x11perf -copywinwin10
      
      Before:
      1200000 trep @   0.0247 msec ( 40500.0/sec): Copy 10x10 from window to window
      After:
      1800000 trep @   0.0146 msec ( 68600.0/sec): Copy 10x10 from window to window
      Signed-off-by: default avatarHarm Hanemaaijer <fgenfb@yahoo.com>
      b3c2fd2c
  15. 30 Mar, 2013 1 commit
  16. 28 Mar, 2013 1 commit
  17. 26 Mar, 2013 3 commits
  18. 22 Mar, 2013 2 commits
  19. 21 Mar, 2013 1 commit
    • Siarhei Siamashka's avatar
      G2D: accelerate CopyArea between different pixmaps in framebuffer · cc1b1410
      Siarhei Siamashka authored
      
      
      Now source and destination pixmaps don't need to be the same for
      using G2D acceleration (as long as both of them are allocated in
      the framebuffer). This allows using G2D to copy pixels from DRI2
      buffers to the framebuffer on the fallback path (when the window
      of an OpenGL ES application is partially overlapped by some other
      windows). Though it only works when composite extension is
      disabled, for example by adding the following to xorg.conf:
      
          Section "Extensions"
              Option "Composite" "Disable"
          EndSection
      
      If composite extension is enabled, windows have backing pixmaps, and
      we have a longer chain of copies:
      
         DRI2 buffer -> backing pixmap -> framebuffer
      
      Because backing pixmap is not allocated in a physically contiguous
      memory, it can't be copied using G2D yet.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      cc1b1410
  20. 19 Mar, 2013 1 commit
  21. 18 Mar, 2013 1 commit
    • Siarhei Siamashka's avatar
      G2D: Hardware acceleration for XCopyArea (initially 32bpp only) · ecfeb4aa
      Siarhei Siamashka authored
      
      
      Wrap CreateGC function to add a hook for CopyArea operation, which
      can be accelerated using G2D for the buffers inside of the visible
      part of the framebuffer. In the future we may try to also ensure
      that DRI2 buffers are copied using G2D instead of CPU in the case
      if we hit the fallback path and can't avoid this copy.
      
      Benchmark using "x11perf -scroll500 -copywinwin500":
      
      === ShadowFB (software rendering) ===
      
         3000 reps @   2.0308 msec (   492.0/sec): Scroll 500x500 pixels
         3000 reps @   1.9741 msec (   507.0/sec): Scroll 500x500 pixels
         3000 reps @   1.9826 msec (   504.0/sec): Scroll 500x500 pixels
         3000 reps @   1.9830 msec (   504.0/sec): Scroll 500x500 pixels
         3000 reps @   1.9965 msec (   501.0/sec): Scroll 500x500 pixels
        15000 trep @   1.9934 msec (   502.0/sec): Scroll 500x500 pixels
      
         1600 reps @   3.3054 msec (   303.0/sec): Copy 500x500 from window to window
         1600 reps @   3.3179 msec (   301.0/sec): Copy 500x500 from window to window
         1600 reps @   3.2263 msec (   310.0/sec): Copy 500x500 from window to window
         1600 reps @   3.2491 msec (   308.0/sec): Copy 500x500 from window to window
         1600 reps @   3.2357 msec (   309.0/sec): Copy 500x500 from window to window
         8000 trep @   3.2669 msec (   306.0/sec): Copy 500x500 from window to window
      
      === G2D (hardware acceleration) ===
      
         3000 reps @   2.1949 msec (   456.0/sec): Scroll 500x500 pixels
         3000 reps @   2.1929 msec (   456.0/sec): Scroll 500x500 pixels
         3000 reps @   2.1923 msec (   456.0/sec): Scroll 500x500 pixels
         3000 reps @   2.1889 msec (   457.0/sec): Scroll 500x500 pixels
         3000 reps @   2.1941 msec (   456.0/sec): Scroll 500x500 pixels
        15000 trep @   2.1926 msec (   456.0/sec): Scroll 500x500 pixels
      
         2800 reps @   1.8114 msec (   552.0/sec): Copy 500x500 from window to window
         2800 reps @   1.8103 msec (   552.0/sec): Copy 500x500 from window to window
         2800 reps @   1.8160 msec (   551.0/sec): Copy 500x500 from window to window
         2800 reps @   1.8099 msec (   553.0/sec): Copy 500x500 from window to window
         2800 reps @   1.8126 msec (   552.0/sec): Copy 500x500 from window to window
        14000 trep @   1.8120 msec (   552.0/sec): Copy 500x500 from window to window
      
      CPU usage remains low when running this test with G2D acceleration enabled.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      ecfeb4aa
  22. 17 Mar, 2013 2 commits
  23. 16 Mar, 2013 1 commit
  24. 14 Mar, 2013 2 commits
  25. 13 Mar, 2013 1 commit
    • Siarhei Siamashka's avatar
      Added ioctl wrappers for simple G2D fill and blit operations · df534b22
      Siarhei Siamashka authored
      
      
      The existing kernel driver from Allwinner for G2D accelerator
      is quite bad because ioctls are synchronous and blocking the
      caller thread, compromise security (basically it is a backdoor
      for copying data in memory between any arbitrary physical
      addresses) and have high overhead (each individual fill or
      blit operation needs an ioctl). But we need to start with
      something, so use this stuff as a placeholder.
      
      The g2d_driver.h header file is taken from linux-sunxi-3.4
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      df534b22
  26. 12 Mar, 2013 1 commit
  27. 11 Mar, 2013 1 commit
  28. 23 Feb, 2013 1 commit
  29. 18 Feb, 2013 1 commit
    • Siarhei Siamashka's avatar
      Add support for hardware ARGB cursors up to 32x32 size · 06ae9b68
      Siarhei Siamashka authored
      
      
      Actually they are converted to 32x32 with 256 color palette. In the
      case if we have more than 256 unique colors, the color components
      of the pixels are reduced from 8-bit to 7-bit, then to 6-bit if
      necessary and so on (until we reduce the number of unique colors
      so that they can fit the palette). In the worst case we may
      theoretically end up with just 2 bits per A, R, G and B channels,
      but in practice 7 or 6 bits seem to be enough.
      Signed-off-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      06ae9b68