- 16 Oct, 2013 1 commit
-
-
Siarhei Siamashka authored
Marvell PJ4 core used in CuBox very poorly handles VFP uncached reads from the framebuffer. Using WMMX or ARM LDM reads is much faster, with LDM instructions having a minor advantage. This improves framebuffer read performance from ~50MB/s to ~100MB/s. WMMX runtime detection and PJ4 core identification is also added as part of this fix. Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
-
- 05 Jun, 2013 1 commit
-
-
Siarhei Siamashka authored
Using VFP, we can load up to 128 bytes with a single VLDM instruction. But before this patch, only NEON implementation was available. Just because it showed better results on Allwinner A10 compared to VFP. And this DDX driver used to primarily target just sunxi hardware. But looks like it makes sense to also target other devices (at least ODROID-X, which has the same Mali400 GPU and can use the same DRI2 integration for EGL and GLESv2 support). And on the other ARM devices, VFP aligned reads generally work better than NEON. The benchmark results are listed below: 1280x720, 32bpp, testing "x11perf -scroll500" == Exynos 5250, Cortex-A15, Non-cacheable streaming enhancement disabled == NEON : 10000 trep @ 3.7101 msec ( 270.0/sec): Scroll 500x500 pixels VFP : 10000 trep @ 2.6678 msec ( 375.0/sec): Scroll 500x500 pixels == Exynos 5250, Cortex-A15, Non-cacheable streaming enhancement enabled == NEON : 15000 trep @ 2.2568 msec ( 443.0/sec): Scroll 500x500 pixels VFP : 15000 trep @ 2.3016 msec ( 434.0/sec): Scroll 500x500 pixels == Exynos 4412, Cortex-A9 == NEON : 10000 trep @ 4.5125 msec ( 222.0/sec): Scroll 500x500 pixels VFP : 10000 trep @ 2.7015 msec ( 370.0/sec): Scroll 500x500 pixels == TI DM3730, Cortex-A8 == NEON : 15000 trep @ 2.2303 msec ( 448.0/sec): Scroll 500x500 pixels VFP : 10000 trep @ 3.0670 msec ( 326.0/sec): Scroll 500x500 pixels == Allwinner A10, Cortex-A8 == NEON : 10000 trep @ 2.5559 msec ( 391.0/sec): Scroll 500x500 pixels VFP : 10000 trep @ 3.0580 msec ( 327.0/sec): Scroll 500x500 pixels == Raspberry Pi, BCM2708, ARM1176 == VFP : 3000 trep @ 8.7699 msec ( 114.0/sec): Scroll 500x500 pixels The benchmark numbers in this particular test setup roughly represent memory copy bandwidth measured in MB/s (when doing overlapped blits inside of a writecombine mapped framebuffer). ----------------------------------------------------------------------- Note: the use of VFP two-pass overlapped copy instead of ShadowFB is still not enabled by default when running on Raspberry Pi because the performance results are not so great. Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
-
- 30 Mar, 2013 1 commit
-
-
Siarhei Siamashka authored
Should be useful for better performance when moving windows and scrolling on the devices without a dedicated 2D hardware accelerator (Allwinner A13). Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
-