Xpra: Ticket #422: opencl acceleration for csc and/or encoding

References:



Mon, 26 Aug 2013 14:43:17 GMT - Antoine Martin: attachment set

stub opencl csc module


Tue, 27 Aug 2013 16:26:13 GMT - Antoine Martin: attachment set

minor tweaks


Tue, 27 Aug 2013 16:33:36 GMT - Antoine Martin: owner, status changed

More kernels we may be able to use:


Wed, 28 Aug 2013 07:05:10 GMT - Antoine Martin:

Testing with plain x264 command line (running a couple of times to ensure the values are consistent - they are..):

Resulting files:

$ du -sk *opencl.x264
83404	no-opencl.x264
82776	opencl.x264

So this doesn't look like it makes much of a difference unfortunately (at least on my GTS 450), if anything it is a tad slower.

The one thing where this may still be useful is for motion detection, where we could increase the search diameter without incurring too much more CPU usage.

Enabling it looks simple enough, in x264.h:

int b_opencl;            /* use OpenCL when available */

(assuming that x264 is built with opencl support)


Wed, 28 Aug 2013 08:16:23 GMT - Antoine Martin: description changed

For the record, this is what I had to do to get pyopencl to build on Fedora 19 with the nvidia SDK to avoid this error at import time:

ImportError: /usr/lib/python2.7/dist-packages/pyopencl/_cl.so: \
    symbol clRetainDevice, version OPENCL_1.2 not defined in file libOpenCL.so.1 with link time reference

The existing headers look like this:

$ ls -la /usr/include/CL
lrwxrwxrwx. 1 root root 32 Aug 28 12:39 /usr/include/CL -> /etc/alternatives/opencl-headers

Edit: Just downgrading the version of opencl-headers to 1.1 is enough.


Alternatively, we can move the headers to a version specific directory and add the OpenCL 1.1 headers:

cd /etc/alternatives/
mv opencl-headers opencl-headers-1.2
mkdir opencl-headers-1.1
ln -sf opencl-headers-1.1 opencl-headers
cd opencl-headers-1.1
wget http://www.khronos.org/registry/cl/api/1.1/cl_gl_ext.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_ext.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_gl_ext.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_gl.h
wget http://www.khronos.org/registry/cl/api/1.1/cl.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_platform.h
wget http://www.khronos.org/registry/cl/api/1.1/opencl.h

Then we need to ensure pyopengl will be built against 1.1, so siteconf.py contains:

CL_PRETEND_VERSION = '1.1'

Wed, 28 Aug 2013 08:19:16 GMT - Antoine Martin:

Having installed freeocl, I now have 3 providers available:

$ LD_LIBRARY_PATH=/opt/cuda/lib64/ XPRA_SWSCALE_DEBUG=0 PYTHONPATH=. python ./tests/xpra/codecs/test_csc_opencl.py
PyOpenCL OpenGL support: True
found 3 OpenCL platforms:
* FreeOCL (FreeOCL developers) - 1 devices:
 + CPU: AMD Phenom(tm) II X4 945 Processor (OpenCL 1.2 FreeOCL-0.3.6 / OpenCL C 1.2)
* NVIDIA CUDA (NVIDIA Corporation) - 1 devices:
 + GPU: GeForce GTS 450 (OpenCL 1.1 CUDA / OpenCL C 1.1 )
* Intel(R) OpenCL (Intel(R) Corporation) - 1 devices:
 + CPU: AMD Phenom(tm) II X4 945 Processor (OpenCL 1.2 (Build 67279) / OpenCL C 1.2 )

Wed, 28 Aug 2013 15:55:46 GMT - Antoine Martin: attachment set

works ok but only one format so far: YUV420P to RGB


Wed, 28 Aug 2013 16:26:18 GMT - Antoine Martin:

Please try the patch above and report on performance. You may need to adjust some env vars for finding the libraries in the cuda paths and for selecting the opencl platform/device:

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/cuda/lib64/
export PYTHONPATH=.
XPRA_OPENCL_DEVICE_TYPE=GPU python ./tests/xpra/codecs/test_csc_opencl.py
XPRA_OPENCL_DEVICE_TYPE=CPU python ./tests/xpra/codecs/test_csc_opencl.py

Note: careful with LD_LIBRARY_PATH, putting cuda ahead of regular libraries can cause some serious problems (conflicts with libopencl versions for example).


Results deleted (those figures were wrong because of a bug)

The results aren't as bad as they look for nvidia:

Even then, I think there is room for improvement since we copy the pixels in and out and we may not need to (we just need a buffer interface).

Interestingly, the performance varies widely depending on the picture size.. will need to look into the worksize/localsize settings.


Wed, 28 Aug 2013 16:27:29 GMT - Antoine Martin: attachment set

updated patch - fix crash with swscale


Wed, 28 Aug 2013 16:45:36 GMT - Smo:

Here are the results on Nvidia K1 (Nvidia) OpenCL

At 1920x1080 191 MPixels/s 223 MPixels/s 161 MPixels/s 184 MPixels/s 172 MPixels/s


Thu, 29 Aug 2013 16:18:43 GMT - Antoine Martin: attachment set

working version with all yuv formats as input and both BGRX and RGBX as output


Thu, 29 Aug 2013 16:22:05 GMT - Antoine Martin: owner, status changed

Please re-run with patch v10 which fixes some important bugs.

I am afraid that I cannot commit it as-is because the OpenCL shared libraries we end up loading cause some serious problems:

Traceback (most recent call last):
  File "/usr/bin/xpra", line 6, in <module>
    sys.exit(xpra.scripts.main.main(__file__, sys.argv))
  File "/usr/lib64/python2.7/site-packages/xpra/scripts/main.py", line 432, in main
    return run_server(parser, options, mode, script_file, args)
  File "/usr/lib64/python2.7/site-packages/xpra/scripts/server.py", line 454, in run_server
    import gtk.gdk          #@Reimport
  File "/usr/lib64/python2.7/site-packages/gtk-2.0/gtk/__init__.py", line 40, in <module>
    from gtk import _gtk
ImportError: dlopen: cannot load any more object with static TLS

Fri, 30 Aug 2013 14:04:00 GMT - Antoine Martin: attachment set

updated patch with support for RGB to YUV444P (and more to come)


Sat, 31 Aug 2013 05:17:47 GMT - Antoine Martin: owner, status changed

Added support in r4247

According to BT.601) is more lazy than MPEG1's - but since OpenCL is so cheap to run (it is the memory transfers that cost us), I went for the MPEG1-like more exhaustive calculations instead (using an average of all source pixel values).

Still have to figure out the TLS issue before this can be of any use..


Wed, 04 Sep 2013 11:51:03 GMT - Antoine Martin:

Testing on a dual Xeon E5-2670 with dual NVidia K1s (more results here), I found that the individual K1 GPU cores are actually slower than my GTS 450 and so using OpenCL with x264 actually makes it run slower (and I believe the CPU savings are not worth much either):


Fri, 06 Sep 2013 12:53:09 GMT - Antoine Martin: status changed; resolution set

The TLS issue has been solved in r4282 by only properly initializing csc_opencl (getting a context) after we have loaded GTK... which works around the problem rather than solving it properly.

OpenCL is now enabled (r4298) and working well so closing this ticket.

Note: we may still want some enhancements:


Mon, 07 Oct 2013 08:45:59 GMT - Antoine Martin:

See also #437


Tue, 15 Oct 2013 12:19:02 GMT - Antoine Martin:

There were many more changes and tweaks (too many to list).


Note: the TLS issue is discussed PyOpenCL mailing list. Looks like a PyOpenCL build issue - may need to revisit when testing with the Nvidia SDK which only supports OpenCL 1.1 ...


Fri, 18 Oct 2013 03:45:44 GMT - Antoine Martin: status changed; resolution deleted

Just found that the the AMD icd causes the client to get into a spin and waste CPU on a spinlock. Simply having the AMD icd in /etc/OpenCL/vendors is enough to trigger the problem, so OpenCL should probably be disabled by default to prevent this. What is really odd is that this only affects the client, the server will happily run with the AMD icd (you can force it to be used with: XPRA_FORCE_CSC_MODE=YUV420P XPRA_CSC_TYPE=opencl xpra start ...) We cannot do a runtime check as calling any OpenCL API will cause the loader to dlopen the problematic library.. and we're toast.

Beware: one cannot strace the xpra client (the machine locks up - need ssh to come and kill the strace process)

Here's what strace has to say:

open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
read(10, "0-7\n", 8192)                 = 4
close(10)                               = 0
mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f78e9007000
mprotect(0x7f78e9007000, 4096, PROT_NONE) = 0
clone(Process 2797 attached
 <unfinished ...>
[pid  2797] set_robust_list(0x7f78e98079e0, 24 <unfinished ...>
[pid  2655] <... clone resumed> child_stack=0x7f78e9806fb0, \
    flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, \
    parent_tidptr=0x7f78e98079d0, tls=0x7f78e9807700, child_tidptr=0x7f78e98079d0) = 2797
[pid  2797] <... set_robust_list resumed> ) = 0
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff <unfinished ...>
[pid  2655] ioctl(9, 0x4008642a <unfinished ...>
[pid  2797] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2655] <... ioctl resumed> , 0x7fff7aabbb08) = 0
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff <unfinished ...>
[pid  2655] ioctl(9, 0xc03064a6 <unfinished ...>
[pid  2797] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

The futex call repeats forever and the xpra client process consumes >70% CPU doing absolutely nothing.


Mon, 11 Nov 2013 09:59:47 GMT - Antoine Martin: description changed


Thu, 05 Dec 2013 16:15:40 GMT - Antoine Martin:

And another one for good measure, Intel this time, is doing an illegal memory access, caught with valgrind:

==27195== Invalid read of size 8
==27195==    at 0x118DDA1C: __intel_sse2_strrchr (in /opt/intel/opencl-1.2-3.0.67279/lib64/libtbb_preview.so.2)
==27195==    by 0x118C8531: tbb::internal::init_dl_data() (dynamic_link.cpp:290)
==27195==    by 0x118C8466: __sti__$E (dynamic_link.cpp:449)
==27195==    by 0x118E8001: ??? (in /opt/intel/opencl-1.2-3.0.67279/lib64/libtbb_preview.so.2)
==27195==    by 0x118C367A: ??? (in /opt/intel/opencl-1.2-3.0.67279/lib64/libtbb_preview.so.2)
==27195==    by 0x7FF000276: ???
==27195==    by 0x6E6F687479702E: ???
==27195==    by 0x6E69622F7273752E: ???
==27195==    by 0x746100617270782E: ???
==27195==    by 0x652D2D0068636173: ???
==27195==    by 0x3D676E69646F636D: ???
==27195==    by 0x6E2D2D0034363267: ???
==27195==  Address 0xec4c5d8 is 56 bytes inside a block of size 58 alloc'd
==27195==    at 0x4A06409: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==27195==    by 0x3452405C95: open_path (dl-load.c:2036)
==27195==    by 0x34524086DC: _dl_map_object (dl-load.c:2223)
==27195==    by 0x345240CAD1: openaux (dl-deps.c:63)
==27195==    by 0x345240F303: _dl_catch_error (dl-error.c:177)
==27195==    by 0x345240D1D1: _dl_map_object_deps (dl-deps.c:256)
==27195==    by 0x34524138BB: dl_open_worker (dl-open.c:265)
==27195==    by 0x345240F303: _dl_catch_error (dl-error.c:177)
==27195==    by 0x34524131EA: _dl_open (dl-open.c:656)
==27195==    by 0x3452C0102A: dlopen_doit (dlopen.c:66)
==27195==    by 0x345240F303: _dl_catch_error (dl-error.c:177)
==27195==    by 0x3452C0162C: _dlerror_run (dlerror.c:163)

Tue, 10 Dec 2013 09:01:27 GMT - Antoine Martin: owner, status changed

I have added the most important setup and configuration information here: wiki/CSC and the performance data now lives here: wiki/CSC/Performance


There are new SDKs available:


Maybe this can be enabled by default server side?

I don't think we will ever bother using OpenCL or nvcuda (#384) for CSC on the client side, since we're better off using OpenGL for CSC, scaling and rendering (it is now stable enough to use).


Fri, 20 Dec 2013 00:46:54 GMT - Smo:

I've tested the Intel, AMD and Nvidia OpenCL ICD's and tested with no problem however there is an issue with the AMD ICD which prevents Xorg from receiving a kill signal. Even just having this ICD available seems to be enough to trigger it.

I'm going to work from a clean install and try to find a set of instructions that includes all the above info to install the Intel + Nvidia ICD's on Fedora 20 to work with xpra.


Sat, 04 Jan 2014 05:35:16 GMT - Antoine Martin:

I've just hit this error:

clFinish failed: invalid command queue

After a computer suspend-resume, it seems that the context becomes invalid (must have been cleared from the GPU during suspend). r5110 fixes that.


Quite likely to affect nvenc (added to #466) and csc_nvcuda (added to #384)


Thu, 09 Jan 2014 00:29:30 GMT - Smo:

Trying to test with AMD OpenCL using HD 6870 GPU

Getting some strange output is this normal?

using new OpenCL context
YUV420P to BGRX    at  1920x1080        : 90 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV420P to RGBX    at  1920x1080        : 128 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV422P to BGRX    at  1920x1080        : 113 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV422P to RGBX    at  1920x1080        : 131 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV444P to BGRX    at  1920x1080        : 141 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV444P to RGBX    at  1920x1080        : 112 MPixels/s

Seems to be starting many new contexts.


Thu, 09 Jan 2014 00:59:46 GMT - Smo:

Tested a few suspend/resume with r5153 with an ATI HD6870 and no issue.

2014-01-08 17:55:44,912 PyOpenCL loaded, header version: 1.2, GL support: False
2014-01-08 17:55:44,913  using platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.)
2014-01-08 17:55:44,913  using device: GPU: Barts (OpenCL 1.2 AMD-APP (1348.4) / OpenCL C 1.2 )

Fore more info


Thu, 09 Jan 2014 02:08:06 GMT - Antoine Martin:

From comment:20: that's odd, are you not seeing any using new OpenCL context after suspend/resume as I was? (I will try an intel chipset too) The patch attachment/ticket/422/opencl-forcewait.patch makes it easier to hit the context problems: adding a 10 second delay in the encoding so that we can more easily suspend a PC whilst the GPU context is active.


Also, the log from comment:19 is worrying: the context should not have changed during the same run and I don't see how it could.. r5154 will tell us what has changed (the context or "program"), if you still get multiple occurrences of using new OpenCL context during the test run, please run the test with XPRA_OPENGL_DEBUG=1 and post the lines preceding these ones, they should read something like: old program=(..), new program=(..) or old context=(..), new context=(..).


Thu, 09 Jan 2014 02:21:04 GMT - Antoine Martin: attachment set

introduces a 10 second delay in the encoding to make it easier to suspend with a live context


Thu, 09 Jan 2014 05:08:11 GMT - Smo:

For comment:20

init_context(..) channel order=RGBA, filter mode=NEAREST
init_context(..) kernel_function RGB_to_YUV422P: <pyopencl._cl.Kernel object at 0x3300628>
old program=<pyopencl.Program object at 0x2e21510>, new program=<pyopencl.Program object at 0x2e21510>
using new OpenCL context (program changed)
init_context(..) kernel source=

Thu, 09 Jan 2014 06:12:38 GMT - Antoine Martin: attachment set

try to use the underlying int_ptr to compare opencl program instances


Thu, 09 Jan 2014 06:16:45 GMT - Antoine Martin:

What the? the programs are clearly the same... yet fail the comparison test.

Looks like the docs are wrong: pyopencl.Program: Instances of this class are hashable, and two instances of this class may be compared using “==” and ”!=”. (Hashability was added in version 2011.2.) (unless you are using an outdated version of PyOpenCL?)

Can you please try once more with attachment/ticket/422/opencl-programcompare.patch to see if the spurious using new OpenCL context still occur? (and post your version of the PyOpenCL package) The easy alternative, would be to remove the program test altogether, I have manually verified that we always re-initialize the programs when we re-initialize the device so this would be safe, for now. But this would make the code much more brittle.


Thu, 09 Jan 2014 15:34:35 GMT - Smo:

Odd pyopencl seems to be installed 32 bit??

Using /usr/lib/python2.7/site-packages/pyopencl-2013.2-py2.7-linux-x86_64.egg I installed this with easy_install -Z pyopencl I may have to do it by hand we'll see.

I applied your patch and they seem to be all gone now.


Thu, 09 Jan 2014 15:42:46 GMT - Antoine Martin:

OK, I'll try to produce a test case to report the bug to PyOpenCL, which I will have to ask you to test for me since I can't reproduce this weirdness. In the meantine, r5157 merges the workaround with a long comment explaining its purpose.

FYI: /usr/lib/python2.7/site-packages/ can contain both 32-bit and 64-bit extensions..


Thu, 09 Jan 2014 15:46:16 GMT - Smo:

Thanks for the clarification. I'll update the performance chart with my numbers from this machine and a quick instruction set for being able to run it.

AMD drivers require some extra stuff like exporting COMPUTE=:0 so I assume you actually have to have an X server running?

That said I think we've tried out opencl_csc on several platforms now and several opencl ICD's


Thu, 09 Jan 2014 23:42:56 GMT - Smo:

Install AMD OpenCL on Fedora 20

I did this from a fresh install with LXDE

From a root terminal

yum group install "Development Tools"; yum install kernel-devel opencl-headers gcc-c++
cd /tmp
wget http://www2.ati.com/drivers/beta/amd-catalyst-13.11-betaV9.95-linux-x86.x86_64.zip
unzip amd-catalyst-13.11-betaV9.95-linux-x86.x86_64.zip
chmod +x Install-AMD-APP.sh; ./Install-AMD-APP.sh

I chose to do an express install. It may ask you to reboot I chose to do this after I installed the AMD App SDK.

Download AMD-APP-SDK-v2.9-lnx64.tgz from http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/

tar xfvz ../AMD-APP-SDK-v2.9-lnx64.tgz
./Install-AMD-App.sh

I rebooted after this install and proceed to install pyopencl with easyinstall

easy_install -Z pyopencl

Started and tested xpra with this command line

COMPUTE=:0 XPRA_OPENCL_DEVICE_TYPE=GPU xpra --no-daemon --bind-tcp=0.0.0.0:1300 --start-child="xterm -fg white -bg black" start :13

Wed, 12 Feb 2014 19:18:47 GMT - Smo: status changed; resolution set

Works well with both AMD and Nvidia


Sat, 23 Jan 2021 04:54:59 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/422