xpra icon
Bug tracker and wiki

Opened 5 years ago

Closed 5 years ago

#422 closed task (fixed)

opencl acceleration for csc and/or encoding

Reported by: Antoine Martin Owned by: SmO
Priority: major Milestone: 0.11
Component: core Version:
Keywords: Cc:

Description (last modified by Antoine Martin)

References:

Attachments (8)

add-csc-opencl.patch (13.7 KB) - added by Antoine Martin 5 years ago.
stub opencl csc module
add-csc-opencl-v3.patch (19.7 KB) - added by Antoine Martin 5 years ago.
minor tweaks
add-csc-opencl-v6.patch (22.7 KB) - added by Antoine Martin 5 years ago.
works ok but only one format so far: YUV420P to RGB
add-csc-opencl-v7.patch (23.0 KB) - added by Antoine Martin 5 years ago.
updated patch - fix crash with swscale
add-csc-opencl-v10.patch (17.3 KB) - added by Antoine Martin 5 years ago.
working version with all yuv formats as input and both BGRX and RGBX as output
add-csc-opencl-v13.patch (35.6 KB) - added by Antoine Martin 5 years ago.
updated patch with support for RGB to YUV444P (and more to come)
opencl-forcewait.patch (530 bytes) - added by Antoine Martin 5 years ago.
introduces a 10 second delay in the encoding to make it easier to suspend with a live context
opencl-programcompare.patch (919 bytes) - added by Antoine Martin 5 years ago.
try to use the underlying int_ptr to compare opencl program instances

Download all attachments as: .zip

Change History (36)

Changed 5 years ago by Antoine Martin

Attachment: add-csc-opencl.patch added

stub opencl csc module

Changed 5 years ago by Antoine Martin

Attachment: add-csc-opencl-v3.patch added

minor tweaks

comment:1 Changed 5 years ago by Antoine Martin

Owner: changed from Antoine Martin to Antoine Martin
Status: newassigned

More kernels we may be able to use:

comment:2 Changed 5 years ago by Antoine Martin

Testing with plain x264 command line (running a couple of times to ensure the values are consistent - they are..):

  • OpenCL enabled:
    $ time ./x264 --opencl  -o opencl.x264  video.mp4 
    lavf [info]: 720x404p 0:1 @ 24000/1001 fps (vfr)
    x264 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT
    x264 [info]: OpenCL acceleration enabled with NVIDIA Corporation GeForce GTS 450 
    x264 [info]: profile High, level 3.0
    x264 [info]: frame I:364   Avg QP:15.09  size: 37254                           
    x264 [info]: frame P:10936 Avg QP:20.31  size:  5108
    x264 [info]: frame B:19868 Avg QP:23.11  size:   772
    x264 [info]: consecutive B-frames: 10.2% 11.5%  8.4% 69.9%
    x264 [info]: mb I  I16..4: 29.4% 17.4% 53.2%
    x264 [info]: mb P  I16..4:  2.0%  2.6%  3.3%  P16..4: 11.9%  6.5%  4.6%  0.0%  0.0%    skip:69.2%
    x264 [info]: mb B  I16..4:  0.1%  0.1%  0.2%  B16..8:  8.5%  2.2%  0.8%  direct: 0.7%  skip:87.4%  L0:48.4% L1:45.2% BI: 6.5%
    x264 [info]: 8x8 transform intra:28.0% inter:27.9%
    x264 [info]: coded y,uvDC,uvAC intra: 37.6% 57.8% 45.3% inter: 3.7% 4.7% 2.0%
    x264 [info]: i16 v,h,dc,p: 64% 27%  8%  2%
    x264 [info]: i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 19% 15% 58%  1%  1%  1%  1%  1%  2%
    x264 [info]: i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 30% 22% 23%  4%  4%  4%  5%  4%  4%
    x264 [info]: i8c dc,h,v,p: 51% 24% 22%  4%
    x264 [info]: Weighted P-Frames: Y:1.1% UV:1.0%
    x264 [info]: ref P L0: 64.6%  7.0% 17.6% 10.7%  0.1%
    x264 [info]: ref B L0: 79.6% 17.1%  3.3%
    x264 [info]: ref B L1: 95.0%  5.0%
    x264 [info]: kb/s:521.59
    
    encoded 31168 frames, 175.77 fps, 521.59 kb/s
    
    real	2m57.650s
    user	10m12.278s
    sys	0m36.051s
    
  • without OpenCL:
    $ time ./x264  -o no-opencl.x264  video.mp4 
    lavf [info]: 720x404p 0:1 @ 24000/1001 fps (vfr)
    x264 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT
    x264 [info]: profile High, level 3.0
    x264 [info]: frame I:373   Avg QP:16.18  size: 36484                           
    x264 [info]: frame P:12582 Avg QP:20.97  size:  4720
    x264 [info]: frame B:18213 Avg QP:23.12  size:   681
    x264 [info]: consecutive B-frames: 17.9% 10.8%  5.7% 65.7%
    x264 [info]: mb I  I16..4: 23.1% 24.5% 52.5%
    x264 [info]: mb P  I16..4:  1.6%  2.4%  2.8%  P16..4: 11.8%  6.5%  4.5%  0.0%  0.0%    skip:70.5%
    x264 [info]: mb B  I16..4:  0.1%  0.1%  0.2%  B16..8:  7.6%  1.9%  0.7%  direct: 0.6%  skip:88.8%  L0:47.1% L1:46.3% BI: 6.6%
    x264 [info]: 8x8 transform intra:31.5% inter:27.4%
    x264 [info]: coded y,uvDC,uvAC intra: 36.9% 56.1% 43.1% inter: 3.9% 4.9% 2.1%
    x264 [info]: i16 v,h,dc,p: 61% 29%  8%  2%
    x264 [info]: i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 20% 15% 58%  1%  1%  1%  1%  1%  2%
    x264 [info]: i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 30% 22% 22%  4%  4%  4%  5%  4%  4%
    x264 [info]: i8c dc,h,v,p: 51% 23% 22%  4%
    x264 [info]: Weighted P-Frames: Y:0.8% UV:0.7%
    x264 [info]: ref P L0: 64.9%  6.8% 17.7% 10.5%  0.0%
    x264 [info]: ref B L0: 78.6% 18.1%  3.3%
    x264 [info]: ref B L1: 95.4%  4.6%
    x264 [info]: kb/s:525.55
    
    encoded 31168 frames, 186.50 fps, 525.55 kb/s
    
    real	2m47.235s
    user	10m10.138s
    sys	0m6.067s
    

Resulting files:

$ du -sk *opencl.x264
83404	no-opencl.x264
82776	opencl.x264

So this doesn't look like it makes much of a difference unfortunately (at least on my GTS 450), if anything it is a tad slower.

The one thing where this may still be useful is for motion detection, where we could increase the search diameter without incurring too much more CPU usage.

Enabling it looks simple enough, in x264.h:

int b_opencl;            /* use OpenCL when available */

(assuming that x264 is built with opencl support)

comment:3 Changed 5 years ago by Antoine Martin

Description: modified (diff)

For the record, this is what I had to do to get pyopencl to build on Fedora 19 with the nvidia SDK to avoid this error at import time:

ImportError: /usr/lib/python2.7/dist-packages/pyopencl/_cl.so: \
    symbol clRetainDevice, version OPENCL_1.2 not defined in file libOpenCL.so.1 with link time reference

The existing headers look like this:

$ ls -la /usr/include/CL
lrwxrwxrwx. 1 root root 32 Aug 28 12:39 /usr/include/CL -> /etc/alternatives/opencl-headers

Edit: Just downgrading the version of opencl-headers to 1.1 is enough.


Alternatively, we can move the headers to a version specific directory and add the OpenCL 1.1 headers:

cd /etc/alternatives/
mv opencl-headers opencl-headers-1.2
mkdir opencl-headers-1.1
ln -sf opencl-headers-1.1 opencl-headers
cd opencl-headers-1.1

wget http://www.khronos.org/registry/cl/api/1.1/cl_gl_ext.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_ext.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_gl_ext.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_gl.h
wget http://www.khronos.org/registry/cl/api/1.1/cl.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_platform.h
wget http://www.khronos.org/registry/cl/api/1.1/opencl.h

Then we need to ensure pyopengl will be built against 1.1, so siteconf.py contains:

CL_PRETEND_VERSION = '1.1'
Last edited 5 years ago by Antoine Martin (previous) (diff)

comment:4 Changed 5 years ago by Antoine Martin

Having installed freeocl, I now have 3 providers available:

$ LD_LIBRARY_PATH=/opt/cuda/lib64/ XPRA_SWSCALE_DEBUG=0 PYTHONPATH=. python ./tests/xpra/codecs/test_csc_opencl.py 
PyOpenCL OpenGL support: True
found 3 OpenCL platforms:
* FreeOCL (FreeOCL developers) - 1 devices:
 + CPU: AMD Phenom(tm) II X4 945 Processor (OpenCL 1.2 FreeOCL-0.3.6 / OpenCL C 1.2)
* NVIDIA CUDA (NVIDIA Corporation) - 1 devices:
 + GPU: GeForce GTS 450 (OpenCL 1.1 CUDA / OpenCL C 1.1 )
* Intel(R) OpenCL (Intel(R) Corporation) - 1 devices:
 + CPU: AMD Phenom(tm) II X4 945 Processor (OpenCL 1.2 (Build 67279) / OpenCL C 1.2 )

Changed 5 years ago by Antoine Martin

Attachment: add-csc-opencl-v6.patch added

works ok but only one format so far: YUV420P to RGB

comment:5 Changed 5 years ago by Antoine Martin

Please try the patch above and report on performance.
You may need to adjust some env vars for finding the libraries in the cuda paths and for selecting the opencl platform/device:

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/cuda/lib64/
export PYTHONPATH=.
XPRA_OPENCL_DEVICE_TYPE=GPU python ./tests/xpra/codecs/test_csc_opencl.py
XPRA_OPENCL_DEVICE_TYPE=CPU python ./tests/xpra/codecs/test_csc_opencl.py 

Note: careful with LD_LIBRARY_PATH, putting cuda ahead of regular libraries can cause some serious problems (conflicts with libopencl versions for example).


Results deleted (those figures were wrong because of a bug)

The results aren't as bad as they look for nvidia:

  • cpu csc is already very fast since it is such as simple operation
  • hopefully the difference will be more noticeable when we add scaling
  • the gfx card is quite slow by modern standards (we'll see if faster ones help - not guaranteed it will make a huge difference here since the cost is mostly memory bandwidth)
  • most of the cpu time is spent copying buffers to and from the gfx card and on modern cpus that is slightly better than doing fpu or more general instruction decoding

Even then, I think there is room for improvement since we copy the pixels in and out and we may not need to (we just need a buffer interface).

Interestingly, the performance varies widely depending on the picture size.. will need to look into the worksize/localsize settings.

Last edited 5 years ago by Antoine Martin (previous) (diff)

Changed 5 years ago by Antoine Martin

Attachment: add-csc-opencl-v7.patch added

updated patch - fix crash with swscale

comment:6 Changed 5 years ago by Smo

Here are the results on Nvidia K1 (Nvidia) OpenCL

At 1920x1080
191 MPixels/s
223 MPixels/s
161 MPixels/s
184 MPixels/s
172 MPixels/s

Changed 5 years ago by Antoine Martin

Attachment: add-csc-opencl-v10.patch added

working version with all yuv formats as input and both BGRX and RGBX as output

comment:7 Changed 5 years ago by Antoine Martin

Owner: changed from Antoine Martin to Smo
Status: assignednew

Please re-run with patch v10 which fixes some important bugs.

I am afraid that I cannot commit it as-is because the OpenCL shared libraries we end up loading cause some serious problems:

Traceback (most recent call last):
  File "/usr/bin/xpra", line 6, in <module>
    sys.exit(xpra.scripts.main.main(__file__, sys.argv))
  File "/usr/lib64/python2.7/site-packages/xpra/scripts/main.py", line 432, in main
    return run_server(parser, options, mode, script_file, args)
  File "/usr/lib64/python2.7/site-packages/xpra/scripts/server.py", line 454, in run_server
    import gtk.gdk          #@Reimport
  File "/usr/lib64/python2.7/site-packages/gtk-2.0/gtk/__init__.py", line 40, in <module>
    from gtk import _gtk
ImportError: dlopen: cannot load any more object with static TLS

Changed 5 years ago by Antoine Martin

Attachment: add-csc-opencl-v13.patch added

updated patch with support for RGB to YUV444P (and more to come)

comment:8 Changed 5 years ago by Antoine Martin

Owner: changed from Smo to Antoine Martin
Status: newassigned

Added support in r4247

According to Recommended 8-Bit YUV Formats for Video Rendering (section on "YUV Sampling"), MPEG2's subsampling code (BT.601) is more lazy than MPEG1's - but since OpenCL is so cheap to run (it is the memory transfers that cost us), I went for the MPEG1-like more exhaustive calculations instead (using an average of all source pixel values).

Still have to figure out the TLS issue before this can be of any use..

Last edited 5 years ago by Antoine Martin (previous) (diff)

comment:9 Changed 5 years ago by Antoine Martin

Testing on a dual Xeon E5-2670 with dual NVidia K1s (more results here), I found that the individual K1 GPU cores are actually slower than my GTS 450 and so using OpenCL with x264 actually makes it run slower (and I believe the CPU savings are not worth much either):

  • without OpenCL:
    encoded 3347 frames, 148.74 fps, 1853.13 kb/s
    
    real	0m22.759s
    user	6m40.754s
    sys	0m7.133s
    
  • with OpenCL:
    encoded 3347 frames, 89.80 fps, 1866.38 kb/s
    
    real	0m46.335s
    user	4m42.685s
    sys	0m26.054s
    

comment:10 Changed 5 years ago by Antoine Martin

Resolution: fixed
Status: assignedclosed

The TLS issue has been solved in r4282 by only properly initializing csc_opencl (getting a context) after we have loaded GTK... which works around the problem rather than solving it properly.

OpenCL is now enabled (r4298) and working well so closing this ticket.

Note: we may still want some enhancements:

  • handle more modes with generated kernel byteswapping for channel modes not handled by the runtime library (easy)
  • handle scaling (big!)
  • debug kernel build errors with FreeOCL and pocl
Last edited 5 years ago by Antoine Martin (previous) (diff)

comment:11 Changed 5 years ago by Antoine Martin

  • scaling was added in r4310
  • generating missing rgb modes was added in r4303

See also #437

comment:12 Changed 5 years ago by Antoine Martin

There were many more changes and tweaks (too many to list).


Note: the TLS issue is discussed here on the PyOpenCL mailing list.
Looks like a PyOpenCL build issue - may need to revisit when testing with the Nvidia SDK which only supports OpenCL 1.1 ...

comment:13 Changed 5 years ago by Antoine Martin

Resolution: fixed
Status: closedreopened

Just found that the the AMD icd causes the client to get into a spin and waste CPU on a spinlock.
Simply having the AMD icd in /etc/OpenCL/vendors is enough to trigger the problem, so OpenCL should probably be disabled by default to prevent this. What is really odd is that this only affects the client, the server will happily run with the AMD icd (you can force it to be used with: XPRA_FORCE_CSC_MODE=YUV420P XPRA_CSC_TYPE=opencl xpra start ...)
We cannot do a runtime check as calling any OpenCL API will cause the loader to dlopen the problematic library.. and we're toast.

Beware: one cannot strace the xpra client (the machine locks up - need ssh to come and kill the strace process)

Here's what strace has to say:

open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
read(10, "0-7\n", 8192)                 = 4
close(10)                               = 0
mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f78e9007000
mprotect(0x7f78e9007000, 4096, PROT_NONE) = 0
clone(Process 2797 attached
 <unfinished ...>
[pid  2797] set_robust_list(0x7f78e98079e0, 24 <unfinished ...>
[pid  2655] <... clone resumed> child_stack=0x7f78e9806fb0, \
    flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, \
    parent_tidptr=0x7f78e98079d0, tls=0x7f78e9807700, child_tidptr=0x7f78e98079d0) = 2797
[pid  2797] <... set_robust_list resumed> ) = 0
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff <unfinished ...>
[pid  2655] ioctl(9, 0x4008642a <unfinished ...>
[pid  2797] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2655] <... ioctl resumed> , 0x7fff7aabbb08) = 0
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff <unfinished ...>
[pid  2655] ioctl(9, 0xc03064a6 <unfinished ...>
[pid  2797] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

The futex call repeats forever and the xpra client process consumes >70% CPU doing absolutely nothing.

Last edited 5 years ago by Antoine Martin (previous) (diff)

comment:14 Changed 5 years ago by Antoine Martin

Description: modified (diff)

comment:15 Changed 5 years ago by Antoine Martin

And another one for good measure, Intel this time, is doing an illegal memory access, caught with valgrind:

==27195== Invalid read of size 8
==27195==    at 0x118DDA1C: __intel_sse2_strrchr (in /opt/intel/opencl-1.2-3.0.67279/lib64/libtbb_preview.so.2)
==27195==    by 0x118C8531: tbb::internal::init_dl_data() (dynamic_link.cpp:290)
==27195==    by 0x118C8466: __sti__$E (dynamic_link.cpp:449)
==27195==    by 0x118E8001: ??? (in /opt/intel/opencl-1.2-3.0.67279/lib64/libtbb_preview.so.2)
==27195==    by 0x118C367A: ??? (in /opt/intel/opencl-1.2-3.0.67279/lib64/libtbb_preview.so.2)
==27195==    by 0x7FF000276: ???
==27195==    by 0x6E6F687479702E: ???
==27195==    by 0x6E69622F7273752E: ???
==27195==    by 0x746100617270782E: ???
==27195==    by 0x652D2D0068636173: ???
==27195==    by 0x3D676E69646F636D: ???
==27195==    by 0x6E2D2D0034363267: ???
==27195==  Address 0xec4c5d8 is 56 bytes inside a block of size 58 alloc'd
==27195==    at 0x4A06409: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==27195==    by 0x3452405C95: open_path (dl-load.c:2036)
==27195==    by 0x34524086DC: _dl_map_object (dl-load.c:2223)
==27195==    by 0x345240CAD1: openaux (dl-deps.c:63)
==27195==    by 0x345240F303: _dl_catch_error (dl-error.c:177)
==27195==    by 0x345240D1D1: _dl_map_object_deps (dl-deps.c:256)
==27195==    by 0x34524138BB: dl_open_worker (dl-open.c:265)
==27195==    by 0x345240F303: _dl_catch_error (dl-error.c:177)
==27195==    by 0x34524131EA: _dl_open (dl-open.c:656)
==27195==    by 0x3452C0102A: dlopen_doit (dlopen.c:66)
==27195==    by 0x345240F303: _dl_catch_error (dl-error.c:177)
==27195==    by 0x3452C0162C: _dlerror_run (dlerror.c:163)
Last edited 5 years ago by Antoine Martin (previous) (diff)

comment:16 Changed 5 years ago by Antoine Martin

Owner: changed from Antoine Martin to SmO
Status: reopenednew

I have added the most important setup and configuration information here: wiki/CSC and the performance data now lives here: wiki/CSC/Performance


There are new SDKs available:

  • Intel SDK XE 2013 R2 - which I am unable to test on my AMD CPU, can you please check that it still runs OK and maybe add or update the performance data (hopefully they will have fixed the invalid 64-bit memory access from comment:15 - if you have time, run the minimal opencl tests under valgrind)
  • AMD APP SDK v2.9 - and I can no longer reproduce the client problems.


Maybe this can be enabled by default server side?

I don't think we will ever bother using OpenCL or nvcuda (#384) for CSC on the client side, since we're better off using OpenGL for CSC, scaling and rendering (it is now stable enough to use).

comment:17 Changed 5 years ago by Smo

I've tested the Intel, AMD and Nvidia OpenCL ICD's and tested with no problem however there is an issue with the AMD ICD which prevents Xorg from receiving a kill signal. Even just having this ICD available seems to be enough to trigger it.

I'm going to work from a clean install and try to find a set of instructions that includes all the above info to install the Intel + Nvidia ICD's on Fedora 20 to work with xpra.

comment:18 Changed 5 years ago by Antoine Martin

I've just hit this error:

clFinish failed: invalid command queue

After a computer suspend-resume, it seems that the context becomes invalid (must have been cleared from the GPU during suspend). r5110 fixes that.


Quite likely to affect nvenc (added to #466) and csc_nvcuda (added to #384)

comment:19 Changed 5 years ago by Smo

Trying to test with AMD OpenCL using HD 6870 GPU

Getting some strange output is this normal?

using new OpenCL context
YUV420P to BGRX    at  1920x1080        : 90 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV420P to RGBX    at  1920x1080        : 128 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV422P to BGRX    at  1920x1080        : 113 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV422P to RGBX    at  1920x1080        : 131 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV444P to BGRX    at  1920x1080        : 141 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV444P to RGBX    at  1920x1080        : 112 MPixels/s

Seems to be starting many new contexts.

comment:20 Changed 5 years ago by Smo

Tested a few suspend/resume with r5153 with an ATI HD6870 and no issue.

2014-01-08 17:55:44,912 PyOpenCL loaded, header version: 1.2, GL support: False
2014-01-08 17:55:44,913  using platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.)
2014-01-08 17:55:44,913  using device: GPU: Barts (OpenCL 1.2 AMD-APP (1348.4) / OpenCL C 1.2 )

Fore more info

Last edited 5 years ago by Smo (previous) (diff)

comment:21 Changed 5 years ago by Antoine Martin

From comment:20: that's odd, are you not seeing any using new OpenCL context after suspend/resume as I was? (I will try an intel chipset too)
The patch attachment/ticket/422/opencl-forcewait.patch makes it easier to hit the context problems: adding a 10 second delay in the encoding so that we can more easily suspend a PC whilst the GPU context is active.


Also, the log from comment:19 is worrying: the context should not have changed during the same run and I don't see how it could..
r5154 will tell us what has changed (the context or "program"), if you still get multiple occurrences of using new OpenCL context during the test run, please run the test with XPRA_OPENGL_DEBUG=1 and post the lines preceding these ones, they should read something like: old program=(..), new program=(..) or old context=(..), new context=(..).

Last edited 5 years ago by Antoine Martin (previous) (diff)

Changed 5 years ago by Antoine Martin

Attachment: opencl-forcewait.patch added

introduces a 10 second delay in the encoding to make it easier to suspend with a live context

comment:22 Changed 5 years ago by Smo

For comment:20

init_context(..) channel order=RGBA, filter mode=NEAREST
init_context(..) kernel_function RGB_to_YUV422P: <pyopencl._cl.Kernel object at 0x3300628>
old program=<pyopencl.Program object at 0x2e21510>, new program=<pyopencl.Program object at 0x2e21510>
using new OpenCL context (program changed)
init_context(..) kernel source=

Changed 5 years ago by Antoine Martin

Attachment: opencl-programcompare.patch added

try to use the underlying int_ptr to compare opencl program instances

comment:23 Changed 5 years ago by Antoine Martin

What the? the programs are clearly the same... yet fail the comparison test.

Looks like the docs are wrong: pyopencl.Program: Instances of this class are hashable, and two instances of this class may be compared using “==” and ”!=”. (Hashability was added in version 2011.2.) (unless you are using an outdated version of PyOpenCL?)

Can you please try once more with attachment/ticket/422/opencl-programcompare.patch to see if the spurious using new OpenCL context still occur? (and post your version of the PyOpenCL package)
The easy alternative, would be to remove the program test altogether, I have manually verified that we always re-initialize the programs when we re-initialize the device so this would be safe, for now. But this would make the code much more brittle.

comment:24 Changed 5 years ago by Smo

Odd pyopencl seems to be installed 32 bit??

Using /usr/lib/python2.7/site-packages/pyopencl-2013.2-py2.7-linux-x86_64.egg
I installed this with easy_install -Z pyopencl I may have to do it by hand we'll see.

I applied your patch and they seem to be all gone now.

Last edited 5 years ago by Antoine Martin (previous) (diff)

comment:25 Changed 5 years ago by Antoine Martin

OK, I'll try to produce a test case to report the bug to PyOpenCL, which I will have to ask you to test for me since I can't reproduce this weirdness.
In the meantine, r5157 merges the workaround with a long comment explaining its purpose.

FYI: /usr/lib/python2.7/site-packages/ can contain both 32-bit and 64-bit extensions..

comment:26 Changed 5 years ago by Smo

Thanks for the clarification. I'll update the performance chart with my numbers from this machine and a quick instruction set for being able to run it.

AMD drivers require some extra stuff like exporting COMPUTE=:0 so I assume you actually have to have an X server running?

That said I think we've tried out opencl_csc on several platforms now and several opencl ICD's

Last edited 5 years ago by Antoine Martin (previous) (diff)

comment:27 Changed 5 years ago by Smo

Install AMD OpenCL on Fedora 20

I did this from a fresh install with LXDE

From a root terminal

yum group install "Development Tools"; yum install kernel-devel opencl-headers gcc-c++

cd /tmp
wget http://www2.ati.com/drivers/beta/amd-catalyst-13.11-betaV9.95-linux-x86.x86_64.zip

unzip amd-catalyst-13.11-betaV9.95-linux-x86.x86_64.zip
chmod +x Install-AMD-APP.sh; ./Install-AMD-APP.sh

I chose to do an express install. It may ask you to reboot I chose to do this after I installed the AMD App SDK.

Download AMD-APP-SDK-v2.9-lnx64.tgz from http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/

tar xfvz ../AMD-APP-SDK-v2.9-lnx64.tgz
./Install-AMD-App.sh

I rebooted after this install and proceed to install pyopencl with easyinstall

easy_install -Z pyopencl

Started and tested xpra with this command line

COMPUTE=:0 XPRA_OPENCL_DEVICE_TYPE=GPU xpra --no-daemon --bind-tcp=0.0.0.0:1300 --start-child="xterm -fg white -bg black" start :13

comment:28 Changed 5 years ago by Smo

Resolution: fixed
Status: newclosed

Works well with both AMD and Nvidia

Note: See TracTickets for help on using tickets.