Xpra: Ticket #653: NVENC SDK version 4 support

The updated SDK was released today, build fixes are in r7397. Should probably be backported.

Some queries extracted from NVENC App Note.

we need a driver / SDK / kernel matrix, there are just too many combinations (see #466)
H.264 Lossless Encoding

Could be very interesting for us, as we then don't need to auto-refresh anything. (bandwidth and GPU performance permitting - no doubt this is going to be more expensive than lossy compression, so should be used sparingly)

H.264 Regular YUV 4:4:4

Well, we had this feature already working in version 3 (eventually), and it is inexplicably broken in recent driver versions (see #466). So they've changed it somehow, their API documentation is as hopeless as it was before on this subject: it does not even mention it once.

Claimed Enhanced Performance and Enhanced Quality: The Maxwell NVENC hardware doubles the encoder performance as compared to Kepler NVENC.

Sounds good, remains to be measured.

Works on GeForce?, Quadro, Tesla, and GRID boards. For low end Quadro and GeForce? boards two sessions of NVENC are allowed.

Now, this is worrying: I hope this means that we can have only two sessions without license keys, and not be limited if we do have the license keys. This will need to be tested. If not, we can stick with CentOS7 + kernel 3.11 + older drivers + SDK version 3. But then we miss on some of the version 4 enhancements.

Sun, 24 Aug 2014 06:43:50 GMT - Antoine Martin:

Instead of including the GUID presets from the header file, we should just duplicate the values in our code since we can check if they are supported at runtime. That's much easier than fiddling with IFDEF for compiling against different versions of the SDK. ie: the new lossess GUID:

static const GUID NV_ENC_PRESET_LOSSLESS_DEFAULT_GUID =
    { 0xd5bfb716, 0xc604, 0x44e7, { 0x9b, 0xb8, 0xde, 0xa5, 0x51, 0xf, 0xc3, 0xac } };
static const GUID NV_ENC_PRESET_LOSSLESS_HP_GUID =
    { 0x149998e7, 0x2364, 0x411d, { 0x82, 0xef, 0x17, 0x98, 0x88, 0x9, 0x34, 0x9 } };

We can also check for lossless support with the caps: NV_ENC_CAPS_SUPPORT_LOSSLESS_ENCODE.

Mon, 25 Aug 2014 15:35:31 GMT - Antoine Martin: attachment set

attachment set to nvenc-v4.patch

better patch using constants.pxi for missing enum

Mon, 25 Aug 2014 15:43:45 GMT - Antoine Martin:

Looks like the reason why 444 mode is broken in recent driver versions is that the support may have been added in the API proper this time: rather than doing 3 encoding rounds on each plane, we may have to check the new capability NV_ENC_CAPS_SUPPORT_YUV444_ENCODE and chromaFormatIDC should then be set to 3 instead... As usual, the documentation is non-existent.

Maybe we should copy the code and start versioning the nvenc codec? (for this new way of doing yuv444, and for the new lossless mode). Then we can ship two versions of the codec and just load the appropriate one at runtime (based on which driver version is installed).

Fri, 19 Sep 2014 16:14:49 GMT - Antoine Martin:

With r7702, we can more easily build against different versions of the nvenc SDK, just use --with-nvenc=3 or --with-nvenc=4.

Minor fixes to test code in r7703, and minor fixes / improvements in r7704 + r7705.

I've run some performance tests on a GeForce GTX 760 vs GeForce GTX 750 Ti (same system, same bus speed, driver version 331.79 - so no yuv444) and the 760 was marginally faster with SDK v3, which is to be expected I guess (256-bit memory interface vs 128-bit and almost double the number of cuda cores). Will do some more tests with SDK v4 after installing newer drivers, and with yuv444 after installing older drivers.. (and kernel, etc)

Mon, 13 Oct 2014 11:15:52 GMT - Antoine Martin: attachment set

attachment set to nvenc-split.patch

work in progress patch: splitting into a new codec source module to prevent clashes and regressions

Tue, 14 Oct 2014 15:12:43 GMT - Antoine Martin: attachment set

attachment set to nvenc-v3-v4.diff

diff between SDK v3 and v4

Tue, 14 Oct 2014 15:19:55 GMT - Antoine Martin: attachment set

attachment set to nvenc-debug-and-dumpsettings.patch

adds debug to all API calls and logs all the settings used

Tue, 14 Oct 2014 15:23:19 GMT - Antoine Martin: attachment set

attachment set to nvenc-split-v2.patch

updated split codec patch

Tue, 14 Oct 2014 15:29:33 GMT - Antoine Martin: owner, status changed

owner changed from Antoine Martin to Antoine Martin
status changed from new to assigned

Need a little help checking my logic: the sample code in the SDK works OK (obviously), but xpra's nvenc module does not when running against the newer drivers and libs.

So first I diffed the two SDK and didn't find much (see attachment/ticket/653/nvenc-v3-v4.diff). Whitespace here and there, removal of all the other codecs but h264, some new GUIDs etc.. We needed a small fix because the newer libs check for the value NV_ENC_CONFIG_VER in a sub-structure, done in r7937.

Then I got as far as nvEncInitializeEncoder, but this returned the error code: This indicates that one or more of the parameter passed to the API call is invalid. Easy, I thought, they must have changed something, I'll just copy their encoding settings to get passed it, I'll just modify the example code to print everything (see attachment/ticket/653/nvenc-debug-and-dumpsettings.patch). Except, copying every single setting as used in the example still gave the same error! Then I thought I must have missed one, so I enhanced the debug patch to dump the whole settings structure in hex form and then modified the xpra nvenc code to load these binary settings directly into the memory structures (see attachment/ticket/653/nvenc-split-v2.patch), taking care of overwriting pointers (there is only one). Still no go!

Then I started from the opposite direction: write the C code based on what we do in Cython (see attachment/ticket/653/test_nvenc_api.c). Compile it with:

gcc -o main xpra/codecs/nvenc4/test_nvenc_api.c \
   -L/usr/local/cuda/lib64 \
   -I/opt/nvenc_4.0_sdk/Samples/nvEncodeApp/inc/ -I/opt/cuda-6.5/include/ \
    -lcuda -L/usr/lib64 -lnvidia-encode

Run it with:

LD_LIBRARY_PATH=/opt/cuda-6.5/lib64:/opt/cuda-6.5/targets/x86_64-linux/lib/:$LD_LIBRARY_PATH \
   PATH=/opt/cuda-6.5/bin:$PATH \
   ./main

And it succeeds in opening an encode session... I am stumped. Ideas?

Wed, 15 Oct 2014 07:54:18 GMT - Antoine Martin: attachment set

attachment set to test_nvenc_api.c

updated api test code, closer to the cython version and more complete

Wed, 15 Oct 2014 07:55:02 GMT - Antoine Martin: attachment set

attachment set to nvenc-split-v4.patch

new version of the nvenc4 codec with cuda init done using ctypes instead of pycuda

Wed, 15 Oct 2014 10:17:15 GMT - Antoine Martin:

OK, I got the C code very very close to the Cython version, then I thought that the only thing that was potentially different was the way we obtained the cuda context: using pycuda instead of directly using the C api. So the v4 patch used ctypes to call the cuda library directly (and hacked the input format check to get passed that bit). And it still errored out!

Then I thought it must be something in the environment that makes nvenc unhappy, and sure enough: calling the C version from the cython code causes a segfault in nvEncInitializeEncoder. (run with v6)

gdb shows us:

#0  __memmove_ssse3 () at ../sysdeps/x86_64/multiarch/memcpy-ssse3.S:2927
#1  0x00007fffef09251c in ?? () from /lib64/libnvidia-encode.so.1
#2  0x00007fffef094572 in ?? () from /lib64/libnvidia-encode.so.1
#3  0x00007fffef09d42a in ?? () from /lib64/libnvidia-encode.so.1
#4  0x00007ffff0291703 in test_nvenc_api () at xpra/codecs/nvenc4/test_nvenc_api.c:255
...

Which is called from this bit of code:

   0x00007fffef092503:	mov    0x30(%rbx),%edx
   0x00007fffef092506:	mov    0x168(%rbx),%rsi
   0x00007fffef09250d:	mov    %rax,%rdi
   0x00007fffef092510:	mov    %rax,0x1530(%rbx)
=> 0x00007fffef092517:	callq  0x7fffef08fdd0 <memcpy@plt>

If I break before the memcpy:

(gdb) p/x $rax
$1 = 0x7fffe419a010
(gdb) p/x $rbx+0x1530
$2 = 0x10f9ca0

For reference, the memcpy code does 16 bytes at a time using 8 mmx registers:

movdqu	-0x10(%rsi), %xmm0
movdqu	-0x20(%rsi), %xmm1
etc..

And I have no idea what is is copying, it's not our config structure for sure.. It's copying gigabytes of data until it hits addresses lower than 0x600000 where if fails. Could be a sign issue.

Looks like a bug to me. Not sure why it only triggers when we call the library from a different context. (it shouldn't make any difference)

And THAT is why proprietary software sucks. Bugs happen, all the time. No way to figure out a workaround or fix them. (and judging by the poor quality of the documentation and example code that comes with nvenc, this is very unlikely to be the only issue)

Wed, 15 Oct 2014 10:19:23 GMT - Antoine Martin: attachment set

attachment set to nvenc-split-v6.2.patch

same patch, but excluding the license key

Wed, 15 Oct 2014 10:49:25 GMT - Antoine Martin:

ah, something interesting: using malloc instead of stack variables avoids the crash. Though it doesn't help the cython version which still fails with the same invalid parameter crap.

Wed, 15 Oct 2014 15:44:29 GMT - Antoine Martin:

I've tried with the 343.22 drivers instead of 340.46, no change.

I've also tried copying the memory structure (via hex strings) to ensure that I use exactly the same settings, bar the pointer. Still the same error. (also added a nvEncDestroyEncoder in the C version to ensure we cleanup after use, but since the same code fails the same way without even calling into the C code... that was a bit pointless anyway)

I guess the next step is to try to call the nvEncInitializeEncoder in plain C from the Cython codec, to see if that helps...

What a waste of time this is. Shoddy API.

Thu, 16 Oct 2014 05:01:53 GMT - Antoine Martin:

Never give up! Well, obviously... the problem was not in the way we call nvEncInitializeEncoder, but with nvEncOpenEncodeSessionEx. The API just happens to notice later on.. and tells us in the wrong place. FFS.

I have code that sort of works. Performance tuning will be needed as this seems to be running at 60% of the speed we had before! (ouch)

Thu, 16 Oct 2014 15:34:49 GMT - Antoine Martin: attachment set

attachment set to nvenc-split-v9.patch

latest, working patch

Thu, 16 Oct 2014 15:37:27 GMT - Antoine Martin:

The patch above mostly works, as long as you use the right client keys.. (older SDK would give you a helpful error if you used the wrong key, this one just errors on other method calls). Still todo:

malloc tidy up
remove old yuv444 mode and try to implement the new one
figure out how to load the correct module (and prefer SDK 4 if supported?)

etc..

Fri, 17 Oct 2014 12:11:40 GMT - Antoine Martin:

Done in r7944 (see commit message for details).

With this change, we can now support multiple driver versions with a single build. We build both the NVENC v3 and v4 encoders, and choose whichever one happens to work with the driver + kernel + card that we find.

still TODO:

allow us to choose a different cuda sdk for each nvenc module?
testing: will add results to #595
YUV444 and lossless mode not done
figure out if there are new restrictions on the number of contexts..
figure out how to get more contexts, parallel encoding hits a bottleneck somewhere, hopefully not the hardware... (see ticket:595#comment:13)

etc

Fri, 24 Oct 2014 12:49:38 GMT - Antoine Martin:

Motivated by the fact that my laptop does not have an Nvidia card in it and that I still do want to build the cuda bits and ship them in the RPM, without requiring the full SDK at deployment time, r7958 now builds fat binaries. It should be possible to run nvenc with only the drivers installed, saving almost 2GB of disk space! We may want to tune the target architectures (which are currently hardcoded) if new ones offer significant performance improvements (which is doubtful: what we do with CUDA is too simple to really matter much). This probably requires CUDA 6+ to build now.

Tue, 27 Jan 2015 06:13:22 GMT - Antoine Martin: owner, status changed

owner changed from Antoine Martin to Smo
status changed from assigned to new

NVENC4 support has been working for a while in trunk / 0.15, let's get this tested.

The YUV444 and lossless mode support are now tracked in a separate ticket: #796.

Thu, 12 Mar 2015 16:40:13 GMT - Smo:

It is indeed possible to run with the rpms from xpra fedora 20 beta repo however I had to compile pycuda myself. When installing from rpm I hit this issue since I installed the drivers manually from nvidia.com

Resolving Dependencies
--> Running transaction check
---> Package python-pycuda.x86_64 0:2014.1-2 will be installed
--> Processing Dependency: python-pytools for package: python-pycuda-2014.1-2.x86_64
--> Processing Dependency: libcuda.so.1()(64bit) for package: python-pycuda-2014.1-2.x86_64
--> Running transaction check
---> Package python-pycuda.x86_64 0:2014.1-2 will be installed
--> Processing Dependency: libcuda.so.1()(64bit) for package: python-pycuda-2014.1-2.x86_64
---> Package python-pytools.noarch 0:2014.3-1.fc20 will be installed
--> Finished Dependency Resolution
Error: Package: python-pycuda-2014.1-2.x86_64 (winswitch)
           Requires: libcuda.so.1()(64bit)
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest

Not sure how we can make this skip this check.

Tested this with the same machine with 2 cards

GeForce GTX 650 and GeForce GTX 750 Ti with driver installer NVIDIA-Linux-x86_64-346.47.run which is the long lived branch from http://www.nvidia.ca/object/unix.html It would appear that without a license key with nvenc4 that you are far reduced in the amount of contexts that can be used. (2 I think)

I will do some more testing with a more recent card to make sure and update this ticket.

Sat, 14 Mar 2015 09:49:15 GMT - Antoine Martin:

When installing python-pycuda, you have two options:

always install the drivers from RPM... (not ideal: not flexible)
use rpm install --nodeps (also not ideal to have broken dependencies insalled)

It is worth mentioning negativo17 again, I like the way they are packaging things (nvidia-driver-cuda-libs can be installed without any dependencies for example) and they support both long lived and short lived driver branches for Fedora.

Maybe we can remove the dependency on libcuda.so.1 using something like:

%define __requires_exclude ^lib.*$

In the pycuda specfile. Does that work for you? (see auto provides and requires filtering for details) This would be a much better solution than using the big hammer AutoReqProv: no (though I guess we could use this as a last resort) According to Can some specific autodetected dependency be ignored upon rpmbuild, this is only supported in Fedora and centos 7...

The only other alternative I can think of is to roll your own driver RPMs. You could use negativo17 as a starting point.

It would appear that without a license key with nvenc4 that you are far reduced in the amount of contexts that can be used. (2 I think)

Yes, I believe that the consumer cards are limited to just 2 contexts and the "pro" cards (quadro and grid) to 32.

To figure out what the real limit is, uncomment the test_context_limits in the nvenc test class.

If you have more than one card in the same system, you can also test the load balancing code.

Sun, 29 Mar 2015 07:14:50 GMT - Antoine Martin:

I have tested the libcuda dependency exclude and merged it in r8856. We can now install the python-pycuda RPM without having libcuda installed via RPM (ie: using nvidia binary installers) - at least on distros that support the macro.

This has been pushed as an update to the stable repository. Most of the builds are now against cuda 7.0, at least the ones that have had repository updates (centos or those I had to do by hand: rawhide..).

Tue, 14 Apr 2015 16:19:02 GMT - Antoine Martin: status changed; resolution set

status changed from new to closed
resolution set to fixed

Tested on a number of deployments now, so closing this ticket.

Sat, 23 Jan 2021 05:02:02 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/653