xpra icon
Bug tracker and wiki

Opened 5 years ago

Closed 5 years ago

#653 closed task (fixed)

NVENC SDK version 4 support

Reported by: Antoine Martin Owned by: Smo
Priority: major Milestone: 0.15
Component: encodings Version: trunk
Keywords: nvenc Cc:

Description

The updated SDK was released today, build fixes are in r7397. Should probably be backported.

Some queries extracted from NVENC App Note.

  • we need a driver / SDK / kernel matrix, there are just too many combinations (see #466)
  • H.264 Lossless Encoding

Could be very interesting for us, as we then don't need to auto-refresh anything.
(bandwidth and GPU performance permitting - no doubt this is going to be more expensive than lossy compression, so should be used sparingly)

  • H.264 Regular YUV 4:4:4

Well, we had this feature already working in version 3 (eventually), and it is inexplicably broken in recent driver versions (see #466). So they've changed it somehow, their API documentation is as hopeless as it was before on this subject: it does not even mention it once.

  • Claimed Enhanced Performance and Enhanced Quality: The Maxwell NVENC hardware doubles the encoder performance as compared to Kepler NVENC.

Sounds good, remains to be measured.

  • Works on GeForce?, Quadro, Tesla, and GRID boards. For low end Quadro and GeForce? boards two sessions of NVENC are allowed.

Now, this is worrying: I hope this means that we can have only two sessions without license keys, and not be limited if we do have the license keys. This will need to be tested. If not, we can stick with CentOS7 + kernel 3.11 + older drivers + SDK version 3. But then we miss on some of the version 4 enhancements.

Attachments (9)

nvenc-v4.patch (6.0 KB) - added by Antoine Martin 5 years ago.
better patch using constants.pxi for missing enum
nvenc-split.patch (166.3 KB) - added by Antoine Martin 5 years ago.
work in progress patch: splitting into a new codec source module to prevent clashes and regressions
nvenc-v3-v4.diff (58.5 KB) - added by Antoine Martin 5 years ago.
diff between SDK v3 and v4
nvenc-debug-and-dumpsettings.patch (23.7 KB) - added by Antoine Martin 5 years ago.
adds debug to all API calls and logs all the settings used
nvenc-split-v2.patch (188.4 KB) - added by Antoine Martin 5 years ago.
updated split codec patch
test_nvenc_api.c (7.7 KB) - added by Antoine Martin 5 years ago.
updated api test code, closer to the cython version and more complete
nvenc-split-v4.patch (188.1 KB) - added by Antoine Martin 5 years ago.
new version of the nvenc4 codec with cuda init done using ctypes instead of pycuda
nvenc-split-v6.2.patch (197.3 KB) - added by Antoine Martin 5 years ago.
same patch, but excluding the license key
nvenc-split-v9.patch (190.4 KB) - added by Antoine Martin 5 years ago.
latest, working patch

Download all attachments as: .zip

Change History (25)

comment:1 Changed 5 years ago by Antoine Martin

Instead of including the GUID presets from the header file, we should just duplicate the values in our code since we can check if they are supported at runtime. That's much easier than fiddling with IFDEF for compiling against different versions of the SDK.
ie: the new lossess GUID:

static const GUID NV_ENC_PRESET_LOSSLESS_DEFAULT_GUID = 
    { 0xd5bfb716, 0xc604, 0x44e7, { 0x9b, 0xb8, 0xde, 0xa5, 0x51, 0xf, 0xc3, 0xac } };
static const GUID NV_ENC_PRESET_LOSSLESS_HP_GUID = 
    { 0x149998e7, 0x2364, 0x411d, { 0x82, 0xef, 0x17, 0x98, 0x88, 0x9, 0x34, 0x9 } };

We can also check for lossless support with the caps: NV_ENC_CAPS_SUPPORT_LOSSLESS_ENCODE.

Changed 5 years ago by Antoine Martin

Attachment: nvenc-v4.patch added

better patch using constants.pxi for missing enum

comment:2 Changed 5 years ago by Antoine Martin

Looks like the reason why 444 mode is broken in recent driver versions is that the support may have been added in the API proper this time: rather than doing 3 encoding rounds on each plane, we may have to check the new capability NV_ENC_CAPS_SUPPORT_YUV444_ENCODE and chromaFormatIDC should then be set to 3 instead... As usual, the documentation is non-existent.

Maybe we should copy the code and start versioning the nvenc codec? (for this new way of doing yuv444, and for the new lossless mode). Then we can ship two versions of the codec and just load the appropriate one at runtime (based on which driver version is installed).

comment:3 Changed 5 years ago by Antoine Martin

With r7702, we can more easily build against different versions of the nvenc SDK, just use --with-nvenc=3 or --with-nvenc=4.

Minor fixes to test code in r7703, and minor fixes / improvements in r7704 + r7705.

I've run some performance tests on a GeForce GTX 760 vs GeForce GTX 750 Ti (same system, same bus speed, driver version 331.79 - so no yuv444) and the 760 was marginally faster with SDK v3, which is to be expected I guess (256-bit memory interface vs 128-bit and almost double the number of cuda cores).
Will do some more tests with SDK v4 after installing newer drivers, and with yuv444 after installing older drivers.. (and kernel, etc)

Last edited 5 years ago by Antoine Martin (previous) (diff)

Changed 5 years ago by Antoine Martin

Attachment: nvenc-split.patch added

work in progress patch: splitting into a new codec source module to prevent clashes and regressions

Changed 5 years ago by Antoine Martin

Attachment: nvenc-v3-v4.diff added

diff between SDK v3 and v4

Changed 5 years ago by Antoine Martin

adds debug to all API calls and logs all the settings used

Changed 5 years ago by Antoine Martin

Attachment: nvenc-split-v2.patch added

updated split codec patch

comment:4 Changed 5 years ago by Antoine Martin

Owner: changed from Antoine Martin to Antoine Martin
Status: newassigned

Need a little help checking my logic: the sample code in the SDK works OK (obviously), but xpra's nvenc module does not when running against the newer drivers and libs.

So first I diffed the two SDK and didn't find much (see attachment/ticket/653/nvenc-v3-v4.diff). Whitespace here and there, removal of all the other codecs but h264, some new GUIDs etc..
We needed a small fix because the newer libs check for the value NV_ENC_CONFIG_VER in a sub-structure, done in r7937.

Then I got as far as nvEncInitializeEncoder, but this returned the error code: This indicates that one or more of the parameter passed to the API call is invalid.
Easy, I thought, they must have changed something, I'll just copy their encoding settings to get passed it, I'll just modify the example code to print everything (see attachment/ticket/653/nvenc-debug-and-dumpsettings.patch). Except, copying every single setting as used in the example still gave the same error!
Then I thought I must have missed one, so I enhanced the debug patch to dump the whole settings structure in hex form and then modified the xpra nvenc code to load these binary settings directly into the memory structures (see attachment/ticket/653/nvenc-split-v2.patch), taking care of overwriting pointers (there is only one). Still no go!

Then I started from the opposite direction: write the C code based on what we do in Cython (see attachment/ticket/653/test_nvenc_api.c). Compile it with:

gcc -o main xpra/codecs/nvenc4/test_nvenc_api.c \
   -L/usr/local/cuda/lib64 \
   -I/opt/nvenc_4.0_sdk/Samples/nvEncodeApp/inc/ -I/opt/cuda-6.5/include/ \
    -lcuda -L/usr/lib64 -lnvidia-encode

Run it with:

LD_LIBRARY_PATH=/opt/cuda-6.5/lib64:/opt/cuda-6.5/targets/x86_64-linux/lib/:$LD_LIBRARY_PATH \
   PATH=/opt/cuda-6.5/bin:$PATH \
   ./main

And it succeeds in opening an encode session... I am stumped. Ideas?

Last edited 5 years ago by Antoine Martin (previous) (diff)

Changed 5 years ago by Antoine Martin

Attachment: test_nvenc_api.c added

updated api test code, closer to the cython version and more complete

Changed 5 years ago by Antoine Martin

Attachment: nvenc-split-v4.patch added

new version of the nvenc4 codec with cuda init done using ctypes instead of pycuda

comment:5 Changed 5 years ago by Antoine Martin

OK, I got the C code very very close to the Cython version, then I thought that the only thing that was potentially different was the way we obtained the cuda context: using pycuda instead of directly using the C api. So the v4 patch used ctypes to call the cuda library directly (and hacked the input format check to get passed that bit). And it still errored out!

Then I thought it must be something in the environment that makes nvenc unhappy, and sure enough: calling the C version from the cython code causes a segfault in nvEncInitializeEncoder. (run with v6)

gdb shows us:

#0  __memmove_ssse3 () at ../sysdeps/x86_64/multiarch/memcpy-ssse3.S:2927
#1  0x00007fffef09251c in ?? () from /lib64/libnvidia-encode.so.1
#2  0x00007fffef094572 in ?? () from /lib64/libnvidia-encode.so.1
#3  0x00007fffef09d42a in ?? () from /lib64/libnvidia-encode.so.1
#4  0x00007ffff0291703 in test_nvenc_api () at xpra/codecs/nvenc4/test_nvenc_api.c:255
...

Which is called from this bit of code:

   0x00007fffef092503:	mov    0x30(%rbx),%edx
   0x00007fffef092506:	mov    0x168(%rbx),%rsi
   0x00007fffef09250d:	mov    %rax,%rdi
   0x00007fffef092510:	mov    %rax,0x1530(%rbx)
=> 0x00007fffef092517:	callq  0x7fffef08fdd0 <memcpy@plt>

If I break before the memcpy:

(gdb) p/x $rax
$1 = 0x7fffe419a010
(gdb) p/x $rbx+0x1530
$2 = 0x10f9ca0

For reference, the memcpy code does 16 bytes at a time using 8 mmx registers:

movdqu	-0x10(%rsi), %xmm0
movdqu	-0x20(%rsi), %xmm1
etc..

And I have no idea what is is copying, it's not our config structure for sure..
It's copying gigabytes of data until it hits addresses lower than 0x600000 where if fails. Could be a sign issue.

Looks like a bug to me. Not sure why it only triggers when we call the library from a different context. (it shouldn't make any difference)


And THAT is why proprietary software sucks. Bugs happen, all the time. No way to figure out a workaround or fix them.
(and judging by the poor quality of the documentation and example code that comes with nvenc, this is very unlikely to be the only issue)

Changed 5 years ago by Antoine Martin

Attachment: nvenc-split-v6.2.patch added

same patch, but excluding the license key

comment:6 Changed 5 years ago by Antoine Martin

ah, something interesting: using malloc instead of stack variables avoids the crash.
Though it doesn't help the cython version which still fails with the same invalid parameter crap.

comment:7 Changed 5 years ago by Antoine Martin

I've tried with the 343.22 drivers instead of 340.46, no change.

I've also tried copying the memory structure (via hex strings) to ensure that I use exactly the same settings, bar the pointer. Still the same error.
(also added a nvEncDestroyEncoder in the C version to ensure we cleanup after use, but since the same code fails the same way without even calling into the C code... that was a bit pointless anyway)

I guess the next step is to try to call the nvEncInitializeEncoder in plain C from the Cython codec, to see if that helps...

What a waste of time this is. Shoddy API.

comment:8 Changed 5 years ago by Antoine Martin

Never give up!
Well, obviously... the problem was not in the way we call nvEncInitializeEncoder, but with nvEncOpenEncodeSessionEx.
The API just happens to notice later on.. and tells us in the wrong place. FFS.

I have code that sort of works. Performance tuning will be needed as this seems to be running at 60% of the speed we had before! (ouch)

Changed 5 years ago by Antoine Martin

Attachment: nvenc-split-v9.patch added

latest, working patch

comment:9 Changed 5 years ago by Antoine Martin

The patch above mostly works, as long as you use the right client keys.. (older SDK would give you a helpful error if you used the wrong key, this one just errors on other method calls). Still todo:

  • malloc tidy up
  • remove old yuv444 mode and try to implement the new one
  • figure out how to load the correct module (and prefer SDK 4 if supported?)

etc..

comment:10 Changed 5 years ago by Antoine Martin

Done in r7944 (see commit message for details).

With this change, we can now support multiple driver versions with a single build. We build both the NVENC v3 and v4 encoders, and choose whichever one happens to work with the driver + kernel + card that we find.

still TODO:

  • allow us to choose a different cuda sdk for each nvenc module?
  • testing: will add results to #595
  • YUV444 and lossless mode not done
  • figure out if there are new restrictions on the number of contexts..
  • figure out how to get more contexts, parallel encoding hits a bottleneck somewhere, hopefully not the hardware... (see ticket:595#comment:13)

etc

Last edited 5 years ago by Antoine Martin (previous) (diff)

comment:11 Changed 5 years ago by Antoine Martin

Motivated by the fact that my laptop does not have an Nvidia card in it and that I still do want to build the cuda bits and ship them in the RPM, without requiring the full SDK at deployment time, r7958 now builds fat binaries.
It should be possible to run nvenc with only the drivers installed, saving almost 2GB of disk space!
We may want to tune the target architectures (which are currently hardcoded) if new ones offer significant performance improvements (which is doubtful: what we do with CUDA is too simple to really matter much). This probably requires CUDA 6+ to build now.

comment:12 Changed 5 years ago by Antoine Martin

Owner: changed from Antoine Martin to Smo
Status: assignednew

NVENC4 support has been working for a while in trunk / 0.15, let's get this tested.

The YUV444 and lossless mode support are now tracked in a separate ticket: #796.

comment:13 Changed 5 years ago by Smo

It is indeed possible to run with the rpms from xpra fedora 20 beta repo however I had to compile pycuda myself. When installing from rpm I hit this issue since I installed the drivers manually from nvidia.com

Resolving Dependencies
--> Running transaction check
---> Package python-pycuda.x86_64 0:2014.1-2 will be installed
--> Processing Dependency: python-pytools for package: python-pycuda-2014.1-2.x86_64
--> Processing Dependency: libcuda.so.1()(64bit) for package: python-pycuda-2014.1-2.x86_64
--> Running transaction check
---> Package python-pycuda.x86_64 0:2014.1-2 will be installed
--> Processing Dependency: libcuda.so.1()(64bit) for package: python-pycuda-2014.1-2.x86_64
---> Package python-pytools.noarch 0:2014.3-1.fc20 will be installed
--> Finished Dependency Resolution
Error: Package: python-pycuda-2014.1-2.x86_64 (winswitch)
           Requires: libcuda.so.1()(64bit)
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest

Not sure how we can make this skip this check.

Tested this with the same machine with 2 cards

GeForce GTX 650 and GeForce GTX 750 Ti
with driver installer NVIDIA-Linux-x86_64-346.47.run which is the long lived branch from http://www.nvidia.ca/object/unix.html
It would appear that without a license key with nvenc4 that you are far reduced in the amount of contexts that can be used. (2 I think)

I will do some more testing with a more recent card to make sure and update this ticket.

comment:14 Changed 5 years ago by Antoine Martin

When installing python-pycuda, you have two options:

  • always install the drivers from RPM... (not ideal: not flexible)
  • use rpm install --nodeps (also not ideal to have broken dependencies insalled)

It is worth mentioning negativo17 again, I like the way they are packaging things (nvidia-driver-cuda-libs can be installed without any dependencies for example) and they support both long lived and short lived driver branches for Fedora.

Maybe we can remove the dependency on libcuda.so.1 using something like:

%define __requires_exclude ^lib.*$

In the pycuda specfile. Does that work for you?
(see Tweaking Dependency Generators and auto provides and requires filtering for details)
This would be a much better solution than using the big hammer AutoReqProv: no (though I guess we could use this as a last resort)
According to Can some specific autodetected dependency be ignored upon rpmbuild, this is only supported in Fedora and centos 7...

The only other alternative I can think of is to roll your own driver RPMs. You could use negativo17 as a starting point.


It would appear that without a license key with nvenc4 that you are far reduced in the amount of contexts that can be used. (2 I think)


Yes, I believe that the consumer cards are limited to just 2 contexts and the "pro" cards (quadro and grid) to 32.

To figure out what the real limit is, uncomment the test_context_limits in the nvenc test class.

If you have more than one card in the same system, you can also test the load balancing code.

Last edited 5 years ago by Antoine Martin (previous) (diff)

comment:15 Changed 5 years ago by Antoine Martin

I have tested the libcuda dependency exclude and merged it in r8856.
We can now install the python-pycuda RPM without having libcuda installed via RPM (ie: using nvidia binary installers) - at least on distros that support the macro.

This has been pushed as an update to the stable repository. Most of the builds are now against cuda 7.0, at least the ones that have had repository updates (centos or those I had to do by hand: rawhide..).

comment:16 Changed 5 years ago by Antoine Martin

Resolution: fixed
Status: newclosed

Tested on a number of deployments now, so closing this ticket.

Note: See TracTickets for help on using tickets.