Xpra: Ticket #466: nvenc improvements: YUV444P mode and bandwidth auto tuning

split from #370:

zero out the image padding since it does get encoded!
honouring max_block_sizes, max_grid_sizes and max_threads_per_block - doesn't seem to be causing problems yet
handle YUV444P mode - needs docs (apparently not supported by the hardware??)
- then we can handle quality changes by swapping the kernel we use (NV12 / YUV444P)
handle resize without re-init
handle speed/quality changes with nvEncReconfigureEncoder (with edge resistance if it causes a new IDR frame)
allocate memory when needed rather than keeping it allocated for the duration of the encoder (fit more encoders on one card)
upload pixels in place? (skip inputBuffer)

deal with GPU suspend-resume cycles (see r5110 for opencl in #422) - here is the CUDA error with nvenc:

  File "encoder.pyx", line 1588, in xpra.codecs.nvenc.encoder.Encoder.compress_image (xpra/codecs/nvenc/encoder.c:12085)
  File "encoder.pyx", line 1624, in xpra.codecs.nvenc.encoder.Encoder.do_compress_image (xpra/codecs/nvenc/encoder.c:12598)
LogicError: cuMemcpyHtoD failed: invalid/unknown error code

Lower priority:

choose the cuda device using gpuGetMaxGflopsDeviceId: max_gflops = device_properties.multiProcessorCount * device_properties.clockRate;
handle other RGB modes in kernel (easy - allows us to run in big endian servers)
access nvenc encoder statistics info?
try using nvenc on win32 for shadow servers
when downscaling automatically (one of the dimensions is >4k), we don't need to downscale both dimensions by the same ratio: a very wide window could be downscaled horizontally only

Sat, 04 Jan 2014 05:33:36 GMT - Antoine Martin: owner, status, description changed

owner changed from Antoine Martin to Antoine Martin
status changed from new to assigned
description modified (diff)

Tue, 04 Feb 2014 03:15:51 GMT - Antoine Martin: description changed

description modified (diff)

Wed, 19 Feb 2014 11:47:32 GMT - Antoine Martin: attachment set

attachment set to nvenc-yuv444p.patch

YUV444 for NVENC: using 3 pass encoding (one for each of Y, U and V)

Wed, 19 Feb 2014 14:23:35 GMT - Antoine Martin: description changed

description modified (diff)

YUV444P support added in r5515 - has scope for optimization: parallelize kernels and encoding, map each plane to a locked input buffer, or both, etc..
the CUDA context selection is in #520 (and mostly done already)

Updated TODO list:

benchmark it, how much slower than YUV420P is it?
use speed and window dimensions to derive target bitrate
handle resize without re-init (at least up to current padded rowstride value)
nvEncReconfigureEncoder on the fly?

Sat, 22 Feb 2014 12:05:31 GMT - Antoine Martin:

r5515 caused a big memory leak client side, fixed in r5542

Sun, 02 Mar 2014 13:02:56 GMT - Antoine Martin:

Important YUV444P fix in r5667: so this is what the undocumented colourPlaneId does!

r5664 and r5666 also allow us to tune the bitrate based on the usual "speed" setting and the encoder input size, using an exponential scale to prefer low bandwidth (see changeset for details).

Testing with glxspheres64 and -d nvenc, auto-scaling turned off with XPRA_SCALING=0:

with quality=100 (YUV444P mode), typically:

compress_image(..) returning 129399 bytes (1.4%), complete compression for frame 645 took 39.4ms

with quality=50 (YUV420P mode), typically:

compress_image(..) returning 33506 bytes (0.4%), complete compression for frame 365 took 17.5ms

So YUV420P is much faster than the 3-pass YUV444P mode, as expected.

Note: r5668 enables YUV444P for quality>=50%, but with r5669 we don't bother with it when downscaling.

Wed, 19 Mar 2014 07:28:25 GMT - Antoine Martin: owner, status, summary changed

owner changed from Antoine Martin to Smo
status changed from assigned to new
summary changed from nvenc improvements to nvenc improvements: YUV444P mode and bandwidth auto tuning

This will have to do for this release, most of the important remaining items are too intrusive to change this late in the release cycle.

Remaining items moved to #538 and #564

smo: please test:

YUV444P mode (see above), compare it with YUV420P.
effect of bitrate tuning via speed setting

Thu, 15 May 2014 20:42:02 GMT - Smo:

I'm not able to run glxspheres64 because of the nature of the setup. Is there there something else that I could try that doesn't involve GL?

I'm hoping to close this as it seems to work well for me but I want to post some information from my setup before closing.

Sat, 17 May 2014 09:09:25 GMT - Antoine Martin:

I only use glxspheres and glxgears often because they produce lots of frames without requiring any external data, but playing a video will do just as well.

Note: you may be able to run GL stuff against software mesa rendering, without needing an X11 server running and with the nvidia libGL installed on the system, by using LD_SO_PRELOAD tricks.

Tue, 10 Jun 2014 05:20:28 GMT - Smo: status changed; resolution set

status changed from new to closed
resolution set to worksforme

Needs more testing with newer NVIDIA drivers / cuda sdk but will close now until there is something to comment on.

Tue, 10 Jun 2014 08:24:26 GMT - Antoine Martin:

Did you measure the bitrate and performance as per comment:6?

FYI: r6699 allows us to specify multiple license keys in CSV format:

XPRA_NVENC_CLIENT_KEY="key1,key2" /usr/bin/xpra start ...

Which makes it easier to deal with the constant nvidia license key driver breakage

Sun, 24 Aug 2014 06:39:57 GMT - Antoine Martin: status changed; resolution deleted

status changed from closed to reopened
resolution worksforme deleted

Please test with nvenc v4, see #653

Mon, 27 Oct 2014 17:51:44 GMT - Antoine Martin: owner, status changed

owner changed from Smo to Antoine Martin
status changed from reopened to new

I am taking this ticket back as YUV444 in nvenc4 is completely different from SDK v4 and is going to require quite a few changes - which should give us a nice performance improvement. Will re-assign for testing + benchmarking afterwards.

Tue, 27 Jan 2015 06:12:36 GMT - Antoine Martin: owner changed

owner changed from Antoine Martin to Smo

Moving the new YUV444 mode to a new ticket so this can get more testing, together with #653.

smo: not sure who should test this, but it's been ready for months, time to get on it.

Thu, 12 Mar 2015 17:14:23 GMT - Smo:

Here are some performance numbers from 2 cards

quality=100 (YUV444P mode)
GTX 650
compress_image(..) returning 54939 bytes (1.1%), complete compression for frame 6875 took 9.3ms
compress_image(..) returning 54939 bytes (1.1%), complete compression for frame 6876 took 8.8ms
GTX 750 ti
compress_image(..) returning 164794 bytes (3.4%), complete compression for frame 64 took 10.2ms
compress_image(..) returning 164816 bytes (3.4%), complete compression for frame 63 took 13.3ms
GTX 970
compress_image(..) returning 325881 bytes (4.5%), complete compression for frame 151 took 14.5ms
compress_image(..) returning 321035 bytes (4.5%), complete compression for frame 152 took 14.8ms

quality=50 (YUV420P mode)
GTX 650
compress_image(..) returning 15659 bytes (0.3%), complete compression for frame 310 took 8.6ms
compress_image(..) returning 15617 bytes (0.3%), complete compression for frame 311 took 8.5ms
GTX 750 ti
compress_image(..) returning 10193 bytes (0.2%), complete compression for frame 1085 took 11.7ms
compress_image(..) returning 10193 bytes (0.2%), complete compression for frame 1086 took 11.0ms
GTX 970
compress_image(..) returning 18628 bytes (0.4%), complete compression for frame 178 took 9.2ms
compress_image(..) returning 18356 bytes (0.4%), complete compression for frame 179 took 8.3ms

I have 1 more card to test and I will update this when I do some more testing.

Sat, 14 Mar 2015 10:11:17 GMT - Antoine Martin:

Interesting to see the GTX 970 going more slowly than I expected, more slowly than when I had tested it IIRC.

Some things worth mentioning:

this is not a real benchmark test (the data we feed is not like a real application - it is generated), we could try to code up a more realistic one
you have to make sure that the card is warm or that it is set to performance mode (without any frequency scaling)
you can put the license keys in nvenc3.keys or nvenc4.keys either in /etc/xpra or in the per-user directory ~/.xpra. The environment variable XPRA_NVENC_CLIENT_KEY still overrides all keys defined in those files. As of r8778, you can also put your license keys in nvenc.keys which will be used by both codecs. (and by nvenc5 and later when I get around to it) You can mix license keys for different driver versions and the code will validate them and figure out which ones can be used, but:
- newer drivers may just fallback to the 2 context limit instead!
- you will get a warning in the log output

Thu, 23 Apr 2015 20:14:07 GMT - Smo: status changed; resolution set

status changed from new to closed
resolution set to fixed

I agree there are many factors when trying to benchmark. It would be a good idea to come up with a better way.

I'm closing this for now as it is working.

Sat, 23 Jan 2021 04:56:25 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/466