#466 closed task (fixed)
nvenc improvements: YUV444P mode and bandwidth auto tuning
Reported by: | Antoine Martin | Owned by: | Smo |
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | server | Version: | |
Keywords: | Cc: |
Description (last modified by )
split from #370:
- zero out the image padding since it does get encoded!
- honouring
max_block_sizes
,max_grid_sizes
andmax_threads_per_block
- doesn't seem to be causing problems yet - handle YUV444P mode - needs docs (apparently not supported by the hardware??)
- then we can handle quality changes by swapping the kernel we use (
NV12
/YUV444P
)
- then we can handle quality changes by swapping the kernel we use (
- handle resize without re-init
- handle speed/quality changes with
nvEncReconfigureEncoder
(with edge resistance if it causes a new IDR frame) - allocate memory when needed rather than keeping it allocated for the duration of the encoder (fit more encoders on one card)
- upload pixels in place? (skip
inputBuffer
) - deal with GPU suspend-resume cycles (see r5110 for opencl in #422) - here is the CUDA error with nvenc:
File "encoder.pyx", line 1588, in xpra.codecs.nvenc.encoder.Encoder.compress_image (xpra/codecs/nvenc/encoder.c:12085) File "encoder.pyx", line 1624, in xpra.codecs.nvenc.encoder.Encoder.do_compress_image (xpra/codecs/nvenc/encoder.c:12598) LogicError: cuMemcpyHtoD failed: invalid/unknown error code
Lower priority:
- choose the cuda device using
gpuGetMaxGflopsDeviceId: max_gflops = device_properties.multiProcessorCount * device_properties.clockRate;
- handle other RGB modes in kernel (easy - allows us to run in big endian servers)
- access nvenc encoder statistics info?
- try using nvenc on win32 for shadow servers
- when downscaling automatically (one of the dimensions is >4k), we don't need to downscale both dimensions by the same ratio: a very wide window could be downscaled horizontally only
Attachments (1)
Change History (18)
comment:1 Changed 7 years ago by
Description: | modified (diff) |
---|---|
Owner: | changed from Antoine Martin to Antoine Martin |
Status: | new → assigned |
comment:2 Changed 7 years ago by
Description: | modified (diff) |
---|
Changed 7 years ago by
Attachment: | nvenc-yuv444p.patch added |
---|
comment:3 Changed 7 years ago by
Description: | modified (diff) |
---|
- YUV444P support added in r5515 - has scope for optimization: parallelize kernels and encoding, map each plane to a locked input buffer, or both, etc..
- the CUDA context selection is in #520 (and mostly done already)
Updated TODO list:
- benchmark it, how much slower than YUV420P is it?
- use speed and window dimensions to derive target bitrate
- handle resize without re-init (at least up to current padded rowstride value)
nvEncReconfigureEncoder
on the fly?
comment:5 Changed 7 years ago by
Important YUV444P
fix in r5667: so this is what the undocumented colourPlaneId
does!
r5664 and r5666 also allow us to tune the bitrate based on the usual "speed" setting and the encoder input size, using an exponential scale to prefer low bandwidth (see changeset for details).
Testing with glxspheres64
and -d nvenc
, auto-scaling turned off with XPRA_SCALING=0
:
- with
quality=100
(YUV444P
mode), typically:compress_image(..) returning 129399 bytes (1.4%), complete compression for frame 645 took 39.4ms
- with
quality=50
(YUV420P
mode), typically:compress_image(..) returning 33506 bytes (0.4%), complete compression for frame 365 took 17.5ms
So YUV420P
is much faster than the 3-pass YUV444P
mode, as expected.
Note: r5668 enables YUV444P
for quality>=50%, but with r5669 we don't bother with it when downscaling.
comment:6 Changed 7 years ago by
Owner: | changed from Antoine Martin to Smo |
---|---|
Status: | assigned → new |
Summary: | nvenc improvements → nvenc improvements: YUV444P mode and bandwidth auto tuning |
comment:7 Changed 7 years ago by
I'm not able to run glxspheres64 because of the nature of the setup. Is there there something else that I could try that doesn't involve GL?
I'm hoping to close this as it seems to work well for me but I want to post some information from my setup before closing.
comment:8 Changed 7 years ago by
I only use glxspheres
and glxgears
often because they produce lots of frames without requiring any external data, but playing a video will do just as well.
Note: you may be able to run GL stuff against software mesa rendering, without needing an X11 server running and with the nvidia libGL
installed on the system, by using LD_SO_PRELOAD
tricks.
comment:9 Changed 7 years ago by
Resolution: | → worksforme |
---|---|
Status: | new → closed |
Needs more testing with newer NVIDIA drivers / cuda sdk but will close now until there is something to comment on.
comment:10 Changed 7 years ago by
Did you measure the bitrate and performance as per comment:6?
FYI: r6699 allows us to specify multiple license keys in CSV format:
XPRA_NVENC_CLIENT_KEY="key1,key2" /usr/bin/xpra start ...
Which makes it easier to deal with the constant nvidia license key driver breakage
comment:11 Changed 7 years ago by
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
Please test with nvenc v4, see #653
comment:12 Changed 6 years ago by
Owner: | changed from Smo to Antoine Martin |
---|---|
Status: | reopened → new |
I am taking this ticket back as YUV444
in nvenc4 is completely different from SDK v4 and is going to require quite a few changes - which should give us a nice performance improvement. Will re-assign for testing + benchmarking afterwards.
comment:13 Changed 6 years ago by
Owner: | changed from Antoine Martin to Smo |
---|
Moving the new YUV444 mode to a new ticket so this can get more testing, together with #653.
smo: not sure who should test this, but it's been ready for months, time to get on it.
comment:14 Changed 6 years ago by
Here are some performance numbers from 2 cards
quality=100 (YUV444P mode) GTX 750 ti compress_image(..) returning 164794 bytes (3.4%), complete compression for frame 64 took 10.2ms compress_image(..) returning 164816 bytes (3.4%), complete compression for frame 63 took 13.3ms GTX 650 compress_image(..) returning 54939 bytes (1.1%), complete compression for frame 6875 took 9.3ms compress_image(..) returning 54939 bytes (1.1%), complete compression for frame 6876 took 8.8ms GTX 970 compress_image(..) returning 325881 bytes (4.5%), complete compression for frame 151 took 14.5ms compress_image(..) returning 321035 bytes (4.5%), complete compression for frame 152 took 14.8ms
quality=50 (YUV420P mode) GTX 750 ti compress_image(..) returning 10193 bytes (0.2%), complete compression for frame 1085 took 11.7ms compress_image(..) returning 10193 bytes (0.2%), complete compression for frame 1086 took 11.0ms GTX 650 compress_image(..) returning 15659 bytes (0.3%), complete compression for frame 310 took 8.6ms compress_image(..) returning 15617 bytes (0.3%), complete compression for frame 311 took 8.5ms GTX 970 compress_image(..) returning 18628 bytes (0.4%), complete compression for frame 178 took 9.2ms compress_image(..) returning 18356 bytes (0.4%), complete compression for frame 179 took 8.3ms
I have 1 more card to test and I will update this when I do some more testing.
comment:15 Changed 6 years ago by
Interesting to see the GTX 970 going more slowly than I expected, more slowly than when I had tested it IIRC.
Some things worth mentioning:
- this is not a real benchmark test (the data we feed is not like a real application - it is generated), we could try to code up a more realistic one
- you have to make sure that the card is warm or that it is set to performance mode (without any frequency scaling)
- you can put the license keys in
nvenc3.keys
ornvenc4.keys
either in/etc/xpra
or in the per-user directory~/.xpra
. The environment variableXPRA_NVENC_CLIENT_KEY
still overrides all keys defined in those files. As of r8778, you can also put your license keys innvenc.keys
which will be used by both codecs. (and by nvenc5 and later when I get around to it) You can mix license keys for different driver versions and the code will validate them and figure out which ones can be used, but:- newer drivers may just fallback to the 2 context limit instead!
- you will get a warning in the log output
comment:16 Changed 6 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
I agree there are many factors when trying to benchmark. It would be a good idea to come up with a better way.
I'm closing this for now as it is working.
comment:17 Changed 3 months ago by
this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/466
YUV444 for NVENC: using 3 pass encoding (one for each of Y, U and V)