xpra icon
Bug tracker and wiki

Opened 3 years ago

Closed 2 years ago

#466 closed task (fixed)

nvenc improvements: YUV444P mode and bandwidth auto tuning

Reported by: Antoine Martin Owned by: Smo
Priority: minor Milestone:
Component: server Version:
Keywords: Cc:

Description (last modified by Antoine Martin)

split from #370:

  • zero out the image padding since it does get encoded!
  • honouring max_block_sizes, max_grid_sizes and max_threads_per_block - doesn't seem to be causing problems yet
  • handle YUV444P mode - needs docs (apparently not supported by the hardware??)
    • then we can handle quality changes by swapping the kernel we use (NV12 / YUV444P)
  • handle resize without re-init
  • handle speed/quality changes with nvEncReconfigureEncoder (with edge resistance if it causes a new IDR frame)
  • allocate memory when needed rather than keeping it allocated for the duration of the encoder (fit more encoders on one card)
  • upload pixels in place? (skip inputBuffer)
  • deal with GPU suspend-resume cycles (see r5110 for opencl in #422) - here is the CUDA error with nvenc:
      File "encoder.pyx", line 1588, in xpra.codecs.nvenc.encoder.Encoder.compress_image (xpra/codecs/nvenc/encoder.c:12085)
      File "encoder.pyx", line 1624, in xpra.codecs.nvenc.encoder.Encoder.do_compress_image (xpra/codecs/nvenc/encoder.c:12598)
    LogicError: cuMemcpyHtoD failed: invalid/unknown error code
    


Lower priority:

  • choose the cuda device using gpuGetMaxGflopsDeviceId: max_gflops = device_properties.multiProcessorCount * device_properties.clockRate;
  • handle other RGB modes in kernel (easy - allows us to run in big endian servers)
  • access nvenc encoder statistics info?
  • try using nvenc on win32 for shadow servers
  • when downscaling automatically (one of the dimensions is >4k), we don't need to downscale both dimensions by the same ratio: a very wide window could be downscaled horizontally only

Attachments (1)

nvenc-yuv444p.patch (35.7 KB) - added by Antoine Martin 3 years ago.
YUV444 for NVENC: using 3 pass encoding (one for each of Y, U and V)

Download all attachments as: .zip

Change History (17)

comment:1 Changed 3 years ago by Antoine Martin

Description: modified (diff)
Owner: changed from Antoine Martin to Antoine Martin
Status: newassigned

comment:2 Changed 3 years ago by Antoine Martin

Description: modified (diff)

Changed 3 years ago by Antoine Martin

Attachment: nvenc-yuv444p.patch added

YUV444 for NVENC: using 3 pass encoding (one for each of Y, U and V)

comment:3 Changed 3 years ago by Antoine Martin

Description: modified (diff)
  • YUV444P support added in r5515 - has scope for optimization: parallelize kernels and encoding, map each plane to a locked input buffer, or both, etc..
  • the CUDA context selection is in #520 (and mostly done already)

Updated TODO list:

  • benchmark it, how much slower than YUV420P is it?
  • use speed and window dimensions to derive target bitrate
  • handle resize without re-init (at least up to current padded rowstride value)
  • nvEncReconfigureEncoder on the fly?
Last edited 3 years ago by Antoine Martin (previous) (diff)

comment:4 Changed 3 years ago by Antoine Martin

r5515 caused a big memory leak client side, fixed in r5542

comment:5 Changed 3 years ago by Antoine Martin

Important YUV444P fix in r5667: so this is what the undocumented colourPlaneId does!


r5664 and r5666 also allow us to tune the bitrate based on the usual "speed" setting and the encoder input size, using an exponential scale to prefer low bandwidth (see changeset for details).


Testing with glxspheres64 and -d nvenc, auto-scaling turned off with XPRA_SCALING=0:

  • with quality=100 (YUV444P mode), typically:
    compress_image(..) returning 129399 bytes (1.4%), complete compression for frame 645 took 39.4ms
    
  • with quality=50 (YUV420P mode), typically:
    compress_image(..) returning 33506 bytes (0.4%), complete compression for frame 365 took 17.5ms
    


So YUV420P is much faster than the 3-pass YUV444P mode, as expected.

Note: r5668 enables YUV444P for quality>=50%, but with r5669 we don't bother with it when downscaling.

Last edited 3 years ago by Antoine Martin (previous) (diff)

comment:6 Changed 3 years ago by Antoine Martin

Owner: changed from Antoine Martin to Smo
Status: assignednew
Summary: nvenc improvementsnvenc improvements: YUV444P mode and bandwidth auto tuning

This will have to do for this release, most of the important remaining items are too intrusive to change this late in the release cycle.

Remaining items moved to #538 and #564

smo: please test:

  • YUV444P mode (see above), compare it with YUV420P.
  • effect of bitrate tuning via speed setting
Last edited 3 years ago by Antoine Martin (previous) (diff)

comment:7 Changed 3 years ago by Smo

I'm not able to run glxspheres64 because of the nature of the setup. Is there there something else that I could try that doesn't involve GL?

I'm hoping to close this as it seems to work well for me but I want to post some information from my setup before closing.

comment:8 Changed 3 years ago by Antoine Martin

I only use glxspheres and glxgears often because they produce lots of frames without requiring any external data, but playing a video will do just as well.

Note: you may be able to run GL stuff against software mesa rendering, without needing an X11 server running and with the nvidia libGL installed on the system, by using LD_SO_PRELOAD tricks.

Last edited 3 years ago by Antoine Martin (previous) (diff)

comment:9 Changed 3 years ago by Smo

Resolution: worksforme
Status: newclosed

Needs more testing with newer NVIDIA drivers / cuda sdk but will close now until there is something to comment on.

comment:10 Changed 3 years ago by Antoine Martin

Did you measure the bitrate and performance as per comment:6?


FYI: r6699 allows us to specify multiple license keys in CSV format:

XPRA_NVENC_CLIENT_KEY="key1,key2" /usr/bin/xpra start ...

Which makes it easier to deal with the constant nvidia license key driver breakage

comment:11 Changed 3 years ago by Antoine Martin

Resolution: worksforme
Status: closedreopened

Please test with nvenc v4, see #653

comment:12 Changed 3 years ago by Antoine Martin

Owner: changed from Smo to Antoine Martin
Status: reopenednew

I am taking this ticket back as YUV444 in nvenc4 is completely different from SDK v4 and is going to require quite a few changes - which should give us a nice performance improvement. Will re-assign for testing + benchmarking afterwards.

comment:13 Changed 2 years ago by Antoine Martin

Owner: changed from Antoine Martin to Smo

Moving the new YUV444 mode to a new ticket so this can get more testing, together with #653.

smo: not sure who should test this, but it's been ready for months, time to get on it.

comment:14 Changed 2 years ago by Smo

Here are some performance numbers from 2 cards

quality=100 (YUV444P mode)

GTX 650
compress_image(..) returning 54939 bytes (1.1%), complete compression for frame 6875 took 9.3ms
compress_image(..) returning 54939 bytes (1.1%), complete compression for frame 6876 took 8.8ms

GTX 750 ti
compress_image(..) returning 164794 bytes (3.4%), complete compression for frame 64 took 10.2ms
compress_image(..) returning 164816 bytes (3.4%), complete compression for frame 63 took 13.3ms

GTX 970 
compress_image(..) returning 325881 bytes (4.5%), complete compression for frame 151 took 14.5ms
compress_image(..) returning 321035 bytes (4.5%), complete compression for frame 152 took 14.8ms
quality=50 (YUV420P mode)

GTX 650
compress_image(..) returning 15659 bytes (0.3%), complete compression for frame 310 took 8.6ms
compress_image(..) returning 15617 bytes (0.3%), complete compression for frame 311 took 8.5ms

GTX 750 ti
compress_image(..) returning 10193 bytes (0.2%), complete compression for frame 1085 took 11.7ms
compress_image(..) returning 10193 bytes (0.2%), complete compression for frame 1086 took 11.0ms

GTX 970
compress_image(..) returning 18628 bytes (0.4%), complete compression for frame 178 took 9.2ms
compress_image(..) returning 18356 bytes (0.4%), complete compression for frame 179 took 8.3ms

I have 1 more card to test and I will update this when I do some more testing.

Last edited 2 years ago by Smo (previous) (diff)

comment:15 Changed 2 years ago by Antoine Martin

Interesting to see the GTX 970 going more slowly than I expected, more slowly than when I had tested it IIRC.

Some things worth mentioning:

  • this is not a real benchmark test (the data we feed is not like a real application - it is generated), we could try to code up a more realistic one
  • you have to make sure that the card is warm or that it is set to performance mode (without any frequency scaling)
  • you can put the license keys in nvenc3.keys or nvenc4.keys either in /etc/xpra or in the per-user directory ~/.xpra. The environment variable XPRA_NVENC_CLIENT_KEY still overrides all keys defined in those files. As of r8778, you can also put your license keys in nvenc.keys which will be used by both codecs. (and by nvenc5 and later when I get around to it) You can mix license keys for different driver versions and the code will validate them and figure out which ones can be used, but:
    • newer drivers may just fallback to the 2 context limit instead!
    • you will get a warning in the log output

comment:16 Changed 2 years ago by Smo

Resolution: fixed
Status: newclosed

I agree there are many factors when trying to benchmark. It would be a good idea to come up with a better way.

I'm closing this for now as it is working.

Note: See TracTickets for help on using tickets.