Xpra: Ticket #2022: Client doesn't stay on NVENC h264, even when selected in the UI

Unless I run xpra with only h264 enabled,

/usr/bin/python2 /usr/bin/xpra start -d encoder --start=xterm --no-daemon --video-encoders=nvenc --encodings=h264 :57

I can observe from the debug data that very few frames are actually rendered using NVENC even when I have h264 selected from the UI. When I run it using the above command, it works wonderfully. I'm using r20858 on CentOS7 with VirtualGL on the server and an r20882 client on windows. The driver is 390.87. I'm using a C2075 for render and a K20c for encode.

I do not think the problem has much to do with my hardware or reversion because I'm having this trouble on all of my configurations including a brand-new new computer (the other one is pretty old) with a GTX 1060 running release 2.3.4 server with release 2.4 client.

Nathan

Thu, 01 Nov 2018 16:49:05 GMT - Antoine Martin: owner changed

owner changed from Antoine Martin to Nathan Hallquist

Xpra will only use video encoders when it thinks is necessary, for more details on the original feature see #410 - there have been many updates and fixes since.

Roughly speaking, the video encoder will not be triggered unless something on screen refreshes rapidly enough. By only allowing h264, you are bypassing all the automatic encoding logic.

To test, you can run glxgears or anything else that generates lots of frames and verify that NVENC does get used.

If you feel that your screen updates would be better compressed with a video encoder and that the automatic logic is too slow to switch to the video encoder, you can debug the video region detection with -d regiondetect, the encoding selection logic with -d encoding. Bear in mind that GPU encoders take longer to setup, so it is best to only use them when we're certain that the content is video-ish and is going to keep producing screen updates regularly.

Maybe we could tweak the logic to switch to video sooner when we know we're dealing with an efficient encoder (ie: NVENC).

Thu, 01 Nov 2018 19:44:13 GMT - Nathan Hallquist:

Here's a "-d compress" log. For context, this log was collected during a *sustained* rotation operation on an OpenGL rendering of a finite element model of vehicle impact (it is a standard benchmark for us). The log looks like the below excerpt the whole way through. When XPRA jumps between RGB and YUV the screen briefly locks hard.

I have my system tray UI set to speed=auto, quality=auto and encoding = h264.

There are two cases I want to include:

(1) Without "XPRA_FORCE_CSC_MODE=BGRX" it is slow and throws "fail to encode errors" for NVENC sometimes, especially when I set speed to 100 (I had speed set to auto). In this case it didn't throw those errors, but the whole picture *did* turn yellow a few times during the collection of this log.

(2) with XPRA_FORCE_CSC_MODE=BGRX it is also very slow, but not as slow as before (but nothing like speed and perfection of having only NVENC enabled). There were noticeable jerks when it switched between encodings. Never any NVENC encoding errors nor random color changes like when BGRX isn't set.

Case (1) Plain, no XPRA_FORCE_CSC_MODE.

2018-11-01 13:12:06,968 compress:   1.2ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   134, client_options={'zlib': 5, 'delta': 181, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 186}
2018-11-01 13:12:07,553 compress: 582.9ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   1.0%  ( 6128KB to    61KB), sequence   135, client_options={u'speed': 41, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 20}
2018-11-01 13:12:07,695 compress:   2.0ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   136, client_options={'zlib': 5, 'delta': 186, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 189}
2018-11-01 13:12:08,232 compress: 533.5ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   1.0%  ( 6128KB to    60KB), sequence   137, client_options={u'speed': 46, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 34}
2018-11-01 13:12:08,318 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   138, client_options={'zlib': 5, 'delta': 189, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 192}
2018-11-01 13:12:08,365 compress:  46.0ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.8%  ( 6128KB to    48KB), sequence   139, client_options={u'speed': 46, u'frame': 1, u'pts': 626, 'csc': 'YUV444P', u'quality': 34}
2018-11-01 13:12:08,506 compress:   0.2ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   140, client_options={'zlib': 5, 'delta': 192, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 195}
2018-11-01 13:12:08,548 compress:  40.0ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.6%  ( 6128KB to    34KB), sequence   141, client_options={u'speed': 46, u'frame': 2, u'pts': 814, 'csc': 'YUV444P', u'quality': 34}
2018-11-01 13:12:08,696 compress:   1.4ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   142, client_options={'zlib': 5, 'delta': 195, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 198}
2018-11-01 13:12:09,204 compress: 505.4ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   1.0%  ( 6128KB to    58KB), sequence   143, client_options={u'speed': 44, u'frame': 0, u'pts': 0, 'csc': u'XRGB', u'quality': 44}
2018-11-01 13:12:09,300 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   144, client_options={'zlib': 5, 'delta': 198, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 201}
2018-11-01 13:12:09,914 compress: 612.8ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   2.2%  ( 6128KB to   137KB), sequence   145, client_options={u'speed': 46, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 24}
2018-11-01 13:12:09,924 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   146, client_options={'zlib': 5, 'delta': 201, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 204}
2018-11-01 13:12:10,584 compress: 658.1ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   1.2%  ( 6128KB to    73KB), sequence   147, client_options={u'speed': 46, u'frame': 0, u'pts': 0, 'csc': u'XRGB', u'quality': 24}
2018-11-01 13:12:10,648 compress:   0.4ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   148, client_options={'zlib': 5, 'delta': 204, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 207}
2018-11-01 13:12:11,210 compress: 561.0ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   1.2%  ( 6128KB to    70KB), sequence   149, client_options={u'speed': 30, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 5}
2018-11-01 13:12:11,231 compress:   0.1ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   150, client_options={'zlib': 6, 'delta': 207, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 210}
2018-11-01 13:12:11,802 compress: 569.8ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.4%  ( 6128KB to    23KB), sequence   151, client_options={u'speed': 30, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 5}
2018-11-01 13:12:11,822 compress:   0.2ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   152, client_options={'zlib': 6, 'delta': 210, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 213}
2018-11-01 13:12:11,846 compress:  23.9ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.2%  ( 6128KB to    13KB), sequence   153, client_options={u'speed': 30, u'frame': 1, u'pts': 591, 'csc': 'YUV444P', u'quality': 5}
2018-11-01 13:12:12,834 compress:   0.2ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   154, client_options={'zlib': 7, 'delta': 213, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 216}
2018-11-01 13:12:13,367 compress: 532.6ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.3%  ( 6128KB to    21KB), sequence   155, client_options={u'speed': 28, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 4}
2018-11-01 13:12:13,368 compress:   0.1ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   156, client_options={'zlib': 7, 'delta': 216, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 219}
2018-11-01 13:12:13,423 compress:  52.1ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.1%  ( 6128KB to     8KB), sequence   157, client_options={u'speed': 28, u'frame': 1, u'pts': 453, 'csc': 'YUV444P', u'quality': 4}
2018-11-01 13:12:13,744 compress:   0.8ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   158, client_options={'zlib': 6, 'delta': 219, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 222}
2018-11-01 13:12:14,316 compress: 569.9ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.4%  ( 6128KB to    22KB), sequence   159, client_options={u'speed': 32, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 8}

Case 2:

2018-11-01 13:09:17,544 compress:   0.5ms for   36x36   pixels at 1596,411  for wid=2     using     rgb24 with ratio  37.5%  (    5KB to     1KB), sequence   323, client_options={'zlib': 5, 'delta': 401, 'bucket': 1, 'rgb_format': 'BGRX', 'flush': 1, 'store': 469}
2018-11-01 13:09:17,547 compress:   1.1ms for 1512x16   pixels at    2,941  for wid=2     using      jpeg with ratio   1.7%  (   94KB to     1KB), sequence   324, client_options={'quality': 49}
2018-11-01 13:09:18,099 compress:  89.3ms for 1636x959  pixels at    0,0    for wid=2     using      webp with ratio   2.6%  ( 6128KB to   161KB), sequence   325, client_options={u'quality': 100, u'rgb_format': 'BGRX'}
2018-11-01 13:09:18,234 compress:   0.6ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   326, client_options={'zlib': 4, 'delta': 460, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 475}
2018-11-01 13:09:18,814 compress: 577.6ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   4.7%  ( 6128KB to   290KB), sequence   327, client_options={u'speed': 39, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 10}
2018-11-01 13:09:18,835 compress:   0.5ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   328, client_options={'zlib': 6, 'delta': 475, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 478}
2018-11-01 13:09:18,892 compress:  56.2ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   2.2%  ( 6128KB to   133KB), sequence   329, client_options={u'speed': 39, u'frame': 1, u'pts': 606, 'csc': 'YUV444P', u'quality': 10}
2018-11-01 13:09:19,385 compress:   0.7ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.9%  (    6KB to     0KB), sequence   330, client_options={'flush': 1, 'zlib': 6, 'bucket': 0, 'store': 481, 'rgb_format': 'BGRX'}
2018-11-01 13:09:19,442 compress:  55.0ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   3.2%  ( 6128KB to   193KB), sequence   331, client_options={u'speed': 39, u'frame': 2, u'pts': 1156, 'csc': 'YUV444P', u'quality': 10}
2018-11-01 13:09:19,841 compress:   1.2ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   332, client_options={'zlib': 5, 'delta': 481, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 484}
2018-11-01 13:09:20,438 compress: 592.9ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.4%  ( 6128KB to    26KB), sequence   333, client_options={u'speed': 41, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 7}
2018-11-01 13:09:20,569 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   334, client_options={'zlib': 5, 'delta': 484, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 487}
2018-11-01 13:09:20,620 compress:  50.2ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.3%  ( 6128KB to    16KB), sequence   335, client_options={u'speed': 41, u'frame': 1, u'pts': 730, 'csc': 'YUV444P', u'quality': 7}
2018-11-01 13:09:20,832 compress:   0.8ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   336, client_options={'zlib': 5, 'delta': 487, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 490}
2018-11-01 13:09:21,400 compress: 562.8ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.7%  ( 6128KB to    41KB), sequence   337, client_options={u'speed': 47, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 23}
2018-11-01 13:09:21,504 compress:   0.7ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   338, client_options={'zlib': 5, 'delta': 490, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 493}
2018-11-01 13:09:21,514 compress:   1.2ms for 1636x959  pixels at    0,0    for wid=2     using    scroll as   8 rectangles  ( 6128KB)           , sequence   340, client_options={'flush': 2}
2018-11-01 13:09:21,543 compress:  28.7ms for 1636x561  pixels at    0,144  for wid=2     using      jpeg with ratio   0.9%  ( 3585KB to    31KB), sequence   341, client_options={'quality': 23, 'optimize': True, 'flush': 1}
2018-11-01 13:09:21,547 compress:   3.7ms for 1636x61   pixels at    0,763  for wid=2     using      jpeg with ratio   0.6%  (  389KB to     2KB), sequence   342, client_options={'quality': 23, 'optimize': True}
2018-11-01 13:09:21,699 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   342, client_options={'zlib': 5, 'delta': 493, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 496}
2018-11-01 13:09:21,707 compress:   0.8ms for 1636x959  pixels at    0,0    for wid=2     using    scroll as   1 rectangles  ( 6128KB)           , sequence   344, client_options={'flush': 1}
2018-11-01 13:09:21,740 compress:  32.2ms for 1636x606  pixels at    0,260  for wid=2     using      jpeg with ratio   1.1%  ( 3872KB to    42KB), sequence   345, client_options={'quality': 23, 'optimize': True}
2018-11-01 13:09:21,929 compress:   0.2ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   345, client_options={'zlib': 5, 'delta': 496, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 499}
2018-11-01 13:09:22,522 compress: 591.0ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   1.7%  ( 6128KB to   101KB), sequence   346, client_options={u'speed': 49, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 26}
2018-11-01 13:09:22,523 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   347, client_options={'zlib': 5, 'delta': 499, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 502}
2018-11-01 13:09:22,571 compress:  47.3ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.9%  ( 6128KB to    56KB), sequence   348, client_options={u'speed': 49, u'frame': 1, u'pts': 531, 'csc': 'YUV444P', u'quality': 26}
2018-11-01 13:09:22,634 compress:   1.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   349, client_options={'zlib': 5, 'delta': 502, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 505}
2018-11-01 13:09:22,688 compress:  51.2ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.7%  ( 6128KB to    41KB), sequence   350, client_options={u'speed': 49, u'frame': 2, u'pts': 703, 'csc': 'YUV444P', u'quality': 26}
2018-11-01 13:09:22,827 compress:   0.7ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   351, client_options={'zlib': 5, 'delta': 505, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 508}
2018-11-01 13:09:23,371 compress: 542.7ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.8%  ( 6128KB to    49KB), sequence   352, client_options={u'speed': 51, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 24}
2018-11-01 13:09:23,427 compress:  55.6ms for 1636x959  pixels at    0,0    for wid=2     using      webp with ratio   2.0%  ( 6128KB to   122KB), sequence   353, client_options={u'quality': 100, u'rgb_format': 'BGRX'}
2018-11-01 13:09:23,495 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   354, client_options={'zlib': 5, 'delta': 508, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 513}
2018-11-01 13:09:23,542 compress:  45.6ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.5%  ( 6128KB to    28KB), sequence   355, client_options={u'speed': 51, u'frame': 1, u'pts': 669, 'csc': 'YUV444P', u'quality': 24}
2018-11-01 13:09:23,654 compress:   0.9ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   356, client_options={'zlib': 5, 'delta': 513, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 516}
2018-11-01 13:09:23,663 compress:   0.8ms for 1636x959  pixels at    0,0    for wid=2     using    scroll as   1 rectangles  ( 6128KB)           , sequence   358, client_options={'flush': 1}
2018-11-01 13:09:23,683 compress:  20.4ms for 1636x601  pixels at    0,266  for wid=2     using      jpeg with ratio   0.9%  ( 3840KB to    35KB), sequence   359, client_options={'quality': 24}
2018-11-01 13:09:23,986 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   359, client_options={'zlib': 5, 'delta': 516, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 519}
2018-11-01 13:09:23,999 compress:   0.6ms for 1636x959  pixels at    0,0    for wid=2     using    scroll as   3 rectangles  ( 6128KB)           , sequence   361, client_options={'flush': 4}
2018-11-01 13:09:24,016 compress:  16.3ms for 1636x516  pixels at    0,312  for wid=2     using      jpeg with ratio   1.1%  ( 3297KB to    35KB), sequence   362, client_options={'quality': 26, 'flush': 3}
2018-11-01 13:09:24,018 compress:   1.0ms for 1636x4    pixels at    0,836  for wid=2     using       png with ratio   0.8%  (   25KB to     0KB), sequence   363, client_options={'flush': 2}
2018-11-01 13:09:24,020 compress:   1.8ms for 1636x5    pixels at    0,915  for wid=2     using       png with ratio   4.6%  (   31KB to     1KB), sequence   364, client_options={'flush': 1}
2018-11-01 13:09:24,023 compress:   1.5ms for 1636x12   pixels at    0,926  for wid=2     using      jpeg with ratio   3.4%  (   76KB to     2KB), sequence   365, client_options={'quality': 26}
2018-11-01 13:09:24,619 compress:  73.4ms for 1636x959  pixels at    0,0    for wid=2     using      webp with ratio   1.9%  ( 6128KB to   116KB), sequence   365, client_options={u'quality': 100, u'rgb_format': 'BGRX'}

Fri, 02 Nov 2018 07:56:49 GMT - Antoine Martin:

First things first, something is not right with your setup if rgb ends up using zlib compression. You should have lz4 installed and if not then you should see a big warning in your server log on startup. You can confirm the status by running xpra/net/net_util.py on Linux and Network_info.exe on win32.

As for the logs, I meant "-d compress" as well as, or combined with the -d regiondetect and -d encoding from comment:1.

From the first log sample, most of the time it seems to be taking 500ms to compress at a resolution of 1636x958. That's dreadfully slow. You may want to raise min-speed and leave the encoding set to "auto" so it can at least switch to vp8 / vp9 when not using NVENC. Why it isn't switching to nvenc would require the -d score log output. I think that is going to be the key.

it is slow and throws "fail to encode errors" for NVENC sometimes

Please post the exact error message, including the full stacktrace. You may need to add -d video,encoder to see it. That could also cause encoder selection problems, we should be able to fix those or at least avoid the conditions that trigger it.

with XPRA_FORCE_CSC_MODE=BGRX

Why are you changing this switch exactly? Setting a high enough (min-)quality should give the same result. There are two ways that NVENC can handle BGRX input: natively for the cards + driver combinations that support it (can be turned off with XPRA_NVENC_NATIVE_RGB=0), or via a CUDA conversion step otherwise. It's not clear what your card supports. Please post the output of the server startup with -d nvenc Either way, NVENC should win the scoring compared to doing the CSC step via another module (libyuv or whatever).

There are some scroll encoding packets in the middle, those can cause problems when encoding video as we then skip frames and encode them differently. You can play with raising:

XPRA_SCROLL_MIN_PERCENT=80

Or even disable scroll encoding altogether:

XPRA_SCROLL_ENCODING=0

As for the IRC discussion regarding setting hints, we can worry about that once the encoding selection bug is fixed - it will help (see #1952 and #2023)

Fri, 02 Nov 2018 20:10:24 GMT - Nathan Hallquist:

After a lot of struggling I've gotten LZ4 to work, but I don't understand why it was wrong. For some reason, on my computer, MINGW_BUILD.sh doesn't include lz4 correctly for version 2.1.0 or 2.1.1. When I backed down to version 1.1.0 my windows build correctly includes lz4, and I see this:

2018-11-02 13:01:50,174 compress:   0.3ms for 1636x959  pixels at    0,0    for wid=2     using    scroll as   1 rectangles  ( 6128KB)           , sequence   111, client_options={'flush': 2}
2018-11-02 13:01:50,315 compress: 141.1ms for 1636x593  pixels at    0,233  for wid=2     using     rgb24 with ratio   1.7%  ( 3789KB to    66KB), sequence   112, client_options={'flush': 1, 'lz4': 1, 'rgb_format': 'BGRX'}
2018-11-02 13:01:50,320 compress:   4.2ms for 1636x20   pixels at    0,915  for wid=2     using     rgb24 with ratio   3.4%  (  127KB to     4KB), sequence   113, client_options={'lz4': 1, 'rgb_format': 'BGRX'}
2018-11-02 13:01:50,740 compress:   8.3ms for 1636x959  pixels at    0,0    for wid=2     using     rgb24 with ratio   3.3%  ( 6128KB to   199KB), sequence   113, client_options={'lz4': 1, 'rgb_format': 'BGRX'}

Version 2.1.0 works only if I run the python script from within mingw shell. However, lz4 doesn't seem to be getting put into the build directory correctly when I use version 2.1.0. Consequently, it doesn't make it into the installer. I'm not sure why it's wrong, and I'll look into a little more before giving up and moving on to your other suggestions. (This problem is less important to me than the encoder one, especially since I've got a workaround, namely downgrade to 1.1.0)

Making lz4 work didn't resolve the other issues regarding the NVIDIA encoder.

Fri, 02 Nov 2018 21:34:54 GMT - Nathan Hallquist:

By running python with "-S" in the "dist/lib" directory I've determined that lz4 2.1.1 works *if* I copy the "egg" data into "dist/lib".

Python is far outside my core expertise, so I can only speculate about the right fix. Maybe the install_exe function of setup.py needs to include the "egg" directories...

Fri, 02 Nov 2018 22:27:55 GMT - Nathan Hallquist:

Command line:

 xpra start -d compress,regiondetect,encoding --start=gnome-terminal --no-daemon --video-encoders=nvenc :57

For this capture I had encoder set to H264 (everything else was set to auto). I rendered a 3D model and rotated it. I had XPRA_SCROLL_ENCODING=0 (because other experiments showed this helps some). This particular run was *very* jerky on the gigabit LAN. I have no idea why.

If you are interested I can send you all the software I am running.

raw-attachment/ticket/2022/1.txt

Fri, 02 Nov 2018 22:30:59 GMT - Nathan Hallquist:

Same as before except everything is set to "auto" including encoder. Very slow.

attachment/ticket/2022/2.txt

Fri, 02 Nov 2018 22:39:57 GMT - Nathan Hallquist:

In this case I disable all encodings except rgb and h264. When I begin to rotate it's jerky, but once it gets going its great!

Command line:

xpra start -d compress,regiondetect,encoding --start=gnome-terminal --no-daemon --video-encoders=nvenc --encodings=h264,h265,rgb :57

Output:

attachment/ticket/2022/3.txt

Fri, 02 Nov 2018 22:45:37 GMT - Nathan Hallquist: attachment set

attachment set to 2.txt

Log for second try.

Fri, 02 Nov 2018 22:47:29 GMT - Nathan Hallquist: attachment set

attachment set to 3.txt

third example

Sat, 03 Nov 2018 05:34:51 GMT - Antoine Martin: attachment set

attachment set to 1.txt

re-adding attachment with txt extension so it can be viewed online

Sat, 03 Nov 2018 06:28:31 GMT - Antoine Martin:

After a lot of struggling I've gotten LZ4 to work, but I don't understand why it was wrong. For some reason, on my computer, MINGW_BUILD.sh doesn't include lz4 correctly for version 2.1.0 or 2.1.1.

So you're using custom builds rather than our official releases. Please make sure to mention that every time, as per wiki/ReportingBugs. And test with official builds too to make sure that the problem doesn't come from your builds. As for lz4, this is a well documented issue: ticket:1883#comment:4 with its own ticket: #1929

Making lz4 work didn't resolve the other issues regarding the NVIDIA encoder.

It wasn't meant to. It was essential to find out why your output was not the one that was expected. And it will help with performance (outdated benchmark here: https://xpra.org/stats/lz4/)

I had XPRA_SCROLL_ENCODING=0 (because other experiments showed this helps some)

"scroll" encoding is not used for windows we correctly tag as "video", hence why you should look into tagging your video windows.

xpra start -d compress,regiondetect,encoding --start=gnome-terminal --no-daemon --video-encoders=nvenc --encodings=h264,h265,rgb :57

As I mentioned before, that's a very bad idea:

don't disable other video encoders, include "-d score" and then we'll see if the other video encoders somehow get used when they should not (NVENC should win)
don't disable other encodings! xpra will be forced to use "rgb" for non-video stuff, which is going to completely kill your bandwidth, even on a gigabit lan

Considering those important configuration issues, the logs you attached are not very telling. Please see comment:3 :

Why it isn't switching to nvenc would require the -d score log output. I think that is going to be the key.

Please also include both:

-d score,video
-d score,encoding,video,scaling,csc,stats (just in case details are missing from the first one)

What I do see in there:

lots of:

send_delayed_regions: using full window update 1636x959 as  h264, from <bound method WindowVideoSource.get_best_encoding_video of WindowVideoSource(2 : (1636, 959))>
do_video_encode(h264, XShmImageWrapper(BGRX: 0, 0, 1636, 959), {})
make_instance() nvenc - instance count=0

So it is choosing the correct encoding (h264) and the correct encoder (nvenc) but for some reason the context gets recycled every time. And since nvenc is slow to setup, that's probably where the delay comes from.

by the time the log samples start, the batch delay is already through the roof:
```
send_delayed for wid 2, batch delay is 233ms, elapsed time is 233ms
```

So something already happened before that which reduced the framerate to less than 5fps. Could well be the wacky settings, or the re-cycling issue, -d stats should be enough to see why that is.

Sat, 03 Nov 2018 17:47:07 GMT - Nathan Hallquist:

I *have* been testing with both builds consistently on the client side, and I've been using the builds from winswitch beta on CentOS7. There are no other surprises in my configs. I have not, for instance, set LD_LIBRARY_PATH to anything interesting. I had not seen wiki/ReportingBug before (I should have thought to look for such a document). My apologies!

Regarding tagging, can you point me to some information on how that works (I'm ignorant about how a program passes this kind of information through)? I will pass it on to our UI guys, and I expect that if it's straightforward I get can them to put it in our UI trunk right away. (They are using wxwidgets with gtk2 on X. Our software's window has a rectangular video region surrounded by buttons).

Regarding, the additional debug captures I will try to do that in a few hours.

Again. Thank you.

Sat, 03 Nov 2018 17:53:35 GMT - Antoine Martin:

Regarding tagging, can you point me to some information on how that works

It's all in this ticket which is assigned to you since I knew you would be interested: ticket:2023#comment:1. There's a trivial example included: browser/xpra/trunk/src/tests/xpra/test_apps/test_window_content_type.py. Basically, set the _XPRA_CONTENT_TYPE X11 window property to what you want (ie: "video" or "text").

Sat, 03 Nov 2018 22:47:53 GMT - Nathan Hallquist:

I'm doing the first requested test: debug on score and video. My command line:

xpra start -d score,video --start="vglrun ./lspp45 ../d3plots/d3plot" --no-daemon --video-encoders=nvenc  :57 2>&1 |tee 4.txt

The "vglrun" command is so that I can spare you all the startup logging. Once I connected to the server I went straight to rotating.

attachment/ticket/2022/4.txt

Same exercise but with -d score,video,compress

attachment/ticket/2022/5.txt

Same exercise but with -d score,encoding,video,scaling,csc,stats

attachment/ticket/2022/6.txt

Same exercise but with -d score,encoding,video,scaling,csc,stats,compress

attachment/ticket/2022/7.txt

Same exercise but with -d score,encoding,video,scaling,csc,stats,compress using the r20905 client from the web site.

attachment/ticket/2022/8.txt

After uploading the 7th one, only as control, I tried "--encodings=h264,h265" (this is not what I did for 8.txt). I found that it locks up just a tiny little bit at the beginning, unless I let it settle before setting in on rotating. On the 8.txt (which had all encoders enabled and was with the stock client) I was extra careful to let it "settle". As I expected, with settling it is still has the issue for which I opened this ticket, but I figure it's best to isolate one problem at a time. I, therefore, regard 8.txt as the nicest debug log.

Sat, 03 Nov 2018 22:48:36 GMT - Nathan Hallquist: attachment set

attachment set to 4.txt

Sat, 03 Nov 2018 22:48:46 GMT - Nathan Hallquist: attachment set

attachment set to 5.txt

Sat, 03 Nov 2018 22:48:58 GMT - Nathan Hallquist: attachment set

attachment set to 6.txt

Sat, 03 Nov 2018 22:49:08 GMT - Nathan Hallquist: attachment set

attachment set to 7.txt

Sat, 03 Nov 2018 22:56:59 GMT - Nathan Hallquist:

I've forwarded this to our chief UI developer. I'll let you know when it's done.

Replying to Antoine Martin:

Regarding tagging, can you point me to some information on how that works

It's all in this ticket which is assigned to you since I knew you would be interested: ticket:2023#comment:1. There's a trivial example included: browser/xpra/trunk/src/tests/xpra/test_apps/test_window_content_type.py. Basically, set the _XPRA_CONTENT_TYPE X11 window property to what you want (ie: "video" or "text").

Sat, 03 Nov 2018 23:23:06 GMT - Nathan Hallquist: attachment set

attachment set to 8.txt

Sun, 04 Nov 2018 06:34:34 GMT - Antoine Martin:

As per previous comments, do not disable the other video encoders.

What I found in the logs despite that:

DPI set to 45 x 25 (wanted 96 x 96)

You should be using the patched dummy driver from the repo.

we don't find a video encoder for small areas (because you disabled the other video encoders):

video size 484x92 out of range for nvenc, min 128x128
(..)
Error: no video pipeline options found for BGRX at 485x93

your card is limited to 2 contexts since you're using a gamer card without a license key:

CUDA 9.2.0 / PyCUDA 2018.1, found 1 device:
  + GeForce GTX 1060 6GB @ 0000:AF:00.0 (memory: 98% free, compute: 6.1)
(..)
setup_pipeline failed for (84, (1, 1), None, 0, 0, None, 'BGRX', (1, 1), 520, 188, video_spec(nvenc)): cannot open encoding session, context is NULL

And that causes failures when we reach the limit. 2 contexts should be enough if you only have one "video" window though. When we do hit this limit, things go downhill fast:

setup_pipeline(..) failed! took 2197.59ms

That's the encoding pipeline blocked for over 2 seconds! ouch.

the batch delay is already high because of this:

update_batch_delay: congestion                   : 1.87,8.74  {}

You may want to turn off bandwidth detection to see if that helps with --bandwidth-detection=no. (#1912) Or post the -d bandwidth log output so we can fix this. But then it also goes up because of this:

update_batch_delay: compression-work-queue       : 2.64,5.12  {u'target': 1000, u'weight_multiplier': 3648, u'smoothing': u'logp', u'aim': 250, u'aimed_avg': 5247, u'div': 1000, u'avg': 12057, u'recent': 13314}

That's nvenc hogging the compresion work queue because it is slow to setup, and even worse when it hits the 2 context limit and fails (2 seconds as per above)... not much we can do about this.

the video context seems to get recycled too quickly, then we see:

check_pipeline('auto', 451, 134, 'BGRX') setting up a new pipeline as check failed - encodings=['h264', 'h265']

why it failed is not logged, so I can't be sure why, but here are some possibilities:

r20914: a refresh issue fixed recently (ticket:1816#comment:14)
r20926: a bug fixed today causing unnecessary video context churn

r20927 also adds an env var you can use to turn on video context cleanup debugging:

XPRA_DEBUG_VIDEO_CLEAN=1 xpra start ...

This should tell us why the video context is being destroyed so quickly.

There are new centos 7.5 builds here: https://xpra.org/beta.

Sun, 04 Nov 2018 18:08:47 GMT - Nathan Hallquist:

I did not understand what you meant by "do not disable the other video encoders". For some reason I thought you meant don't use "--encodings=", I just now think you don't want "--video-encoders=" either. I apologize for that. I was not being intentionally obtuse. I am very grateful for your help. I do not mean to waste your time!

I was aware of the limit regarding encoder streams, but I didn't realize I was bumping up against it. Is there an easy way to buy or request licenses from NVIDIA (for testing)?

All of my users have GTX cards (and except for a solitary tesla card), that's all I have. I have two kerpler-era K20c tesla cards. That is the card that sometimes draws frames with weird colors (lots of yellow) with XPRA. I infer older cards have some limitations in this regard. That card, by the way, does work great for the render window if I add the "CSC_MODE=BGRX" environment variable and a --encodings. I think that card can also be made to work. If getting a Quadro (or some other card) will expedite this I'll buy whatever you recommend. Any recommendations?

Ultimately, I'd like to get XPRA to do NVENC only in my CAD render window and not anywhere else (everywhere else works just fine), so hopefully we can eventually use GTX cards. I'll try to get into the office later today so that I can perform the followup tests you requested.

Thanks again!

Sun, 04 Nov 2018 18:36:02 GMT - Antoine Martin:

I did not understand what you meant

My bad, a lot of things that seem obvious to me probably aren't.

I was aware of the limit regarding encoder streams, but I didn't realize I was bumping up against it

If we can fix all the bugs, you should be OK with just 2 contexts.

Is there an easy way to buy or request licenses from NVIDIA (for testing)?

I don't know.

I have two kepler-era K20c tesla cards. That is the card that sometimes draws frames with weird colors (lots of yellow) with XPRA

Could be bugs with lossless and / or YUV444 mode with those older cards. If you can get us some "-d video" debug of when that happens, we may be able to restrict them to the modes that do work.

If getting a Quadro (or some other card) will expedite this I'll buy whatever you recommend. Any recommendations?

I've used both gamer cards (GTX 7xx to GTX10xx - with and without license keys) and quadros myself. There is absolutely no difference in performance, the quadros just cost at least 4 times as much. See Video Encode and Decode GPU Support Matrix

Ultimately, I'd like to get XPRA to do NVENC only in my CAD render window and not anywhere else (everywhere else works just fine)

Tagging windows should allow you to do that.

Sun, 04 Nov 2018 22:42:11 GMT - Nathan Hallquist:

I've installed the dummy driver. I've updated to r20927. I'm running with the environment variable you specified. I've turned off bandwidth detection. Bandwidth detection made no noticeable difference. The DPI problem seems to be resolved (based on my browsing of the log file).

I've run:

 xpra start -d score,encoding,video,scaling,csc,stats,compress --start="vglrun ./lspp45 ../d3plots/d3plot" --no-daemon --bandwidth-detection=no  :57 2>&1 |tee 11.txt

Here's the full log:

attachment/ticket/2022/11.txt

After enabling XPRA_DEBUG_VIDEO_CLEAN=1 I'm getting this:

2018-11-04 14:27:23,066 video_context_clean()
  File "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/site-packages/xpra/server/background_worker.py", line 62, in run
    item()
  File "/usr/lib64/python2.7/site-packages/xpra/server/source/encodings_mixin.py", line 160, in recalculate_delays
    ws.reconfigure()
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1102, in reconfigure
    self.update_encoding_options(force_reload)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1035, in update_encoding_options
    self.check_pipeline_score(force_reload)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1131, in check_pipeline_score
    self.video_context_clean()
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 300, in video_context_clean
    traceback.print_stack()
2018-11-04 14:27:23,069 video_context_clean() done

Sun, 04 Nov 2018 22:42:33 GMT - Nathan Hallquist: attachment set

attachment set to 11.txt

Sun, 04 Nov 2018 23:01:06 GMT - Nathan Hallquist:

Replying to Antoine Martin:

I did not understand what you meant

My bad, a lot of things that seem obvious to me probably aren't.

I very much appreciate your patience!

If you can get us some "-d video" debug of when that happens, we may be able to restrict them to the modes that do work.

I just tried this exercise from the gigabit LAN. This older config is working much better than the GTX config. With bandwidth detection turned off my eyeball cannot detect anything wrong (when bandwidth detection is on it locks occasionally). Last time I saw the colors to go weird I was using broadband, so I'll go back home and try that (I came to work for this test).

For completeness I'm posting my experience on gigabit on my old config. I'm rendering to a C2075 (which is so old it cannot encode) and encoding on a K20c (which doesn't support rendering).

Command:

 xpra start -d score,encoding,video,scaling,csc,stats,compress --start="vglrun ./lspp45 ../d3plots/d3plot" --no-daemon --bandwidth-detection=no  :57 2>&1 |tee 12.txt

Environment:

[nathan@bobross lsprepost4.5_centos7]$ env | grep XPRA
XPRA_CUDA_DEVICE_NAME=Tesla K20c
XPRA_CUDA_DEVICE=0
XPRA_SCROLL_ENCODING=0
[nathan@bobross lsprepost4.5_centos7]$

attachment/ticket/2022/12.txt

Update: From home I find that with nvenc and only nvence enabled (--encodings) perforamnce is great. With normal options (nothing disabled) performance is bad and my client eventually crashes. I've attached the server log from that.

attachment/ticket/2022/14.txt

Sun, 04 Nov 2018 23:02:54 GMT - Nathan Hallquist: attachment set

attachment set to 12.txt

Mon, 05 Nov 2018 04:10:56 GMT - Antoine Martin:

This older config is working much better than the GTX config

Probably because of the 2 context limit on gamer cards.

From this new log:

nvenc always wins the scoring

get_video_pipeline_options(['h264', 'vp9', 'vp8', 'mpeg1', 'mpeg2'], 3840, 1027, 'BGRX') \
    scores=[
     (81, (1, 1), None, 0, 0, None, 'BGRX', (1, 1), 3840, 1026, video_spec(nvenc)),
     (..)
    ]

So disabling the other video encoders is counter-productive.

this log sample seems to be missing XPRA_DEBUG_VIDEO_CLEAN=1 so I still cannot tell why the video encoder context gets recycled...

The sample from comment:17 lacks the log context around the backtrace. A single backtrace is not enough in any case, some video context re-cycling is normal. That said, there aren't many instances of re-cycling during this test, so this may not tell us much.

the nvenc encoder seems to stick around for quite some time, long enough for over 750 frames:

video_encode nvenc encoder: h264 1462x756 result is 148828 bytes (56.9 MPixels/s), \
    client options={u'speed': 94, u'frame': 778, u'pts': 19778, 'csc': 'YUV444P', u'quality': 97}

The batch delay was also very good (low) at that point:

calculate_batch_delay for wid=1 current batch delay=8, last update 1 seconds ago

You seem to be getting at least 25 fps, reaching closer to 50fps near the end of the log - which is quite impressive. Was it not smooth at that point?

At which point in time in this log file am I looking for problems?

With bandwidth detection turned off my eyeball cannot detect anything wrong

Maybe no problems in this log then?

Mon, 05 Nov 2018 05:11:56 GMT - Nathan Hallquist: attachment set

attachment set to 14.txt

Mon, 05 Nov 2018 05:25:00 GMT - Nathan Hallquist:

I agree that the K20c run on the LAN was good (I meant to indicate that in my original post). I've since added 14.txt (I did not realize that you had already replied, otherwise I would have made a new reply).

14.txt was on broadband. Over broadband I'm seeing that --encodings=h264 is much better than the default options (I am only sending logs for the default options per your request).

Regarding the environment variable, you are right. I did forget to add the new environment variable on 12.txt and 14.txt. However, I did have the environment variable set correctly on 11.txt (my most recent run on the gtx). I'll redo 12.txt and 14.txt in the morning.

Mon, 05 Nov 2018 13:03:22 GMT - Antoine Martin:

Ah, didn't see that 11.txt had the backtraces.

From this 11.txt log:

how did you install the latest builds without installing libyuv? it is an RPM dependency in v2.5
cosmetic: either disable printing or give yourself the correct permissions

at some point nvenc fails because of the 2 context limit:

setup_pipeline failed for (78, (1, 1), (1, 1), 520, 188, video_spec(swscale), u'XRGB', (1, 1), 520, 188, video_spec(nvenc)): \
    cannot open encoding session, context is NULL

and we try a few other nvenc combinations (like nvenc without swscale, etc) - which takes time. but later on we re-try and it succeeds:

check_pipeline_score(False) \
   found a better video encoder class than \
     <type 'xpra.codecs.enc_x264.encoder.Encoder'>: (86, (1, 1), None, 0, 0, None, 'BGRX', (1, 1), 520, 188, video_spec(nvenc))

We could use the failures to detect the actual context limit value for each card, and then use this value to avoid trying to create new contexts when the limit is reached. (we have code for doing this already - it just uses the wrong limit: the default is 32) It is difficult to see why this happened just from the log output, if you don't have too many windows then I suspect that it could be because the encoder context cleanup happens in the encode thread so when we destroy an nvenc encoder (for resizing or whatever), it isn't actually freed immediately whereas we may try to create the one immediately... I'm not sure how easy it would be to change that.

there aren't many instances of context recycling:
- switch from x264 to nvenc (as per above)
- a window is removed (2018-11-04 14:35:01,187)
- we detect a sub-rectangle of the window where the "video" seems to happen: video subregion was None, now rectangle(3, 27, 1558, 879) (window size: 1652,998)
- and later on the other way: video subregion was rectangle(3, 27, 1558, 879), now None (window size: 1652,998)
- and again: video subregion was None, now rectangle(3, 27, 1558, 879) (window size: 1652,998)
- and finally at the end when there is no activity for ~10 seconds (video_encoder_timeout)

There's a dbus API you could use to turn off video region detection, or to set a particular region: #1060, #1401. Maybe we could just enhance the content-type to support values like "video:full", "video:subregion" or something like that. Re-cycling contexts is expensive with nvenc.

With normal options (nothing disabled) performance is bad and my client eventually crashes.

How did it crash? Can you run it from the command line and see its output? Please try with the latest builds, preferably the 64-bit ones.

From the 14.txt log:

a strange error:

Warning: XShm get_pixels_ptr XImage is NULL
nvenc(XRGB/NV12/H264 - hp -  520x188 ).compress_image(XShmImageWrapper(BGRX: 0, 0, 520, 189), 40, 99, {'av-sync': True, 'bandwidth-limit': 0, 'content-type': ''})
  Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1921, in do_video_encode
   ret = ve.compress_image(csc_image, quality, speed, options)
  File "xpra/codecs/nvenc/encoder.pyx", line 2179, in xpra.codecs.nvenc.encoder.Encoder.compress_image
  File "xpra/codecs/nvenc/encoder.pyx", line 2216, in xpra.codecs.nvenc.encoder.Encoder.do_compress_image
  File "xpra/codecs/nvenc/encoder.pyx", line 2245, in xpra.codecs.nvenc.encoder.Encoder.copy_image
AssertionError: failed to get pixels from XShmImageWrapper(BGRX: 0, 0, 520, 189)

No idea what caused this. Unless this happens repeatedly, I think we can ignore this.

there's a lot of re-cycling going on, which is why things are slow, every time an nvenc context is (re)created, the initial frame takes ~400ms:

compress: 429.1ms for 1632x914  pixels at    0,0    for wid=1     using      h264 with ratio   3.4%  ( 5833KB to   197KB), sequence    13, client_options={u'speed': 99, u'frame': 0, u'pts': 0, 'csc': u'XRGB', u'quality': 40}

Please enable XPRA_DEBUG_VIDEO_CLEAN=1 for that.

Mon, 05 Nov 2018 16:57:54 GMT - Nathan Hallquist:

how did you install the latest builds without installing libyuv? it is an RPM dependency in v2.5

I think it is installed. What line in the log indicates that it is not?

[nathan@curry ~]$ rpm -q --info libyuv
Name        : libyuv
Version     : 0
Release     : 0.31.20180904git9a07219.el7_5
Architecture: x86_64
Install Date: Mon 29 Oct 2018 11:39:10 AM PDT
Group       : Unspecified
Size        : 361374
License     : BSD
Signature   : DSA/SHA1, Tue 23 Oct 2018 11:37:47 PM PDT, Key ID 18adb31cf18ad6bb
Source RPM  : libyuv-0-0.31.20180904git9a07219.el7_5.src.rpm
Build Date  : Tue 23 Oct 2018 10:00:00 PM PDT
Build Host  : buildbot
Relocations : (not relocatable)
URL         : https://chromium.googlesource.com/libyuv/libyuv
Summary     : YUV conversion and scaling functionality library
Description :
This is an open source project that includes YUV conversion and scaling
functionality. Converts all webcam formats to YUV (I420). Convert YUV to
formats for rendering/effects. Rotate by 90 degrees to adjust for mobile
devices in portrait mode. Scale YUV to prepare content for compression,
with point, bilinear or box filter.

[nathan@curry ~]$ rpm -q --verify -v libyuv
.........    /usr/lib64/libyuv.so.0
.........    /usr/lib64/libyuv.so.0.0.0
.........    /usr/share/doc/libyuv-0
.........  d /usr/share/doc/libyuv-0/AUTHORS
.........  d /usr/share/doc/libyuv-0/PATENTS
.........  d /usr/share/doc/libyuv-0/README.md
.........    /usr/share/licenses/libyuv-0
.........  l /usr/share/licenses/libyuv-0/LICENSE
[nathan@curry ~]$

Mon, 05 Nov 2018 17:10:46 GMT - Antoine Martin:

I think it is installed. What line in the log indicates that it is not?

My bad, something has gone wrong with the centos7 builds, tracking this here: ticket:2004#comment:2

Mon, 05 Nov 2018 18:08:21 GMT - Nathan Hallquist:

I made a video demonstrating what our software looks like. This was rendered on the machine with the Tesla K20c (this is similar to 12.txt). I made it so that you can see what I am doing. For instance, the window closing events are for the "help" box and the "animate" box. The only part that needs nvenc encoding is the box with the 3d drawing.

This case works pretty well. It locks up during a rotation at 9 1/3 seconds (it would not do that with --encodings=h264), but it's pretty good. (I don't have a proper debug log because I was making it for demonstration rather than debug purposes).

Again, thanks for all your help. I'll be following up shortly with more debug logs.

https://ftp.lstc.com/anonymous/outgoing/nathan/bobross-ok.mp4

Mon, 05 Nov 2018 19:19:46 GMT - Nathan Hallquist:

I was able to get a repeat of 14.txt (with XPRA_DEBUG_VIDEO_CLEAN=1). Same server different client. This time I was using the stock XPRA 64bit client (from the web site) connecting through my cell phone hotspot. When tested with --encoding=h264 (for control) the performance was shockingly good (on cellphone hotspot''). However, with the normal options it was slow and eventually crashed.

[nathan@bobross lsprepost4.5_centos7]$  xpra start -d score,encoding,video,scaling,csc,stats,compress --start="vglrun ./lspp45 ../d3plots/d3plot" --no-daemon --bandwidth-detection=no :57 2>&1 | tee  15.txt^C
[nathan@bobross lsprepost4.5_centos7]$ env | grep XPRA
XPRA_CUDA_DEVICE_NAME=Tesla
XPRA_CUDA_DEVICE=0
XPRA_SCROLL_ENCODING=0
XPRA_DEBUG_VIDEO_CLEAN=1
[nathan@bobross lsprepost4.5_centos7]$

attachment/ticket/2022/15.txt

I *was not* able to replicate using the same commands when connected on either 802.11g or 802.11n. In all cases I was port forwarding ssh. This trouble arises *only* on cell phone and broadband. Broadband is an especially important case, however, because we have several offsite users.

How do I get debug log off the client?

Mon, 05 Nov 2018 19:20:05 GMT - Nathan Hallquist: attachment set

attachment set to 15.txt

Tue, 06 Nov 2018 15:06:37 GMT - Antoine Martin:

However, with the normal options it was slow and eventually crashed.

As per comment:21, how did it crash, what was the output, etc?

How do I get debug log off the client?

By default, the client debug log output is sent to the server. (see --remote-logging) But when dealing with crashes, the last messages may not make it to the server, so just run:

xpra_cmd.exe attach ssh://whatever/ -d whateverdebugflags

If it isn't clear what is causing the crash, using -d all might be required to cover all possibilities.

From this latest log:

there are newer centos server builds than r20927, please try that
the context re-cycling issue is not a major factor in this log (see details below), so maybe this was a red herring or just an aggravating factor
the batch delay goes up because of bandwidth congestion (at one point is is above 100ms which would limit your screen updates to less than 10fps), try running the client with --bandwidth-detection=no --bandwidth-limit=100mbps and add -d bandwidth to the server debug flags to confirm (the client overrules the server's bandwidth-detection flag - and maybe we shouldn't allow that?) - h264 is more efficient, so it could be avoiding some bandwidth issues
we haven't looked at the impact of auto-refresh, maybe that's what is killing the performance over slower connections, try: --auto-refresh-delay=0 on the client and see if that helps

The video_context_clean events come from:

windows that are destroyed (_lost_window), twice - in both cases, the windows used h264 for a single frame (and webp, jpeg for a few more):
```
encoding_totals for wid=2 with primary encoding=auto : {'h264': [1, 60300], 'webp': [2, 120868], 'jpeg': [3, 129888]}
```

Normally, this isn't a huge problem but with nvenc this is more costly because of the initial setup cost and the fact that this is "wasting" a precious encoder context. If the video encoders trigger too early, we could favour software encoders until we have enough frames to be certain that a hardware video encoder is worth using.

a change in video region detection:

video subregion was rectangle(0, 0, 1604, 881), now rectangle(3, 27, 1450, 762) (window size: 1604,881)

as single change should not be a huge problem

client disconnection at the end - as expected

... with the normal options it was slow ... This trouble arises *only* on cell phone and broadband.

Can you describe this "trouble"? Is it stuttering, occasionally pausing, or just generally slow? If the problems are punctual, can you match up the problems with timecodes from the log output? (roughly) Try capturing "xpra info" at a time when things are slow.

Wed, 07 Nov 2018 01:20:12 GMT - Nathan Hallquist:

I've been away all day dealing with some other stuff. I think you are on to something. When I turn off the bandwidth detection on the client side the performance is better than I've ever seen. I (almost) cannot distinguish broadband from LAN.

The interesting part: the longer I leave the session running the slower things get. I'll try to get debug data before the day is out. Otherwise I'll have it first thing tomorrow.

Again, thank you!

Wed, 07 Nov 2018 03:39:52 GMT - Nathan Hallquist:

Added the --auto-refresh-delay option. Everything runs smoothly (with the Tesla card) for as long as I leave the window open.

I like auto-refresh very much so I'll be doing anything I can to help fix it. (with autorefresh turned on the video streams become slower and slower seemingly with each refresh until the client crashes--after reconnect its perfect at first, and, then, the process repeats exactly as before) I will be taking debug logs tomorrow. I'll also try to match lock-ups to timestamps in the logs.

Thu, 08 Nov 2018 03:19:42 GMT - Antoine Martin:

I like auto-refresh very much so I'll be doing anything I can to help fix it.

The debug we need to get to the bottom of this would be:

leave --bandwidth-detection=yes on the client and post the server's -d bandwidth logs - it looks like the bandwidth detection is detecting the difference between LAN and broadband / phone connections, but maybe over-compensating
for auto-refresh issues, post the -d refresh,regionrefresh,compress,video for when things get out of whack - the bandwidth detection should have an impact on the auto-refresh so you may want to try with / without, nvenc's slow startup could be causing the auto-refresh to trigger too early, or maybe it triggers and causes context re-cycling - we'll see

Thu, 08 Nov 2018 18:25:07 GMT - Nathan Hallquist:

Since I've updated my server to the latest beta and disabled the bandwidth detection I've not been able to replicate any trouble on the Tesla card when I use clients compiled by xpra.org. I've tried 2.4 and the latest betas both and they work!

The good news for me is that my builds give the same bad results with or without my patches. I really want to get my patches incorporated into mainline, but I don't want to send them until I make them work on Linux and MacOS.

The bad news (for me, not for you) is that, until I get our UI team to add the X properties and until I go back to testing the Gamer cards I think my only problems are created by something in the way I'm building XPRA on Windows.

I'm not really sure to begin debugging this. I'm going to start with the suggestions in your previous replies (I haven't tried them all yet). The log captures on the server side look pretty similar for both clients (lots of h264). My client is slow even when it *is* using h264 on the server side. I can't figure out how to read "-d all" because there are so many GTK messages.

I am working on this today.

Again, I thank you and your team for all your help!

Fri, 09 Nov 2018 00:57:04 GMT - Antoine Martin:

Since I've updated my server to the latest beta and disabled the bandwidth detection I've not been able to replicate any trouble on the Tesla card when I use clients compiled by xpra.org.

Did you identify what fixed things? My guess is the fixes from comment:14, likely r20914.

I really want to get my patches incorporated into mainline, but I don't want to send them until I make them work on Linux and MacOS.

What do those patches do? Can you open a ticket for that?

I think my only problems are created by something in the way I'm building XPRA on Windows.

The number 1 performance issue is likely to be the lack of opengl, see ticket:1883#comment:8 If you can't figure it out, please open a new ticket for that.

Mon, 12 Nov 2018 13:05:27 GMT - Antoine Martin:

FYI: as of r20971 the server controls the bandwidth-detection option and the client can no longer enable this feature on its own.

Thu, 15 Nov 2018 22:48:23 GMT - Nathan Hallquist:

I'm not exactly sure why my custom builds were not working on windows. I think I was accidentally setting some "speed=" things in my "conf.d" folder. Anyway, I can now connect flawlessly to my computer having a K20c tesla (for encoding and a C2075 for render) if I disable the bandwidth-detection.

I still cannot make the computer with the GTX card work right. I'm attaching a log from a XPRA session on a computer with single GTX card running the opengl job.

I am using gigabit Ethernet.

I'm still working on getting _XPRA_CONTENT_TYPE added to our software.

Here's the command I used to collect the debug data

xpra start --no-daemon --bandwidth-detection=no --bandwidth-limit=100mbps --start="vglrun lsprepost4.5_centos7/lspp45 d3plots/d3plot" -d score,encoding,video,scaling,csc,stats,compress :250 2>&1

attachment/ticket/2022/gtx1.txt

Thu, 15 Nov 2018 22:49:06 GMT - Nathan Hallquist: attachment set

attachment set to gtx1.txt

Mon, 19 Nov 2018 14:37:10 GMT - Antoine Martin:

Related tickets: #2044, #2048

Tue, 20 Nov 2018 12:52:18 GMT - Antoine Martin:

Since I've updated my server to the latest beta and disabled the bandwidth detection

As per comment:29, it would really help if you could reproduce the problems with bandwidth-detection enabled and -d bandwidth. Then we can make it less aggressive and work better out of the box for everyone.

I still cannot make the computer with the GTX card work right. I'm attaching a log from a XPRA session on a computer with single GTX card running the opengl job.

#2043 and #2049 should help
#2044 and #2048 will also help - those are still pending
r21052 raises the context limit when we have multiple devices to choose from, records context failures with NULL values
r21053 allows you to specify the number of contexts the card can handle (applies to all cards - if you have an heterogeneous setup, you are better off just disabling the ones that have a low limit). So, for consumer cards without a license key, use:
```
XPRA_NVENC_CONTEXT_LIMIT=2 xpra start ...
```

There is no overcommit, so this should prevent the worst of the stuttering as we won't be trying to use nvenc once two contexts are in use (trying and failing takes time)

r21055 allows you to use both contexts rather than aiming for just one (half the limit - which is fine when you have 32...)

If you still have problems, then that probably means that we're choosing the wrong window for nvenc. Please post -d encoding,damage with an explanation of which window id should be using the video encoders.

Note: I have seen what looks like some malformed frames in the client output: unknown format returned - ffmpeg fails to decode a frame then we reset the encoder.. which takes a bit of time. New ticket: #2051.

Tue, 20 Nov 2018 18:30:02 GMT - Nathan Hallquist:

I will do the tests with bandwidth detection ASAP (probably within the next week). At the moment looking, among other things, at the ssh patch I submitted and the gnome-terminal ticket. Thanks for all your help!

Sat, 24 Nov 2018 10:23:24 GMT - Antoine Martin:

Stumbled upon this hack for bypassing the licensing restrictions: https://github.com/keylase/nvidia-patch. It patches the driver so that the license key check always succeeds. Another approach which can be used is to break on the license key checking code in gdb and then extract the license keys included in the blob.

Mon, 24 Dec 2018 20:00:58 GMT - Nathan Hallquist:

Using the debug options discussed in this ticket I have found that, at this point, a good deal of my NVENC context recycling is caused by the the video-region-detection dimensions changing. I am now working with our LS-PrePost developers to get me a callback so that I can plug in some code I've written to notify XPRA on dbus of the OpenGL canvas coordinates.

I have already demonstrated that this helps a lot using the "xpra control" commands. It may even completely resolve the NVENC context recycling issues, but I'm not sure yet...

Tue, 25 Dec 2018 19:12:58 GMT - Antoine Martin:

Using the debug options discussed in this ticket I have found that, at this point, a good deal of my NVENC context recycling is caused by the the video-region-detection dimensions changing.

If you can provide the sample debug output, maybe we can improve the code so that it doesn't change the region quite so often. -d damage,regiondetect should be enough.

And even when we do recycle the context, there are two ways we can make it better still:

release the context before trying to create the new one so we won't hit the limit (easier but slower)
most cards support nvenc on-the-fly region resizing, so we could still re-use the same context, without having to recycle it with just a partial re-init

(faster but harder to implement)

Tue, 01 Jan 2019 00:02:14 GMT - Nathan Hallquist:

I have gone ahead and implemented dbus calls.

I am calling

(1) SetVideoRegionDetection to false (2) SetVideoRegionEnabled to true (3) ResetVideoRegion (not sure if I need this) (4) SetVideoRegion

When I do this the video region gets stuck forever on x264 unless I disable x264 at startup (and x264 is laggy, also not sure if that is a bug). I'm not sure if this is a bug, or if I'm doing something wrong.

Tue, 01 Jan 2019 18:43:09 GMT - Antoine Martin:

(3) ResetVideoRegion (not sure if I need this)

Not needed.

When I do this the video region gets stuck forever on x264 unless I disable x264 at startup

That's what setting a video region does.

(and x264 is laggy, also not sure if that is a bug). I'm not sure if this is a bug, or if I'm doing something wrong.

Are you talking about the laggyness? What exactly is the symptom? Did you capture xpra info?

Using "video" may trigger some delayed frames with x264. (but not with nvenc - not yet anyway) Assuming that you're using the x264 encoder, you may want to only enable vpx and not x264: --video-encoders=vpx. Or you can also try to cap the number of delayed frames with x264:

XPRA_X264_MAX_DELAYED_FRAMES=1 xpra start ..

Maybe what we need to do is to define a new content-type ("animation" perhaps?) that enables video encoders but without the features that cause delayed frames (b-frames, multiple reference frames) as those are only really useful for true video content (ie: video player or youtube).

Tue, 01 Jan 2019 18:57:27 GMT - Nathan Hallquist:

I'm encountering two unrelated problems:

x264 and vpx are both laggy, but I think this might be an insurmountable computational intensity issue (on my the old Westmere chip I'm testing this on, at least). I'm seeing other behaviors consistent with this theory: if I turn off the mesh, for instance, it gets less laggy (and the lag goes away sometimes).

When I set the video region by hand (after using it a little) XPRA ends up in nvenc, but when I set it right at start it is in x264 and stays in x264 forever, and it never switches to nvenc.

My users are comparing xpra to VNC (and NX), and the only consistent complaint I have heard (from a small number of them) has to do with (at least I think it has to do with) the time it takes nvenc to start.

This region that I am flagging with DBUS is either resting (a constant lossless frame) or animating. Nothing like scrolling or UI happens in this region. My experiments tell me that VPX and X264 cannot keep up, but nvenc is amazing and turbojpeg is the best when there is no nvenc (at least on old westmere chips).

Based on my experience, therefore, and assuming I've made no other mistakes, I would like for my DBUS calls to make a *permanent* nvenc context in that region, as the shape of the region never changes without triggering my resize callback which will notify xpra.

Tue, 01 Jan 2019 19:00:00 GMT - Nathan Hallquist:

I've done a little attempt at debugging and I've found that the score for nvenc never gets better than x264 when I set the region using dbus. I think it is because the setup penalty always swamps everything else for some reason, but I'm not sure that I'm reading this right.

Every iteration has a -300:

Dec 31 15:23:29 curry sh[158375]: 2018-12-31 15:23:29,349 get_pipeline_score(BGRX   , None                    , x264(BGRX to h264)      ,  1276,   918) quality:  63, speed:  34, setup: 100 - 100 runtime: 100 scaling: (1, 1) / (1, 1), encoder dimensions=1276x918, sizescore= 89, client score delta=  0, cpu score=  0, gpu score=  0, score= 57
Dec 31 15:23:29 curry sh[158375]: 2018-12-31 15:23:29,350 get_pipeline_score(BGRX   , None                    , nvenc(BGRX to h264)     ,  1276,   918) quality:  63, speed:  48, setup: 100 - -300 runtime: 100 scaling: (1, 1) / (1, 1), encoder dimensions=1276x918, sizescore= 87, client score delta=  0, cpu score=  0, gpu score=100, score= 39

Tue, 01 Jan 2019 19:36:52 GMT - Antoine Martin:

x264 and vpx are both laggy, but I think this might be an insurmountable computational intensity issue (on my the old Westmere chip I'm testing this on, at least). I'm seeing other behaviors consistent with this theory: if I turn off the mesh, for instance, it gets less laggy (and the lag goes away sometimes).

I'm not sure what "turn off the mesh" means here. Setting a high speed setting should reduce the encoding latency, it will just use up more bandwidth instead.

When I set the video region by hand (after using it a little) XPRA ends up in nvenc, but when I set it right at start it is in x264 and stays in x264 forever, and it never switches to nvenc.

Two things are likely at play here:

we don't use nvenc immediately unless we are certain that the video is here to stay - so initially x264 will win the scoring, even more so recently: see r21057, you can try lowering with XPRA_MIN_FPS_COST
we try to stick to the same encoder to prevent constant recycling of encoder contexts - though nvenc should still win out eventually

My users are comparing xpra to VNC (and NX), and the only consistent complaint I have heard (from a small number of them) has to do with (at least I think it has to do with) the time it takes nvenc to start.

We can make it start quicker, but then users may complain about jerkiness for the first few frames. (due to the nvenc initialization delay)

I've done a little attempt at debugging and I've found that the score for nvenc never gets better than x264 when I set the region using dbus. I think it is because the setup penalty always swamps everything else for some reason, but I'm not sure that I'm reading this right.

Could be the fps change (r21057), and / or maybe setting the video region interferes with that.

Tue, 01 Jan 2019 19:39:01 GMT - Nathan Hallquist:

we don't use nvenc immediately unless we are certain that the video is here to stay - so initially x264 will win the scoring, even more so recently: see r21057, you can try lowering with XPRA_MIN_FPS_COST

What's going on is identify_video_subregion is ending before it can update the fps when detection is disabled. I'm trying to come up with a patch... fps stays forever at 0.

Wed, 02 Jan 2019 00:13:20 GMT - Nathan Hallquist:

I have written a patch that gets my vgl app working perfectly. It:

fixes the fps calculation in video_subregion.py.

forces a context recycle when the quality changes. This resolves the problem I was having with quality changes not taking effect. This is in window_video_source.py.

passes the detection property through to get_pipeline_score and then discards setup costs if the detection is turned off. I do this because if detection is turned off it is safe to assume that there won't be context recycling because the window's layout won't change for a while.

changes video_encoder_timeout in window_video_source.py so that the context does not timeout unless detection is turned on. This is related to the previous item.

Wed, 02 Jan 2019 00:13:59 GMT - Nathan Hallquist: attachment set

attachment set to fix_nodetect.patch

Wed, 02 Jan 2019 08:44:18 GMT - Antoine Martin:

Thanks, merged in separate commits with some minor changes:

r21259 recycle video encoder when either quality or speed changes
r21260 use a much longer timeout for video encoder context: default is 10 minutes, you can set XPRA_VIDEO_NODETECT_TIMEOUT=0 to disable the timeout completely - there is no need for a get_detection method as this can be replaced with 2 lines of code
r21264 update fps value even when detection is turned off (tiny style changes)
r21265 + r21266: ignore setup cost when not using video region detection

@nathan_lstc: does that work for you?

As per comment:39:

If you can provide the sample debug output, maybe we can improve the code so that it doesn't change the region quite so often. -d damage,regiondetect should be enough.

This would still be worth doing. (making xpra work correctly out of the box rather than relying on application hints to do the right thing)

Wed, 02 Jan 2019 19:09:43 GMT - Nathan Hallquist:

Looks good. I'll test it more thoroughly later, but I'm pretty sure everything is right.

Another problem. When I reattach xpra seems to forget everything that the client apps tell it in dbus. Ideas for fixing:

(1) have my app somehow grab startup-complete signal and resend stuff to the dbus

(2) have xpra send some XEvent to my program

(3) have xpra remember the video-region stuff accross attach.

Presently, I'm looking into option (1), is that the right thing?

Thu, 03 Jan 2019 03:12:18 GMT - Nathan Hallquist:

I've implemented option (1). It works fine. I'm very satisfied with how well LSTC's (my employer) UI stuff is working. I'm going to test out unrelated ticket number #2090 tomorrow. Thanks.

Thu, 03 Jan 2019 08:23:10 GMT - Antoine Martin:

Can I close this ticket as "fixed"?

Thu, 03 Jan 2019 19:50:58 GMT - Nathan Hallquist:

I've retested everything on LAN and WIFI with default settings. I can say everything works perfectly as of r21284.

I still need to test on broadband before I can say for sure that everything works.

My environment:

ExecStart=/bin/sh -c "cd ~;PATH=/opt/xpra/bin:$PATH xpra -d compress --no-daemon start --no-printing --start-via-proxy=no --systemd-run=no --start=\"xrdb -merge $HOME/.Xresources\" --start-child=xterm --exit-with-children --xsettings=no --file-transfer=yes --mdns=no :`id -u`"
Environment=PYTHONPATH=/opt/xpra/lib64/python2.7/site-packages
Environment=LD_LIBRARY_PATH=/opt/libjpeg-turbo/lib64/:/usr/local/cuda/lib64:/usr/lib64/xpra:/opt/x264/lib
Environment=CUDA_VISIBLE_DEVICES=1
Environment=XPRA_VIDEO_NODETECT_TIMEOUT=0
Environment=XPRA_FORCE_BATCH=1
Environment=XPRA_BATCH_MAX_DELAY=25

Thu, 03 Jan 2019 19:54:17 GMT - Antoine Martin:

As per ticket:2090#comment:7 :

this should no longer be needed: Environment=XPRA_FORCE_BATCH=1
I'm still hoping that we can find a better solution than using this: Environment=XPRA_BATCH_MAX_DELAY=25

Thu, 03 Jan 2019 19:58:31 GMT - Nathan Hallquist:

It's working fine from my LTE phone over a bluetooth link share. I can manipulate a 3D model is real-time over this (using quality=15). You may close this ticket!

Thu, 03 Jan 2019 20:09:09 GMT - Antoine Martin: status changed; resolution set

status changed from new to closed
resolution set to fixed

It's working fine from my LTE phone over a bluetooth link share. I can manipulate a 3D model is real-time over this

(using quality=15).

Doesn't the quality auto-adjust well enough? (maybe we should set min-quality=0 to allow it to get there?)

Fri, 04 Jan 2019 16:58:31 GMT - Nathan Hallquist:

I have deployed this and I almost imediately got a report that XPRA had hung for one of my users. I will include two interesting-looking log excerpts. How can I help debug?

The initial problem seems to be related to some kind of network congestion:

Jan 03 19:01:01 curry run-parts(/etc/cron.hourly)[86181]: finished mcelog.cron
Jan 03 19:06:20 curry sh[2000]: 2019-01-03 19:06:20,407 Error in worker thread processing item <bound method ClientConnection.recalculate_delays of ClientConnection(3 : Protocol(unix-domain socket:/run/user/267/xpra/curry-267))>
Jan 03 19:06:20 curry sh[2000]: Traceback (most recent call last):
Jan 03 19:06:20 curry sh[2000]: File "/opt/xpra/lib64/python2.7/site-packages/xpra/server/background_worker.py", line 62, in run
Jan 03 19:06:20 curry sh[2000]: item()
Jan 03 19:06:20 curry sh[2000]: File "/opt/xpra/lib64/python2.7/site-packages/xpra/server/source/encodings_mixin.py", line 137, in recalculate_delays
Jan 03 19:06:20 curry sh[2000]: self.statistics.update_averages()
Jan 03 19:06:20 curry sh[2000]: File "/opt/xpra/lib64/python2.7/site-packages/xpra/server/source/source_stats.py", line 118, in update_averages
Jan 03 19:06:20 curry sh[2000]: css = tuple(x for x in self.congestion_send_speed if x[0]>min_time)
Jan 03 19:06:20 curry sh[2000]: File "/opt/xpra/lib64/python2.7/site-packages/xpra/server/source/source_stats.py", line 118, in <genexpr>
Jan 03 19:06:20 curry sh[2000]: css = tuple(x for x in self.congestion_send_speed if x[0]>min_time)
Jan 03 19:06:20 curry sh[2000]: RuntimeError: deque mutated during iteration
Jan 03 19:07:07 curry kernel: nvidia-modeset: WARNING: GPU:1: Correcting number of heads for current head configuration (0x00)
Jan 03 19:07:10 curry sh[2000]: 2019-01-03 19:07:10,746 client @53.594 server is not responding, drawing spinners over the windows
Jan 03 19:07:10 curry sh[2000]: 2019-01-03 19:07:10,877 client @53.859 server is OK again
Jan 03 19:10:01 curry systemd[1]: Started Session 4132 of user root.

Here's the messages after it tries to recover. There are many lines of this:

Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,027 Warning: delayed region timeout
Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,028  region is 15 seconds old, will retry - bad connection?
Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,028  5 late responses:
Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,028      54 h264 :  18s
Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,028      55 h264 :  18s
Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,028      56 h264 :  17s
Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,028      57 h264 :  17s
Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,029      58 h264 :  15s
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,172 Warning: delayed region timeout
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,172  region is 15 seconds old, will retry - bad connection?
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,172  22 late responses:
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,172    3944 jpeg :  22s
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173    3945 scroll:  21s
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173    3946 png  :  21s
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173    3947 jpeg :  21s
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173    3948 jpeg :  21s
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173    3949 scroll:  20s
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173    3950 rgb24:  20s
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173    3951 jpeg :  20s
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173    3952 jpeg :  20s
Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173    3953 scroll:  20s

Then I start to see:

Jan 03 19:49:12 curry sh[2000]: 2019-01-03 19:49:12,453 Error: connection timed out: unix-domain socket:/run/user/267/xpra/curry-267
Jan 03 19:49:12 curry sh[2000]: 2019-01-03 19:49:12,456 Error: connection timed out: unix-domain socket:/run/xpra/curry-267
Jan 03 19:49:44 curry sh[2000]: *** BUG ***
Jan 03 19:49:44 curry sh[2000]: In pixman_set_extents: The expression region->extents.x1 < region->extents.x2 was false
Jan 03 19:49:44 curry sh[2000]: Set a breakpoint on '_pixman_log_error' to debug
Jan 03 19:50:01 curry systemd[1]: Started Session 4144 of user root.
Jan 03 19:50:01 curry CROND[156120]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jan 03 19:55:46 curry sh[2000]: *** BUG ***
Jan 03 19:55:46 curry sh[2000]: In pixman_set_extents: The expression region->extents.x1 < region->extents.x2 was false
Jan 03 19:55:46 curry sh[2000]: Set a breakpoint on '_pixman_log_error' to debug
Jan 03 19:56:44 curry sh[2000]: *** BUG ***
Jan 03 19:56:44 curry sh[2000]: In pixman_set_extents: The expression region->extents.x1 < region->extents.x2 was false
Jan 03 19:56:44 curry sh[2000]: Set a breakpoint on '_pixman_log_error' to debug
Jan 03 20:00:01 curry systemd[1]: Started Session 4145 of user root.
Jan 03 20:00:01 curry CROND[156756]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jan 03 20:00:50 curry sh[2000]: *** BUG ***
Jan 03 20:00:50 curry sh[2000]: In pixman_set_extents: The expression region->extents.x1 < region->extents.x2 was false
Jan 03 20:00:50 curry sh[2000]: Set a breakpoint on '_pixman_log_error' to debug

Fri, 04 Jan 2019 16:59:30 GMT - Nathan Hallquist:

All of my clients also had bandwidth detection disabled, so my claim that I had renabled it with bandwidth detection might have been false. However, my user from last night certainly had bandwidth detection *on*. I'll be retesting that momentarily.

Fri, 04 Jan 2019 17:56:11 GMT - Antoine Martin:

deque mutated during iteration

This is a real bug, but this has been present for a long time so this may just be an unlucky symptom of another problem: you may be hitting this race condition now because there are many congestion events. Fixed in r21288.

*** BUG *** In pixman_set_extents: The expression region->extents.x1 < region extents.x2 was false Set a breakpoint on '_pixman_log_error' to debug

I have never ever seen this message. Very suspicious.

WARNING: GPU:1: Correcting number of heads for current head configuration (0x00)

Could this be the cause? Maybe the GPU is hanging and causing the whole system to stall? This would explain the other timeouts that follow:

Warning: delayed region timeout
Error: connection timed out: unix-domain socket:/...

Fri, 04 Jan 2019 19:33:41 GMT - Nathan Hallquist:

The "correcting number of heads" thing is a red-herring, I believe. All of my headless servers say that and it never causes any trouble. I've never looked into it.

What I have found is that he reports that the "Network Performance Issue" dialog pops up as soon as I turn the bandwidth-detection feature back on, and his connection gets steadily worse over time.

If I turn it off his connection is great he says. It is 100% reproducible he tells me. I cannot, however, seem to reproduce the problem in my office.

What logs shall I collect?

Fri, 04 Jan 2019 19:40:04 GMT - Antoine Martin: status changed; resolution deleted

status changed from closed to reopened
resolution fixed deleted

What I have found is that he reports that the "Network Performance Issue" dialog pops up as soon as I turn the bandwidth-detection feature back on, and his connection gets steadily worse over time.

This could be a similar problem to ticket:2090#comment:9 : a batch delay that stays too high following congestion. (real or not..)

What logs shall I collect?

-d bandwidth Maybe correlate it with -d compress too.

Fri, 04 Jan 2019 19:47:25 GMT - Antoine Martin:

There was even a bug a while back that triggered more congestion-events after the first one. Could be something similar where the congestion counter measures just trigger repeatedly.

Fri, 04 Jan 2019 20:46:20 GMT - Nathan Hallquist:

I've got a log for you: log2.txt

The guy running the client tells me that he started the 3D software at 43:19 and by 42:28 things were already broken.

He's using the 64 windows beta client on the xpra website from 12-18

Fri, 04 Jan 2019 20:46:39 GMT - Nathan Hallquist: attachment set

attachment set to log2.txt

Sat, 05 Jan 2019 08:12:32 GMT - Antoine Martin:

The guy running the client tells me that he started the 3D software at 43:19 and by 42:28 things were already broken.

Unless he's going back in time, that's 42:19 ;)

Found a few things of interest in there:

the first congestion event is at 12:42:23,435:

networksend_congestion_event(late-ack for sequence     72: late by 102ms, target latency=272 \
    ((118, 54, 0, 100)), 465, 1354781) 1323Kbps (average=0Kbps)

We have a target latency of 272ms for this connection: 118ms for network (based on measured latency), 54ms for sending the packet, and 100ms tolerance. Yet somehow the ack packet comes back 102ms late. Spurious events like these can happen, so this may or may not be a problem. There are a couple more events, before this important one:

you have nvenc enabled and it looks like that is causing a 500ms delay when it starts up:

compress: 504.6ms for 1186x938  pixels at    3,23   for wid=2     \
    using      h264 with ratio   0.1%  ( 4350KB to     4KB), sequence     3, client_options=\
    {u'speed': 30, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 40}

Immediately after that, we see congestion events with a 500ms delay:

networksend_congestion_event(late-ack for sequence      5: late by 508ms, target latency=260 ((118, 42, 0, 100)), 1513, 284030) 134Kbps (average=157Kbps)

I don't understand how the encode thread could affect the network send thread... but that's too much to be a coincidence.

a few more events after that, like the initial ones

I have seen unexplained / spurious congestion events before, especially with mswindows clients. You may want to try:

does this happen with nvenc disabled?
double the tolerance on the server:
```
XPRA_ACK_TOLERANCE=200 xpra start ...
```

If that works then we may have to double it by default, or just for win32 clients.

enable / disable naggle globally on the win32 client (rather than letting the network code turn it on and off as needed):
```
XPRA_SOCKET_NODELAY=0 xpra_cmd.exe attach ...
```
```
XPRA_SOCKET_NODELAY=1 xpra_cmd.exe attach ...
```

Sat, 05 Jan 2019 09:20:32 GMT - Antoine Martin:

DOH: regarding the 500ms delay, that's obviously something holding onto the python GIL during nvenc / pycuda setup. If you can provide the -d cuda,nvenc output and also set XPRA_NVENC_DEBUG_API=1, I should be able to figure out which call is blocking the whole python interpreter and then we can cythonize it and release the GIL.

Sat, 05 Jan 2019 15:31:17 GMT - Antoine Martin:

Found a detailed NVENC log sample in ticket:1472#comment:7, it seems that the delays happen:

between testing device .. and created context
between init_cuda cuda_device and init_cuda cuda_context

In both cases, this is where we call pycuda.driver.Device.make_context.

Also relatively slow is the encoder context cleanup call (after cuda_clean) - but since we call it twice, it could just be elsewhere... r21292 will help clarify that.

More and more, I am thinking of solving delay problems (see comment:21 and ticket:2048#comment:2) by moving video encoding to its own thread. This would only fix some of the problems (#2048 mostly), but not if pycuda holds the GIL...

Sat, 05 Jan 2019 16:18:56 GMT - Antoine Martin: attachment set

attachment set to test_cuda_context_gil.py

test if make_context releases the GIL

Sat, 05 Jan 2019 16:19:55 GMT - Antoine Martin:

@nathan_lstc: please run attachment/ticket/2022/test_cuda_context_gil.py on a system with nvenc so we will know if pycuda is holding the GIL.

Sun, 06 Jan 2019 21:17:00 GMT - Nathan Hallquist:

I've modified it as follows: (added init) I also reduced the sleep number from 0.1 to 0.01

from pycuda import driver
driver.init()
print("initilizing device")

output:

[root@curry nathan]# !py
python test_cuda_context_gil.py
******initilizing device
*device=<pycuda._driver.Device object at 0x7ff61455a680>
context=<pycuda._driver.Context object at 0x7ff5ce4ad668>
*done
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
Aborted (core dumped)
[root@curry nathan]#

Mon, 07 Jan 2019 06:51:37 GMT - Antoine Martin:

The error printed at the end can probably be avoided by calling cuda_context.detach(). In hindsight, this test code output is not very useful... I wanted to know if the print thread was being blocked during the call to make_context. Was it? (did the "*" stop being printed out). A better way would have been to print timestamps every time instead.

Mon, 07 Jan 2019 22:11:33 GMT - Nathan Hallquist:

Here's the output for a few runs:

[root@curry nathan]# python test_cuda_context_gil.py
******************initilizing device
*device=<pycuda._driver.Device object at 0x7fe1928ef680>
1546898949.39
1546898949.6
context=<pycuda._driver.Context object at 0x7fe14c867668>
**********done
*[root@curry nathan]# python test_cuda_context_gil.py
*************************************************initilizing device
device=<pycuda._driver.Device object at 0x7f0f6c8e2680>
*1546898951.31
1546898951.49
context=<pycuda._driver.Context object at 0x7f0f2685b668>
********done
[root@curry nathan]#

The current version:

#!/usr/bin/env python
def print_thread():
    while True:
        sys.stdout.write("*")
        sys.stdout.flush()
        time.sleep(0.001)
from threading import Thread
t = Thread(target=print_thread)
t.daemon = True
t.start()
time.sleep(0)
from pycuda import driver
driver.init()
print("initilizing device")
device = driver.Device(0)
print("device=%s" % (device,))
print(time.time())
context = device.make_context()
print(time.time())
print("context=%s" % (context,))
time.sleep(0.01)
context.detach()
print("done")

Mon, 07 Jan 2019 22:13:20 GMT - Antoine Martin:

It's still printing stars "*" and not timestamps...

Mon, 07 Jan 2019 22:14:18 GMT - Antoine Martin:

We want to see if the print_thread is paused during the call to make_context.

Mon, 07 Jan 2019 22:21:24 GMT - Nathan Hallquist:

There is a "print time.time()" on either side of the make_context. The stars are printing fast enough that I think they should have appeared. But you are right it would be clearer with timestamps. Here it is:

[root@curry nathan]# [root@curry nathan]# !py
python test_cuda_context_gil.py
*print thread: 0.00265097618103
*print thread: 0.00385713577271
*print thread: 0.00500512123108
*print thread: 0.00671005249023
*print thread: 0.00848293304443
*print thread: 0.00974202156067
*print thread: 0.0113780498505
*print thread: 0.0127730369568
*print thread: 0.01389503479
*print thread: 0.0152540206909
*print thread: 0.0163860321045
*print thread: 0.0179829597473
*print thread: 0.019357919693
*print thread: 0.0204629898071
*print thread: 0.0215699672699
*print thread: 0.0227890014648
*print thread: 0.0246050357819
*print thread: 0.026202917099
*print thread: 0.0370080471039
*print thread: 0.0383989810944
*print thread: 0.039715051651
*print thread: 0.0414950847626
*print thread: 0.0425989627838
*print thread: 0.0456280708313
*print thread: 0.0471770763397
*print thread: 0.0486969947815
*print thread: 0.0498020648956
*print thread: 0.0512359142303
*print thread: 0.0525970458984
*print thread: 0.0548369884491
*print thread: 0.0561380386353
*print thread: 0.0572350025177
*print thread: 0.0583231449127
*print thread: 0.0594069957733
*print thread: 0.0614700317383
*print thread: 0.0625629425049
*print thread: 0.0636930465698
*print thread: 0.0647871494293
*print thread: 0.0663070678711
*print thread: 0.0675950050354
*print thread: 0.0690340995789
*print thread: 0.0701401233673
*print thread: 0.0712389945984
*print thread: 0.0740530490875
*print thread: 0.078320980072
*print thread: 0.07941198349
*print thread: 0.080500125885
*print thread: 0.0816259384155
*print thread: 0.0827519893646
*print thread: 0.0839450359344
*print thread: 0.0852839946747
*print thread: 0.0901050567627
*initilizing device
device=<pycuda._driver.Device object at 0x7f36ca67c6d8>
 print thread: 0.258333921432
1546899654.25
1546899654.47
context=<pycuda._driver.Context object at 0x7f36845e66e0>
*print thread: 0.471437931061
*print thread: 0.472643136978
*print thread: 0.473837137222
*print thread: 0.474987030029
*print thread: 0.476146936417
*print thread: 0.477342128754
*print thread: 0.478557109833
*print thread: 0.479777097702
*done
[root@curry nathan]#

Mon, 07 Jan 2019 22:24:58 GMT - Antoine Martin:

Sorry, didn't see that.

So, as I expected, the problem is that python-pycuda is stalling the whole python interpreter during both device init and context creation. I'll ask for help, otherwise we may have to rewrite the cuda glue in cython so we can call it with Cython: Interfacing with External C Code: Releasing the GIL.

Mon, 07 Jan 2019 22:55:06 GMT - Nathan Hallquist:

Should I still have my user do the XPRA_SOCKET_NODELAY and the XPRA_ACK_TOLERANCE test?

Tue, 08 Jan 2019 04:02:38 GMT - Antoine Martin: owner, status changed

owner changed from Nathan Hallquist to Antoine Martin
status changed from reopened to new

Should I still have my user do the XPRA_SOCKET_NODELAY and the XPRA_ACK_TOLERANCE test?

Probably not. The pycuda blockage is probably the main cause of the latency problems.

Tue, 08 Jan 2019 06:04:47 GMT - Antoine Martin:

Question on pycuda mailing list: releasing the GIL during init and make_context

Wed, 09 Jan 2019 15:26:04 GMT - Antoine Martin: owner changed

owner changed from Antoine Martin to Nathan Hallquist

Batch delay fixes tracked in ticket:2090#comment:10

As per this reply: No particular reason from my end. I'd be happy to consider a PR with such a change. Should be pretty straightforward. Time for me to work on pycuda a bit then.

In the meantime, this should achieve the same results as setting XPRA_BATCH_MAX_DELAY=50 but better: we expire the delayed regions quicker, but they only actually get sent if there really isn't any reason to wait longer. This doesn't fix the problem where we are too sensitive to latency going up and not sensitive enough when it goes back down, but it should mitigate most of it.

@nathan_lstc: is this better out of the box? (bar pycuda issue..)

Sun, 13 Jan 2019 16:58:22 GMT - Antoine Martin: attachment set

attachment set to test_cuda_context_gil.3.py

increase timer frequency so we can verify the change more clearly

Sun, 13 Jan 2019 17:28:39 GMT - Antoine Martin:

I've just tried on two systems with GTX cards:

$ lspci -v | grep GTX;python ./test_cuda_context_gil.py
10:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2) (prog-if 00 [VGA controller])
* 1547391204.69
initilizing device
device=<pycuda._driver.Device object at 0x7f5304885890>
context=<pycuda._driver.Context object at 0x7f52e7a29cf8>
 * 1547391204.8
done

$ lspci -v | grep GTX;python ./test_cuda_context_gil.py
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1) (prog-if 00 [VGA controller])
* 1547391242.08
initilizing device
device=<pycuda._driver.Device object at 0x7fe51232e998>
context=<pycuda._driver.Context object at 0x7fe4f5c32cf8>
 * 1547391242.25
done

The slowest one takes 170ms, which isn't very high, but enough to verify the changes with the updated test class firing the timer every ~10 ms:

$ lspci -v | grep GTX;python ./test_cuda_context_gil.py
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1) (prog-if 00 [VGA controller])
* 1547398645.759
* 1547398645.765 loading pycuda
* 1547398645.779
* 1547398645.793
* 1547398645.805
* 1547398645.805 driver init
* 1547398645.815
* 1547398645.824 initializing device
* 1547398645.824 device=<pycuda._driver.Device object at 0x7ff8569cdcb0>
* 1547398645.825
* 1547398645.836
* 1547398645.847
* 1547398645.857
* 1547398645.867
* 1547398645.878
* 1547398645.888
* 1547398645.898
* 1547398645.908
* 1547398645.914 context=<pycuda._driver.Context object at 0x7ff83a038de8>
all done

The timer now runs regularly, even during the call to make_context.

r21358 adds a patch to pycuda, which we can submit upstream after further testing. I've uploaded new python-pycuda RPM packages to the https://xpra.org/beta repository.

@nathan_lstc: does that work for you? You should no longer see spurious congestion events during nvenc context initialization.

Mon, 14 Jan 2019 04:54:35 GMT - Antoine Martin:

I have audited all the few remaining slow calls to pycuda and nvenc:

copy_image calls either memcpy_htod or memcpy_dtod and both are already allowing other threads to run
nvEncInitializeEncoder and nvEncDestroyEncoder already use Cython's nogil
cuda_context.detach() can take ~25ms, I have not modified it because it uses CUDAPP_CALL_GUARDED_CLEANUP which does not have a THREADED variant and I think there might well be a reason for that - this can be revisited if the call takes much longer on some setups

Mon, 14 Jan 2019 23:01:21 GMT - Nathan Hallquist:

The user who reported this problem has tested and he reports that it works. I have likewise tested and it works (but I never saw the problem very much to begin with).

So, yes, I think it works! Thanks!

Thu, 17 Jan 2019 15:50:34 GMT - Antoine Martin: status changed; resolution set

status changed from new to closed
resolution set to fixed

Based on comment:35, I've created #2111. As per comment:47 and others, without the log sample we won't be improving the detection heuristics so you will need to use the dbus video-region api to avoid context recycling.

I hope that the congestion events were caused by pycuda and that they're now all gone.

At last, closing!

Thu, 17 Jan 2019 16:44:29 GMT - Nathan Hallquist:

I totally forgot about that the log sample

I have, indeed, resolved my problem with the help of DBUS. XPRA is working great for LS-PREPOST now.

However, I am happy to you improve XPRA more generally! If you are still interested, then I will do it right away, let me know. Shall I do it?

Open a new ticket, perhaps?

Thu, 17 Jan 2019 17:28:41 GMT - Antoine Martin:

Open a new ticket, perhaps?

Yes please, something like "better video region detection". With a log of screen updates, and what the video region should have been. Then we can see if we can tune the code to auto-detect it properly, given time..

Sat, 23 Jan 2021 05:40:03 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/2022