Unless I run xpra with only h264 enabled,
/usr/bin/python2 /usr/bin/xpra start -d encoder --start=xterm --no-daemon --video-encoders=nvenc --encodings=h264 :57
I can observe from the debug data that very few frames are actually rendered using NVENC even when I have h264 selected from the UI. When I run it using the above command, it works wonderfully. I'm using r20858 on CentOS7 with VirtualGL on the server and an r20882 client on windows. The driver is 390.87. I'm using a C2075 for render and a K20c for encode.
I do not think the problem has much to do with my hardware or reversion because I'm having this trouble on all of my configurations including a brand-new new computer (the other one is pretty old) with a GTX 1060 running release 2.3.4 server with release 2.4 client.
Nathan
Xpra will only use video encoders when it thinks is necessary, for more details on the original feature see #410 - there have been many updates and fixes since.
Roughly speaking, the video encoder will not be triggered unless something on screen refreshes rapidly enough.
By only allowing h264
, you are bypassing all the automatic encoding logic.
To test, you can run glxgears
or anything else that generates lots of frames and verify that NVENC does get used.
If you feel that your screen updates would be better compressed with a video encoder and that the automatic logic is too slow to switch to the video encoder, you can debug the video region detection with -d regiondetect
, the encoding selection logic with -d encoding
.
Bear in mind that GPU encoders take longer to setup, so it is best to only use them when we're certain that the content is video-ish and is going to keep producing screen updates regularly.
Maybe we could tweak the logic to switch to video sooner when we know we're dealing with an efficient encoder (ie: NVENC).
Here's a "-d compress" log. For context, this log was collected during a *sustained* rotation operation on an OpenGL rendering of a finite element model of vehicle impact (it is a standard benchmark for us). The log looks like the below excerpt the whole way through. When XPRA jumps between RGB and YUV the screen briefly locks hard.
I have my system tray UI set to speed=auto, quality=auto and encoding = h264.
There are two cases I want to include:
(1) Without "XPRA_FORCE_CSC_MODE=BGRX" it is slow and throws "fail to encode errors" for NVENC sometimes, especially when I set speed to 100 (I had speed set to auto). In this case it didn't throw those errors, but the whole picture *did* turn yellow a few times during the collection of this log.
(2) with XPRA_FORCE_CSC_MODE=BGRX it is also very slow, but not as slow as before (but nothing like speed and perfection of having only NVENC enabled). There were noticeable jerks when it switched between encodings. Never any NVENC encoding errors nor random color changes like when BGRX isn't set.
Case (1) Plain, no XPRA_FORCE_CSC_MODE.
2018-11-01 13:12:06,968 compress: 1.2ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 134, client_options={'zlib': 5, 'delta': 181, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 186} 2018-11-01 13:12:07,553 compress: 582.9ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 1.0% ( 6128KB to 61KB), sequence 135, client_options={u'speed': 41, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 20} 2018-11-01 13:12:07,695 compress: 2.0ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 136, client_options={'zlib': 5, 'delta': 186, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 189} 2018-11-01 13:12:08,232 compress: 533.5ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 1.0% ( 6128KB to 60KB), sequence 137, client_options={u'speed': 46, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 34} 2018-11-01 13:12:08,318 compress: 0.3ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 138, client_options={'zlib': 5, 'delta': 189, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 192} 2018-11-01 13:12:08,365 compress: 46.0ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.8% ( 6128KB to 48KB), sequence 139, client_options={u'speed': 46, u'frame': 1, u'pts': 626, 'csc': 'YUV444P', u'quality': 34} 2018-11-01 13:12:08,506 compress: 0.2ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 140, client_options={'zlib': 5, 'delta': 192, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 195} 2018-11-01 13:12:08,548 compress: 40.0ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.6% ( 6128KB to 34KB), sequence 141, client_options={u'speed': 46, u'frame': 2, u'pts': 814, 'csc': 'YUV444P', u'quality': 34} 2018-11-01 13:12:08,696 compress: 1.4ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 142, client_options={'zlib': 5, 'delta': 195, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 198} 2018-11-01 13:12:09,204 compress: 505.4ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 1.0% ( 6128KB to 58KB), sequence 143, client_options={u'speed': 44, u'frame': 0, u'pts': 0, 'csc': u'XRGB', u'quality': 44} 2018-11-01 13:12:09,300 compress: 0.3ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 144, client_options={'zlib': 5, 'delta': 198, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 201} 2018-11-01 13:12:09,914 compress: 612.8ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 2.2% ( 6128KB to 137KB), sequence 145, client_options={u'speed': 46, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 24} 2018-11-01 13:12:09,924 compress: 0.3ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 146, client_options={'zlib': 5, 'delta': 201, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 204} 2018-11-01 13:12:10,584 compress: 658.1ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 1.2% ( 6128KB to 73KB), sequence 147, client_options={u'speed': 46, u'frame': 0, u'pts': 0, 'csc': u'XRGB', u'quality': 24} 2018-11-01 13:12:10,648 compress: 0.4ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 148, client_options={'zlib': 5, 'delta': 204, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 207} 2018-11-01 13:12:11,210 compress: 561.0ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 1.2% ( 6128KB to 70KB), sequence 149, client_options={u'speed': 30, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 5} 2018-11-01 13:12:11,231 compress: 0.1ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 150, client_options={'zlib': 6, 'delta': 207, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 210} 2018-11-01 13:12:11,802 compress: 569.8ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.4% ( 6128KB to 23KB), sequence 151, client_options={u'speed': 30, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 5} 2018-11-01 13:12:11,822 compress: 0.2ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 152, client_options={'zlib': 6, 'delta': 210, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 213} 2018-11-01 13:12:11,846 compress: 23.9ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.2% ( 6128KB to 13KB), sequence 153, client_options={u'speed': 30, u'frame': 1, u'pts': 591, 'csc': 'YUV444P', u'quality': 5} 2018-11-01 13:12:12,834 compress: 0.2ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 154, client_options={'zlib': 7, 'delta': 213, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 216} 2018-11-01 13:12:13,367 compress: 532.6ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.3% ( 6128KB to 21KB), sequence 155, client_options={u'speed': 28, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 4} 2018-11-01 13:12:13,368 compress: 0.1ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 156, client_options={'zlib': 7, 'delta': 216, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 219} 2018-11-01 13:12:13,423 compress: 52.1ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.1% ( 6128KB to 8KB), sequence 157, client_options={u'speed': 28, u'frame': 1, u'pts': 453, 'csc': 'YUV444P', u'quality': 4} 2018-11-01 13:12:13,744 compress: 0.8ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 158, client_options={'zlib': 6, 'delta': 219, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 222} 2018-11-01 13:12:14,316 compress: 569.9ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.4% ( 6128KB to 22KB), sequence 159, client_options={u'speed': 32, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 8}
Case 2:
2018-11-01 13:09:17,544 compress: 0.5ms for 36x36 pixels at 1596,411 for wid=2 using rgb24 with ratio 37.5% ( 5KB to 1KB), sequence 323, client_options={'zlib': 5, 'delta': 401, 'bucket': 1, 'rgb_format': 'BGRX', 'flush': 1, 'store': 469} 2018-11-01 13:09:17,547 compress: 1.1ms for 1512x16 pixels at 2,941 for wid=2 using jpeg with ratio 1.7% ( 94KB to 1KB), sequence 324, client_options={'quality': 49} 2018-11-01 13:09:18,099 compress: 89.3ms for 1636x959 pixels at 0,0 for wid=2 using webp with ratio 2.6% ( 6128KB to 161KB), sequence 325, client_options={u'quality': 100, u'rgb_format': 'BGRX'} 2018-11-01 13:09:18,234 compress: 0.6ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 326, client_options={'zlib': 4, 'delta': 460, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 475} 2018-11-01 13:09:18,814 compress: 577.6ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 4.7% ( 6128KB to 290KB), sequence 327, client_options={u'speed': 39, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 10} 2018-11-01 13:09:18,835 compress: 0.5ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 328, client_options={'zlib': 6, 'delta': 475, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 478} 2018-11-01 13:09:18,892 compress: 56.2ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 2.2% ( 6128KB to 133KB), sequence 329, client_options={u'speed': 39, u'frame': 1, u'pts': 606, 'csc': 'YUV444P', u'quality': 10} 2018-11-01 13:09:19,385 compress: 0.7ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.9% ( 6KB to 0KB), sequence 330, client_options={'flush': 1, 'zlib': 6, 'bucket': 0, 'store': 481, 'rgb_format': 'BGRX'} 2018-11-01 13:09:19,442 compress: 55.0ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 3.2% ( 6128KB to 193KB), sequence 331, client_options={u'speed': 39, u'frame': 2, u'pts': 1156, 'csc': 'YUV444P', u'quality': 10} 2018-11-01 13:09:19,841 compress: 1.2ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 332, client_options={'zlib': 5, 'delta': 481, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 484} 2018-11-01 13:09:20,438 compress: 592.9ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.4% ( 6128KB to 26KB), sequence 333, client_options={u'speed': 41, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 7} 2018-11-01 13:09:20,569 compress: 0.3ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 334, client_options={'zlib': 5, 'delta': 484, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 487} 2018-11-01 13:09:20,620 compress: 50.2ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.3% ( 6128KB to 16KB), sequence 335, client_options={u'speed': 41, u'frame': 1, u'pts': 730, 'csc': 'YUV444P', u'quality': 7} 2018-11-01 13:09:20,832 compress: 0.8ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 336, client_options={'zlib': 5, 'delta': 487, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 490} 2018-11-01 13:09:21,400 compress: 562.8ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.7% ( 6128KB to 41KB), sequence 337, client_options={u'speed': 47, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 23} 2018-11-01 13:09:21,504 compress: 0.7ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 338, client_options={'zlib': 5, 'delta': 490, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 493} 2018-11-01 13:09:21,514 compress: 1.2ms for 1636x959 pixels at 0,0 for wid=2 using scroll as 8 rectangles ( 6128KB) , sequence 340, client_options={'flush': 2} 2018-11-01 13:09:21,543 compress: 28.7ms for 1636x561 pixels at 0,144 for wid=2 using jpeg with ratio 0.9% ( 3585KB to 31KB), sequence 341, client_options={'quality': 23, 'optimize': True, 'flush': 1} 2018-11-01 13:09:21,547 compress: 3.7ms for 1636x61 pixels at 0,763 for wid=2 using jpeg with ratio 0.6% ( 389KB to 2KB), sequence 342, client_options={'quality': 23, 'optimize': True} 2018-11-01 13:09:21,699 compress: 0.3ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 342, client_options={'zlib': 5, 'delta': 493, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 496} 2018-11-01 13:09:21,707 compress: 0.8ms for 1636x959 pixels at 0,0 for wid=2 using scroll as 1 rectangles ( 6128KB) , sequence 344, client_options={'flush': 1} 2018-11-01 13:09:21,740 compress: 32.2ms for 1636x606 pixels at 0,260 for wid=2 using jpeg with ratio 1.1% ( 3872KB to 42KB), sequence 345, client_options={'quality': 23, 'optimize': True} 2018-11-01 13:09:21,929 compress: 0.2ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 345, client_options={'zlib': 5, 'delta': 496, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 499} 2018-11-01 13:09:22,522 compress: 591.0ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 1.7% ( 6128KB to 101KB), sequence 346, client_options={u'speed': 49, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 26} 2018-11-01 13:09:22,523 compress: 0.3ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 347, client_options={'zlib': 5, 'delta': 499, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 502} 2018-11-01 13:09:22,571 compress: 47.3ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.9% ( 6128KB to 56KB), sequence 348, client_options={u'speed': 49, u'frame': 1, u'pts': 531, 'csc': 'YUV444P', u'quality': 26} 2018-11-01 13:09:22,634 compress: 1.3ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 349, client_options={'zlib': 5, 'delta': 502, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 505} 2018-11-01 13:09:22,688 compress: 51.2ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.7% ( 6128KB to 41KB), sequence 350, client_options={u'speed': 49, u'frame': 2, u'pts': 703, 'csc': 'YUV444P', u'quality': 26} 2018-11-01 13:09:22,827 compress: 0.7ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 351, client_options={'zlib': 5, 'delta': 505, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 508} 2018-11-01 13:09:23,371 compress: 542.7ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.8% ( 6128KB to 49KB), sequence 352, client_options={u'speed': 51, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 24} 2018-11-01 13:09:23,427 compress: 55.6ms for 1636x959 pixels at 0,0 for wid=2 using webp with ratio 2.0% ( 6128KB to 122KB), sequence 353, client_options={u'quality': 100, u'rgb_format': 'BGRX'} 2018-11-01 13:09:23,495 compress: 0.3ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 354, client_options={'zlib': 5, 'delta': 508, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 513} 2018-11-01 13:09:23,542 compress: 45.6ms for 1636x958 pixels at 0,0 for wid=2 using h264 with ratio 0.5% ( 6128KB to 28KB), sequence 355, client_options={u'speed': 51, u'frame': 1, u'pts': 669, 'csc': 'YUV444P', u'quality': 24} 2018-11-01 13:09:23,654 compress: 0.9ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 356, client_options={'zlib': 5, 'delta': 513, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 516} 2018-11-01 13:09:23,663 compress: 0.8ms for 1636x959 pixels at 0,0 for wid=2 using scroll as 1 rectangles ( 6128KB) , sequence 358, client_options={'flush': 1} 2018-11-01 13:09:23,683 compress: 20.4ms for 1636x601 pixels at 0,266 for wid=2 using jpeg with ratio 0.9% ( 3840KB to 35KB), sequence 359, client_options={'quality': 24} 2018-11-01 13:09:23,986 compress: 0.3ms for 1636x1 pixels at 0,958 for wid=2 using rgb24 with ratio 0.8% ( 6KB to 0KB), sequence 359, client_options={'zlib': 5, 'delta': 516, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 519} 2018-11-01 13:09:23,999 compress: 0.6ms for 1636x959 pixels at 0,0 for wid=2 using scroll as 3 rectangles ( 6128KB) , sequence 361, client_options={'flush': 4} 2018-11-01 13:09:24,016 compress: 16.3ms for 1636x516 pixels at 0,312 for wid=2 using jpeg with ratio 1.1% ( 3297KB to 35KB), sequence 362, client_options={'quality': 26, 'flush': 3} 2018-11-01 13:09:24,018 compress: 1.0ms for 1636x4 pixels at 0,836 for wid=2 using png with ratio 0.8% ( 25KB to 0KB), sequence 363, client_options={'flush': 2} 2018-11-01 13:09:24,020 compress: 1.8ms for 1636x5 pixels at 0,915 for wid=2 using png with ratio 4.6% ( 31KB to 1KB), sequence 364, client_options={'flush': 1} 2018-11-01 13:09:24,023 compress: 1.5ms for 1636x12 pixels at 0,926 for wid=2 using jpeg with ratio 3.4% ( 76KB to 2KB), sequence 365, client_options={'quality': 26} 2018-11-01 13:09:24,619 compress: 73.4ms for 1636x959 pixels at 0,0 for wid=2 using webp with ratio 1.9% ( 6128KB to 116KB), sequence 365, client_options={u'quality': 100, u'rgb_format': 'BGRX'}
First things first, something is not right with your setup if rgb ends up using zlib compression. You should have lz4 installed and if not then you should see a big warning in your server log on startup.
You can confirm the status by running xpra/net/net_util.py
on Linux and Network_info.exe
on win32.
As for the logs, I meant "-d compress" as well as, or combined with the -d regiondetect
and -d encoding
from comment:1.
From the first log sample, most of the time it seems to be taking 500ms to compress at a resolution of 1636x958. That's dreadfully slow. You may want to raise min-speed and leave the encoding set to "auto" so it can at least switch to vp8 / vp9 when not using NVENC.
Why it isn't switching to nvenc would require the -d score
log output. I think that is going to be the key.
it is slow and throws "fail to encode errors" for NVENC sometimes
Please post the exact error message, including the full stacktrace.
You may need to add -d video,encoder
to see it.
That could also cause encoder selection problems, we should be able to fix those or at least avoid the conditions that trigger it.
with XPRA_FORCE_CSC_MODE=BGRX
Why are you changing this switch exactly?
Setting a high enough (min-)quality should give the same result.
There are two ways that NVENC can handle BGRX input: natively for the cards + driver combinations that support it (can be turned off with XPRA_NVENC_NATIVE_RGB=0
), or via a CUDA conversion step otherwise.
It's not clear what your card supports.
Please post the output of the server startup with -d nvenc
Either way, NVENC should win the scoring compared to doing the CSC step via another module (libyuv or whatever).
There are some scroll encoding packets in the middle, those can cause problems when encoding video as we then skip frames and encode them differently. You can play with raising:
XPRA_SCROLL_MIN_PERCENT=80
Or even disable scroll encoding altogether:
XPRA_SCROLL_ENCODING=0
As for the IRC discussion regarding setting hints, we can worry about that once the encoding selection bug is fixed - it will help (see #1952 and #2023)
After a lot of struggling I've gotten LZ4 to work, but I don't understand why it was wrong. For some reason, on my computer, MINGW_BUILD.sh doesn't include lz4 correctly for version 2.1.0 or 2.1.1. When I backed down to version 1.1.0 my windows build correctly includes lz4, and I see this:
2018-11-02 13:01:50,174 compress: 0.3ms for 1636x959 pixels at 0,0 for wid=2 using scroll as 1 rectangles ( 6128KB) , sequence 111, client_options={'flush': 2} 2018-11-02 13:01:50,315 compress: 141.1ms for 1636x593 pixels at 0,233 for wid=2 using rgb24 with ratio 1.7% ( 3789KB to 66KB), sequence 112, client_options={'flush': 1, 'lz4': 1, 'rgb_format': 'BGRX'} 2018-11-02 13:01:50,320 compress: 4.2ms for 1636x20 pixels at 0,915 for wid=2 using rgb24 with ratio 3.4% ( 127KB to 4KB), sequence 113, client_options={'lz4': 1, 'rgb_format': 'BGRX'} 2018-11-02 13:01:50,740 compress: 8.3ms for 1636x959 pixels at 0,0 for wid=2 using rgb24 with ratio 3.3% ( 6128KB to 199KB), sequence 113, client_options={'lz4': 1, 'rgb_format': 'BGRX'}
Version 2.1.0 works only if I run the python script from within mingw shell. However, lz4 doesn't seem to be getting put into the build directory correctly when I use version 2.1.0. Consequently, it doesn't make it into the installer. I'm not sure why it's wrong, and I'll look into a little more before giving up and moving on to your other suggestions. (This problem is less important to me than the encoder one, especially since I've got a workaround, namely downgrade to 1.1.0)
Making lz4 work didn't resolve the other issues regarding the NVIDIA encoder.
By running python with "-S" in the "dist/lib" directory I've determined that lz4 2.1.1 works *if* I copy the "egg" data into "dist/lib".
Python is far outside my core expertise, so I can only speculate about the right fix. Maybe the install_exe function of setup.py needs to include the "egg" directories...
Command line:
xpra start -d compress,regiondetect,encoding --start=gnome-terminal --no-daemon --video-encoders=nvenc :57
For this capture I had encoder set to H264 (everything else was set to auto). I rendered a 3D model and rotated it. I had XPRA_SCROLL_ENCODING=0 (because other experiments showed this helps some). This particular run was *very* jerky on the gigabit LAN. I have no idea why.
If you are interested I can send you all the software I am running.
raw-attachment/ticket/2022/1.txt
Same as before except everything is set to "auto" including encoder. Very slow.
In this case I disable all encodings except rgb and h264. When I begin to rotate it's jerky, but once it gets going its great!
Command line:
xpra start -d compress,regiondetect,encoding --start=gnome-terminal --no-daemon --video-encoders=nvenc --encodings=h264,h265,rgb :57
Output:
Log for second try.
third example
re-adding attachment with txt extension so it can be viewed online
After a lot of struggling I've gotten LZ4 to work, but I don't understand why it was wrong. For some reason, on my computer,
MINGW_BUILD.sh
doesn't include lz4 correctly for version 2.1.0 or 2.1.1.
So you're using custom builds rather than our official releases. Please make sure to mention that every time, as per wiki/ReportingBugs. And test with official builds too to make sure that the problem doesn't come from your builds. As for lz4, this is a well documented issue: ticket:1883#comment:4 with its own ticket: #1929
Making lz4 work didn't resolve the other issues regarding the NVIDIA encoder.
It wasn't meant to. It was essential to find out why your output was not the one that was expected. And it will help with performance (outdated benchmark here: https://xpra.org/stats/lz4/)
I had XPRA_SCROLL_ENCODING=0 (because other experiments showed this helps some)
"scroll" encoding is not used for windows we correctly tag as "video", hence why you should look into tagging your video windows.
xpra start -d compress,regiondetect,encoding --start=gnome-terminal --no-daemon --video-encoders=nvenc --encodings=h264,h265,rgb :57
As I mentioned before, that's a very bad idea:
Considering those important configuration issues, the logs you attached are not very telling. Please see comment:3 :
Why it isn't switching to nvenc would require the
-d score
log output. I think that is going to be the key.
Please also include both:
-d score,video
-d score,encoding,video,scaling,csc,stats
(just in case details are missing from the first one)
What I do see in there:
send_delayed_regions: using full window update 1636x959 as h264, from <bound method WindowVideoSource.get_best_encoding_video of WindowVideoSource(2 : (1636, 959))> do_video_encode(h264, XShmImageWrapper(BGRX: 0, 0, 1636, 959), {}) make_instance() nvenc - instance count=0
So it is choosing the correct encoding (h264) and the correct encoder (nvenc) but for some reason the context gets recycled every time. And since nvenc is slow to setup, that's probably where the delay comes from.
send_delayed for wid 2, batch delay is 233ms, elapsed time is 233ms
So something already happened before that which reduced the framerate to less than 5fps.
Could well be the wacky settings, or the re-cycling issue, -d stats
should be enough to see why that is.
I *have* been testing with both builds consistently on the client side, and I've been using the builds from winswitch beta on CentOS7. There are no other surprises in my configs. I have not, for instance, set LD_LIBRARY_PATH to anything interesting. I had not seen wiki/ReportingBug before (I should have thought to look for such a document). My apologies!
Regarding tagging, can you point me to some information on how that works (I'm ignorant about how a program passes this kind of information through)? I will pass it on to our UI guys, and I expect that if it's straightforward I get can them to put it in our UI trunk right away. (They are using wxwidgets with gtk2 on X. Our software's window has a rectangular video region surrounded by buttons).
Regarding, the additional debug captures I will try to do that in a few hours.
Again. Thank you.
Regarding tagging, can you point me to some information on how that works
It's all in this ticket which is assigned to you since I knew you would be interested: ticket:2023#comment:1. There's a trivial example included: browser/xpra/trunk/src/tests/xpra/test_apps/test_window_content_type.py.
Basically, set the _XPRA_CONTENT_TYPE
X11 window property to what you want (ie: "video" or "text").
I'm doing the first requested test: debug on score and video. My command line:
xpra start -d score,video --start="vglrun ./lspp45 ../d3plots/d3plot" --no-daemon --video-encoders=nvenc :57 2>&1 |tee 4.txt
The "vglrun" command is so that I can spare you all the startup logging. Once I connected to the server I went straight to rotating.
Same exercise but with -d score,video,compress
Same exercise but with -d score,encoding,video,scaling,csc,stats
Same exercise but with -d score,encoding,video,scaling,csc,stats,compress
Same exercise but with -d score,encoding,video,scaling,csc,stats,compress using the r20905 client from the web site.
After uploading the 7th one, only as control, I tried "--encodings=h264,h265" (this is not what I did for 8.txt). I found that it locks up just a tiny little bit at the beginning, unless I let it settle before setting in on rotating. On the 8.txt (which had all encoders enabled and was with the stock client) I was extra careful to let it "settle". As I expected, with settling it is still has the issue for which I opened this ticket, but I figure it's best to isolate one problem at a time. I, therefore, regard 8.txt as the nicest debug log.
I've forwarded this to our chief UI developer. I'll let you know when it's done.
Replying to Antoine Martin:
Regarding tagging, can you point me to some information on how that works
It's all in this ticket which is assigned to you since I knew you would be interested: ticket:2023#comment:1. There's a trivial example included: browser/xpra/trunk/src/tests/xpra/test_apps/test_window_content_type.py. Basically, set the
_XPRA_CONTENT_TYPE
X11 window property to what you want (ie: "video" or "text").
As per previous comments, do not disable the other video encoders.
What I found in the logs despite that:
DPI set to 45 x 25 (wanted 96 x 96)
You should be using the patched dummy driver from the repo.
video size 484x92 out of range for nvenc, min 128x128 (..) Error: no video pipeline options found for BGRX at 485x93
CUDA 9.2.0 / PyCUDA 2018.1, found 1 device: + GeForce GTX 1060 6GB @ 0000:AF:00.0 (memory: 98% free, compute: 6.1) (..) setup_pipeline failed for (84, (1, 1), None, 0, 0, None, 'BGRX', (1, 1), 520, 188, video_spec(nvenc)): cannot open encoding session, context is NULL
And that causes failures when we reach the limit. 2 contexts should be enough if you only have one "video" window though. When we do hit this limit, things go downhill fast:
setup_pipeline(..) failed! took 2197.59ms
That's the encoding pipeline blocked for over 2 seconds! ouch.
update_batch_delay: congestion : 1.87,8.74 {}
You may want to turn off bandwidth detection to see if that helps with --bandwidth-detection=no
. (#1912)
Or post the -d bandwidth
log output so we can fix this.
But then it also goes up because of this:
update_batch_delay: compression-work-queue : 2.64,5.12 {u'target': 1000, u'weight_multiplier': 3648, u'smoothing': u'logp', u'aim': 250, u'aimed_avg': 5247, u'div': 1000, u'avg': 12057, u'recent': 13314}
That's nvenc hogging the compresion work queue because it is slow to setup, and even worse when it hits the 2 context limit and fails (2 seconds as per above)... not much we can do about this.
check_pipeline('auto', 451, 134, 'BGRX') setting up a new pipeline as check failed - encodings=['h264', 'h265']
why it failed is not logged, so I can't be sure why, but here are some possibilities:
r20927 also adds an env var you can use to turn on video context cleanup debugging:
XPRA_DEBUG_VIDEO_CLEAN=1 xpra start ...
This should tell us why the video context is being destroyed so quickly.
There are new centos 7.5 builds here: https://xpra.org/beta.
I did not understand what you meant by "do not disable the other video encoders". For some reason I thought you meant don't use "--encodings=", I just now think you don't want "--video-encoders=" either. I apologize for that. I was not being intentionally obtuse. I am very grateful for your help. I do not mean to waste your time!
I was aware of the limit regarding encoder streams, but I didn't realize I was bumping up against it. Is there an easy way to buy or request licenses from NVIDIA (for testing)?
All of my users have GTX cards (and except for a solitary tesla card), that's all I have. I have two kerpler-era K20c tesla cards. That is the card that sometimes draws frames with weird colors (lots of yellow) with XPRA. I infer older cards have some limitations in this regard. That card, by the way, does work great for the render window if I add the "CSC_MODE=BGRX" environment variable and a --encodings. I think that card can also be made to work. If getting a Quadro (or some other card) will expedite this I'll buy whatever you recommend. Any recommendations?
Ultimately, I'd like to get XPRA to do NVENC only in my CAD render window and not anywhere else (everywhere else works just fine), so hopefully we can eventually use GTX cards. I'll try to get into the office later today so that I can perform the followup tests you requested.
Thanks again!
I did not understand what you meant
My bad, a lot of things that seem obvious to me probably aren't.
I was aware of the limit regarding encoder streams, but I didn't realize I was bumping up against it
If we can fix all the bugs, you should be OK with just 2 contexts.
Is there an easy way to buy or request licenses from NVIDIA (for testing)?
I don't know.
I have two kepler-era K20c tesla cards. That is the card that sometimes draws frames with weird colors (lots of yellow) with XPRA
Could be bugs with lossless and / or YUV444 mode with those older cards. If you can get us some "-d video" debug of when that happens, we may be able to restrict them to the modes that do work.
If getting a Quadro (or some other card) will expedite this I'll buy whatever you recommend. Any recommendations?
I've used both gamer cards (GTX 7xx to GTX10xx - with and without license keys) and quadros myself. There is absolutely no difference in performance, the quadros just cost at least 4 times as much. See Video Encode and Decode GPU Support Matrix
Ultimately, I'd like to get XPRA to do NVENC only in my CAD render window and not anywhere else (everywhere else works just fine)
Tagging windows should allow you to do that.
I've installed the dummy driver. I've updated to r20927. I'm running with the environment variable you specified. I've turned off bandwidth detection. Bandwidth detection made no noticeable difference. The DPI problem seems to be resolved (based on my browsing of the log file).
I've run:
xpra start -d score,encoding,video,scaling,csc,stats,compress --start="vglrun ./lspp45 ../d3plots/d3plot" --no-daemon --bandwidth-detection=no :57 2>&1 |tee 11.txt
Here's the full log:
After enabling XPRA_DEBUG_VIDEO_CLEAN=1 I'm getting this:
2018-11-04 14:27:23,066 video_context_clean() File "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File "/usr/lib64/python2.7/site-packages/xpra/server/background_worker.py", line 62, in run item() File "/usr/lib64/python2.7/site-packages/xpra/server/source/encodings_mixin.py", line 160, in recalculate_delays ws.reconfigure() File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1102, in reconfigure self.update_encoding_options(force_reload) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1035, in update_encoding_options self.check_pipeline_score(force_reload) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1131, in check_pipeline_score self.video_context_clean() File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 300, in video_context_clean traceback.print_stack() 2018-11-04 14:27:23,069 video_context_clean() done
Replying to Antoine Martin:
I did not understand what you meant
My bad, a lot of things that seem obvious to me probably aren't.
I very much appreciate your patience!
If you can get us some "-d video" debug of when that happens, we may be able to restrict them to the modes that do work.
I just tried this exercise from the gigabit LAN. This older config is working much better than the GTX config. With bandwidth detection turned off my eyeball cannot detect anything wrong (when bandwidth detection is on it locks occasionally). Last time I saw the colors to go weird I was using broadband, so I'll go back home and try that (I came to work for this test).
For completeness I'm posting my experience on gigabit on my old config. I'm rendering to a C2075 (which is so old it cannot encode) and encoding on a K20c (which doesn't support rendering).
Command:
xpra start -d score,encoding,video,scaling,csc,stats,compress --start="vglrun ./lspp45 ../d3plots/d3plot" --no-daemon --bandwidth-detection=no :57 2>&1 |tee 12.txt
Environment:
[nathan@bobross lsprepost4.5_centos7]$ env | grep XPRA XPRA_CUDA_DEVICE_NAME=Tesla K20c XPRA_CUDA_DEVICE=0 XPRA_SCROLL_ENCODING=0 [nathan@bobross lsprepost4.5_centos7]$
Update: From home I find that with nvenc and only nvence enabled (--encodings) perforamnce is great. With normal options (nothing disabled) performance is bad and my client eventually crashes. I've attached the server log from that.
This older config is working much better than the GTX config
Probably because of the 2 context limit on gamer cards.
From this new log:
get_video_pipeline_options(['h264', 'vp9', 'vp8', 'mpeg1', 'mpeg2'], 3840, 1027, 'BGRX') \ scores=[ (81, (1, 1), None, 0, 0, None, 'BGRX', (1, 1), 3840, 1026, video_spec(nvenc)), (..) ]
So disabling the other video encoders is counter-productive.
XPRA_DEBUG_VIDEO_CLEAN=1
so I still cannot tell why the video encoder context gets recycled...
The sample from comment:17 lacks the log context around the backtrace. A single backtrace is not enough in any case, some video context re-cycling is normal. That said, there aren't many instances of re-cycling during this test, so this may not tell us much.
video_encode nvenc encoder: h264 1462x756 result is 148828 bytes (56.9 MPixels/s), \ client options={u'speed': 94, u'frame': 778, u'pts': 19778, 'csc': 'YUV444P', u'quality': 97}
The batch delay was also very good (low) at that point:
calculate_batch_delay for wid=1 current batch delay=8, last update 1 seconds ago
You seem to be getting at least 25 fps, reaching closer to 50fps near the end of the log - which is quite impressive. Was it not smooth at that point?
At which point in time in this log file am I looking for problems?
With bandwidth detection turned off my eyeball cannot detect anything wrong
Maybe no problems in this log then?
I agree that the K20c run on the LAN was good (I meant to indicate that in my original post). I've since added 14.txt (I did not realize that you had already replied, otherwise I would have made a new reply).
14.txt was on broadband. Over broadband I'm seeing that --encodings=h264 is much better than the default options (I am only sending logs for the default options per your request).
Regarding the environment variable, you are right. I did forget to add the new environment variable on 12.txt and 14.txt. However, I did have the environment variable set correctly on 11.txt (my most recent run on the gtx). I'll redo 12.txt and 14.txt in the morning.
Ah, didn't see that 11.txt had the backtraces.
From this 11.txt log:
setup_pipeline failed for (78, (1, 1), (1, 1), 520, 188, video_spec(swscale), u'XRGB', (1, 1), 520, 188, video_spec(nvenc)): \ cannot open encoding session, context is NULL
and we try a few other nvenc combinations (like nvenc without swscale, etc) - which takes time. but later on we re-try and it succeeds:
check_pipeline_score(False) \ found a better video encoder class than \ <type 'xpra.codecs.enc_x264.encoder.Encoder'>: (86, (1, 1), None, 0, 0, None, 'BGRX', (1, 1), 520, 188, video_spec(nvenc))
We could use the failures to detect the actual context limit value for each card, and then use this value to avoid trying to create new contexts when the limit is reached. (we have code for doing this already - it just uses the wrong limit: the default is 32) It is difficult to see why this happened just from the log output, if you don't have too many windows then I suspect that it could be because the encoder context cleanup happens in the encode thread so when we destroy an nvenc encoder (for resizing or whatever), it isn't actually freed immediately whereas we may try to create the one immediately... I'm not sure how easy it would be to change that.
video subregion was None, now rectangle(3, 27, 1558, 879) (window size: 1652,998)
video subregion was rectangle(3, 27, 1558, 879), now None (window size: 1652,998)
video subregion was None, now rectangle(3, 27, 1558, 879) (window size: 1652,998)
video_encoder_timeout
)
There's a dbus API you could use to turn off video region detection, or to set a particular region: #1060, #1401. Maybe we could just enhance the content-type to support values like "video:full", "video:subregion" or something like that. Re-cycling contexts is expensive with nvenc.
With normal options (nothing disabled) performance is bad and my client eventually crashes.
How did it crash? Can you run it from the command line and see its output? Please try with the latest builds, preferably the 64-bit ones.
From the 14.txt log:
Warning: XShm get_pixels_ptr XImage is NULL nvenc(XRGB/NV12/H264 - hp - 520x188 ).compress_image(XShmImageWrapper(BGRX: 0, 0, 520, 189), 40, 99, {'av-sync': True, 'bandwidth-limit': 0, 'content-type': ''}) Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1921, in do_video_encode ret = ve.compress_image(csc_image, quality, speed, options) File "xpra/codecs/nvenc/encoder.pyx", line 2179, in xpra.codecs.nvenc.encoder.Encoder.compress_image File "xpra/codecs/nvenc/encoder.pyx", line 2216, in xpra.codecs.nvenc.encoder.Encoder.do_compress_image File "xpra/codecs/nvenc/encoder.pyx", line 2245, in xpra.codecs.nvenc.encoder.Encoder.copy_image AssertionError: failed to get pixels from XShmImageWrapper(BGRX: 0, 0, 520, 189)
No idea what caused this. Unless this happens repeatedly, I think we can ignore this.
compress: 429.1ms for 1632x914 pixels at 0,0 for wid=1 using h264 with ratio 3.4% ( 5833KB to 197KB), sequence 13, client_options={u'speed': 99, u'frame': 0, u'pts': 0, 'csc': u'XRGB', u'quality': 40}
Please enable XPRA_DEBUG_VIDEO_CLEAN=1
for that.
- how did you install the latest builds without installing libyuv? it is an RPM dependency in v2.5
I think it is installed. What line in the log indicates that it is not?
[nathan@curry ~]$ rpm -q --info libyuv Name : libyuv Version : 0 Release : 0.31.20180904git9a07219.el7_5 Architecture: x86_64 Install Date: Mon 29 Oct 2018 11:39:10 AM PDT Group : Unspecified Size : 361374 License : BSD Signature : DSA/SHA1, Tue 23 Oct 2018 11:37:47 PM PDT, Key ID 18adb31cf18ad6bb Source RPM : libyuv-0-0.31.20180904git9a07219.el7_5.src.rpm Build Date : Tue 23 Oct 2018 10:00:00 PM PDT Build Host : buildbot Relocations : (not relocatable) URL : https://chromium.googlesource.com/libyuv/libyuv Summary : YUV conversion and scaling functionality library Description : This is an open source project that includes YUV conversion and scaling functionality. Converts all webcam formats to YUV (I420). Convert YUV to formats for rendering/effects. Rotate by 90 degrees to adjust for mobile devices in portrait mode. Scale YUV to prepare content for compression, with point, bilinear or box filter.
[nathan@curry ~]$ rpm -q --verify -v libyuv ......... /usr/lib64/libyuv.so.0 ......... /usr/lib64/libyuv.so.0.0.0 ......... /usr/share/doc/libyuv-0 ......... d /usr/share/doc/libyuv-0/AUTHORS ......... d /usr/share/doc/libyuv-0/PATENTS ......... d /usr/share/doc/libyuv-0/README.md ......... /usr/share/licenses/libyuv-0 ......... l /usr/share/licenses/libyuv-0/LICENSE [nathan@curry ~]$
I think it is installed. What line in the log indicates that it is not?
My bad, something has gone wrong with the centos7 builds, tracking this here: ticket:2004#comment:2
I made a video demonstrating what our software looks like. This was rendered on the machine with the Tesla K20c (this is similar to 12.txt). I made it so that you can see what I am doing. For instance, the window closing events are for the "help" box and the "animate" box. The only part that needs nvenc encoding is the box with the 3d drawing.
This case works pretty well. It locks up during a rotation at 9 1/3 seconds (it would not do that with --encodings=h264), but it's pretty good. (I don't have a proper debug log because I was making it for demonstration rather than debug purposes).
Again, thanks for all your help. I'll be following up shortly with more debug logs.
https://ftp.lstc.com/anonymous/outgoing/nathan/bobross-ok.mp4
I was able to get a repeat of 14.txt (with XPRA_DEBUG_VIDEO_CLEAN=1). Same server different client. This time I was using the stock XPRA 64bit client (from the web site) connecting through my cell phone hotspot. When tested with --encoding=h264 (for control) the performance was shockingly good (on cellphone hotspot''). However, with the normal options it was slow and eventually crashed.
[nathan@bobross lsprepost4.5_centos7]$ xpra start -d score,encoding,video,scaling,csc,stats,compress --start="vglrun ./lspp45 ../d3plots/d3plot" --no-daemon --bandwidth-detection=no :57 2>&1 | tee 15.txt^C [nathan@bobross lsprepost4.5_centos7]$ env | grep XPRA XPRA_CUDA_DEVICE_NAME=Tesla XPRA_CUDA_DEVICE=0 XPRA_SCROLL_ENCODING=0 XPRA_DEBUG_VIDEO_CLEAN=1 [nathan@bobross lsprepost4.5_centos7]$
I *was not* able to replicate using the same commands when connected on either 802.11g or 802.11n. In all cases I was port forwarding ssh. This trouble arises *only* on cell phone and broadband. Broadband is an especially important case, however, because we have several offsite users.
How do I get debug log off the client?
However, with the normal options it was slow and eventually crashed.
As per comment:21, how did it crash, what was the output, etc?
How do I get debug log off the client?
By default, the client debug log output is sent to the server. (see --remote-logging
)
But when dealing with crashes, the last messages may not make it to the server, so just run:
xpra_cmd.exe attach ssh://whatever/ -d whateverdebugflags
If it isn't clear what is causing the crash, using -d all
might be required to cover all possibilities.
From this latest log:
--bandwidth-detection=no --bandwidth-limit=100mbps
and add -d bandwidth
to the server debug flags to confirm (the client overrules the server's bandwidth-detection flag - and maybe we shouldn't allow that?) - h264 is more efficient, so it could be avoiding some bandwidth issues
--auto-refresh-delay=0
on the client and see if that helps
The video_context_clean
events come from:
_lost_window
), twice - in both cases, the windows used h264 for a single frame (and webp, jpeg for a few more):
encoding_totals for wid=2 with primary encoding=auto : {'h264': [1, 60300], 'webp': [2, 120868], 'jpeg': [3, 129888]}
Normally, this isn't a huge problem but with nvenc this is more costly because of the initial setup cost and the fact that this is "wasting" a precious encoder context. If the video encoders trigger too early, we could favour software encoders until we have enough frames to be certain that a hardware video encoder is worth using.
video subregion was rectangle(0, 0, 1604, 881), now rectangle(3, 27, 1450, 762) (window size: 1604,881)as single change should not be a huge problem
... with the normal options it was slow ... This trouble arises *only* on cell phone and broadband.
Can you describe this "trouble"? Is it stuttering, occasionally pausing, or just generally slow? If the problems are punctual, can you match up the problems with timecodes from the log output? (roughly) Try capturing "xpra info" at a time when things are slow.
I've been away all day dealing with some other stuff. I think you are on to something. When I turn off the bandwidth detection on the client side the performance is better than I've ever seen. I (almost) cannot distinguish broadband from LAN.
The interesting part: the longer I leave the session running the slower things get. I'll try to get debug data before the day is out. Otherwise I'll have it first thing tomorrow.
Again, thank you!
Added the --auto-refresh-delay option. Everything runs smoothly (with the Tesla card) for as long as I leave the window open.
I like auto-refresh very much so I'll be doing anything I can to help fix it. (with autorefresh turned on the video streams become slower and slower seemingly with each refresh until the client crashes--after reconnect its perfect at first, and, then, the process repeats exactly as before) I will be taking debug logs tomorrow. I'll also try to match lock-ups to timestamps in the logs.
I like auto-refresh very much so I'll be doing anything I can to help fix it.
The debug we need to get to the bottom of this would be:
--bandwidth-detection=yes
on the client and post the server's -d bandwidth
logs - it looks like the bandwidth detection is detecting the difference between LAN and broadband / phone connections, but maybe over-compensating
-d refresh,regionrefresh,compress,video
for when things get out of whack - the bandwidth detection should have an impact on the auto-refresh so you may want to try with / without, nvenc's slow startup could be causing the auto-refresh to trigger too early, or maybe it triggers and causes context re-cycling - we'll see
Since I've updated my server to the latest beta and disabled the bandwidth detection I've not been able to replicate any trouble on the Tesla card when I use clients compiled by xpra.org. I've tried 2.4 and the latest betas both and they work!
The good news for me is that my builds give the same bad results with or without my patches. I really want to get my patches incorporated into mainline, but I don't want to send them until I make them work on Linux and MacOS.
The bad news (for me, not for you) is that, until I get our UI team to add the X properties and until I go back to testing the Gamer cards I think my only problems are created by something in the way I'm building XPRA on Windows.
I'm not really sure to begin debugging this. I'm going to start with the suggestions in your previous replies (I haven't tried them all yet). The log captures on the server side look pretty similar for both clients (lots of h264). My client is slow even when it *is* using h264 on the server side. I can't figure out how to read "-d all" because there are so many GTK messages.
I am working on this today.
Again, I thank you and your team for all your help!
Since I've updated my server to the latest beta and disabled the bandwidth detection I've not been able to replicate any trouble on the Tesla card when I use clients compiled by xpra.org.
Did you identify what fixed things? My guess is the fixes from comment:14, likely r20914.
I really want to get my patches incorporated into mainline, but I don't want to send them until I make them work on Linux and MacOS.
What do those patches do? Can you open a ticket for that?
I think my only problems are created by something in the way I'm building XPRA on Windows.
The number 1 performance issue is likely to be the lack of opengl, see ticket:1883#comment:8 If you can't figure it out, please open a new ticket for that.
FYI: as of r20971 the server controls the bandwidth-detection
option and the client can no longer enable this feature on its own.
I'm not exactly sure why my custom builds were not working on windows. I think I was accidentally setting some "speed=" things in my "conf.d" folder. Anyway, I can now connect flawlessly to my computer having a K20c tesla (for encoding and a C2075 for render) if I disable the bandwidth-detection.
I still cannot make the computer with the GTX card work right. I'm attaching a log from a XPRA session on a computer with single GTX card running the opengl job.
I am using gigabit Ethernet.
I'm still working on getting _XPRA_CONTENT_TYPE added to our software.
Here's the command I used to collect the debug data
xpra start --no-daemon --bandwidth-detection=no --bandwidth-limit=100mbps --start="vglrun lsprepost4.5_centos7/lspp45 d3plots/d3plot" -d score,encoding,video,scaling,csc,stats,compress :250 2>&1
attachment/ticket/2022/gtx1.txt
Since I've updated my server to the latest beta and disabled the bandwidth detection
As per comment:29, it would really help if you could reproduce the problems with bandwidth-detection enabled and -d bandwidth
.
Then we can make it less aggressive and work better out of the box for everyone.
I still cannot make the computer with the GTX card work right. I'm attaching a log from a XPRA session on a computer with single GTX card running the opengl job.
XPRA_NVENC_CONTEXT_LIMIT=2 xpra start ...
There is no overcommit, so this should prevent the worst of the stuttering as we won't be trying to use nvenc once two contexts are in use (trying and failing takes time)
If you still have problems, then that probably means that we're choosing the wrong window for nvenc.
Please post -d encoding,damage
with an explanation of which window id should be using the video encoders.
Note: I have seen what looks like some malformed frames in the client output: unknown format returned
- ffmpeg fails to decode a frame then we reset the encoder.. which takes a bit of time. New ticket: #2051.
I will do the tests with bandwidth detection ASAP (probably within the next week). At the moment looking, among other things, at the ssh patch I submitted and the gnome-terminal ticket. Thanks for all your help!
Stumbled upon this hack for bypassing the licensing restrictions: https://github.com/keylase/nvidia-patch. It patches the driver so that the license key check always succeeds. Another approach which can be used is to break on the license key checking code in gdb and then extract the license keys included in the blob.
Using the debug options discussed in this ticket I have found that, at this point, a good deal of my NVENC context recycling is caused by the the video-region-detection dimensions changing. I am now working with our LS-PrePost developers to get me a callback so that I can plug in some code I've written to notify XPRA on dbus of the OpenGL canvas coordinates.
I have already demonstrated that this helps a lot using the "xpra control" commands. It may even completely resolve the NVENC context recycling issues, but I'm not sure yet...
Using the debug options discussed in this ticket I have found that, at this point, a good deal of my NVENC context recycling is caused by the the video-region-detection dimensions changing.
If you can provide the sample debug output, maybe we can improve the code so that it doesn't change the region quite so often. -d damage,regiondetect
should be enough.
And even when we do recycle the context, there are two ways we can make it better still:
(faster but harder to implement)
I have gone ahead and implemented dbus calls.
I am calling
(1) SetVideoRegionDetection to false (2) SetVideoRegionEnabled to true (3) ResetVideoRegion (not sure if I need this) (4) SetVideoRegion
When I do this the video region gets stuck forever on x264 unless I disable x264 at startup (and x264 is laggy, also not sure if that is a bug). I'm not sure if this is a bug, or if I'm doing something wrong.
(3) ResetVideoRegion (not sure if I need this)
Not needed.
When I do this the video region gets stuck forever on x264 unless I disable x264 at startup
That's what setting a video region does.
(and x264 is laggy, also not sure if that is a bug). I'm not sure if this is a bug, or if I'm doing something wrong.
Are you talking about the laggyness?
What exactly is the symptom? Did you capture xpra info
?
Using "video" may trigger some delayed frames with x264. (but not with nvenc - not yet anyway)
Assuming that you're using the x264 encoder, you may want to only enable vpx and not x264: --video-encoders=vpx
.
Or you can also try to cap the number of delayed frames with x264:
XPRA_X264_MAX_DELAYED_FRAMES=1 xpra start ..
Maybe what we need to do is to define a new content-type ("animation" perhaps?) that enables video encoders but without the features that cause delayed frames (b-frames, multiple reference frames) as those are only really useful for true video content (ie: video player or youtube).
I'm encountering two unrelated problems:
My users are comparing xpra to VNC (and NX), and the only consistent complaint I have heard (from a small number of them) has to do with (at least I think it has to do with) the time it takes nvenc to start.
This region that I am flagging with DBUS is either resting (a constant lossless frame) or animating. Nothing like scrolling or UI happens in this region. My experiments tell me that VPX and X264 cannot keep up, but nvenc is amazing and turbojpeg is the best when there is no nvenc (at least on old westmere chips).
Based on my experience, therefore, and assuming I've made no other mistakes, I would like for my DBUS calls to make a *permanent* nvenc context in that region, as the shape of the region never changes without triggering my resize callback which will notify xpra.
I've done a little attempt at debugging and I've found that the score for nvenc never gets better than x264 when I set the region using dbus. I think it is because the setup penalty always swamps everything else for some reason, but I'm not sure that I'm reading this right.
Every iteration has a -300:
Dec 31 15:23:29 curry sh[158375]: 2018-12-31 15:23:29,349 get_pipeline_score(BGRX , None , x264(BGRX to h264) , 1276, 918) quality: 63, speed: 34, setup: 100 - 100 runtime: 100 scaling: (1, 1) / (1, 1), encoder dimensions=1276x918, sizescore= 89, client score delta= 0, cpu score= 0, gpu score= 0, score= 57 Dec 31 15:23:29 curry sh[158375]: 2018-12-31 15:23:29,350 get_pipeline_score(BGRX , None , nvenc(BGRX to h264) , 1276, 918) quality: 63, speed: 48, setup: 100 - -300 runtime: 100 scaling: (1, 1) / (1, 1), encoder dimensions=1276x918, sizescore= 87, client score delta= 0, cpu score= 0, gpu score=100, score= 39
x264 and vpx are both laggy, but I think this might be an insurmountable computational intensity issue (on my the old Westmere chip I'm testing this on, at least). I'm seeing other behaviors consistent with this theory: if I turn off the mesh, for instance, it gets less laggy (and the lag goes away sometimes).
I'm not sure what "turn off the mesh" means here. Setting a high speed setting should reduce the encoding latency, it will just use up more bandwidth instead.
When I set the video region by hand (after using it a little) XPRA ends up in nvenc, but when I set it right at start it is in x264 and stays in x264 forever, and it never switches to nvenc.
Two things are likely at play here:
XPRA_MIN_FPS_COST
My users are comparing xpra to VNC (and NX), and the only consistent complaint I have heard (from a small number of them) has to do with (at least I think it has to do with) the time it takes nvenc to start.
We can make it start quicker, but then users may complain about jerkiness for the first few frames. (due to the nvenc initialization delay)
I've done a little attempt at debugging and I've found that the score for nvenc never gets better than x264 when I set the region using dbus. I think it is because the setup penalty always swamps everything else for some reason, but I'm not sure that I'm reading this right.
Could be the fps change (r21057), and / or maybe setting the video region interferes with that.
- we don't use nvenc immediately unless we are certain that the video is here to stay - so initially x264 will win the scoring, even more so recently: see r21057, you can try lowering with
XPRA_MIN_FPS_COST
What's going on is identify_video_subregion is ending before it can update the fps when detection is disabled. I'm trying to come up with a patch... fps stays forever at 0.
I have written a patch that gets my vgl app working perfectly. It:
Thanks, merged in separate commits with some minor changes:
XPRA_VIDEO_NODETECT_TIMEOUT=0
to disable the timeout completely - there is no need for a get_detection
method as this can be replaced with 2 lines of code
@nathan_lstc: does that work for you?
As per comment:39:
If you can provide the sample debug output, maybe we can improve the code so that it doesn't change the region quite so often.
-d damage,regiondetect
should be enough.
This would still be worth doing. (making xpra work correctly out of the box rather than relying on application hints to do the right thing)
Looks good. I'll test it more thoroughly later, but I'm pretty sure everything is right.
Another problem. When I reattach xpra seems to forget everything that the client apps tell it in dbus. Ideas for fixing:
(1) have my app somehow grab startup-complete signal and resend stuff to the dbus
(2) have xpra send some XEvent to my program
(3) have xpra remember the video-region stuff accross attach.
Presently, I'm looking into option (1), is that the right thing?
I've implemented option (1). It works fine. I'm very satisfied with how well LSTC's (my employer) UI stuff is working. I'm going to test out unrelated ticket number #2090 tomorrow. Thanks.
Can I close this ticket as "fixed"?
I've retested everything on LAN and WIFI with default settings. I can say everything works perfectly as of r21284.
I still need to test on broadband before I can say for sure that everything works.
My environment:
ExecStart=/bin/sh -c "cd ~;PATH=/opt/xpra/bin:$PATH xpra -d compress --no-daemon start --no-printing --start-via-proxy=no --systemd-run=no --start=\"xrdb -merge $HOME/.Xresources\" --start-child=xterm --exit-with-children --xsettings=no --file-transfer=yes --mdns=no :`id -u`" Environment=PYTHONPATH=/opt/xpra/lib64/python2.7/site-packages Environment=LD_LIBRARY_PATH=/opt/libjpeg-turbo/lib64/:/usr/local/cuda/lib64:/usr/lib64/xpra:/opt/x264/lib Environment=CUDA_VISIBLE_DEVICES=1 Environment=XPRA_VIDEO_NODETECT_TIMEOUT=0 Environment=XPRA_FORCE_BATCH=1 Environment=XPRA_BATCH_MAX_DELAY=25
As per ticket:2090#comment:7 :
Environment=XPRA_FORCE_BATCH=1
Environment=XPRA_BATCH_MAX_DELAY=25
It's working fine from my LTE phone over a bluetooth link share. I can manipulate a 3D model is real-time over this (using quality=15). You may close this ticket!
It's working fine from my LTE phone over a bluetooth link share. I can manipulate a 3D model is real-time over this
(using quality=15).
Doesn't the quality auto-adjust well enough? (maybe we should set min-quality=0
to allow it to get there?)
I have deployed this and I almost imediately got a report that XPRA had hung for one of my users. I will include two interesting-looking log excerpts. How can I help debug?
The initial problem seems to be related to some kind of network congestion:
Jan 03 19:01:01 curry run-parts(/etc/cron.hourly)[86181]: finished mcelog.cron Jan 03 19:06:20 curry sh[2000]: 2019-01-03 19:06:20,407 Error in worker thread processing item <bound method ClientConnection.recalculate_delays of ClientConnection(3 : Protocol(unix-domain socket:/run/user/267/xpra/curry-267))> Jan 03 19:06:20 curry sh[2000]: Traceback (most recent call last): Jan 03 19:06:20 curry sh[2000]: File "/opt/xpra/lib64/python2.7/site-packages/xpra/server/background_worker.py", line 62, in run Jan 03 19:06:20 curry sh[2000]: item() Jan 03 19:06:20 curry sh[2000]: File "/opt/xpra/lib64/python2.7/site-packages/xpra/server/source/encodings_mixin.py", line 137, in recalculate_delays Jan 03 19:06:20 curry sh[2000]: self.statistics.update_averages() Jan 03 19:06:20 curry sh[2000]: File "/opt/xpra/lib64/python2.7/site-packages/xpra/server/source/source_stats.py", line 118, in update_averages Jan 03 19:06:20 curry sh[2000]: css = tuple(x for x in self.congestion_send_speed if x[0]>min_time) Jan 03 19:06:20 curry sh[2000]: File "/opt/xpra/lib64/python2.7/site-packages/xpra/server/source/source_stats.py", line 118, in <genexpr> Jan 03 19:06:20 curry sh[2000]: css = tuple(x for x in self.congestion_send_speed if x[0]>min_time) Jan 03 19:06:20 curry sh[2000]: RuntimeError: deque mutated during iteration Jan 03 19:07:07 curry kernel: nvidia-modeset: WARNING: GPU:1: Correcting number of heads for current head configuration (0x00) Jan 03 19:07:10 curry sh[2000]: 2019-01-03 19:07:10,746 client @53.594 server is not responding, drawing spinners over the windows Jan 03 19:07:10 curry sh[2000]: 2019-01-03 19:07:10,877 client @53.859 server is OK again Jan 03 19:10:01 curry systemd[1]: Started Session 4132 of user root.
Here's the messages after it tries to recover. There are many lines of this:
Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,027 Warning: delayed region timeout Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,028 region is 15 seconds old, will retry - bad connection? Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,028 5 late responses: Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,028 54 h264 : 18s Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,028 55 h264 : 18s Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,028 56 h264 : 17s Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,028 57 h264 : 17s Jan 03 19:42:36 curry sh[2000]: 2019-01-03 19:42:36,029 58 h264 : 15s Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,172 Warning: delayed region timeout Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,172 region is 15 seconds old, will retry - bad connection? Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,172 22 late responses: Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,172 3944 jpeg : 22s Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173 3945 scroll: 21s Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173 3946 png : 21s Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173 3947 jpeg : 21s Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173 3948 jpeg : 21s Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173 3949 scroll: 20s Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173 3950 rgb24: 20s Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173 3951 jpeg : 20s Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173 3952 jpeg : 20s Jan 03 19:42:39 curry sh[2000]: 2019-01-03 19:42:39,173 3953 scroll: 20s
Then I start to see:
Jan 03 19:49:12 curry sh[2000]: 2019-01-03 19:49:12,453 Error: connection timed out: unix-domain socket:/run/user/267/xpra/curry-267 Jan 03 19:49:12 curry sh[2000]: 2019-01-03 19:49:12,456 Error: connection timed out: unix-domain socket:/run/xpra/curry-267 Jan 03 19:49:44 curry sh[2000]: *** BUG *** Jan 03 19:49:44 curry sh[2000]: In pixman_set_extents: The expression region->extents.x1 < region->extents.x2 was false Jan 03 19:49:44 curry sh[2000]: Set a breakpoint on '_pixman_log_error' to debug Jan 03 19:50:01 curry systemd[1]: Started Session 4144 of user root. Jan 03 19:50:01 curry CROND[156120]: (root) CMD (/usr/lib64/sa/sa1 1 1) Jan 03 19:55:46 curry sh[2000]: *** BUG *** Jan 03 19:55:46 curry sh[2000]: In pixman_set_extents: The expression region->extents.x1 < region->extents.x2 was false Jan 03 19:55:46 curry sh[2000]: Set a breakpoint on '_pixman_log_error' to debug Jan 03 19:56:44 curry sh[2000]: *** BUG *** Jan 03 19:56:44 curry sh[2000]: In pixman_set_extents: The expression region->extents.x1 < region->extents.x2 was false Jan 03 19:56:44 curry sh[2000]: Set a breakpoint on '_pixman_log_error' to debug Jan 03 20:00:01 curry systemd[1]: Started Session 4145 of user root. Jan 03 20:00:01 curry CROND[156756]: (root) CMD (/usr/lib64/sa/sa1 1 1) Jan 03 20:00:50 curry sh[2000]: *** BUG *** Jan 03 20:00:50 curry sh[2000]: In pixman_set_extents: The expression region->extents.x1 < region->extents.x2 was false Jan 03 20:00:50 curry sh[2000]: Set a breakpoint on '_pixman_log_error' to debug
All of my clients also had bandwidth detection disabled, so my claim that I had renabled it with bandwidth detection might have been false. However, my user from last night certainly had bandwidth detection *on*. I'll be retesting that momentarily.
deque mutated during iteration
This is a real bug, but this has been present for a long time so this may just be an unlucky symptom of another problem: you may be hitting this race condition now because there are many congestion events. Fixed in r21288.
*** BUG ***
In pixman_set_extents: The expression region->extents.x1 < region extents.x2 was false
Set a breakpoint on '_pixman_log_error' to debug
I have never ever seen this message. Very suspicious.
WARNING: GPU:1: Correcting number of heads for current head configuration (0x00)
Could this be the cause? Maybe the GPU is hanging and causing the whole system to stall? This would explain the other timeouts that follow:
Warning: delayed region timeout
Error: connection timed out: unix-domain socket:/...
The "correcting number of heads" thing is a red-herring, I believe. All of my headless servers say that and it never causes any trouble. I've never looked into it.
What I have found is that he reports that the "Network Performance Issue" dialog pops up as soon as I turn the bandwidth-detection feature back on, and his connection gets steadily worse over time.
If I turn it off his connection is great he says. It is 100% reproducible he tells me. I cannot, however, seem to reproduce the problem in my office.
What logs shall I collect?
What I have found is that he reports that the "Network Performance Issue" dialog pops up as soon as I turn the bandwidth-detection feature back on, and his connection gets steadily worse over time.
This could be a similar problem to ticket:2090#comment:9 : a batch delay that stays too high following congestion. (real or not..)
What logs shall I collect?
-d bandwidth
Maybe correlate it with -d compress
too.
There was even a bug a while back that triggered more congestion-events after the first one. Could be something similar where the congestion counter measures just trigger repeatedly.
I've got a log for you: log2.txt
The guy running the client tells me that he started the 3D software at 43:19 and by 42:28 things were already broken.
He's using the 64 windows beta client on the xpra website from 12-18
The guy running the client tells me that he started the 3D software at 43:19 and by 42:28 things were already broken.
Unless he's going back in time, that's 42:19 ;)
Found a few things of interest in there:
12:42:23,435
:
networksend_congestion_event(late-ack for sequence 72: late by 102ms, target latency=272 \ ((118, 54, 0, 100)), 465, 1354781) 1323Kbps (average=0Kbps)
We have a target latency of 272ms for this connection: 118ms for network (based on measured latency), 54ms for sending the packet, and 100ms tolerance. Yet somehow the ack packet comes back 102ms late. Spurious events like these can happen, so this may or may not be a problem. There are a couple more events, before this important one:
compress: 504.6ms for 1186x938 pixels at 3,23 for wid=2 \ using h264 with ratio 0.1% ( 4350KB to 4KB), sequence 3, client_options=\ {u'speed': 30, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 40}
Immediately after that, we see congestion events with a 500ms delay:
networksend_congestion_event(late-ack for sequence 5: late by 508ms, target latency=260 ((118, 42, 0, 100)), 1513, 284030) 134Kbps (average=157Kbps)
I don't understand how the encode thread could affect the network send thread... but that's too much to be a coincidence.
I have seen unexplained / spurious congestion events before, especially with mswindows clients. You may want to try:
XPRA_ACK_TOLERANCE=200 xpra start ...
If that works then we may have to double it by default, or just for win32 clients.
XPRA_SOCKET_NODELAY=0 xpra_cmd.exe attach ...
XPRA_SOCKET_NODELAY=1 xpra_cmd.exe attach ...
DOH: regarding the 500ms delay, that's obviously something holding onto the python GIL during nvenc / pycuda setup.
If you can provide the -d cuda,nvenc
output and also set XPRA_NVENC_DEBUG_API=1
, I should be able to figure out which call is blocking the whole python interpreter and then we can cythonize it and release the GIL.
Found a detailed NVENC log sample in ticket:1472#comment:7, it seems that the delays happen:
testing device ..
and created context
init_cuda cuda_device
and init_cuda cuda_context
In both cases, this is where we call pycuda.driver.Device.make_context.
Also relatively slow is the encoder context cleanup call (after cuda_clean
) - but since we call it twice, it could just be elsewhere... r21292 will help clarify that.
More and more, I am thinking of solving delay problems (see comment:21 and ticket:2048#comment:2) by moving video encoding to its own thread. This would only fix some of the problems (#2048 mostly), but not if pycuda holds the GIL...
test if make_context releases the GIL
@nathan_lstc: please run attachment/ticket/2022/test_cuda_context_gil.py on a system with nvenc so we will know if pycuda is holding the GIL.
I've modified it as follows: (added init) I also reduced the sleep number from 0.1 to 0.01
from pycuda import driver driver.init() print("initilizing device")
output:
[root@curry nathan]# !py python test_cuda_context_gil.py ******initilizing device *device=<pycuda._driver.Device object at 0x7ff61455a680> context=<pycuda._driver.Context object at 0x7ff5ce4ad668> *done ------------------------------------------------------------------- PyCUDA ERROR: The context stack was not empty upon module cleanup. ------------------------------------------------------------------- A context was still active when the context stack was being cleaned up. At this point in our execution, CUDA may already have been deinitialized, so there is no way we can finish cleanly. The program will be aborted now. Use Context.pop() to avoid this problem. ------------------------------------------------------------------- Aborted (core dumped) [root@curry nathan]#
The error printed at the end can probably be avoided by calling cuda_context.detach()
.
In hindsight, this test code output is not very useful... I wanted to know if the print thread was being blocked during the call to make_context
. Was it? (did the "*" stop being printed out).
A better way would have been to print timestamps every time instead.
Here's the output for a few runs:
[root@curry nathan]# python test_cuda_context_gil.py ******************initilizing device *device=<pycuda._driver.Device object at 0x7fe1928ef680> 1546898949.39 1546898949.6 context=<pycuda._driver.Context object at 0x7fe14c867668> **********done *[root@curry nathan]# python test_cuda_context_gil.py *************************************************initilizing device device=<pycuda._driver.Device object at 0x7f0f6c8e2680> *1546898951.31 1546898951.49 context=<pycuda._driver.Context object at 0x7f0f2685b668> ********done [root@curry nathan]#
The current version:
#!/usr/bin/env python def print_thread(): while True: sys.stdout.write("*") sys.stdout.flush() time.sleep(0.001) from threading import Thread t = Thread(target=print_thread) t.daemon = True t.start() time.sleep(0) from pycuda import driver driver.init() print("initilizing device") device = driver.Device(0) print("device=%s" % (device,)) print(time.time()) context = device.make_context() print(time.time()) print("context=%s" % (context,)) time.sleep(0.01) context.detach() print("done")
It's still printing stars "*" and not timestamps...
We want to see if the print_thread
is paused during the call to make_context
.
There is a "print time.time()" on either side of the make_context. The stars are printing fast enough that I think they should have appeared. But you are right it would be clearer with timestamps. Here it is:
[root@curry nathan]# [root@curry nathan]# !py python test_cuda_context_gil.py *print thread: 0.00265097618103 *print thread: 0.00385713577271 *print thread: 0.00500512123108 *print thread: 0.00671005249023 *print thread: 0.00848293304443 *print thread: 0.00974202156067 *print thread: 0.0113780498505 *print thread: 0.0127730369568 *print thread: 0.01389503479 *print thread: 0.0152540206909 *print thread: 0.0163860321045 *print thread: 0.0179829597473 *print thread: 0.019357919693 *print thread: 0.0204629898071 *print thread: 0.0215699672699 *print thread: 0.0227890014648 *print thread: 0.0246050357819 *print thread: 0.026202917099 *print thread: 0.0370080471039 *print thread: 0.0383989810944 *print thread: 0.039715051651 *print thread: 0.0414950847626 *print thread: 0.0425989627838 *print thread: 0.0456280708313 *print thread: 0.0471770763397 *print thread: 0.0486969947815 *print thread: 0.0498020648956 *print thread: 0.0512359142303 *print thread: 0.0525970458984 *print thread: 0.0548369884491 *print thread: 0.0561380386353 *print thread: 0.0572350025177 *print thread: 0.0583231449127 *print thread: 0.0594069957733 *print thread: 0.0614700317383 *print thread: 0.0625629425049 *print thread: 0.0636930465698 *print thread: 0.0647871494293 *print thread: 0.0663070678711 *print thread: 0.0675950050354 *print thread: 0.0690340995789 *print thread: 0.0701401233673 *print thread: 0.0712389945984 *print thread: 0.0740530490875 *print thread: 0.078320980072 *print thread: 0.07941198349 *print thread: 0.080500125885 *print thread: 0.0816259384155 *print thread: 0.0827519893646 *print thread: 0.0839450359344 *print thread: 0.0852839946747 *print thread: 0.0901050567627 *initilizing device device=<pycuda._driver.Device object at 0x7f36ca67c6d8> print thread: 0.258333921432 1546899654.25 1546899654.47 context=<pycuda._driver.Context object at 0x7f36845e66e0> *print thread: 0.471437931061 *print thread: 0.472643136978 *print thread: 0.473837137222 *print thread: 0.474987030029 *print thread: 0.476146936417 *print thread: 0.477342128754 *print thread: 0.478557109833 *print thread: 0.479777097702 *done [root@curry nathan]#
Sorry, didn't see that.
So, as I expected, the problem is that python-pycuda is stalling the whole python interpreter during both device init and context creation. I'll ask for help, otherwise we may have to rewrite the cuda glue in cython so we can call it with Cython: Interfacing with External C Code: Releasing the GIL.
Should I still have my user do the XPRA_SOCKET_NODELAY and the XPRA_ACK_TOLERANCE test?
Should I still have my user do the XPRA_SOCKET_NODELAY and the XPRA_ACK_TOLERANCE test?
Probably not. The pycuda blockage is probably the main cause of the latency problems.
Question on pycuda mailing list: releasing the GIL during init and make_context
Batch delay fixes tracked in ticket:2090#comment:10
As per this reply: No particular reason from my end. I'd be happy to consider a PR with such a change. Should be pretty straightforward. Time for me to work on pycuda a bit then.
In the meantime, this should achieve the same results as setting XPRA_BATCH_MAX_DELAY=50
but better: we expire the delayed regions quicker, but they only actually get sent if there really isn't any reason to wait longer.
This doesn't fix the problem where we are too sensitive to latency going up and not sensitive enough when it goes back down, but it should mitigate most of it.
@nathan_lstc: is this better out of the box? (bar pycuda issue..)
increase timer frequency so we can verify the change more clearly
I've just tried on two systems with GTX cards:
$ lspci -v | grep GTX;python ./test_cuda_context_gil.py 10:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2) (prog-if 00 [VGA controller]) * 1547391204.69 initilizing device device=<pycuda._driver.Device object at 0x7f5304885890> context=<pycuda._driver.Context object at 0x7f52e7a29cf8> * 1547391204.8 done
$ lspci -v | grep GTX;python ./test_cuda_context_gil.py 01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1) (prog-if 00 [VGA controller]) * 1547391242.08 initilizing device device=<pycuda._driver.Device object at 0x7fe51232e998> context=<pycuda._driver.Context object at 0x7fe4f5c32cf8> * 1547391242.25 done
The slowest one takes 170ms, which isn't very high, but enough to verify the changes with the updated test class firing the timer every ~10 ms:
$ lspci -v | grep GTX;python ./test_cuda_context_gil.py 01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1) (prog-if 00 [VGA controller]) * 1547398645.759 * 1547398645.765 loading pycuda * 1547398645.779 * 1547398645.793 * 1547398645.805 * 1547398645.805 driver init * 1547398645.815 * 1547398645.824 initializing device * 1547398645.824 device=<pycuda._driver.Device object at 0x7ff8569cdcb0> * 1547398645.825 * 1547398645.836 * 1547398645.847 * 1547398645.857 * 1547398645.867 * 1547398645.878 * 1547398645.888 * 1547398645.898 * 1547398645.908 * 1547398645.914 context=<pycuda._driver.Context object at 0x7ff83a038de8> all done
The timer now runs regularly, even during the call to make_context.
r21358 adds a patch to pycuda, which we can submit upstream after further testing. I've uploaded new python-pycuda RPM packages to the https://xpra.org/beta repository.
@nathan_lstc: does that work for you? You should no longer see spurious congestion events during nvenc context initialization.
I have audited all the few remaining slow calls to pycuda and nvenc:
copy_image
calls either memcpy_htod
or memcpy_dtod
and both are already allowing other threads to run
nvEncInitializeEncoder
and nvEncDestroyEncoder
already use Cython's nogil
cuda_context.detach()
can take ~25ms, I have not modified it because it uses CUDAPP_CALL_GUARDED_CLEANUP
which does not have a THREADED
variant and I think there might well be a reason for that - this can be revisited if the call takes much longer on some setups
The user who reported this problem has tested and he reports that it works. I have likewise tested and it works (but I never saw the problem very much to begin with).
So, yes, I think it works! Thanks!
Based on comment:35, I've created #2111. As per comment:47 and others, without the log sample we won't be improving the detection heuristics so you will need to use the dbus video-region api to avoid context recycling.
I hope that the congestion events were caused by pycuda and that they're now all gone.
At last, closing!
I totally forgot about that the log sample
I have, indeed, resolved my problem with the help of DBUS. XPRA is working great for LS-PREPOST now.
However, I am happy to you improve XPRA more generally! If you are still interested, then I will do it right away, let me know. Shall I do it?
Open a new ticket, perhaps?
Open a new ticket, perhaps?
Yes please, something like "better video region detection". With a log of screen updates, and what the video region should have been. Then we can see if we can tune the code to auto-detect it properly, given time..
this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/2022