Xpra: Ticket #733: nvenc out of memory, leaks, crashes

Not sure if this is due to nvenc v4 (#653), the driver version, changes in our code, or what.. But it's pretty bad.

I've also seen it coming up as error during picture encoding - returned 10: ... during self.functionList.nvEncEncodePicture. Once this happens, it is also possible to hit AssertionError: no NVENC device found! since there is no more memory available on the card!

I keep an eye on the server state using:

watch 'xpra info | egrep -e \
    "window\[[0-9]*\].encoder=|encoder_height|encoder_width|last_failure|context_count|device_count|generation|kernel"'

The memory does not go back down when we disconnect the client either..

This is a blocker for #653, #466



Sun, 09 Nov 2014 08:36:21 GMT - Antoine Martin: owner, status, description changed


Tue, 11 Nov 2014 15:23:25 GMT - Antoine Martin:

I think it is a threading issue, r8097 adds tests for threading. Problem is that the compression code can create a new video encoding pipeline from multiple threads: from the encode thread, from the timer worker thread, etc. And the same thing goes for closing the encoder.

I believe we need to do ALL of these things, all in the same thread.


Thu, 13 Nov 2014 06:54:35 GMT - Antoine Martin:

Confirmed as a threading issue, this trivial patch prevents the lockups, but also causes significant stuttering as we always evaluate the encoding pipeline in the encoding thread - this penalizes x264 and vpx unnecessarily:

--- src/xpra/server/window_video_source.py	(revision 8097)
+++ src/xpra/server/window_video_source.py	(working copy)
@@ -614,7 +614,7 @@
                 self._lossless_threshold_base = min(80, 10+self._current_speed/5)
                 self._lossless_threshold_pixel_boost = 90
-        if self._video_encoder:
+        if self._video_encoder and False:
             self.check_pipeline_score(force_reload)
     def check_pipeline_score(self, force_reload):

Thu, 13 Nov 2014 10:53:24 GMT - Antoine Martin:

I thought I had managed to reproduce it with the patch applied, thinking it wasn't a fix bug that it just made it harder to hit. It even locked up my X11 session at one point! Not sure if this is related:

[407579.282272] NVRM: Xid (PCI:0000:01:00): 31, Ch 0000000d, engmask 00002100, intr 10000000

I also see this error on the client side sometimes:

Exception: avcodec decoding failed to decode 19847 bytes of h264 data (frame 0, step 1 of 1)

Always on frame 0. As if the data is either invalid or incomplete.

But now that I am trying hard to reproduce it, I cannot hit the bug!?! I thought it might be caused by small dimensions, no change. Frequent changes: managed over 200 generations without problem... etc And just as I write this: it happens again!

Sending a SIGUSR1 to the server process shows some threads doing:

Thoughts: maybe this has something to do with the xpra info connection, there were reports of server hangs IIRC. Or maybe it's a threading issue in the auto-refresh. So:

With full nvenc debug enabled, the lockup looks like this:

2014-11-13 17:56:26,762 compress_image(XShmImageWrapper(BGRX: 0, 0, 621, 417), {}) thread=<Thread(encode, started daemon 140456921536256)>
2014-11-13 17:56:26,763 compress_image(..) host buffer populated with 1035828 bytes (max 4259840)
2014-11-13 17:56:26,764 compress_image(..) input buffer copied to device
2014-11-13 17:56:26,765 compress_image(..) kernel BGRA_to_NV12 executed - CSC took 0.2 ms
2014-11-13 17:56:26,766 nvEncMapInputResource(0x7fbeace9e420)
2014-11-13 17:56:26,789 compress_image(..) device buffer mapped to 0x7fbe9c27b450
2014-11-13 17:56:26,790 nvEncEncodePicture(0x7fbeace9f040)
2014-11-13 17:56:26,795 compress_image(..) encoded in 29.6 ms
2014-11-13 17:56:26,796 nvEncLockBitstream(0x7fbeace9ea30)
2014-11-13 17:56:48,670 compress_image(..) output buffer locked, bitstreamBufferPtr=0x7fbe96a8f000
found 7 frames:
2014-11-13 17:56:48,671 nvEncUnlockBitstream(0x7fbe9c26cb30)
140456570648320 - <frame object at 0x29c98f0>:
2014-11-13 17:56:48,674 nvEncUnmapInputResource(0x7fbe9c26cb30)
2014-11-13 17:56:48,683 compress_image(..) download took 21888.6 ms
2014-11-13 17:56:48,684 compress_image(..) returning 5930 bytes (0.1%), complete compression for frame 17 took 21921.4ms

There's a 22 second gap calling nvEncLockBitstream, and even the sigusr1 signal was delayed I think.

Maybe we should release the gil?


Thu, 13 Nov 2014 14:21:38 GMT - Antoine Martin: attachment set

simple xdotool script to constantly resize the glxgears window to try to cause the server hang


Thu, 13 Nov 2014 14:58:42 GMT - Antoine Martin: keywords set

With r8104 and the change from comment:3, I am not getting any lockups anymore. Tested with the glxgears resize script and by hand. I can reach generation>1000 (that's more than 1000 contexts created and then destroyed), without any server hangs.

Now, I just need to find a better solution than comment:3 ..


Fri, 14 Nov 2014 06:00:41 GMT - Antoine Martin: attachment set

work in progress patch which does everything in the encode thread and removes locking


Fri, 14 Nov 2014 09:50:50 GMT - Antoine Martin: attachment set

better glxgears test script: creates and destroys glxgears instances as well as resizing them


Fri, 14 Nov 2014 10:35:29 GMT - Antoine Martin: owner, status changed

Looks fixed with r8108.

@afarr: can you break it?


Tue, 25 Nov 2014 01:03:43 GMT - alas: owner changed

Looks like this is probably in good shape (you said you ran the server for hours without problem?) ... can go ahead and close unless you think there's some chance of crashing it by going crazy.


Tue, 25 Nov 2014 02:20:59 GMT - Nick Centanni: status changed; resolution set


Sat, 23 Jan 2021 05:04:23 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/733