Xpra: Ticket #733: nvenc out of memory, leaks, crashes

Not sure if this is due to nvenc v4 (#653), the driver version, changes in our code, or what.. But it's pretty bad.

r8085 adds some debugging to xpra info, it shows that when we resize a window, we go through many cycles of creating then destroying nvenc contexts before we settle down (pipeline scoring playing the yo-yo). This needs to be fixed (but is actually quite useful for testing!), but in itself should not be critical... except that:
nvenc leaks memory, to the tune of about ~50MB per context, resize the window once and you can go through 20 contexts and about 1GB of memory! This is pinned device memory by the looks of this error message: MemoryError: cuCtxCreate failed: out of memory which fires from the pycuda code device.make_context(flags=cf.SCHED_YIELD | cf.MAP_HOST).

I've also seen it coming up as error during picture encoding - returned 10: ... during self.functionList.nvEncEncodePicture. Once this happens, it is also possible to hit AssertionError: no NVENC device found! since there is no more memory available on the card!

I keep an eye on the server state using:

watch 'xpra info | egrep -e \
    "window\[[0-9]*\].encoder=|encoder_height|encoder_width|last_failure|context_count|device_count|generation|kernel"'

The memory does not go back down when we disconnect the client either..

This is a blocker for #653, #466

Sun, 09 Nov 2014 08:36:21 GMT - Antoine Martin: owner, status, description changed

owner changed from Antoine Martin to Antoine Martin
status changed from new to assigned
description modified (diff)

Tue, 11 Nov 2014 15:23:25 GMT - Antoine Martin:

I think it is a threading issue, r8097 adds tests for threading. Problem is that the compression code can create a new video encoding pipeline from multiple threads: from the encode thread, from the timer worker thread, etc. And the same thing goes for closing the encoder.

I believe we need to do ALL of these things, all in the same thread.

Thu, 13 Nov 2014 06:54:35 GMT - Antoine Martin:

Confirmed as a threading issue, this trivial patch prevents the lockups, but also causes significant stuttering as we always evaluate the encoding pipeline in the encoding thread - this penalizes x264 and vpx unnecessarily:

--- src/xpra/server/window_video_source.py	(revision 8097)
+++ src/xpra/server/window_video_source.py	(working copy)
@@ -614,7 +614,7 @@
                 self._lossless_threshold_base = min(80, 10+self._current_speed/5)
                 self._lossless_threshold_pixel_boost = 90
-        if self._video_encoder:
+        if self._video_encoder and False:
             self.check_pipeline_score(force_reload)
     def check_pipeline_score(self, force_reload):

Thu, 13 Nov 2014 10:53:24 GMT - Antoine Martin:

I thought I had managed to reproduce it with the patch applied, thinking it wasn't a fix bug that it just made it harder to hit. It even locked up my X11 session at one point! Not sure if this is related:

[407579.282272] NVRM: Xid (PCI:0000:01:00): 31, Ch 0000000d, engmask 00002100, intr 10000000

I also see this error on the client side sometimes:

Exception: avcodec decoding failed to decode 19847 bytes of h264 data (frame 0, step 1 of 1)

Always on frame 0. As if the data is either invalid or incomplete.

But now that I am trying hard to reproduce it, I cannot hit the bug!?! I thought it might be caused by small dimensions, no change. Frequent changes: managed over 200 generations without problem... etc And just as I write this: it happens again!

Sending a SIGUSR1 to the server process shows some threads doing:

condition.wait from _write_format_thread_loop, _read_parse_thread_loop, _write_thread_loop and _write_format_thread_loop, background_worker.
untilConcludes from _read_thread_loop * 2 (info thread)
and itself: dump_frames from sigusr1

this confusing trace:

140729086670592 - <frame object at 0x7ffdf801c0f0>:
  File "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/threading.py", line 764, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 1582, in encode_loop
    fn_and_args = self.compression_work_queue.get(True)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window_source.py", line 1201, in make_data_packet_cb
    refreshlog("auto refresh: %5s screen update (quality=%3i), %s (region=%s, refresh regions=%s)", encoding, actual_quality, msg, region, self.refresh_regions)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window_source.py", line 1470, in make_data_packet
    return packet
  File "/usr/lib64/python2.7/site-packages/xpra/server/window_video_source.py", line 1192, in video_encode
    return self._video_encoder.get_encoding(), Compressed(encoding, data), client_options, width, height, 0, 24
  File "/usr/lib64/python2.7/site-packages/xpra/log.py", line 163, in __call__
    self.log(logging.DEBUG, msg, *args, **kwargs)

Thoughts: maybe this has something to do with the xpra info connection, there were reports of server hangs IIRC. Or maybe it's a threading issue in the auto-refresh. So:

must try to reproduce without info running in parallel - or with aggressive info, to see if it crashes sooner
try with auto-refresh turned off
try without the video regions code (who knows..)

With full nvenc debug enabled, the lockup looks like this:

2014-11-13 17:56:26,762 compress_image(XShmImageWrapper(BGRX: 0, 0, 621, 417), {}) thread=<Thread(encode, started daemon 140456921536256)>
2014-11-13 17:56:26,763 compress_image(..) host buffer populated with 1035828 bytes (max 4259840)
2014-11-13 17:56:26,764 compress_image(..) input buffer copied to device
2014-11-13 17:56:26,765 compress_image(..) kernel BGRA_to_NV12 executed - CSC took 0.2 ms
2014-11-13 17:56:26,766 nvEncMapInputResource(0x7fbeace9e420)
2014-11-13 17:56:26,789 compress_image(..) device buffer mapped to 0x7fbe9c27b450
2014-11-13 17:56:26,790 nvEncEncodePicture(0x7fbeace9f040)
2014-11-13 17:56:26,795 compress_image(..) encoded in 29.6 ms
2014-11-13 17:56:26,796 nvEncLockBitstream(0x7fbeace9ea30)
2014-11-13 17:56:48,670 compress_image(..) output buffer locked, bitstreamBufferPtr=0x7fbe96a8f000
found 7 frames:
2014-11-13 17:56:48,671 nvEncUnlockBitstream(0x7fbe9c26cb30)
140456570648320 - <frame object at 0x29c98f0>:
2014-11-13 17:56:48,674 nvEncUnmapInputResource(0x7fbe9c26cb30)
2014-11-13 17:56:48,683 compress_image(..) download took 21888.6 ms
2014-11-13 17:56:48,684 compress_image(..) returning 5930 bytes (0.1%), complete compression for frame 17 took 21921.4ms

There's a 22 second gap calling nvEncLockBitstream, and even the sigusr1 signal was delayed I think.

Maybe we should release the gil?

Thu, 13 Nov 2014 14:21:38 GMT - Antoine Martin: attachment set

attachment set to glxgears-resize.sh

simple xdotool script to constantly resize the glxgears window to try to cause the server hang

Thu, 13 Nov 2014 14:58:42 GMT - Antoine Martin: keywords set

keywords nvenc added

With r8104 and the change from comment:3, I am not getting any lockups anymore. Tested with the glxgears resize script and by hand. I can reach generation>1000 (that's more than 1000 contexts created and then destroyed), without any server hangs.

Now, I just need to find a better solution than comment:3 ..

Fri, 14 Nov 2014 06:00:41 GMT - Antoine Martin: attachment set

attachment set to video-lockless.patch

work in progress patch which does everything in the encode thread and removes locking

Fri, 14 Nov 2014 09:50:50 GMT - Antoine Martin: attachment set

attachment set to glxgears-resize.2.sh

better glxgears test script: creates and destroys glxgears instances as well as resizing them

Fri, 14 Nov 2014 10:35:29 GMT - Antoine Martin: owner, status changed

owner changed from Antoine Martin to alas
status changed from assigned to new

Looks fixed with r8108.

@afarr: can you break it?

Tue, 25 Nov 2014 01:03:43 GMT - alas: owner changed

owner changed from alas to Nick Centanni

Looks like this is probably in good shape (you said you ran the server for hours without problem?) ... can go ahead and close unless you think there's some chance of crashing it by going crazy.

Tue, 25 Nov 2014 02:20:59 GMT - Nick Centanni: status changed; resolution set

status changed from new to closed
resolution set to fixed

Sat, 23 Jan 2021 05:04:23 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/733