xpra icon
Bug tracker and wiki

Opened 6 years ago

Closed 6 years ago

Last modified 5 years ago

#517 closed defect (fixed)

nvenc memory leak

Reported by: Antoine Martin Owned by: Antoine Martin
Priority: critical Milestone: 0.12
Component: server Version:
Keywords: nvenc Cc:

Description

Video sub regions are a little bit unpredictable and often end up destroying video contexts and re-creating them later... which quickly led to:

Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/window_video_source.py", line 1011, in setup_pipeline
    self._video_encoder.init_context(enc_width, enc_height, enc_in_format, encoder_spec.encoding, quality, speed, encoder_scaling, self.encoding_options)
  File "encoder.pyx", line 1315, in xpra.codecs.nvenc.encoder.Encoder.init_context (xpra/codecs/nvenc/encoder.c:7730)
  File "encoder.pyx", line 1351, in xpra.codecs.nvenc.encoder.Encoder.init_cuda (xpra/codecs/nvenc/encoder.c:8492)
  File "encoder.pyx", line 1209, in xpra.codecs.nvenc.encoder.get_BGRA2NV12 (xpra/codecs/nvenc/encoder.c:6516)
  File "encoder.pyx", line 1197, in xpra.codecs.nvenc.encoder.get_CUDA_kernel (xpra/codecs/nvenc/encoder.c:6268)
MemoryError: cuModuleLoadDataEx failed: out of memory - 

And probably also this one:

setup_pipeline failed for (61, None, 'BGRX', codec_spec(nvenc:nvenc))
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/window_video_source.py", line 1011, in setup_pipeline
    self._video_encoder.init_context(enc_width, enc_height, enc_in_format, encoder_spec.encoding, quality, speed, encoder_scaling, self.encoding_options)
  File "encoder.pyx", line 1315, in xpra.codecs.nvenc.encoder.Encoder.init_context (xpra/codecs/nvenc/encoder.c:7730)
  File "encoder.pyx", line 1377, in xpra.codecs.nvenc.encoder.Encoder.init_cuda (xpra/codecs/nvenc/encoder.c:9153)
  File "encoder.pyx", line 1424, in xpra.codecs.nvenc.encoder.Encoder.init_nvenc (xpra/codecs/nvenc/encoder.c:9582)
  File "encoder.pyx", line 1183, in xpra.codecs.nvenc.encoder.raiseNVENC (xpra/codecs/nvenc/encoder.c:6065)
Exception: initializing encoder - returned 8: This indicates that one or more of the parameter passed to the API call is invalid.

Note: As part of the work on video regions (#410), nvenc also needed a fix (r5442) for handling input data with a larger rowstride than anticipated (which is often the case with video subregions and XShm).

Change History (5)

comment:1 Changed 6 years ago by Antoine Martin

Status: newassigned

This is 100% reproducible, simply resizing a fast refreshing window causes the encoder to re-init lots of times, often losing 15 to 30MB of memory each time.

Strange thing is, when I run a test designed specifically for reproducing this bug by creating and destroying lots of contexts (see r5469), even after randomizing the input (r5470), I cannot reproduce the leak there!?

comment:2 Changed 6 years ago by Antoine Martin

I think I have found it: we clean the encoder contexts using the background worker to prevent delays in the encoding thread. Calling encoder.clean directly (as done in the tests) prevents the leak.

Either CUDA and/or NVENC aren't really thread safe, despite their claims, or the worker gets stuck (which is very unlikely).

comment:3 Changed 6 years ago by Antoine Martin

Third option, likely the right one: we need locking around the CUDA context switching code (push/pop) to prevent multiple threads (in this case: encoding thread and worker thread calling clean) interacting with the same GPU (even though that is done through a different context object - could be related to how python does its threading).

Which means that we will often end up serializing access from the encoding thread anyway, so why bother doing clean in the worker thread and add the complication and overhead of locking? Probably best to just clean from the encoding thread directly.

Alternatively, we could split the cleanup into 2 parts:

  • clean: called from encoding thread, nvenc (and probably also cuda csc) can do everything from there
  • destroy: called from worker thread later, other encoders can do their cleanup there cheaply

Doing cleanup in the worker thread was done because the cost of setting up or destroying an nvenc context is high (see r4708), so this means the simpler option is probably better and this will make #466 more pressing: keeping the same context whilst resizing will mitigate this.

This will need to be backported to v0.11.x.

Last edited 6 years ago by Antoine Martin (previous) (diff)

comment:4 Changed 6 years ago by Antoine Martin

Resolution: fixed
Status: assignedclosed

Fixed in r5473, backport in r5474. Will follow up in #466.

Another leak, decoding side this time, was caused by r5515, fixed in r5542

Last edited 6 years ago by Antoine Martin (previous) (diff)

comment:5 Changed 5 years ago by Antoine Martin

Note for those landing here: NVENC is not safe to use in versions older than 0.15 because of a context leak due to threading.

Note: See TracTickets for help on using tickets.