xpra icon
Bug tracker and wiki

Opened 2 years ago

Closed 20 months ago

#910 closed enhancement (fixed)

fault injection for testing and triggering bugs

Reported by: Antoine Martin Owned by: Antoine Martin
Priority: major Milestone: 0.16
Component: core Version: 0.15.x
Keywords: Cc:

Description

Hopefully this can help us with hard to trigger bugs and for catching corner cases that are difficult to encounter under normal circumstances. Things like error handlers, etc..

For example:

  • gibberish should always cause the new sound wrapper process to abort, r9888 fixes this (r9891 improves the logging) - r9889 makes it easier to trigger by adding fault injection environment variables:
    • XPRA_PROTOCOL_FAULT_INJECTION_RATE=1000 xpra ... will cause the xpra protocol layer to send an invalid packet every 1000 packets - this should cause the other end to hangup
    • XPRA_WRAPPER_FAULT_INJECTION_RATE=500 xpra ... does the same thing but for the wrapper layer
  • failures to encode packets are rare, but can occur - r9892 makes it easier to trigger by adding "None" values to the sound data (which fails when using the bencoder - no failures with rencode..)

More to come, for #902.

Change History (6)

comment:1 Changed 2 years ago by Antoine Martin

Status: newassigned

r9894 does the same thing for the client, pretending that we failed to paint something (which should trigger a refresh):

$ XPRA_PAINT_FAULT_INJECTION_RATE=100 xpra attach --no-mmap
(...)
injecting paint fault for h264 packet 100
injecting paint fault for h264 packet 200
...

comment:2 Changed 2 years ago by Antoine Martin

r9899 adds the ability to fail to report the failures to the server with:

XPRA_PAINT_FAULT_INJECTION_RATE=100 XPRA_PAINT_FAULT_INJECTION_TELL=0 xpra attach

Which makes the server stop sending window updates and eventually print:

delayed_region_timeout: region is 15025ms old, bad connection?
...

r9824 also added the ability to simulate "hello" initial packet parsing errors with:

XPRA_SIMULATE_SERVER_HELLO_ERROR=1 xpra start ...

Which tests the hello error handling.

comment:3 Changed 22 months ago by Antoine Martin

Owner: changed from Antoine Martin to alas
Status: assignednew

This will do for now.
As of r10587, all the fault injection environment variables are documented here: wiki/Usage/EnvironmentOptions.

@afarr: the idea here is to make it easier to exercise code which would not normally fire. we want to check:

  • that the environment variables do trigger those errors
  • that the system deals with it as gracefully as it can

PS: not doing an xpra control interface as this is too intricate

comment:4 Changed 20 months ago by Antoine Martin

r10992 also makes it possible to fail individual codecs:

  • as if the module was not installed:
    XPRA_CODEC_FAIL_IMPORT=enc_pillow,dec_pillow ./xpra/codecs/loader.py
    
  • as if the selftests had failed (which triggers a warning):
    XPRA_CODEC_FAIL_SELFTEST=enc_pillow,dec_pillow ./xpra/codecs/loader.py 
    

This can be useful for testing different decoders/encoders, in particular disabling pillow support. (see also #1010)

comment:5 Changed 20 months ago by alas

Owner: changed from alas to Antoine Martin
  • XPRA_CODEC_FAIL_IMPORT=enc_pillow,dec_pillow ./xpra/codecs/loader.py run on the server (0.16.0 r11031 fedora 21) outputs:
[root@jimador codecs]# XPRA_CODEC_FAIL_IMPORT=enc_pillow,dec_pillow python loader.py
codecs and csc modules found:
* csc_cython           : csc_cython/colorspace_converter.so
* csc_opencl           :
* csc_swscale          : csc_swscale/colorspace_converter.so
* dec_avcodec2         : dec_avcodec2/decoder.so
* dec_pillow           :
* dec_vpx              : vpx/decoder.so
* dec_webp             : webp/decode.so
* enc_pillow           :
* enc_vpx              : vpx/encoder.so
* enc_webp             : webp/encode.so
* enc_x264             : enc_x264/encoder.so
* enc_x265             :
* nvenc4               :
* nvenc5               :

codecs versions:
* PIL                  : 3.0.0
* avcodec2             : 56.41.100
* buffer_api           : 1
* cython               : 1.0.23.1
* dec_pillow           : 3.0.0
* dec_webp             : 0.4.3
* enc_pillow           : 3.0.0
* enc_webp             : 0.4.3
* numpy                : 1.8.2
* swscale              : 3.1.101
* vpx                  : 1.4.0
* x264                 : 146

... presumably as expected.

Likewise, XPRA_CODEC_FAIL_SELFTEST=enc_pillow,dec_pillow ./xpra/codecs/loader.py outputs:

[root@jimador codecs]# XPRA_CODEC_FAIL_SELFTEST=enc_pillow,dec_pillow python loader.py
failed to import Pillow encoder (enc_pillow)
 codec found in fail selftest list
failed to import Pillow decoder (dec_pillow)
 codec found in fail selftest list
codecs and csc modules found:
* csc_cython           : csc_cython/colorspace_converter.so
* csc_opencl           :
* csc_swscale          : csc_swscale/colorspace_converter.so
* dec_avcodec2         : dec_avcodec2/decoder.so
* dec_pillow           :
* dec_vpx              : vpx/decoder.so
* dec_webp             : webp/decode.so
* enc_pillow           :
* enc_vpx              : vpx/encoder.so
* enc_webp             : webp/encode.so
* enc_x264             : enc_x264/encoder.so
* enc_x265             :
* nvenc4               :
* nvenc5               :

codecs versions:
* PIL                  : 3.0.0
* avcodec2             : 56.41.100
* buffer_api           : 1
* cython               : 1.0.23.1
* dec_pillow           : 3.0.0
* dec_webp             : 0.4.3
* enc_pillow           : 3.0.0
* enc_webp             : 0.4.3
* numpy                : 1.8.2
* swscale              : 3.1.101
* vpx                  : 1.4.0
* x264                 : 146

  • Setting XPRA_PAINT_FAULT_INJECTION_RATE=100 on client side shows the expected injecting paint fault warnings, but also triggered the following traceback on a 0.16.0 r10983 windows 8.1 client:
2015-10-26 18:19:27,586 draw error
Traceback (most recent call last):
  File "xpra\client\ui_client_base.pyc", line 2475, in _do_draw
  File "xpra\client\client_window_base.pyc", line 539, in draw_region
  File "xpra\client\window_backing_base.pyc", line 516, in draw_region
  File "xpra\client\window_backing_base.pyc", line 301, in paint_rgb32
  File "xpra\client\window_backing_base.pyc", line 183, in process_delta
AssertionError: delta bucket 2 data does not match: expected (1051, 1, 2078) but got (1051, 1, 2075)

Disconnecting then running with both XPRA_PAINT_FAULT_INJECTION_RATE=100 XPRA_PAINT_FAULT_INJECTION_TELL=0 xpra attach - I did very quickly get not only the expected 2015-10-26 18:21:03,944 delayed_region_timeout: region is 15041ms old, bad connection? error (and a black screen in place of browser app contents until the video stopped "playing") - I also got an error that might be of interest:

2015-10-26 18:23:51,826 Error: expiring missing damage acks: [96].


Running server with XPRA_SIMULATE_SERVER_HELLO_ERROR=1, with no encryption or with encryption and password-file and encryption-keyfile triggers errors that I expect are expected:

Server-side:

2015-10-26 18:29:51,038 Connection lost
2015-10-26 18:29:51,250 server error processing new connection from Protocol(tcp socket: 10.0.32.136:2200 <- 10.0.11.162:64264): Simulating a server error
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/server_core.py", line 657, in _process_hello
    raise Exception("Simulating a server error")
Exception: Simulating a server error
2015-10-26 18:29:51,251 Disconnecting client '10.0.11.162:64264':
2015-10-26 18:29:51,252  server error (error accepting new connection)
2015-10-26 18:29:51,253 Connection lost

And client side (in yellow, which I take it indicates it is a simulated error?):

2015-10-26 18:29:34,960 server failure: disconnected before the session could be established
2015-10-26 18:29:34,960 server requested disconnect: server error (error accepting new connection)
2015-10-26 18:29:34,994 Connection lost

When terminating the server running with encryption and password-file and encryption-keyfile, with a control-C, I got an unusual traceback:

^C/usr/lib64/python2.7/threading.py:1163: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
  return _active[_get_ident()]
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/child_reaper.py", line 175, in reap
    self.poll()
  File "/usr/lib64/python2.7/site-packages/xpra/child_reaper.py", line 113, in poll
    self.add_dead_process(procinfo)
  File "/usr/lib64/python2.7/site-packages/xpra/child_reaper.py", line 154, in add_dead_process
    cb(procinfo.process)
  File "/usr/lib64/python2.7/site-packages/xpra/server/server_base.py", line 386, in pulseaudio_ended
    soundlog.warn("Warning: the pulseaudio server process has terminated after %i seconds", int(elapsed))
  File "/usr/lib64/python2.7/site-packages/xpra/log.py", line 329, in warn
    self.log(logging.WARN, msg, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/xpra/log.py", line 318, in log
    global_logging_handler(self.logger.log, level, msg, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/xpra/log.py", line 100, in standard_logging
    log(level, msg, *args, **kwargs)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1224, in log
    self._log(level, msg, args, **kwargs)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1278, in _log
    record = self.makeRecord(self.name, level, fn, lno, msg, args, exc_info, func, extra)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1252, in makeRecord
    rv = LogRecord(name, level, fn, lno, msg, args, exc_info, func)
  File "/usr/lib64/python2.7/logging/__init__.py", line 291, in __init__
    self.threadName = threading.current_thread().name
  File "/usr/lib64/python2.7/threading.py", line 1163, in currentThread
    return _active[_get_ident()]
KeyboardInterrupt

  • XPRA_PROTOCOL_FAULT_INJECTION_RATE=1000 did indeed cause the client to disconnect with the following output (in yellow):
2015-10-26 18:43:26,890 Received uninterpretable nonsense: invalid packet header: '50726f746f636f6c' read buffer='Protocol JUNK! added by fault injection code'
2015-10-26 18:43:26,890  packet no 369 data: 'Protocol JUNK! added by fault injection code'
2015-10-26 18:43:26,890 sound-sink stopping
2015-10-26 18:43:26,951 Connection lost

Server-side though, I got the fault injection warning in yellow, but then an error in red (though, the yellow does explain that, rather):

2015-10-26 18:43:43,167 injecting fault in Protocol(tcp socket: 10.0.32.136:2200 <- 10.0.11.162:64292)
2015-10-26 18:43:43,874 internal error: read connection tcp socket: 10.0.32.136:2200 <- 10.0.11.162:64292 reset
2015-10-26 18:43:43,874  [Errno 104] Connection reset by peer
  • Oddly (maybe I misread the above?) XPRA_WRAPPER_FAULT_INJECTION_RATE=500 doesn't do "the same thing but for the wrapper layer" - instead it seems to target the sound channel (?):

Server side:

2015-10-26 18:47:11,597 injecting fault in Protocol(Pipe(sound-source))
2015-10-26 18:47:11,599 sound-source injecting fault in Protocol(Pipe())
2015-10-26 18:47:11,602 sound-source gibberish received:
2015-10-26 18:47:11,603 sound-source  "invalid packet header: '5772617070657220' read buffer='Wrapper JUNK! added by fa"...
2015-10-26 18:47:11,603 sound-source  stopping
2015-10-26 18:47:11,604 sound-source stopping on gibberish:
2015-10-26 18:47:11,603 sound-source stopping
2015-10-26 18:47:11,604  "invalid packet header: '5772617070657220' read buffer='Wrapper JUNK! added by fa"...

And client side all I see is 2015-10-26 18:46:55,338 sound-sink stopping.


Not sure if the wrapper layer fault injection is meant to only target the sound or not... in either case, I'll assign this back to you to wait on more new debugging tricks.

comment:6 Changed 20 months ago by Antoine Martin

Resolution: fixed
Status: newclosed

For the codec stuff XPRA_CODEC_FAIL_IMPORT and XPRA_CODEC_FAIL_SELFTEST see ticket:1010#comment:4


With XPRA_PAINT_FAULT_INJECTION_TELL=0, you are meant to get both:
region is 15041ms old, bad connection? and Error: expiring missing damage acks: [96]: the missing ack is making us delay the sending of new updates until we hit the timeout of 15 seconds.


And client side (in yellow, which I take it indicates it is a simulated error?):


0.16 includes a coloured stream logger, which is used with the server when running with {{{daemon=no) and with the client when running from a console.
The definition of the colours can be found in the source: colorstreamhandler.py.


When terminating the server running with encryption and password-file and encryption-keyfile, with a control-C, I got an unusual traceback:
^C/usr/lib64/python2.7/threading.py:1163: RuntimeWarning: tp_compare didn't return -1 or -2 for exception


I believe that this only happens when running with daemon=no and using control-c, and it shouldn't be too much of a problem: if the server doesn't immediately exit, just hit control-c again.
So I'm leaving this one as it is.


Server-side though, I got the fault injection warning in yellow, but then an error in red (though, the yellow does explain that, rather):
..


The warning tells you it is sending garbage to the client, the client then drops the connection and the server sees that as an error. All is well.


Not sure if the wrapper layer fault injection is meant to only target the sound or not... in either case, I'll assign this back to you to wait on more new debugging tricks.


The wrapper is generic but is currently only used for the sound subprocess.

I think this is enough for now, I have added a link to this ticket from wiki/Debugging under "fault injection", and I will try to add more fault injection as I go along.

Note: See TracTickets for help on using tickets.