Xpra: Ticket #1475: X errors crash server

This undesirable behaviour appears to be new-ish (last 2 weeks or so).

With Ubuntu 16.10 on both client and server, xpra server will crash unexpectedly. LibreOffice? application is particularly vulnerable: open a spreadsheet containing data, try to scroll using mouse-wheel. Slow scrolling often works; faster scrolling kills the server.

Sample log file entry from server shows version number at top, X problems at bottom:

 $ tail -n 60 \:82.log
2017-03-27 07:39:59,925 New unix-domain connection received on /home/loew/.xpra/infty-82
2017-03-27 07:39:59,930 Handshake complete; enabling connection
2017-03-27 07:40:00,017 Python/Gtk2 Linux Ubuntu 16.04 xenial client version 2.0-r15319 64-bit
2017-03-27 07:40:00,017  connected from 'mpdhost' as 'loew' - 'Philip Loewen'
2017-03-27 07:40:00,017  using rgb as primary encoding
2017-03-27 07:40:00,017  also available:
2017-03-27 07:40:00,017   h264, vp9, vp8, png, png/P, png/L, rgb24, jpeg, rgb32
2017-03-27 07:40:00,018  client root window size is 1680x1050 with 1 display:
2017-03-27 07:40:00,019   :0.0 (444x277 mm - DPI: 96x96) workarea: 1680x1024
2017-03-27 07:40:00,019     monitor 1 (474x296 mm - DPI: 90x90)
2017-03-27 07:40:00,019     monitor 2 1280x720 (339x190 mm - DPI: 95x96)
2017-03-27 07:40:00,046 setting key repeat rate from client: 500ms delay / 30ms interval
2017-03-27 07:40:00,047 setting keymap: rules=evdev, model=pc105, layout=us
2017-03-27 07:40:00,144 client 1: Attached to ssh/loew@infty.math.ubc.ca/:82 (press Control-C to detach)
2017-03-27 07:40:01,133 client 1: Error: printing disabled:
2017-03-27 07:40:01,133 client 1:  No module named cups
2017-03-27 07:40:03,145 client 1: Warning: the sound output process has failed to start
2017-03-27 07:40:03,965 client 1: Warning: the opus+ogg sound sink has stopped
2017-03-27 07:40:05,329 Introspect error on :1.6:/org/libreoffice: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.ServiceUnknown: The name :1.6 was not provided by any .service files
2017-03-27 07:40:05,330 Executing introspect queue due to error
2017-03-27 07:40:05,330 Introspect error on :1.6:/org/libreoffice/window/14680136: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.ServiceUnknown: The name :1.6 was not provided by any .service files
2017-03-27 07:40:05,330 Executing introspect queue due to error
2017-03-27 07:40:05,332 Introspect error on :1.6:/org/libreoffice/menus/appmenu: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.ServiceUnknown: The name :1.6 was not provided by any .service files
2017-03-27 07:40:05,332 Executing introspect queue due to error
2017-03-27 07:40:05,332 Error: failed to query org.gtk.Actions at /org/libreoffice:
2017-03-27 07:40:05,333  (DBusException(dbus.String(u'The name :1.6 was not provided by any .service files'),),)
2017-03-27 07:40:05,333 Error: failed to query org.gtk.Actions at /org/libreoffice:
2017-03-27 07:40:05,333  (DBusException(dbus.String(u'The name :1.6 was not provided by any .service files'),),)
2017-03-27 07:40:05,333 Error: failed to query org.gtk.Actions at /org/libreoffice/window/14680136:
2017-03-27 07:40:05,333  (DBusException(dbus.String(u'The name :1.6 was not provided by any .service files'),),)
2017-03-27 07:40:05,333 Error: failed to query org.gtk.Actions at /org/libreoffice/window/14680136:
2017-03-27 07:40:05,333  (DBusException(dbus.String(u'The name :1.6 was not provided by any .service files'),),)
2017-03-27 07:40:05,334 Error: failed to query org.gtk.Menus at /org/libreoffice/menus/appmenu:
2017-03-27 07:40:05,334  (DBusException(dbus.String(u'The name :1.6 was not provided by any .service files'),),)
2017-03-27 07:40:08,588 Introspect error on :1.6:/org/libreoffice: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.ServiceUnknown: The name :1.6 was not provided by any .service files
2017-03-27 07:40:08,588 Executing introspect queue due to error
2017-03-27 07:40:08,588 Introspect error on :1.6:/org/libreoffice/window/14680365: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.ServiceUnknown: The name :1.6 was not provided by any .service files
2017-03-27 07:40:08,588 Executing introspect queue due to error
2017-03-27 07:40:08,588 Introspect error on :1.6:/org/libreoffice/menus/appmenu: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.ServiceUnknown: The name :1.6 was not provided by any .service files
2017-03-27 07:40:08,589 Executing introspect queue due to error
2017-03-27 07:40:08,589 Error: failed to query org.gtk.Actions at /org/libreoffice:
2017-03-27 07:40:08,589  (DBusException(dbus.String(u'The name :1.6 was not provided by any .service files'),),)
2017-03-27 07:40:08,590 Error: failed to query org.gtk.Actions at /org/libreoffice:
2017-03-27 07:40:08,590  (DBusException(dbus.String(u'The name :1.6 was not provided by any .service files'),),)
2017-03-27 07:40:08,591 Error: failed to query org.gtk.Actions at /org/libreoffice/window/14680365:
2017-03-27 07:40:08,591  (DBusException(dbus.String(u'The name :1.6 was not provided by any .service files'),),)
2017-03-27 07:40:08,591 Error: failed to query org.gtk.Actions at /org/libreoffice/window/14680365:
2017-03-27 07:40:08,591  (DBusException(dbus.String(u'The name :1.6 was not provided by any .service files'),),)
2017-03-27 07:40:08,591 Error: failed to query org.gtk.Menus at /org/libreoffice/menus/appmenu:
2017-03-27 07:40:08,591  (DBusException(dbus.String(u'The name :1.6 was not provided by any .service files'),),)
2017-03-27 07:47:30,588 client 1: ignoring draw received for a window which is not realized yet!
The program 'Xpra' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadDrawable (invalid Pixmap or Window parameter)'.
  (Details: serial 83017 error_code 9 request_code 14 minor_code 0)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)

I get this with client "xpra v2.0-r15319" on Ubuntu 16.10, but also with a recent xpra client on Mac OSX. So far I have seen it *only* when running LibreOffice? ... in both spreadsheet and word processing modes.

I think this problem started with a comparatively recent xpra update.

Here is the command I use when launching the server:

dbus-launch xpra start              \
  --start-child=/usr/bin/urxvt      \
  --encoding=rgb                    \
  --opengl=no                       \
  --pulseaudio=no                   \
  --mdns=no                         \
  :82 2> /dev/null &

The LibreOffice? version on the server is 5.1.6.2, Build ID: 1:5.1.6~rc2-0ubuntu1~xenial1, CPU Threads: 4; OS Version: Linux 4.4; UI Render: default; Locale: en-CA (en_CA.UTF-8); Calc: group



Mon, 27 Mar 2017 15:53:08 GMT - Antoine Martin: priority, status changed; milestone set

Faster scrolling triggers the "new and enhanced" scrolling detection code: #1426. Does this go away if you start the server with:

XPRA_SCROLL_ENCODING=0 xpra start ...

If this is reproducible, I should be able to fix that quickly. Strange thing is that the scrolling detection code got tested pretty thoroughly. Mostly with Firefox here, but the application that triggers the scrolling detection code shouldn't make any difference.


FYI minor things you may not know:


Mon, 27 Mar 2017 17:08:08 GMT - Philip D Loewen:

Regrettably, the suggested change to an environment variable did not solve the problem. (I accepted all other suggestions in comment:1.)

Further info from the client side: sometimes the last thing printed on the controlling terminal window before the server dies is a line like this:

2017-03-27 09:55:43,112 ignoring draw received for a window which is not realized yet!

Tue, 28 Mar 2017 08:19:52 GMT - Antoine Martin: attachment set

ubuntu 16.04 crash log with "-d all"


Tue, 28 Mar 2017 11:03:59 GMT - Antoine Martin: owner, status changed

This bug can be reproduced, but not very reliably... sometimes the bug triggers immediately, sometimes it takes much longer. It also only triggers on Ubuntu and not Fedora because the Ubuntu version uses short-lived tooltip windows showing the row number - this is what triggers the bug. Adding debug statements to all the calls following do_xpra_child_map_event also seems to make the problem go away! (Heisenbug) Making it harder to track down.

This sort of low level X11 bug usually happens when we make X11 calls without synchronizing (calling the gdk X11 flush logic). Problem is that I cannot see any naked X11 calls in there, all the X11 calls in the window model code is already guarded by "xsync" context managers. So r15436 may well "fix" the problem in that it will be much harder to hit, but not necessarily completely fixed either.. At least I can't seem to hit the bug anymore.

Not sure why this only started occurring in the last 2 weeks, as this code has not really changed for years...

@Philip D Loewen: does that fix the problem for you? (new beta builds posted for Ubuntu 16.x)


Tue, 28 Mar 2017 15:15:38 GMT - Philip D Loewen:

Installed r15436 and tested lightly -- looks good so far. Will test aggressively through the day (UTC-7 here) and report again.

(I crashed the server also with the LibreOffice word processor and even the Thunderbird email client -- so the tooltips in LO Calc would have been a coal-mine canary of sorts rather than the single key issue. Heisenbug indeed.)


Fri, 31 Mar 2017 04:06:27 GMT - Philip D Loewen:

Normal usage that formerly crashed the server now just works. I didn't test very aggressively, but I would optimistically call this issue fixed. Thanks a lot, Antoine, for whatever you did. (And, as usual, for the vast amount of work you did building this excellent system in the first place!)


Fri, 31 Mar 2017 07:42:12 GMT - Antoine Martin: status changed; resolution set

Thanks!


Mon, 26 Mar 2018 05:56:44 GMT - Philip D Loewen: status changed; resolution deleted

These symptoms re-appeared with considerable intensity about a month ago, but I was too busy to report. The only new item in the logs is this:

2018-03-25 20:25:04,003 Missing property or wrong property type WM_TRANSIENT_FOR (window)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/xpra/x11/gtk_x11/prop.py", line 153, in prop_get
    data = X11Window.XGetWindowProperty(get_xwindow(target), key, atom, etype)
  File "xpra/x11/bindings/window_bindings.pyx", line 915, in xpra.x11.bindings.window_bindings._X11WindowBindings.XGetWindowProperty (xpra/x11/bindings/window_bindings.c:9883)
PropertyError: no such window
2018-03-25 20:32:05,044 Missing property or wrong property type WM_TRANSIENT_FOR (window)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/xpra/x11/gtk_x11/prop.py", line 153, in prop_get
    data = X11Window.XGetWindowProperty(get_xwindow(target), key, atom, etype)
  File "xpra/x11/bindings/window_bindings.pyx", line 915, in xpra.x11.bindings.window_bindings._X11WindowBindings.XGetWindowProperty (xpra/x11/bindings/window_bindings.c:9883)
PropertyError: no such window
The program 'Xpra' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadWindow (invalid Window parameter)'.
  (Details: serial 393361 error_code 3 request_code 3 minor_code 0)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)

This is Ubuntu 16.04.4 LTS ... counting down to the 18.04 LTS upgrade!


Mon, 26 Mar 2018 16:39:07 GMT - Antoine Martin: owner, status changed

It is harder to reproduce, but I am seeing the bug. I was hoping that the missing wrapping added in r18865 would fix this, but it doesn't. I am getting a fairly reliable SIGBUS doing memcpy-avc-unaligned from mmap_write which is running in the encode thread. As if the XShmImageWrapper buffer we copy from was no longer valid.


Mon, 26 Mar 2018 17:16:16 GMT - Antoine Martin: owner changed

Philip D Loewen: the SIGBUS may not be related, so it is worth applying r18865 and turning off mmap if you hit it. I am now unable to reproduce the crash - but I'll keep trying.


Tue, 27 Mar 2018 11:45:19 GMT - Antoine Martin:

If you can still reproduce the crash, please run the server with:

XPRA_X_SYNC=1 xpra start -d all

And post the last ~100 lines of debug output.


Instead of manually inspecting all the places where we can inadvertently end up calling an X11 function, r18869 adds a context checker in all the X11 functions that can raise an error and fixes some naked calls detected that way (some backported: r18874 for v2.2.x and r18875 for v1.0.x). Nothing major was found but this should prevent future bugs as any unmanaged calls will now trigger a scary error message. The runtime cost should be minimal.

PS: more in r18894, r18921, r18910, r18905, r18932. See also #1798


Tue, 03 Apr 2018 05:51:21 GMT - Philip D Loewen: cc set

Sorry for going quiet -- I had come disconnected from this ticket and was not aware that progress was being made. I'll pay special attention to the performance of version 2.2.6-r18969-1 over the next few days and report back on what I observe. Thanks a lot!


Sat, 07 Apr 2018 04:36:29 GMT - Antoine Martin:

New since 2.2.6: r19002, r19055, r19125.


Sun, 22 Apr 2018 06:30:44 GMT - Antoine Martin: status changed; resolution set

Feel free to re-open if you can reproduce the problems or if you see one of those scary warnings. (and please include the stacktrace that goes with it)


Sat, 23 Jan 2021 05:25:17 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/1475