Xpra: Ticket #208: xpra crashes when closing a window in SAS -- possible race

I've come across a bug in xpra 0.7.2 running on SLES 11. When I close a window in SAS, the xpra server always crashes immediately. (I can recover my session by running xpra upgrade and then reconnecting.) I looked into this a little bit, and it seems like it may be a race condition: At one point, after having added a number of debugging statements, the bug disappeared, but then reappeared later with the exact same set of debugging statements in.

The crash occurs in wimpiggy/window.py::composite_configure_event(line 722) when self.client_window.get_geometry() is called. For some reason this spits back a BadDrawable XError. Here's the tail end of the log (run with XPRA_SYNCHRONIZE=1 GDK_SYNCHRONIZE=1 XPRA_X11_DEBUG=1 and -d all):

2012-11-09 12:50:16,631 x_event_filter event=('wimpiggy-configure-event', None)/ConfigureNotify
2012-11-09 12:50:16,631 ConfigureNotify event received
2012-11-09 12:50:16,631   event was delivered to window itself
2012-11-09 12:50:16,631   not forwarding to WindowModel handler, it has no wimpiggy-configure-event signal
2012-11-09 12:50:16,631   forwarding event to a CompositeHelper handler's wimpiggy-configure-event signal
2012-11-09 12:50:16,631 WindowModel.composite_configure_event(<CompositeHelper object at 0x7f7e90036c80 (wimpiggy+composite+CompositeHelper at 0xfc5940)>, <AdHocStruct object, contents: {'delivered_to': <gtk.gdk.Window object at 0x7f7e90036e10 (GdkWindow at 0x7f7e90050bf0)>, 'send_event': 0, 'height': 597, 'width': 1106, 'window': <gtk.gdk.Window object at 0x7f7e90036e10 (GdkWindow at 0x7f7e90050bf0)>, 'y': 0, 'x': 0, 'border_width': 0, 'type': 22, 'display': <gtk.gdk.Display object at 0xb8e3c0 (GdkDisplayX11 at 0xc1c200)>}>)
2012-11-09 12:50:16,632 composite_configure_event, client_windows = <gtk.gdk.Window object at 0x7f7e90036e10 (GdkWindow at 0x7f7e90050bf0)>
The program 'xpra' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadDrawable (invalid Pixmap or Window parameter)'.
  (Details: serial 8408 error_code 9 request_code 14 minor_code 0)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)

The whole log is attached.

Fri, 09 Nov 2012 17:56:25 GMT - alis: attachment set

xpra logfile

Fri, 09 Nov 2012 17:57:38 GMT - Antoine Martin: status, description changed

Any chance you could also give us a gdb backtrace? SLES 11 is a bit old, so you are probably using an old Xvfb, not sure if that's relevant yet, but the BadDrawable here is odd.

Fri, 09 Nov 2012 20:03:47 GMT - alis:

Sure. The Xvfb is from package xorg-x11-server-extra-7.4- , built 6/29/2012, if that helps.

Here's the backtrace from gdb. Let me know if you need to see anything else...

(gdb) break _XError
Breakpoint 1 at 0x7fe8f8bf6c30: file XlibInt.c, line 2879.
(gdb) c
Breakpoint 1, _XError (dpy=0xc227f0, rep=0xfbacb0) at XlibInt.c:2879
2879	{
(gdb) bt
#0  _XError (dpy=0xc227f0, rep=0xfbacb0) at XlibInt.c:2879
#1  0x00007fe8f8bfebb8 in _XReply (dpy=0xc227f0, rep=0x7fff124366f4, extra=0, discard=1) at xcb_io.c:419
#2  0x00007fe8f8bda1cf in XGetGeometry (dpy=0xc227f0, d=14682128, root=0x7fff124367b0, x=0x7fff124367cc, y=0x7fff124367c8, width=0x7fff124367c4, height=0x7fff124367c0,
    borderWidth=0x7fff124367bc, depth=0x7fff124367b8) at GetGeom.c:49
#3  0x00007fe8f78b99e1 in gdk_window_x11_get_geometry (window=<optimized out>, x=0x7fff1243685c, y=0x7fff12436858, width=0x7fff12436854, height=0x7fff12436850, depth=
    0x7fff1243684c) at gdkwindow-x11.c:2785
#4  0x00007fe8f788a46c in IA__gdk_window_get_geometry (window=0x7fe8e80505e0 [GdkWindow], x=0x7fff1243685c, y=0x7fff12436858, width=0x7fff12436854, height=0x7fff12436850,
    depth=0x7fff1243684c) at gdkwindow.c:7955
#5  0x00007fe8f82b80ba in _wrap_gdk_window_get_geometry (self=0x7fe8e8037690) at gdkwindow.override:404
#6  0x00007fe901a71b91 in call_function (oparg=<optimized out>, pp_stack=<optimized out>) at Python/ceval.c:3734
#7  PyEval_EvalFrameEx (f=0x14a4ae0, throwflag=<optimized out>) at Python/ceval.c:2412
#8  0x00007fe901a76902 in PyEval_EvalCodeEx (co=0xd0f4e0, globals=<optimized out>, locals=<optimized out>, args=0x7fe8e802ee78, argcount=12732984, kws=0x0, kwcount=0, defs=
    0x0, defcount=0, closure=0x0) at Python/ceval.c:3000
#9  0x00007fe901a41888 in function_call (func=0xedf050, arg=0x7fe8e802ee60, kw=0x0) at Objects/funcobject.c:524
#10 0x00007fe901a34159 in PyObject_Call (func=0xedf050, arg=0x7fe8e802ee60, kw=0x0) at Objects/abstract.c:2492
#11 0x00007fe901a371f7 in instancemethod_call (func=0xedf050, arg=0x7fe8e802ee60, kw=0x0) at Objects/classobject.c:2579
#12 0x00007fe901a34159 in PyObject_Call (func=0xf49fa0, arg=0x7fe8e8022098, kw=0x0) at Objects/abstract.c:2492
#13 0x00007fe901a70118 in PyEval_CallObjectWithKeywords (func=0xf49fa0, arg=0x7fe8e8022098, kw=0x0) at Python/ceval.c:3619
#14 0x00007fe8fd9e8685 in pyg_closure_marshal (closure=0x2c4db70, return_value=0x7fff124370d0, n_param_values=2, param_values=0xde1210, invocation_hint=<optimized out>,
    marshal_data=<optimized out>) at pygtype.c:1118
#15 0x00007fe8fe501d1d in IA__g_closure_invoke (closure=0x2c4db70, return_value=0x7fff124370d0, n_param_values=2, param_values=0xde1210, invocation_hint=0x7fff12436fe0)
    at gclosure.c:767
#16 0x00007fe8fe51766b in signal_emit_unlocked_R (node=0xcc2590, detail=0, instance=0xd41960, emission_return=0x7fff124370d0, instance_and_params=0xde1210) at gsignal.c:3247
#17 0x00007fe8fd9e02aa in pygobject_emit (self=<optimized out>, args=0x7fe8e8024908) at pygobject.c:1736
#18 0x00007fe901a34159 in PyObject_Call (func=0xf76098, arg=0x7fe8e8024908, kw=0x0) at Objects/abstract.c:2492
#19 0x00007fe8f29e4a3d in __pyx_pf_8wimpiggy_8lowlevel_8bindings_154_maybe_send_event (__pyx_v_event=<optimized out>, __pyx_v_signal=<optimized out>,
    __pyx_v_window=<optimized out>, __pyx_self=<optimized out>) at wimpiggy/lowlevel/bindings.c:22765
#20 __pyx_pw_8wimpiggy_8lowlevel_8bindings_155_maybe_send_event (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at wimpiggy/lowlevel/bindings.c:22601
#21 0x00007fe901a34159 in PyObject_Call (func=0xc825a8, arg=0x7fe8e802eaa0, kw=0x0) at Objects/abstract.c:2492
#22 0x00007fe8f29f5d66 in __pyx_pf_8wimpiggy_8lowlevel_8bindings_156_route_event (__pyx_v_parent_signal=<optimized out>, __pyx_v_signal=<optimized out>,
    __pyx_v_event=<optimized out>, __pyx_self=<optimized out>) at wimpiggy/lowlevel/bindings.c:23095
#23 __pyx_pw_8wimpiggy_8lowlevel_8bindings_157_route_event (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at wimpiggy/lowlevel/bindings.c:22929
#24 0x00007fe901a34159 in PyObject_Call (func=0xc825f0, arg=0xf77a00, kw=0x0) at Objects/abstract.c:2492
#25 0x00007fe8f29eb5e4 in __pyx_f_8wimpiggy_8lowlevel_8bindings_x_event_filter (__pyx_v_e_gdk=0x7fff124377d0, __pyx_v_gdk_event=<optimized out>,
    __pyx_v_userdata=<optimized out>) at wimpiggy/lowlevel/bindings.c:26359
#26 0x00007fe8f78a6bd1 in gdk_event_apply_filters (filters=<optimized out>, event=<optimized out>, xevent=<optimized out>) at gdkevents-x11.c:351
#27 gdk_event_translate (display=0xc17200 [GdkDisplayX11], event=0xfb69e0, xevent=0x7fff124377d0, return_exposes=0) at gdkevents-x11.c:923
#28 0x00007fe8f78a85d6 in _gdk_events_queue (display=0xc17200 [GdkDisplayX11]) at gdkevents-x11.c:2306
#29 0x00007fe8f78a8a1e in gdk_event_dispatch (source=<optimized out>, callback=0xfbacb0, user_data=0x1dd8) at gdkevents-x11.c:2367
#30 0x00007fe8fde5d60a in g_main_dispatch (context=<optimized out>) at gmain.c:1960
#31 IA__g_main_context_dispatch (context=0xc21070) at gmain.c:2513
#32 0x00007fe8fde60e88 in g_main_context_iterate (context=0xc21070, block=1, dispatch=1, self=<optimized out>) at gmain.c:2591
#33 0x00007fe8fde6135d in IA__g_main_loop_run (loop=0xe8f580) at gmain.c:2799
#34 0x00007fe8f7c36927 in IA__gtk_main () at gtkmain.c:1218
#35 0x00007fe8f829a30a in _wrap_gtk_main (self=<optimized out>) at ./gtk.override:1241
#36 0x00007fe901a71b91 in call_function (oparg=<optimized out>, pp_stack=<optimized out>) at Python/ceval.c:3734
#37 PyEval_EvalFrameEx (f=0xebbdc0, throwflag=<optimized out>) at Python/ceval.c:2412
#38 0x00007fe901a718dc in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=<optimized out>, func=<optimized out>) at Python/ceval.c:3836
#39 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at Python/ceval.c:3771
#40 PyEval_EvalFrameEx (f=0x9310f0, throwflag=<optimized out>) at Python/ceval.c:2412
#41 0x00007fe901a76b07 in PyEval_EvalCodeEx (co=0x8cd648, globals=<optimized out>, locals=<optimized out>, args=0x92dcc8, argcount=12732984, kws=0x68f768, kwcount=0, defs=
    0x0, defcount=0, closure=0x0) at Python/ceval.c:3000
#42 0x00007fe901a70dd3 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=<optimized out>, func=<optimized out>) at Python/ceval.c:3846
#43 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at Python/ceval.c:3771
#44 PyEval_EvalFrameEx (f=0x68f4c0, throwflag=<optimized out>) at Python/ceval.c:2412
#45 0x00007fe901a76b07 in PyEval_EvalCodeEx (co=0x7fe901e3c5d0, globals=<optimized out>, locals=<optimized out>, args=0x68f8c0, argcount=12732984, kws=0x68e2c8, kwcount=0,
    defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3000
#46 0x00007fe901a70dd3 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=<optimized out>, func=<optimized out>) at Python/ceval.c:3846
#47 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at Python/ceval.c:3771
#48 PyEval_EvalFrameEx (f=0x68e130, throwflag=<optimized out>) at Python/ceval.c:2412
#49 0x00007fe901a76b07 in PyEval_EvalCodeEx (co=0x7fe901e303f0, globals=<optimized out>, locals=<optimized out>, args=0x0, argcount=12732984, kws=0x0, kwcount=0, defs=0x0,
    defcount=0, closure=0x0) at Python/ceval.c:3000
---Type <return> to continue, or q <return> to quit---
#50 0x00007fe901a6ff22 in PyEval_EvalCode (co=0xc227f0, globals=0xfbacb0, locals=0x1dd8) at Python/ceval.c:541
#51 0x00007fe901a8b5d3 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x63d120, locals=0x63d120, flags=<optimized out>, arena=<optimized out>)
    at Python/pythonrun.c:1358
#52 0x00007fe901a8b68a in PyRun_FileExFlags (fp=0x63d340, filename=0x7fff1243a533 "/usr/bin/xpra", start=257, globals=0x63d120, locals=0x63d120, closeit=1, flags=
    0x7fff124385a0) at Python/pythonrun.c:1344
#53 0x00007fe901a8c094 in PyRun_SimpleFileExFlags (fp=0x63d340, filename=0x7fff1243a533 "/usr/bin/xpra", closeit=1, flags=0x7fff124385a0) at Python/pythonrun.c:948
#54 0x00007fe901a9381b in Py_Main (argc=<optimized out>, argv=0x7fff124386b8) at Modules/main.c:617
#55 0x00007fe900dd6bc6 in __libc_start_main () from /lib64/libc.so.6
#56 0x00000000004006e9 in _start () at ../sysdeps/x86_64/elf/start.S:113
(gdb) c
[Thread 0x7fe8ef6b4700 (LWP 28349) exited]
[Thread 0x7fe8f06b6700 (LWP 28347) exited]
[Thread 0x7fe8eeeb3700 (LWP 28350) exited]
[Thread 0x7fe8efeb5700 (LWP 28348) exited]
[Inferior 1 (process 28299) exited with code 01]

Sat, 10 Nov 2012 06:04:51 GMT - Antoine Martin:

Strange, the crash happens straight from the gtk main loop, which calls get_geometry - similar to #167 and #71. I've actually committed a geometry fix recently (both in trunk and 0.7.x branch), I doubt it will fix the problem but it is worth a try.


Can you post the full log? (or the last few hundred lines before the crash will do)

Sat, 10 Nov 2012 15:17:17 GMT - Antoine Martin:

Also, please try with r2089 - may help when we encounter invalid windows/errors as we now try harder to cleanup properly.

Mon, 12 Nov 2012 14:35:17 GMT - alis:

I tried it again with the latest trunk, r2107, and got the same error.

I'm attaching the full log. Both this log and the earlier gdb backtrace were from an xpra server run via

XPRA_SYNCHRONIZE=1 XPRA_X11_DEBUG=1 GDK_SYNCHRONIZE=1 xpra --no-bell -d all upgrade :100

Mon, 12 Nov 2012 14:36:35 GMT - alis: attachment set

xpra logfile on r2107

Mon, 12 Nov 2012 18:03:48 GMT - alis: attachment set

Log of the same sequence of events on xpra version 0.3.2, which was the last version I used which did not have the bug. The window close which leads to a crash on later versions occurs after line 4900.

Mon, 12 Nov 2012 18:06:17 GMT - alis:

I added a logfile of the same situation from xpra 0.3.2, which is the last version I used that did not have this bug. (I tried 0.4.8, which did have the bug.)

The relevant portion of the logfile is after line 4900.

Tue, 13 Nov 2012 02:58:55 GMT - Antoine Martin:

Hah, I didn't realize this was a regression until now.

Any chance you could bisect it? Or at least, taking a very rough cut, just testing 0.4.0 and 0.3.x-latest would give us a good idea when the problematic change was made.

Note: the "guilty" commit may not necessarily be the real source of the problem if, as you suggested, this problem is indeed a race. It may just make it more likely to hit that race..

Tue, 13 Nov 2012 04:26:41 GMT - Antoine Martin:

Here is the human-readable/sanitized sequence of events with 0.3.2:

And with 0.7.x, this is what we see:

From the changelog of v0.3.3 onwards and 0.4.0, the suspicious items are:

Also, in 0.3 you were using rgb24 encoding whereas 0.7.x uses x264, does it make any difference if you switch encoding (with either version)?

Tue, 13 Nov 2012 07:12:13 GMT - Antoine Martin: attachment set

delays the resizing of the corral window until later, giving more time for the unmap event to reach us

Tue, 13 Nov 2012 07:16:41 GMT - Antoine Martin:

I think what was missing was the invalidating named pixmap, this should be fixed in r2113 and was introduced by commit r1089 for #107. Please confirm so I can backport this to all branches.

If you have the time, could you also try the "resize-later" patch above, it may saves us some unnecessary processing and bandwidth usage. (with logs so I can verify)

Tue, 13 Nov 2012 16:00:14 GMT - alis:

Hmm. I tried r2124, and the bug is still there, unfortunately. I also tried switching to rgb24 encoding by selecting it from the system tray icon; that again made no difference.

I went back and searched through the various older versions to find where the regression appeared. In the releases, the regression first appears in 0.4.0; 0.3.11 is OK.

In the trunk, the bug first appears for me with revision r1089 . r1088 is fine. I've attached logs of the triggering action with these two revisions. On r1088, the successful window close occurs between log lines 4118-4556. On r1089, the window close & subsequent crash occur after log line 4150.

Tue, 13 Nov 2012 16:01:02 GMT - alis: attachment set

Lines 4118-4556 show no crash on window close.

Tue, 13 Nov 2012 16:01:23 GMT - alis: attachment set

Lines 4150+ show crash on window close.

Tue, 13 Nov 2012 16:23:32 GMT - Antoine Martin:

Thanks for the effort! OK, just to make sure we know where the bug fires from, can you confirm that the crashes go away if you remove this line:

--- src/wimpiggy/window.py	(revision 2116)
+++ src/wimpiggy/window.py	(working copy)
@@ -292,7 +292,7 @@
     def setup(self):
         h = self._composite.connect("contents-changed", self._forward_contents_changed)
-        self._composite.connect("wimpiggy-configure-event", self.composite_configure_event)
+        #self._composite.connect("wimpiggy-configure-event", self.composite_configure_event)
         self._damage_forward_handle = h
         self._geometry = geometry_with_border(self.client_window)

Tue, 13 Nov 2012 16:50:39 GMT - Antoine Martin: attachment set

this patch should trap errors during our attempts to resize and fail more cleanly

Tue, 13 Nov 2012 16:51:55 GMT - Antoine Martin:

Assuming that the change in comment:11 does solve the issue, can you try the patch above to see if the crash then becomes a more harmless stacktrace? (undoing comment:11 first obviously)

Tue, 13 Nov 2012 17:28:18 GMT - alis:

Yup, disabling the composite_configure_event handler in comment:11 stops the error from occuring. And the windows-configure-gone.patch prevents the crash, with the expected warning in the log:

2012-11-13 12:11:25,284 x_event_filter event=None/Expose
2012-11-13 12:11:25,284 x_event_filter event=None/LeaveNotify
2012-11-13 12:11:25,284 x_event_filter event=None/EnterNotify
2012-11-13 12:11:25,284 x_event_filter event=('wimpiggy-destroy-event', None)/DestroyNotify
2012-11-13 12:11:25,284 DestroyNotify event received
2012-11-13 12:11:25,285   delivering event to window itself
2012-11-13 12:11:25,285   forwarding event to a WindowModel handler's wimpiggy-destroy-event signal
2012-11-13 12:11:25,285   forwarded
2012-11-13 12:11:25,285 x_event_filter event=('wimpiggy-property-notify-event', None)/PropertyNotify
2012-11-13 12:11:25,285 cannot get gdk window for <gtk.gdk.Display object at 0x7fbb8b0ca2d0 (GdkDisplayX11 at 0x9dd200)> : 20971922
2012-11-13 12:11:25,285 _call(True,<built-in function get_pywindow>,(<gtk.gdk.Display object at 0x7fbb8b0ca2d0 (GdkDisplayX11 at 0x9dd200)>, 20971922L),{}) XError: 3
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/wimpiggy/error.py", line 109, in _call
    value = fun(*args, **kwargs)
  File "bindings.pyx", line 418, in wimpiggy.lowlevel.bindings.get_pywindow (wimpiggy/lowlevel/bindings.c:3278)
XError: XError: 3
2012-11-13 12:11:25,285 Some window in our event disappeared before we could handle the event 28/PropertyNotify using ('wimpiggy-property-notify-event', None); so I'm just ignoring it instead. python event=<AdHocStruct object, contents: {'send_event': 0, 'type': 28, 'display': <gtk.gdk.Display object at 0x7fbb8b0ca2d0 (GdkDisplayX11 at 0x9dd200)>}>
2012-11-13 12:11:25,285 x_event_filter event=('wimpiggy-client-message-event', None)/ClientMessage

Thanks! (That's the latest trunk patched, BTW.)

Just out of curiosity, I take it the problem is that we are receiving a PropertyNotify event for the window after we had already received and acted on the destroy-event for it. Do you know why that was happening here so regularly but apparently not normally elsewhere? Do you think the application (SAS) is doing something weird, like hiding/resizing and closing the window in immediate succesion or in the wrong order? What is that PropertyNotify event trying to do?

Wed, 14 Nov 2012 03:21:33 GMT - Antoine Martin: status changed; resolution set

merged a slightly different version in r2127 (and applied to previous stable branches in r2130), we call the resize code (resize_corral_window) via idle_add so this should give the DestroyNotify a chance to be processed first and avoid the warning in the logs.

The problem is not the PropertyNotify, in fact iirc I had seen this sort of behaviour before (scrubbing all window properties before destroying the window - redundant but harmless) and added some code to avoid spitting out errors in the logs, the problem here is the ConfigureNotify event which fires on one window and acts on another.. (keeping window dimensions in sync in an asynchronous x11 world)

Sat, 23 Jan 2021 04:48:27 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/208