Xpra: Ticket #55: timing/race GIL deadlock bug in XSendEvent or socket code

This one is hard to reproduce, but then again sometimes not. I don't think this is a new bug, but the new python-xmodmap code and/or threading make it easier to trigger.

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f4caa14e700 (LWP 17887)]
0x000000321cc0eb0c in __libc_send (fd=<optimized out>, buf=<optimized out>, n=<optimized out>, flags=<optimized out>) at ../sysdeps/unix/sysv/linux/x86_64/send.c:33
33					   (size_t) 0);
(gdb) bt
#0  0x000000321cc0eb0c in __libc_send (fd=<optimized out>, buf=<optimized out>, n=<optimized out>, flags=<optimized out>) at ../sysdeps/unix/sysv/linux/x86_64/send.c:33
#1  0x00007f4cb0dc94c2 in sock_send (s=0x1a0d9c0, args=<optimized out>) at /usr/src/debug/Python-2.7.2/Modules/socketmodule.c:2725
#2  0x000000322d4dffbb in call_function (oparg=<optimized out>, pp_stack=0x7f4caa14d3a8) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4090
#3  PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:2740
#4  0x000000322d4e0580 in fast_function (nk=<optimized out>, na=2, n=<optimized out>, pp_stack=0x7f4caa14d4e8, func=<function at remote 0x7f4caf01d320>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4176
#5  call_function (oparg=<optimized out>, pp_stack=0x7f4caa14d4e8) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4111
#6  PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:2740
#7  0x000000322d4e15a5 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kws=0x7f4cb752e068, kwcount=0, defs=0x0, defcount=0, closure=0x0) at /usr/src/debug/Python-2.7.2/Python/ceval.c:3330
#8  0x000000322d46dd23 in function_call (func=<function at remote 0x7f4caf01d9b0>, arg=
    (<Protocol(_read_queue=<Queue(unfinished_tasks=92, queue=<collections.deque at remote 0x1a41980>, maxsize=5, all_tasks_done=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x1ad5db0>, acquire=<built-in method acquire of thread.lock object at remote 0x1ad5db0>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x1ad5db0>) at remote 0x1acbfd0>, mutex=<thread.lock at remote 0x1ad5db0>, not_full=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x1ad5db0>, acquire=<built-in method acquire of thread.lock object at remote 0x1ad5db0>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x1ad5db0>) at remote 0x1acbc90>, not_empty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x1ad5db0>, acquire=<built-in method acquire of thread.lock object at remote 0x1ad5db0>, _Condition__waiters=[<thread.lock at remote 0x1ad55f0>], release=<built-in method release of thr...(truncated), kw={}) at /usr/src/debug/Python-2.7.2/Objects/funcobject.c:526
#9  0x000000322d449193 in PyObject_Call (func=<function at remote 0x7f4caf01d9b0>, arg=<optimized out>, kw=<optimized out>) at /usr/src/debug/Python-2.7.2/Objects/abstract.c:2529
#10 0x000000322d4dc31f in ext_do_call (nk=0, na=<optimized out>, flags=<optimized out>, pp_stack=0x7f4caa14d7a8, func=<function at remote 0x7f4caf01d9b0>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4403
#11 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:2779
#12 0x000000322d4e0580 in fast_function (nk=<optimized out>, na=1, n=<optimized out>, pp_stack=0x7f4caa14d8e8, func=<function at remote 0x7f4cafd3a050>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4176
#13 call_function (oparg=<optimized out>, pp_stack=0x7f4caa14d8e8) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4111
#14 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:2740
#15 0x000000322d4e0580 in fast_function (nk=<optimized out>, na=1, n=<optimized out>, pp_stack=0x7f4caa14da28, func=<function at remote 0x7f4cafd3a1b8>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4176
#16 call_function (oparg=<optimized out>, pp_stack=0x7f4caa14da28) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4111
#17 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:2740
#18 0x000000322d4e15a5 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at /usr/src/debug/Python-2.7.2/Python/ceval.c:3330
#19 0x000000322d46dc2c in function_call (func=<function at remote 0x7f4cafd3a0c8>, arg=
    (<Thread(_Thread__ident=139967247738624, _Thread__block=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x1ad5bb0>, acquire=<built-in method acquire of thread.lock object at remote 0x1ad5bb0>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x1ad5bb0>) at remote 0x1acb350>, _Thread__name='Thread-61', _Thread__daemonic=True, _Thread__started=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x1ad5d70>, acquire=<built-in method acquire of thread.lock object at remote 0x1ad5d70>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x1ad5d70>) at remote 0x1acb390>) at remote 0x1acb710>, _Thread__stderr=<file at remote 0x7f4cb7546270>, _Thread__target=<instancemethod at remote 0x1a24cd0>, _Thread__kwargs={}, _Verbose__verbose=False, _Thread__args=(), _Thread__stopped=False, _Thread__initialized=True) at remote 0x1acb11...(truncated), kw=0x0) at /usr/src/debug/Python-2.7.2/Objects/funcobject.c:526
#20 0x000000322d449193 in PyObject_Call (func=<function at remote 0x7f4cafd3a0c8>, arg=<optimized out>, kw=<optimized out>) at /usr/src/debug/Python-2.7.2/Objects/abstract.c:2529
#21 0x000000322d457dff in instancemethod_call (func=<function at remote 0x7f4cafd3a0c8>, arg=
    (<Thread(_Thread__ident=139967247738624, _Thread__block=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x1ad5bb0>, acquire=<built-in method acquire of thread.lock object at remote 0x1ad5bb0>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x1ad5bb0>) at remote 0x1acb350>, _Thread__name='Thread-61', _Thread__daemonic=True, _Thread__started=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x1ad5d70>, acquire=<built-in method acquire of thread.lock object at remote 0x1ad5d70>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x1ad5d70>) at remote 0x1acb390>) at remote 0x1acb710>, _Thread__stderr=<file at remote 0x7f4cb7546270>, _Thread__target=<instancemethod at remote 0x1a24cd0>, _Thread__kwargs={}, _Verbose__verbose=False, _Thread__args=(), _Thread__stopped=False, _Thread__initialized=True) at remote 0x1acb11...(truncated), kw=0x0)
    at /usr/src/debug/Python-2.7.2/Objects/classobject.c:2578
#22 0x000000322d449193 in PyObject_Call (func=<instancemethod at remote 0x1a24f00>, arg=<optimized out>, kw=<optimized out>) at /usr/src/debug/Python-2.7.2/Objects/abstract.c:2529
#23 0x000000322d4da457 in PyEval_CallObjectWithKeywords (func=<instancemethod at remote 0x1a24f00>, arg=(), kw=<optimized out>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:3959
#24 0x000000322d510362 in t_bootstrap (boot_raw=0x1a4ed20) at /usr/src/debug/Python-2.7.2/Modules/threadmodule.c:614
#25 0x000000321cc07d90 in start_thread (arg=0x7f4caa14e700) at pthread_create.c:309
#26 0x000000321c8eed0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

And another one:

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1  0x000000322d50c3a5 in PyThread_acquire_lock (lock=0x1cbabe0, waitflag=1) at /usr/src/debug/Python-2.7.2/Python/thread_pthread.h:321
#2  0x000000322d4d9de4 in PyEval_RestoreThread (tstate=0x1ae70a0) at /usr/src/debug/Python-2.7.2/Python/ceval.c:379
#3  0x000000322d4fb4e8 in PyGILState_Ensure () at /usr/src/debug/Python-2.7.2/Python/pystate.c:626
#4  0x00007ff24aa42404 in ?? () from /usr/lib64/python2.7/site-packages/gtk-2.0/gtk/_gtk.so
#5  0x000000321e844754 in g_main_context_check () from /lib64/libglib-2.0.so.0
#6  0x000000321e844fa2 in ?? () from /lib64/libglib-2.0.so.0
#7  0x000000321e8457c5 in g_main_loop_run () from /lib64/libglib-2.0.so.0
#8  0x00000037f694bbd7 in gtk_main () from /usr/lib64/libgtk-x11-2.0.so.0
#9  0x00007ff24aa57fa6 in ?? () from /usr/lib64/python2.7/site-packages/gtk-2.0/gtk/_gtk.so
#10 0x000000322d4dfb13 in call_function (oparg=<optimized out>, pp_stack=0x7fff1aed70f8) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4074
#11 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:2740
#12 0x000000322d4e0580 in fast_function (nk=<optimized out>, na=1, n=<optimized out>, pp_stack=0x7fff1aed7238, func=<function at remote 0x2119c80>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4176
#13 call_function (oparg=<optimized out>, pp_stack=0x7fff1aed7238) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4111
#14 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:2740
#15 0x000000322d4e15a5 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=5, kws=0x1d13910, kwcount=0, defs=0x0, defcount=0, closure=0x0) at /usr/src/debug/Python-2.7.2/Python/ceval.c:3330
#16 0x000000322d4dfadb in fast_function (nk=<optimized out>, na=5, n=<optimized out>, pp_stack=0x7fff1aed7428, func=<function at remote 0x7ff24b836de8>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4186
#17 call_function (oparg=<optimized out>, pp_stack=0x7fff1aed7428) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4111
#18 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:2740
#19 0x000000322d4e15a5 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=2, kws=0x1cd0800, kwcount=0, defs=0x0, defcount=0, closure=0x0) at /usr/src/debug/Python-2.7.2/Python/ceval.c:3330
#20 0x000000322d4dfadb in fast_function (nk=<optimized out>, na=2, n=<optimized out>, pp_stack=0x7fff1aed7618, func=<function at remote 0x7ff24b834f50>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4186
#21 call_function (oparg=<optimized out>, pp_stack=0x7fff1aed7618) at /usr/src/debug/Python-2.7.2/Python/ceval.c:4111
#22 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:2740
#23 0x000000322d4e15a5 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at /usr/src/debug/Python-2.7.2/Python/ceval.c:3330
#24 0x000000322d4e16d2 in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at /usr/src/debug/Python-2.7.2/Python/ceval.c:689
#25 0x000000322d4fb9ec in run_mod (mod=<optimized out>, filename=<optimized out>, globals=
    {'GPollableInputStream': <type at remote 0x1d538b0>, '__builtins__': <module at remote 0x7ff2547d9ad0>, '__file__': '/usr/bin/xpra', 'xpra': <module at remote 0x7ff254713670>, '__package__': None, 'sys': <module at remote 0x7ff2547d9b78>, 'GInitiallyUnowned': <GObjectMeta(__gtype__=<gobject.GType at remote 0x7ff24a68c810>, __doc__=<gobject.GObject.__doc__ at remote 0x7ff25479c0d0>, __module__='gtk') at remote 0x1e2f5c0>, '__name__': '__main__', 'GPollableOutputStream': <type at remote 0x1d82690>, '__doc__': None}, locals=
    {'GPollableInputStream': <type at remote 0x1d538b0>, '__builtins__': <module at remote 0x7ff2547d9ad0>, '__file__': '/usr/bin/xpra', 'xpra': <module at remote 0x7ff254713670>, '__package__': None, 'sys': <module at remote 0x7ff2547d9b78>, 'GInitiallyUnowned': <GObjectMeta(__gtype__=<gobject.GType at remote 0x7ff24a68c810>, __doc__=<gobject.GObject.__doc__ at remote 0x7ff25479c0d0>, __module__='gtk') at remote 0x1e2f5c0>, '__name__': '__main__', 'GPollableOutputStream': <type at remote 0x1d82690>, '__doc__': None}, flags=<optimized out>,
    arena=<optimized out>) at /usr/src/debug/Python-2.7.2/Python/pythonrun.c:1354
#26 0x000000322d4fc7f0 in PyRun_FileExFlags (fp=0x1b88b60, filename=0x7fff1aed83d3 "/usr/bin/xpra", start=<optimized out>, globals=
    {'GPollableInputStream': <type at remote 0x1d538b0>, '__builtins__': <module at remote 0x7ff2547d9ad0>, '__file__': '/usr/bin/xpra', 'xpra': <module at remote 0x7ff254713670>, '__package__': None, 'sys': <module at remote 0x7ff2547d9b78>, 'GInitiallyUnowned': <GObjectMeta(__gtype__=<gobject.GType at remote 0x7ff24a68c810>, __doc__=<gobject.GObject.__doc__ at remote 0x7ff25479c0d0>, __module__='gtk') at remote 0x1e2f5c0>, '__name__': '__main__', 'GPollableOutputStream': <type at remote 0x1d82690>, '__doc__': None}, locals=
    {'GPollableInputStream': <type at remote 0x1d538b0>, '__builtins__': <module at remote 0x7ff2547d9ad0>, '__file__': '/usr/bin/xpra', 'xpra': <module at remote 0x7ff254713670>, '__package__': None, 'sys': <module at remote 0x7ff2547d9b78>, 'GInitiallyUnowned': <GObjectMeta(__gtype__=<gobject.GType at remote 0x7ff24a68c810>, __doc__=<gobject.GObject.__doc__ at remote 0x7ff25479c0d0>, __module__='gtk') at remote 0x1e2f5c0>, '__name__': '__main__', 'GPollableOutputStream': <type at remote 0x1d82690>, '__doc__': None}, closeit=1, flags=
    0x7fff1aed7940) at /usr/src/debug/Python-2.7.2/Python/pythonrun.c:1340
#27 0x000000322d4fd26f in PyRun_SimpleFileExFlags (fp=0x1b88b60, filename=0x7fff1aed83d3 "/usr/bin/xpra", closeit=1, flags=0x7fff1aed7940) at /usr/src/debug/Python-2.7.2/Python/pythonrun.c:944
#28 0x000000322d50e745 in Py_Main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/Python-2.7.2/Modules/main.c:599
#29 0x000000321c82169d in __libc_start_main (main=0x400620 <main>, argc=11, ubp_av=0x7fff1aed7a68, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff1aed7a58) at libc-start.c:226
#30 0x0000000000400651 in _start ()

Generally the server will recover after some timeout... but not always.

This may be related to this error which crashes the server occasionally (does having two keys pressed make any difference?)

xpra client disconnected.
clearing keys pressed: {50: 'Shift_L', 37: 'Control_L'}
cancelling key repeat timer: 4985
cancelling key repeat timer: 4983
_focus(0,[]) has_focus=1
widget with focus: None
Focus -> world window
sending WM_TAKE_FOCUS: <gtk.gdk.Window object at 0x274ca00 (GdkWindow at 0x225aa20)>, 7184111L
python: xcb_io.c:221: poll_for_event: Assertion `(((long) (event_sequence) - (long) (dpy->request)) <= 0)' failed.
sending message to 0x40001fL
Aborted

Sometimes accompanied with this and sometimes not:

xpra: Fatal IO error 11 (Resource temporarily unavailable) on X server :7.

It is always after a "sending message" which is the XSendEvent X11 code, one use is the take-focus and the other is the clipboard (so simply disabling the clipboard does not prevent the crash) Removing the send_wm_take_focus call does not prevent the crash, _take_focus and looks dodgy...

Note not swallowing errors here, this should always succeed

Maybe it doesn't?

The poll_for_event doc reads:

/* Thread-safety rules:
 *
 * At most one thread can be reading from XCB's event queue at a time.
 * If you are not the current event-reading thread and you need to find
 * out if an event is available, you must wait.
 *
 * The same rule applies for reading replies.
 *
 * A single thread cannot be both the the event-reading and the
 * reply-reading thread at the same time.
 *
 * We always look at both the current event and the first pending reply
 * to decide which to process next.
 *
 * We always process all responses in sequence-number order, which may
 * mean waiting for another thread (either the event_waiter or the
 * reply_waiter) to handle an earlier response before we can process or
 * return a later one. If so, we wait on the corresponding condition
 * variable for that thread to process the response and wake us up.
 */

So maybe the gtk main loop is getting messed up by our X11 calls? Strange because this is all done from the main loop, which should be fine?

Also found this interesting chestnut in _XReply which calls poll_for_event:

/* If some thread is already waiting for events,
 * it will get the first one. That thread must
 * process that event before we can continue. */
/* FIXME: That event might be after this reply,
 * and might never even come--or there might be
 * multiple threads trying to get events. */


Tue, 06 Dec 2011 16:07:36 GMT - Antoine Martin: owner, status, description changed


Tue, 06 Dec 2011 16:51:42 GMT - Antoine Martin:

r342 fixes an invalid threaded call from the network code during "connection_lost" events.

This seems to fix at least one of the crashes which I have managed to reproduce like so:

GDK_SYNCHRONIZE=1 xpra --bind-tcp=0.0.0.0:10001 --no-daemon --no-pulseaudio --session-name=Terminal start :7 -z0 -d all &
DISPLAY=:7 gnome-terminal &
xpra attach :7 -z0 --no-clipboard &

Then attach gdb to both processes to get a backtrace. Setup a timer to kill the xpra client (ie: sleep 5; kill ${PIDOFXPRACLIENT} ) And in the client window (the terminal), hold shift+control+alt When the disconnection happens (client gets killed), the server ends up calling _clear_keys_pressed which calls xtest_fake_key from a non-UI thread... The really weird thing is that the crash does not happen immediately, even with GDK_SYNCHRONIZE=1 but sometime later (sometimes much later) when the next client connects!

Will now manually inspect all the threading bits and maybe also try to add some code to detect when X11/gtk are used from the wrong context.


Tue, 06 Dec 2011 18:55:51 GMT - Antoine Martin: status changed; resolution set

The following code can be used to dump all threads on each ping_echo packet (so we can easily trigger it):

Then patch client and server:

Index: xpra/server.py
===================================================================
--- xpra/server.py	(revision 343)
+++ xpra/server.py	(working copy)
@@ -1253,6 +1253,8 @@
         self.client_load = (l1, l2, l3)
         self.server_latency = sl
         log("ping echo client load=%s, measured server latency=%s", self.client_load, sl)
+        from xpra.util import stacktraces
+        log.info("stacktraces:\n %s", "\n".join(stacktraces()))
     def _process_ping(self, proto, packet):
         assert self.can_ping
Index: xpra/client.py
===================================================================
--- xpra/client.py	(revision 334)
+++ xpra/client.py	(working copy)
@@ -716,6 +724,8 @@
         if cl>=0:
             self.client_latency.append(cl)
         log("ping echo server load=%s, measured client latency=%s", self.server_load, cl)
+        from xpra.util import stacktraces
+        log.info("stacktraces:\n %s", "\n".join(stacktraces()))
     def _process_ping(self, packet):
         assert self.can_ping

For reference, here are the threads we have on the client:

And on the server, the same ones as the client but also:

Note: also went on a wild goose chase earlier as the Xvfb left behind by a previous test run was in an inconsistent state and the pointer/keyboard would not respond to events properly!

Everything looks ok now... closing.


Mon, 20 Feb 2012 19:31:13 GMT - Antoine Martin: milestone changed; version set


Sat, 23 Jan 2021 04:44:19 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/55