Xpra: Ticket #475: server deadlocks when resizing fast updating window (ie: glxgears or mplayer)

It doesn't crash it just seems to be stuck:

(gdb) bt
#0  0x00000034528eb7fd in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000003455409f42 in poll (__timeout=-1, __nfds=1, __fds=0x7fff3f1f6fb0) at /usr/include/bits/poll2.h:46
#2  _xcb_conn_wait (c=c@entry=0x1054260, cond=cond@entry=0x7fff3f1f7030, vector=vector@entry=0x0, count=count@entry=0x0) at xcb_conn.c:414
#3  0x000000345540b37f in wait_for_reply (c=c@entry=0x1054260, request=16876, e=e@entry=0x7fff3f1f70f8) at xcb_in.c:399
#4  0x000000345540b492 in xcb_wait_for_reply (c=c@entry=0x1054260, request=16876, e=e@entry=0x7fff3f1f70f8) at xcb_in.c:429
#5  0x0000003456041b47 in _XReply (dpy=dpy@entry=0xbe0c60, rep=rep@entry=0x7fff3f1f7140, extra=extra@entry=0, discard=discard@entry=1) at xcb_io.c:602
#6  0x000000345603d76d in XSync (dpy=0xbe0c60, discard=0) at Sync.c:44
#7  0x000000386365c17e in gdk_flush () from /lib64/libgdk-x11-2.0.so.0
#8  0x00007f85b5f69289 in _wrap_gdk_flush (self=<optimized out>) at gdk.c:17325
#9  0x00000033b2e49dd3 in PyObject_Call (func=<built-in function flush>, arg=<optimized out>, kw=kw@entry=0x0)
    at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#10 0x00007f85b4a627e9 in __pyx_pf_4xpra_3x11_7gtk_x11_12gdk_bindings_42_gw (__pyx_self=<optimized out>, __pyx_v_xwin=12582914L,
    __pyx_v_display=<gtk.gdk.Display at remote 0xfb6780>) at xpra/x11/gtk_x11/gdk_bindings.c:8311
#11 __pyx_pw_4xpra_3x11_7gtk_x11_12gdk_bindings_43_gw (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at xpra/x11/gtk_x11/gdk_bindings.c:8147
#12 0x00000033b2e49dd3 in PyObject_Call (func=<built-in function _gw>, arg=<optimized out>, kw=kw@entry=0x0)
    at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#13 0x00007f85b4a66d0b in __pyx_f_4xpra_3x11_7gtk_x11_12gdk_bindings_x_event_filter (__pyx_v_e_gdk=0x7fff3f1f7530, __pyx_v_gdk_event=<optimized out>,
    __pyx_v_userdata=<optimized out>) at xpra/x11/gtk_x11/gdk_bindings.c:9054
#14 0x000000386365c7b5 in gdk_event_translate () from /lib64/libgdk-x11-2.0.so.0
#15 0x000000386365e968 in _gdk_events_queue () from /lib64/libgdk-x11-2.0.so.0
#16 0x000000386365ea0e in gdk_event_dispatch () from /lib64/libgdk-x11-2.0.so.0
#17 0x00007f85b6f06df6 in g_main_dispatch (context=0x104fbf0) at gmain.c:3054
#18 g_main_context_dispatch (context=context@entry=0x104fbf0) at gmain.c:3630
#19 0x00007f85b6f07148 in g_main_context_iterate (context=0x104fbf0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3701
#20 0x00007f85b6f0754a in g_main_loop_run (loop=0x1e4aa30) at gmain.c:3895
#21 0x0000003862f3ed47 in gtk_main () from /lib64/libgtk-x11-2.0.so.0

Sanitized as:

wait_for_reply()
gtk.gdk.flush()
xpra.x11.gtk_x11.gdk_bindings._gw()
xpra.x11.gtk_x11.gdk_bindings.x_event_filter()
gdk_event_translate()
gdk_event_dispatch()
gtk_main()

And here is another one (with 0.10.x r4000), showing up in a different place:

wait_for_reply()
XGetGeometry()
gdk_window_get_geometry()
resize_corral_window()
xpra.x11.gtk_x11.error.call_synced()
xpra.x11.gtk_x11.window.may_resize_corral_window()

With the X11 server just waiting for something:

(gdb) bt
#0  0x00000034528ed533 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x000000000046692c in WaitForSomething ()
#2  0x0000000000436eb1 in Dispatch ()
#3  0x00000000004266da in main ()

Sometimes it also just kills the X11 server (no xpra backtrace in that case - no X11 backtrace yet..).

In the xpra backtrace case, it looks like we're stuck waiting for a reply from the X11 server, and none is coming... Possibly because we accessed the server from a non-UI thread somewhere. (Ouch!) Could also be some library problems (#422: the AMD OpenCL icd is known to cause weird problems)



Tue, 17 Dec 2013 07:28:20 GMT - Antoine Martin: owner, status, description, summary changed; version set

This is a regression, but not a recent one:


(and it could still be related to lib injection from buggy OpenCL / Nvidia bits.)

Let the bisection "fun" begin (using rgb encoding for testing):

So this crash is caused by r3613 which enables XShm (#345) by default. This can be confirmed by running normally crashing versions (including trunk) without problems by using:

XPRA_XSHM=0 xpra start ...

Problem is that XShm provides a major performance boost, so ideally we want to find a fix for the crash without turning it off completely..


Tue, 17 Dec 2013 08:42:25 GMT - Antoine Martin:

Adding some debug, this is often the last thing that shows before we crash:

2013-12-17 15:29:36,198 resize_corral_window()
Xpra: Fatal IO error 11 (Resource temporarily unavailable) on X server :10.

Interestingly, I also got this error:

shmget failed: error 28 (No space left on device)

Which got me watching the shared memory segments with:

watch "ipcs | wc -l"

And this number goes up with every resize, but never seems to go back down again! Looks like something isn't freeing up the shared memory!


Tue, 17 Dec 2013 09:13:16 GMT - Antoine Martin: attachment set

this seems to fix the leak


Tue, 17 Dec 2013 09:30:52 GMT - Antoine Martin: attachment set

fixes the leak and adds some debug


Tue, 17 Dec 2013 09:42:39 GMT - Antoine Martin:

With the patch above I see:

2013-12-17 16:19:55,287 XShmWrapper.XShmCreateImage(740x416-24) True ID=203
2013-12-17 16:19:55,287 XShmWrapper.shmget(PRIVATE, 1234320 bytes, 1023) shmid=880476204
2013-12-17 16:19:55,287 XShmWrapper.shmat(880476204, NULL, 0) True
2013-12-17 16:19:55,287 XShmWrapper.XShmAttach(..) True
2013-12-17 16:19:55,287 XShmWrapper.get_image(4195268L, 0, 0, 740, 416)
2013-12-17 16:19:55,289 XShmWrapper.get_image(4195268L, 0, 0, 740, 416) ref_count=1, \
    returning XShmImageWrapper(BGRX: 0, 0, 740, 416, ID=718)
2013-12-17 16:19:55,289 XShmImageWrapper.get_image_pixels() self=XShmImageWrapper(BGRX: 0, 0, 740, 416, ID=718)
2013-12-17 16:19:55,291 WindowModel.do_xpra_configure_event(<AdHocStruct object, contents: \
    {'delivered_to': <gtk.gdk.Window object at 0x3c82910 (GdkWindow at 0x3c47360)>, 'send_event': 0, \
     'height': 418, 'width': 743, 'window': <gtk.gdk.Window object at 0x3c82910 (GdkWindow at 0x3c47360)>, \
     'y': 213, 'x': 94, 'serial': 25676L, 'border_width': 0, 'type': 22, \
     'display': <gtk.gdk.Display object at 0x29867d0 (GdkDisplayX11 at 0x2a18230)>}>) managed=True
2013-12-17 16:19:55,292 resize_corral_window()
2013-12-17 16:19:55,292 corral window geometry: (743, 418)
2013-12-17 16:19:55,292 client window geometry: (0, 0, 743, 418)
2013-12-17 16:19:55,292 WindowModel.do_xpra_configure_event(<AdHocStruct object, contents: \
    {'delivered_to': <gtk.gdk.Window object at 0x3c827d0 (GdkWindow at 0x3c47240)>, 'send_event': 0, \
     'height': 418, 'width': 743, 'window': <gtk.gdk.Window object at 0x3c827d0 (GdkWindow at 0x3c47240)>, \
     'y': 0, 'x': 0, 'serial': 25679L, 'border_width': 0, 'type': 22, \
     'display': <gtk.gdk.Display object at 0x29867d0 (GdkDisplayX11 at 0x2a18230)>}>) managed=True
2013-12-17 16:19:55,292 resize_corral_window()
2013-12-17 16:19:55,292 corral window geometry: (743, 418)
2013-12-17 16:19:55,293 client window geometry: (0, 0, 743, 418)
2013-12-17 16:19:55,310 XShmWrapper.get_image(4194552L, 17, 301, 480, 13)
2013-12-17 16:19:55,311 XShmWrapper.get_image(4194552L, 17, 301, 480, 13) ref_count=1, \
    returning XShmImageWrapper(BGRX: 17, 301, 480, 13, ID=719)
2013-12-17 16:19:55,312 XShmWrapper.cleanup() ID=203, ref_count=1
2013-12-17 16:19:55,313 XShmWrapper.XShmCreateImage(743x418-24) True ID=204
2013-12-17 16:19:55,313 XShmWrapper.shmget(PRIVATE, 1245268 bytes, 1023) shmid=880508973
2013-12-17 16:19:55,313 XShmWrapper.shmat(880508973, NULL, 0) True
2013-12-17 16:19:55,313 XShmWrapper.XShmAttach(..) True
2013-12-17 16:19:55,314 XShmWrapper.get_image(4195272L, 0, 0, 743, 418)
2013-12-17 16:19:55,315 XShmWrapper.get_image(4195272L, 0, 0, 743, 418) ref_count=1, \
    returning XShmImageWrapper(BGRX: 0, 0, 743, 418, ID=720)
2013-12-17 16:19:55,316 XShmImageWrapper.free() ID=718, \
    free_callback=<built-in method free_image of xpra.x11.bindings.ximage.XShmWrapper object at 0x3c16be8>
2013-12-17 16:19:55,317 XShmWrapper.free_image() ID=203, closed=1, new ref_count=0
2013-12-17 16:19:55,318 XShmWrapper.free() ID=203, has_shm=True
2013-12-17 16:19:55,318 XShmWrapper.free() ID=203, image=0x3d6f590L
Xpra: Fatal IO error 11 (Resource temporarily unavailable) on X server :10.

So, it looks to me like we still have an XShmWrapper live (ID=203, ref_count=1) when we call a new get_image following the window resize. We create a new wrapper (ID=204) and hand out a new image (ID=720), then when we free() the older wrapper, things blow up.

Thoughts:


Tue, 17 Dec 2013 15:57:11 GMT - Antoine Martin: attachment set

forces us to not use more than one XShm mapping per window


Tue, 17 Dec 2013 16:01:04 GMT - Antoine Martin: attachment set

improve xshm debugging and add unique ids and thread checking


Tue, 17 Dec 2013 16:09:50 GMT - Antoine Martin:

Fixed in r4963.


0.10.x backports in: r4965 and r4966 - keeping ticket opened for testing.


Thu, 19 Dec 2013 11:20:58 GMT - Antoine Martin: status changed; resolution set

0.10.10 released with the fix after testing ok - closing


Sat, 23 Jan 2021 04:56:43 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/475