xpra icon
Bug tracker and wiki

Opened 5 years ago

Closed 5 years ago

#475 closed defect (fixed)

server deadlocks when resizing fast updating window (ie: glxgears or mplayer)

Reported by: Antoine Martin Owned by: Antoine Martin
Priority: blocker Milestone: 0.11
Component: server Version: 0.10.x
Keywords: Cc:

Description (last modified by Antoine Martin)

It doesn't crash it just seems to be stuck:

(gdb) bt
#0  0x00000034528eb7fd in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000003455409f42 in poll (__timeout=-1, __nfds=1, __fds=0x7fff3f1f6fb0) at /usr/include/bits/poll2.h:46
#2  _xcb_conn_wait (c=c@entry=0x1054260, cond=cond@entry=0x7fff3f1f7030, vector=vector@entry=0x0, count=count@entry=0x0) at xcb_conn.c:414
#3  0x000000345540b37f in wait_for_reply (c=c@entry=0x1054260, request=16876, e=e@entry=0x7fff3f1f70f8) at xcb_in.c:399
#4  0x000000345540b492 in xcb_wait_for_reply (c=c@entry=0x1054260, request=16876, e=e@entry=0x7fff3f1f70f8) at xcb_in.c:429
#5  0x0000003456041b47 in _XReply (dpy=dpy@entry=0xbe0c60, rep=rep@entry=0x7fff3f1f7140, extra=extra@entry=0, discard=discard@entry=1) at xcb_io.c:602
#6  0x000000345603d76d in XSync (dpy=0xbe0c60, discard=0) at Sync.c:44
#7  0x000000386365c17e in gdk_flush () from /lib64/libgdk-x11-2.0.so.0
#8  0x00007f85b5f69289 in _wrap_gdk_flush (self=<optimized out>) at gdk.c:17325
#9  0x00000033b2e49dd3 in PyObject_Call (func=<built-in function flush>, arg=<optimized out>, kw=kw@entry=0x0)
    at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#10 0x00007f85b4a627e9 in __pyx_pf_4xpra_3x11_7gtk_x11_12gdk_bindings_42_gw (__pyx_self=<optimized out>, __pyx_v_xwin=12582914L, 
    __pyx_v_display=<gtk.gdk.Display at remote 0xfb6780>) at xpra/x11/gtk_x11/gdk_bindings.c:8311
#11 __pyx_pw_4xpra_3x11_7gtk_x11_12gdk_bindings_43_gw (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at xpra/x11/gtk_x11/gdk_bindings.c:8147
#12 0x00000033b2e49dd3 in PyObject_Call (func=<built-in function _gw>, arg=<optimized out>, kw=kw@entry=0x0)
    at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#13 0x00007f85b4a66d0b in __pyx_f_4xpra_3x11_7gtk_x11_12gdk_bindings_x_event_filter (__pyx_v_e_gdk=0x7fff3f1f7530, __pyx_v_gdk_event=<optimized out>, 
    __pyx_v_userdata=<optimized out>) at xpra/x11/gtk_x11/gdk_bindings.c:9054
#14 0x000000386365c7b5 in gdk_event_translate () from /lib64/libgdk-x11-2.0.so.0
#15 0x000000386365e968 in _gdk_events_queue () from /lib64/libgdk-x11-2.0.so.0
#16 0x000000386365ea0e in gdk_event_dispatch () from /lib64/libgdk-x11-2.0.so.0
#17 0x00007f85b6f06df6 in g_main_dispatch (context=0x104fbf0) at gmain.c:3054
#18 g_main_context_dispatch (context=context@entry=0x104fbf0) at gmain.c:3630
#19 0x00007f85b6f07148 in g_main_context_iterate (context=0x104fbf0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3701
#20 0x00007f85b6f0754a in g_main_loop_run (loop=0x1e4aa30) at gmain.c:3895
#21 0x0000003862f3ed47 in gtk_main () from /lib64/libgtk-x11-2.0.so.0

Sanitized as:

wait_for_reply()
gtk.gdk.flush()
xpra.x11.gtk_x11.gdk_bindings._gw()
xpra.x11.gtk_x11.gdk_bindings.x_event_filter()
gdk_event_translate()
gdk_event_dispatch()
gtk_main()

And here is another one (with 0.10.x r4000), showing up in a different place:

wait_for_reply()
XGetGeometry()
gdk_window_get_geometry()
resize_corral_window()
xpra.x11.gtk_x11.error.call_synced()
xpra.x11.gtk_x11.window.may_resize_corral_window()

With the X11 server just waiting for something:

(gdb) bt
#0  0x00000034528ed533 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x000000000046692c in WaitForSomething ()
#2  0x0000000000436eb1 in Dispatch ()
#3  0x00000000004266da in main ()

Sometimes it also just kills the X11 server (no xpra backtrace in that case - no X11 backtrace yet..).

In the xpra backtrace case, it looks like we're stuck waiting for a reply from the X11 server, and none is coming...
Possibly because we accessed the server from a non-UI thread somewhere. (Ouch!)
Could also be some library problems (#422: the AMD OpenCL icd is known to cause weird problems)

Attachments (4)

xshm-fixleak.patch (1.6 KB) - added by Antoine Martin 5 years ago.
this seems to fix the leak
xshm-fixleak+debug.patch (11.5 KB) - added by Antoine Martin 5 years ago.
fixes the leak and adds some debug
xshm-single.patch (2.8 KB) - added by Antoine Martin 5 years ago.
forces us to not use more than one XShm mapping per window
xshm-debugging.patch (10.9 KB) - added by Antoine Martin 5 years ago.
improve xshm debugging and add unique ids and thread checking

Download all attachments as: .zip

Change History (9)

comment:1 Changed 5 years ago by Antoine Martin

Description: modified (diff)
Owner: changed from Antoine Martin to Antoine Martin
Status: newassigned
Summary: server deadlocks when resizing glxgearsserver deadlocks when resizing fast updating window (ie: glxgears or mplayer)
Version: 0.10.x

This is a regression, but not a recent one:

  • 0.9.x does not seem to be affected (could just be harder to hit..)
  • 0.10.x is affected (maybe a little harder to trigger than 0.11.x? races in resize code can be like that..)


(and it could still be related to lib injection from buggy OpenCL / Nvidia bits.)

Let the bisection "fun" begin (using rgb encoding for testing):

So this crash is caused by r3613 which enables XShm (#345) by default.
This can be confirmed by running normally crashing versions (including trunk) without problems by using:

XPRA_XSHM=0 xpra start ...

Problem is that XShm provides a major performance boost, so ideally we want to find a fix for the crash without turning it off completely..

Last edited 5 years ago by Antoine Martin (previous) (diff)

comment:2 Changed 5 years ago by Antoine Martin

Adding some debug, this is often the last thing that shows before we crash:

2013-12-17 15:29:36,198 resize_corral_window()
Xpra: Fatal IO error 11 (Resource temporarily unavailable) on X server :10.

Interestingly, I also got this error:

shmget failed: error 28 (No space left on device)

Which got me watching the shared memory segments with:

watch "ipcs | wc -l"

And this number goes up with every resize, but never seems to go back down again! Looks like something isn't freeing up the shared memory!

Changed 5 years ago by Antoine Martin

Attachment: xshm-fixleak.patch added

this seems to fix the leak

Changed 5 years ago by Antoine Martin

Attachment: xshm-fixleak+debug.patch added

fixes the leak and adds some debug

comment:3 Changed 5 years ago by Antoine Martin

With the patch above I see:

2013-12-17 16:19:55,287 XShmWrapper.XShmCreateImage(740x416-24) True ID=203
2013-12-17 16:19:55,287 XShmWrapper.shmget(PRIVATE, 1234320 bytes, 1023) shmid=880476204
2013-12-17 16:19:55,287 XShmWrapper.shmat(880476204, NULL, 0) True
2013-12-17 16:19:55,287 XShmWrapper.XShmAttach(..) True
2013-12-17 16:19:55,287 XShmWrapper.get_image(4195268L, 0, 0, 740, 416)
2013-12-17 16:19:55,289 XShmWrapper.get_image(4195268L, 0, 0, 740, 416) ref_count=1, \
    returning XShmImageWrapper(BGRX: 0, 0, 740, 416, ID=718)
2013-12-17 16:19:55,289 XShmImageWrapper.get_image_pixels() self=XShmImageWrapper(BGRX: 0, 0, 740, 416, ID=718)
2013-12-17 16:19:55,291 WindowModel.do_xpra_configure_event(<AdHocStruct object, contents: \
    {'delivered_to': <gtk.gdk.Window object at 0x3c82910 (GdkWindow at 0x3c47360)>, 'send_event': 0, \
     'height': 418, 'width': 743, 'window': <gtk.gdk.Window object at 0x3c82910 (GdkWindow at 0x3c47360)>, \
     'y': 213, 'x': 94, 'serial': 25676L, 'border_width': 0, 'type': 22, \
     'display': <gtk.gdk.Display object at 0x29867d0 (GdkDisplayX11 at 0x2a18230)>}>) managed=True
2013-12-17 16:19:55,292 resize_corral_window()
2013-12-17 16:19:55,292 corral window geometry: (743, 418)
2013-12-17 16:19:55,292 client window geometry: (0, 0, 743, 418)
2013-12-17 16:19:55,292 WindowModel.do_xpra_configure_event(<AdHocStruct object, contents: \
    {'delivered_to': <gtk.gdk.Window object at 0x3c827d0 (GdkWindow at 0x3c47240)>, 'send_event': 0, \
     'height': 418, 'width': 743, 'window': <gtk.gdk.Window object at 0x3c827d0 (GdkWindow at 0x3c47240)>, \
     'y': 0, 'x': 0, 'serial': 25679L, 'border_width': 0, 'type': 22, \
     'display': <gtk.gdk.Display object at 0x29867d0 (GdkDisplayX11 at 0x2a18230)>}>) managed=True
2013-12-17 16:19:55,292 resize_corral_window()
2013-12-17 16:19:55,292 corral window geometry: (743, 418)
2013-12-17 16:19:55,293 client window geometry: (0, 0, 743, 418)
2013-12-17 16:19:55,310 XShmWrapper.get_image(4194552L, 17, 301, 480, 13)
2013-12-17 16:19:55,311 XShmWrapper.get_image(4194552L, 17, 301, 480, 13) ref_count=1, \
    returning XShmImageWrapper(BGRX: 17, 301, 480, 13, ID=719)
2013-12-17 16:19:55,312 XShmWrapper.cleanup() ID=203, ref_count=1
2013-12-17 16:19:55,313 XShmWrapper.XShmCreateImage(743x418-24) True ID=204
2013-12-17 16:19:55,313 XShmWrapper.shmget(PRIVATE, 1245268 bytes, 1023) shmid=880508973
2013-12-17 16:19:55,313 XShmWrapper.shmat(880508973, NULL, 0) True
2013-12-17 16:19:55,313 XShmWrapper.XShmAttach(..) True
2013-12-17 16:19:55,314 XShmWrapper.get_image(4195272L, 0, 0, 743, 418)
2013-12-17 16:19:55,315 XShmWrapper.get_image(4195272L, 0, 0, 743, 418) ref_count=1, \
    returning XShmImageWrapper(BGRX: 0, 0, 743, 418, ID=720)
2013-12-17 16:19:55,316 XShmImageWrapper.free() ID=718, \
    free_callback=<built-in method free_image of xpra.x11.bindings.ximage.XShmWrapper object at 0x3c16be8>
2013-12-17 16:19:55,317 XShmWrapper.free_image() ID=203, closed=1, new ref_count=0
2013-12-17 16:19:55,318 XShmWrapper.free() ID=203, has_shm=True
2013-12-17 16:19:55,318 XShmWrapper.free() ID=203, image=0x3d6f590L
Xpra: Fatal IO error 11 (Resource temporarily unavailable) on X server :10.

So, it looks to me like we still have an XShmWrapper live (ID=203, ref_count=1) when we call a new get_image following the window resize.
We create a new wrapper (ID=204) and hand out a new image (ID=720), then when we free() the older wrapper, things blow up.

Thoughts:

  • when we resize, we should probably close the wrapper (since it won't work)
  • waiting for the last picture to decrement the semaphore to zero may not be the best approach:
    • maybe we should weak reference the images and copy their pixels out of XShm when we want to free the wrapper
    • or just keep the wrapper around until it is properly freed (all images freed) and just fallback to the non-xshm code path Both solutions require memory copying - I think the second one is cleaner.
Last edited 5 years ago by Antoine Martin (previous) (diff)

Changed 5 years ago by Antoine Martin

Attachment: xshm-single.patch added

forces us to not use more than one XShm mapping per window

Changed 5 years ago by Antoine Martin

Attachment: xshm-debugging.patch added

improve xshm debugging and add unique ids and thread checking

comment:4 Changed 5 years ago by Antoine Martin

  • The reference issue was a red herring: it is not illegal to have more than one XShm handle for the same pixmap, just a bit wasteful - but in our case they will be of different sizes, which is why we need them. The attachment/ticket/475/xshm-single.patch implements a one-mapping-per-window policy, this makes things slower for no real benefit.
  • Even with those fixes, I still got some ominous crashes, including one or two claiming that we were accessing the X11 server multi-threaded. Surely not, right? Then I added lots of debug: attachment/ticket/475/xshm-debugging.patch and eventually found that yes, we are doing something very naughty...
  • The real problem comes from how we free the image wrapper class, which in turn may free the xshm wrapper if it is waiting to be closed: this call may come from another thread - and non-UI threads are not allowed to talk to the X11 server!

Fixed in r4963.


0.10.x backports in: r4965 and r4966 - keeping ticket opened for testing.

Last edited 5 years ago by Antoine Martin (previous) (diff)

comment:5 Changed 5 years ago by Antoine Martin

Resolution: fixed
Status: assignedclosed

0.10.10 released with the fix after testing ok - closing

Note: See TracTickets for help on using tickets.