Xpra: Ticket #1314: osx 1.0 r13763 client hangs when server crashes

Testing for #713 by leaving some windows (including a firefox window playing some random something or other), osx client 1.0 r13763 against fedora 23 server 1.0 r13778, I got a variety of server and gstreamer error messages after a time, leading to a server disconnection of the client.

The client hung, not fully disconnecting after the server disconnected it and not disconnecting after using a control-c ... I had to Force Quit it.

Some server error messages (more for context than anything):

2016-09-19 14:33:55,015 client 2: server is not responding, drawing spinners over the windows
2016-09-19 14:33:56,057 client 2: server is OK again
2016-09-19 14:34:41,614 client 2: Warning: long timer waiting time,
2016-09-19 14:34:41,616 client 2:  UI thread polling waited 10.0 seconds longer than intended (10.5 vs 0.5)
...
2016-09-19 14:57:15,159 client 2:  UI thread polling waited 19.7 seconds longer than intended (20.2 vs 0.5)
2016-09-19 14:57:20,223 sound source pipeline warning: Can't record audio fast enough
2016-09-19 14:57:20,449 sound source                   gstaudiobasesrc.c(866)
2016-09-19 14:57:20,449 sound source                    gst_audio_base_src_create ()
2016-09-19 14:57:20,449 sound source                    /GstPipeline
2016-09-19 14:57:20,450 sound source                   pipeline0/GstPulseSrc
2016-09-19 14:57:20,450 sound source                   src
2016-09-19 14:57:20,450 sound source
Dropped 9600 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.
2016-09-19 14:57:35,116 client 2: Warning: long timer waiting time,
2016-09-19 14:57:35,166 client 2:  UI thread polling waited 19.5 seconds longer than intended (20.0 vs 0.5)
2016-09-19 14:57:55,135 client 2: Warning: long timer waiting time,
2016-09-19 14:57:55,161 client 2:  UI thread polling waited 19.5 seconds longer than intended (20.0 vs 0.5)
2016-09-19 14:58:07,015 sound source pipeline warning: Can't record audio fast enough
2016-09-19 14:58:07,158 sound source                   gstaudiobasesrc.c(866)
2016-09-19 14:58:07,218 sound source                    gst_audio_base_src_create ()
2016-09-19 14:58:07,219 sound source                    /GstPipeline
2016-09-19 14:58:07,219 sound source                   pipeline0/GstPulseSrc
2016-09-19 14:58:07,219 sound source                   src
2016-09-19 14:58:07,219 sound source
Dropped 11520 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.
...
2016-09-19 15:02:05,087 client 2:  UI thread polling waited 19.5 seconds longer than intended (20.0 vs 0.5)
2016-09-19 15:02:21,763 client 2: server is not responding, drawing spinners over the windows
2016-09-19 15:02:43,392 sound source pipeline warning: Can't record audio fast enough
2016-09-19 15:02:47,509 sound source                   gstaudiobasesrc.c(866)
W: [null-sink] asyncq.c: q overrun, queuing locally
2016-09-19 15:02:55,913 sound source                    gst_audio_base_src_create ()
2016-09-19 15:03:09,793 sound source                    /GstPipeline
2016-09-19 15:03:27,683 sound source                   pipeline0/GstPulseSrc
2016-09-19 15:03:33,455 sound source                   src
2016-09-19 15:03:34,666 client 2: server is OK again
2016-09-19 15:03:34,404 sound source
Dropped 31200 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.
2016-09-19 15:04:00,178 client 2: server is not responding, drawing spinners over the windows
2016-09-19 15:04:12,752 sound source pipeline warning: Can't record audio fast enough
2016-09-19 15:04:28,209 sound source                   gstaudiobasesrc.c(866)
2016-09-19 15:04:37,057 sound source                    gst_audio_base_src_create ()
2016-09-19 15:04:39,884 sound source                    /GstPipeline
2016-09-19 15:04:47,753 sound source                   pipeline0/GstPulseSrc
2016-09-19 15:04:39,548 Disconnecting client 10.0.11.102:53336:
2016-09-19 15:04:57,737  client ping timeout (waited 60 seconds without a response)
2016-09-19 15:05:04,838 sound source                   src
2016-09-19 15:05:09,486 client 2: server is OK again
2016-09-19 15:05:24,902 sound source
Dropped 16800 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.
...
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 1836, in check_echo_timeout
2016-09-19 15:06:36,852 sound source pipeline warning: Can't record audio fast enough
2016-09-19 15:06:37,877 client 2: server is OK again
    self.disconnect(CLIENT_PING_TIMEOUT, "waited %s seconds without a response" % timeout)
  File "/usr/lib64/python2.7/site-packages/xpra/server/server_base.py", line 1105, in drop_client
    self.disconnect_client(proto, reason, *args)
  File "/usr/lib64/python2.7/site-packages/xpra/server/server_core.py", line 788, in disconnect_client
2016-09-19 15:06:41,091 sound source                   gstaudiobasesrc.c(866)
    self.disconnect_protocol(protocol, reason, *extra)
  File "/usr/lib64/python2.7/site-packages/xpra/server/server_base.py", line 936, in disconnect_protocol
2016-09-19 15:06:41,088 client 2: server is not responding, drawing spinners over the windows
    self.cleanup_protocol(protocol)
  File "/usr/lib64/python2.7/site-packages/xpra/x11/server.py", line 377, in cleanup_protocol
    2016-09-19 15:06:46,317 sound source                    gst_audio_base_src_create ()
X11ServerBase.cleanup_protocol(self, protocol)
  File "/usr/lib64/python2.7/site-packages/xpra/server/server_base.py", line 948, in cleanup_protocol
    self.cleanup_source(source)
2016-09-19 15:06:47,388 sound source                    /GstPipeline
  File "/usr/lib64/python2.7/site-packages/xpra/server/server_base.py", line 957, in cleanup_source
2016-09-19 15:06:49,763 sound source                   pipeline0/GstPulseSrc
2016-09-19 15:06:53,206 sound source                   src
    source.close()
2016-09-19 15:07:02,808 sound source
Dropped 13440 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.
  File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 466, in close
    self.remove_printers()
  File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 1818, in remove_printers
    remove_printer(k)
  File "/usr/lib64/python2.7/site-packages/xpra/platform/pycups_printing.py", line 296, in remove_printer
    exec_lpadmin(["-x", PRINTER_PREFIX+sanitize_name(name)])
  File "/usr/lib64/python2.7/site-packages/xpra/platform/pycups_printing.py", line 236, in exec_lpadmin
2016-09-19 15:07:45,602 sound source pipeline warning: Can't record audio fast enough
2016-09-19 15:07:50,692 sound source                   gstaudiobasesrc.c(866)
2016-09-19 15:07:55,331 sound source                    gst_audio_base_src_create ()
    2016-09-19 15:08:02,656 sound source                    /GstPipeline
2016-09-19 15:08:07,708 client 2: server is OK again
proc = subprocess.Popen(command, stdin=None, stdout=None, stderr=None, shell=False, close_fds=True, preexec_fn=preexec)
  File "/usr/lib64/python2.7/subprocess.py", line 710, in __init__
    2016-09-19 15:08:16,953 sound source                   pipeline0/GstPulseSrc
2016-09-19 15:08:17,401 client 2: server is not responding, drawing spinners over the windows
errread, errwrite)
  File "/usr/lib64/python2.7/subprocess.py", line 1231, in _execute_child
    self.pid = os.fork()
2016-09-19 15:08:29,005 sound source                   src
2016-09-19 15:08:34,557 sound source
Dropped 17280 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.
2016-09-19 15:08:36,756 client 2: server ping timeout - waited 60 seconds without a response
OSError: [Errno 12] Cannot allocate memory
...
2016-09-19 15:09:22,671 xpra client 2 disconnected.
2016-09-19 15:09:23,928 sound source                   gstaudiobasesrc.c(866)
2016-09-19 15:09:25,193 sound source                    gst_audio_base_src_create ()
2016-09-19 15:09:29,451 sound source                    /GstPipeline
2016-09-19 15:09:30,078 sound source                   pipeline0/GstPulseSrc
2016-09-19 15:09:30,983 sound source                   src
2016-09-19 15:09:31,535 sound source
Dropped 12960 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.
2016-09-19 15:09:32,469 sound source stopping

Meanwhile, client side:

2016-09-19 15:02:29,926 server is not responding, drawing spinners over the windows
2016-09-19 15:02:52,853 server is OK again
2016-09-19 15:02:53,039 server is not responding, drawing spinners over the windows
2016-09-19 15:02:55,639 server ping timeout - waited 60 seconds without a response
2016-09-19 15:02:55,692 sound output stopping
/Users/Schadenfreude/Desktop/xpra-catalog/xpr-ant-1-13763/Xpra.app/Contents/Resources/lib/python/xpra/client/gtk_base/gtk_client_window_base.py:1279: Warning: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
  gtk.Window.destroy(self)
/Users/Schadenfreude/Desktop/xpra-catalog/xpr-ant-1-13763/Xpra.app/Contents/Resources/lib/python/xpra/gtk_common/gtk_util.py:383: Warning: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
  gtk.main()
2016-09-19 15:02:55,727 Connection lost
^C
got signal SIGINT, exiting
Killed: 9

After Force Quitting... I re-attached the client with a -d client, then, once it was connected again, tried a control-c again client side... triggering another hang with the following:

2016-09-19 15:30:26,066 check_server_echo(0) last=True, server_ok=True
2016-09-19 15:30:26,107 check_server_echo(0) last=True, server_ok=True
^C
got signal SIGINT, exiting
2016-09-19 15:30:26,134 UIXpraClient.signal_cleanup()
2016-09-19 15:30:26,134 stop_receiving_sound(True) sound sink=sink_subprocess_wrapper(17588)
2016-09-19 15:30:26,134 sound output stopping
2016-09-19 15:30:26,134 stop_receiving_sound(True) calling <bound method sink_subprocess_wrapper.cleanup of sink_subprocess_wrapper(17588)>
2016-09-19 15:30:26,135 stop_receiving_sound(True) done
2016-09-19 15:30:26,136 UIXpraClient.signal_cleanup() done
2016-09-19 15:30:26,137 signal_disconnect_and_quit(130, exit on signal SIGINT) exit_on_signal=False
2016-09-19 15:30:26,137 disconnect_and_quit(130, exit on signal SIGINT)
2016-09-19 15:30:26,137 GTKXpraClient.quit(130) current exit_code=130
2016-09-19 15:30:26,138 UIXpraClient.cleanup()
2016-09-19 15:30:26,138 XpraClientBase.cleanup() protocol=Protocol(tcp socket: 10.0.11.102:53953 <- 10.0.32.134:1203)
2016-09-19 15:30:26,139 calling <bound method Protocol.close of Protocol(tcp socket: 10.0.11.102:53953 <- 10.0.32.134:1203)>
2016-09-19 15:30:26,139 cleanup done
2016-09-19 15:30:26,144 UIXpraClient.cleanup() calling <class 'xpra.client.gtk_base.gtk_keyboard_helper.GTKKeyboardHelper'>.cleanup()
2016-09-19 15:30:26,144 UIXpraClient.cleanup() calling <class 'xpra.platform.darwin.osx_clipboard.OSXClipboardProtocolHelper'>.cleanup()
2016-09-19 15:30:26,146 UIXpraClient.cleanup() calling <class 'xpra.platform.darwin.osx_tray.OSXTray'>.cleanup()
2016-09-19 15:30:26,147 UIXpraClient.cleanup() calling <class 'xpra.client.gtk2.gtk2_notifier.GTK2_Notifier'>.cleanup()
2016-09-19 15:30:26,147 UIXpraClient.cleanup() calling <class 'xpra.platform.darwin.osx_menu.OSXMenuHelper'>.cleanup()
2016-09-19 15:30:26,147 UIXpraClient.cleanup() calling <class 'xpra.platform.darwin.gui.ClientExtras'>.cleanup()
2016-09-19 15:30:26,147 UIXpraClient.cleanup() calling <class 'xpra.codecs.video_helper.VideoHelper'>.cleanup()
2016-09-19 15:30:26,147 destroy_all_windows() destroying 24 / GLClientWindow(24 : gtk2.GLWindowBacking(24, (499, 316), None))
2016-09-19 15:30:26,147 destroy_window(24, GLClientWindow(24 : gtk2.GLWindowBacking(24, (499, 316), None)))
2016-09-19 15:30:26,153 destroy_all_windows() destroying 1 / GLClientWindow(1 : gtk2.GLWindowBacking(1, (499, 316), None))
2016-09-19 15:30:26,153 destroy_window(1, GLClientWindow(1 : gtk2.GLWindowBacking(1, (499, 316), None)))
2016-09-19 15:30:26,157 destroy_all_windows() destroying 2 / GLClientWindow(2 : gtk2.GLWindowBacking(2, (499, 316), None))
2016-09-19 15:30:26,157 destroy_window(2, GLClientWindow(2 : gtk2.GLWindowBacking(2, (499, 316), None)))
2016-09-19 15:30:26,161 destroy_all_windows() destroying 3 / GLClientWindow(3 : gtk2.GLWindowBacking(3, (1511, 826), None))
2016-09-19 15:30:26,161 destroy_window(3, GLClientWindow(3 : gtk2.GLWindowBacking(3, (1511, 826), None)))
2016-09-19 15:30:26,174 destroy_all_windows() destroying 18 / GLClientWindow(18 : gtk2.GLWindowBacking(18, (1108, 688), None))
2016-09-19 15:30:26,174 destroy_window(18, GLClientWindow(18 : gtk2.GLWindowBacking(18, (1108, 688), None)))
2016-09-19 15:30:26,186 XpraClient.clean_mmap() mmap_filename=None
2016-09-19 15:30:26,186 UIXpraClient.cleanup() done
2016-09-19 15:30:26,187 GTKXpraClient.quit(130) cleanup done, main_level=1
2016-09-19 15:30:26,187 GTKXpraClient.quit(130) main loop at level 1, calling gtk quit via timeout
2016-09-19 15:30:26,187 GTKXpraClient.exit() calling <function gtk_main_quit_really at 0x69fa170>
2016-09-19 15:30:26,187 GTKXpraClient.run_main_loop() main loop ended, returning exit_code=130
2016-09-19 15:30:26,187 UIXpraClient.cleanup()
2016-09-19 15:30:26,187 XpraClientBase.cleanup() protocol=None
2016-09-19 15:30:26,187 cleanup done
2016-09-19 15:30:26,188 UIXpraClient.cleanup() calling <class 'xpra.client.gtk_base.gtk_keyboard_helper.GTKKeyboardHelper'>.cleanup()
2016-09-19 15:30:26,188 UIXpraClient.cleanup() calling <class 'xpra.platform.darwin.osx_clipboard.OSXClipboardProtocolHelper'>.cleanup()
2016-09-19 15:30:26,188 UIXpraClient.cleanup() calling <class 'xpra.platform.darwin.osx_tray.OSXTray'>.cleanup()
2016-09-19 15:30:26,188 UIXpraClient.cleanup() calling <class 'xpra.client.gtk2.gtk2_notifier.GTK2_Notifier'>.cleanup()
2016-09-19 15:30:26,188 UIXpraClient.cleanup() calling <class 'xpra.platform.darwin.osx_menu.OSXMenuHelper'>.cleanup()
2016-09-19 15:30:26,188 UIXpraClient.cleanup() calling <class 'xpra.platform.darwin.gui.ClientExtras'>.cleanup()
2016-09-19 15:30:26,188 UIXpraClient.cleanup() calling <class 'xpra.codecs.video_helper.VideoHelper'>.cleanup()
2016-09-19 15:30:26,189 XpraClient.clean_mmap() mmap_filename=None
2016-09-19 15:30:26,189 UIXpraClient.cleanup() done

After which, I had to use Force Quit again.

Some server side logs from the second attempt, with an apparent gap between the client-side control-c and the Disconnecting client message that seems to coincide with the Force quit:

2016-09-19 15:30:26,633 client 4: check_server_echo(0) last=True, server_ok=True
2016-09-19 15:30:26,661 client 4: UIXpraClient.signal_cleanup()
2016-09-19 15:30:26,661 client 4: stop_receiving_sound(True) sound sink=sink_subprocess_wrapper(17588)
2016-09-19 15:30:26,664 client 4: stop_receiving_sound(True) calling <bound method sink_subprocess_wrapper.cleanup of sink_subprocess_wrapper(17588)>
2016-09-19 15:30:26,664 client 4: stop_receiving_sound(True) done
2016-09-19 15:30:26,664 client 4: UIXpraClient.signal_cleanup() done
2016-09-19 15:30:26,665 client 4: signal_disconnect_and_quit(130, exit on signal SIGINT) exit_on_signal=False
2016-09-19 15:30:26,665 client 4: disconnect_and_quit(130, exit on signal SIGINT)
2016-09-19 15:30:26,667 sound source stopping
2016-09-19 15:31:30,735 Disconnecting client 10.0.11.102:53953:
2016-09-19 15:31:30,736  client ping timeout (waited 60 seconds without a response)
2016-09-19 15:31:30,875 xpra client 4 disconnected.


Tue, 20 Sep 2016 04:40:11 GMT - Antoine Martin: owner changed

Your server is having some serious problems:

But looks similar: your server is running out of memory.. Is this caused by the sound accumulating? Underpowered server setup? An application using up too much memory? It could also be the xpra server process accumulating too many video frames for av-async. Knowing which one is using up all the memory will tell us where the fix needs to go.

Those are completely different bugs than the osx client hang and should be filed separately.


As for the OSX hang itself, I am unable to reproduce it. If you can reproduce it, please try running with "-d all". Also try sending a SIGUSR1 to the xpra client process when it is hung, this should make it dump all threads states to the debug output so we can see which ones are stuck and where. (ie: kill -SIGUSR1 $PIDOFXPRACLIENT - you can find the pid using ps -ef | grep xpra, the main xpra process pid should usually be the second column of the first line)


Tue, 20 Sep 2016 22:32:06 GMT - alas:

I am going to presume, for now, that it had more to do with just not bothering to reboot that server vm for months and I just managed to leave enough hanging processes to chew up it's relatively lean memory (though I will see about repro'ing now that I have rebooted it). I'll make a separate ticket if I can do that again.

As for the osx client... I seem to be able to reproduce this at will. Using fedora 23 with a newly built 1.0 r13786 server and using the 1.0 r13763 osx client from your beta repo, any time I use the control-c client side it is hanging.

I'll attach an "excerpt" log with what looks like the interesting bits with just the -d all when I use control-c and also the interesting looking bits from the kill -SIGUSR1 20624 that I tried to use on a separate connection (which didn't kill the client actually, had to use force quit again).

I'll also attach the full client logs in each case, in case there's something there worth looking at.


Tue, 20 Sep 2016 22:33:03 GMT - alas: attachment set

log excerpts


Tue, 20 Sep 2016 22:33:31 GMT - alas: attachment set

-d all and use control-c


Tue, 20 Sep 2016 22:33:57 GMT - alas: attachment set

-d all and use SIGUSR1


Tue, 20 Sep 2016 22:34:19 GMT - alas: owner changed


Wed, 21 Sep 2016 04:34:37 GMT - Antoine Martin: owner changed

I really don't understand this one at all and I just cannot reproduce it all.

All the threads waiting to exit are daemon threads, so they shouldn't prevent the main thread from terminating. Maybe r13789 will help if this is caused by deadlock in the threading module?

If the latest builds still show the problem, please try to figure out when this problem started occurring (nearest revision) and whether this affects all OSX versions or not. (and looking at the code, it should affect all platforms equally.. not just OSX!?)


Wed, 21 Sep 2016 10:59:56 GMT - Antoine Martin:

See also #1292, which also "fixed itself"...

This could also be explained by the churn in #1306: if a library (openssl?) gets out of sync with the rest of the build env...


Fri, 23 Sep 2016 01:18:45 GMT - alas:

It's funny you should mention the "churn" in #1306 - I was testing a 1.0 r13786 and a 1.0 r13814 osx client built by Smo... and I wasn't seeing the same "hang" when trying to use control-c client-side to disconnect.

I will pursue this a bit more very soon, but wanted to mention that for now.


Wed, 28 Sep 2016 01:29:29 GMT - alas: owner changed

Testing a bit with your osx 1.0 r13888 client against a fedora 23 1.0 r13889 server... I'm still seeing a hang when trying to disconnect a client with a control-c.

...
2016-09-27 18:19:49,841 Xpra X11 server version 1.0-r13889 64-bit
2016-09-27 18:19:49,841  running on Linux Fedora 23 TwentyThree
2016-09-27 18:19:49,841 enabled remote logging
2016-09-27 18:19:49,858 Attached to tcp:10.0.32.134:1203 (press Control-C to detach)
2016-09-27 18:19:51,241 sound output using audio codec opus
2016-09-27 18:19:51,242 sound output using container format ogg
^C
got signal SIGINT, exiting
2016-09-27 18:19:54,210 sound output stopping
Exception in thread parse (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "threading.pyc", line 801, in __bootstrap_inner
  File "threading.pyc", line 754, in run
  File "/Users/Schadenfreude/Desktop/xpra-catalog/xpra-ant-1-13888/Xpra.app/Contents/Resources/lib/python/xpra/net/protocol.py", line 682, in _read_parse_thread_loop
  File "/Users/Schadenfreude/Desktop/xpra-catalog/xpra-ant-1-13888/Xpra.app/Contents/Resources/lib/python/xpra/net/protocol.py", line 708, in do_read_parse_thread_loop
  File "Queue.pyc", line 179, in get
  File "threading.pyc", line 384, in notify
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Exception in thread write (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "threading.pyc", line 801, in __bootstrap_inner
  File "threading.pyc", line 754, in run
  File "/Users/Schadenfreude/Desktop/xpra-catalog/xpra-ant-1-13888/Xpra.app/Contents/Resources/lib/python/xpra/net/protocol.py", line 583, in _write_thread_loop
  File "/Users/Schadenfreude/Desktop/xpra-catalog/xpra-ant-1-13888/Xpra.app/Contents/Resources/lib/python/xpra/net/protocol.py", line 565, in _io_thread_loop
  File "/Users/Schadenfreude/Desktop/xpra-catalog/xpra-ant-1-13888/Xpra.app/Contents/Resources/lib/python/xpra/net/protocol.py", line 585, in _write
  File "Queue.pyc", line 179, in get
  File "threading.pyc", line 384, in notify
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Killed: 9

... with the Killed: 9 triggered by a required force quit.

Interestingly, when I start the client with --opengl=on (to override the Intel graphics card), the control-c doesn't generate any errors client side (though I do still have to use a force quit for the hung client).

Meanwhile, I'm noticing this error server-side with each disconnection, twice:

Traceback (most recent call last):
 File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 1658, in do_send_cursor
   w, h, _xhot, _yhot, serial, pixels, name = cursor_data[2:8]
ValueError: need more than 6 values to unpack

Mon, 10 Oct 2016 13:49:22 GMT - Antoine Martin: status changed; resolution set

The traceback in do_send_cursor has been fixed since r13891.

Diagnostics: running with "-d all" I see that the last few log entries were often related to printer forwarding, with more messages after the hang. Running with "--printing=no" fixes the problem.

Subtle bug fixed in r14104, closing. Feel free to re-open if you hit this again.


Sat, 23 Jan 2021 05:20:52 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/1314