Xpra: Ticket #329: Sound sink underruns seem to lead to spinners and eventual ping timeouts (for win32 clients)

Using win client 0.9.0 r3152 and fedora server 0.9.0 r3163 over an internet connection, with several tabs open and video playing... the client starts experiencing sound sink queue underruns, which create choppiness for the whole client and eventually start to lead to paint_spinners and a connection crash.

I'm not sure if this is just a matter of connection issues or if there's something else going on. Below are the -d all logs from the time of one of the crashes (from the client side)

2013-05-07 20:00:13,361 mask_to_names(<flags 0 of type GdkModifierType>)=['mod2']
2013-05-07 20:00:13,377 mask_to_names(<flags 0 of type GdkModifierType>) GetKeyState(VK_NUMLOCK)=1
2013-05-07 20:00:13,378 mask_to_names(<flags 0 of type GdkModifierType>)=['mod2']
2013-05-07 20:00:13,392 mask_to_names(<flags 0 of type GdkModifierType>) GetKeyState(VK_NUMLOCK)=1
2013-05-07 20:00:13,394 mask_to_names(<flags 0 of type GdkModifierType>)=['mod2']
2013-05-07 20:00:13,552 check_echo_timeout(1367981953553) last_ping_echoed_time=1367981953553
2013-05-07 20:00:13,563 average server latency=1752.4, using max wait 4.50s
2013-05-07 20:00:18,071 check_server_echo: server is not responding, redrawing spinners over the windows
2013-05-07 20:00:18,076 paint_spinner(<gtk.gdk.CairoContext object at 0x02B00360>, gtk.gdk.Rectangle(0, 0, 834, 716))
2013-05-07 20:00:18,177 paint_spinner(<gtk.gdk.CairoContext object at 0x02B00360>, gtk.gdk.Rectangle(0, 0, 834, 716))
2013-05-07 20:00:18,276 paint_spinner(<gtk.gdk.CairoContext object at 0x02B00360>, gtk.gdk.Rectangle(0, 0, 834, 716))
2013-05-07 20:00:18,377 paint_spinner(<gtk.gdk.CairoContext object at 0x02B00360>, gtk.gdk.Rectangle(0, 0, 834, 716))
2013-05-07 20:00:18,480 sound sink: adding 104 bytes, {}
2013-05-07 20:00:18,483 need_data: 4096 bytes in /GstPipeline:pipeline0/GstAppSrc:src (__main__.GstAppSrc)
2013-05-07 20:00:18,483 sound sink queue underrun
2013-05-07 20:00:18,572 check_server_echo: server is OK again, stopping redraw2013-05-07 20:00:23,553 check_echo_timeout(1367981963553) last_ping_echoed_time=1367981953553
2013-05-07 20:00:23,555 server ping timeout - waited 60 seconds without a response
2013-05-07 20:00:23,555 XpraClient.quit(2) current exit_code=None
2013-05-07 20:00:23,555 XpraClient.cleanup() client_extras=<xpra.win32.gui.ClientExtras object at 0x02B637D0>
2013-05-07 20:00:23,556 ClientExtras.cleanup() tray=<xpra.win32.win32_tray.Win32Tray instance at 0x04B4A8F0>
2013-05-07 20:00:23,558 ClientExtrasBase.cleanup() session_info_window=None, gtk.main_level()=1
2013-05-07 20:00:23,558 ClientExtrasBase.cleanup() done
2013-05-07 20:00:23,559 ClientExtras.cleanup() calling tray.close=<bound methodWin32Tray.close of <xpra.win32.win32_tray.Win32Tray instance at 0x04B4A8F0>>
2013-05-07 20:00:23,559 Win32Tray.close() tray_widget=<xpra.win32.win32_NotifyIcon.win32NotifyIcon instance at 0x04B558A0>
2013-05-07 20:00:23,561 win32NotifyIcon.close() exit_callback=<function tray_exit at 0x04B4B730>
2013-05-07 20:00:23,563 OnDestroy(0,None,None,None) closed=False, exit_callback=None
2013-05-07 20:00:23,565 OnDestroy(..) calling Shell_NotifyIcon(NIM_DELETE, (525884, 0))
2013-05-07 20:00:23,595 OnDestroy(..) calling exit_callback=None
2013-05-07 20:00:23,596 detect_win32_session_events(None) missing handle!
2013-05-07 20:00:23,598 Win32Tray.close() ended
2013-05-07 20:00:23,598 ClientExtras.cleanup() ended
2013-05-07 20:00:23,599 XpraClient.stop_receiving_sound()
2013-05-07 20:00:23,601 SoundPipeline.cleanup()
2013-05-07 20:00:23,601 SoundPipeline.stop()
2013-05-07 20:00:23,628 need_data: 4096 bytes in /GstPipeline:pipeline0/GstAppSrc:src (__main__.GstAppSrc)
2013-05-07 20:00:23,653 SoundPipeline.stop() done
2013-05-07 20:00:23,654 SoundPipeline.cleanup() done
2013-05-07 20:00:23,655 sound: client_signalled_change(<XpraClient object at 0x290e6c0 (xpra+client+XpraClient at 0x238db30)>) is_on=False
2013-05-07 20:00:23,657 spk_off(())
2013-05-07 20:00:23,657 XpraClient.stop_receiving_sound()
2013-05-07 20:00:23,658 stop_receiving_sound: sound not started!
2013-05-07 20:00:23,660 XpraClient.stop_receiving_sound() done
2013-05-07 20:00:23,661 XpraClientBase.cleanup() protocol=Protocol(SocketConnection(('10.0.11.67', 52994) - ('107.6.90.194', 1201)))
2013-05-07 20:00:23,661 connection closed after 140854 packets received (164730258 bytes) and 19349 packets sent (638945 bytes)
2013-05-07 20:00:23,663 XpraClient.clean_mmap() mmap_file=None
2013-05-07 20:00:23,664 write thread: empty marker, exiting
2013-05-07 20:00:23,664 read thread: empty marker, exiting
2013-05-07 20:00:23,665 update_focus(2,False) _focused=2
2013-05-07 20:00:23,667 mask_to_names(<flags 0 of type GdkModifierType>) GetKeyState(VK_NUMLOCK)=1
2013-05-07 20:00:23,668 mask_to_names(<flags 0 of type GdkModifierType>)=['mod2']
2013-05-07 20:00:23,670 <class 'xpra.window_backing.PixmapBacking'>.close() video_decoder=<xpra.x264.codec.Decoder object at 0x02BD4D70>
2013-05-07 20:00:23,677 unknown win32 message: 28 / 0 / 2452
2013-05-07 20:00:23,688 XpraClient.cleanup() done
2013-05-07 20:00:23,690 XpraClient.quit(2) main loop at level 1, calling gtk quit via timeout
2013-05-07 20:00:23,690 average server latency=1752.4, using max wait 4.50s
2013-05-07 20:00:23,693 Connection lost
2013-05-07 20:00:23,694 XpraClient.quit(1) current exit_code=2
2013-05-07 20:00:23,694 XpraClient.cleanup() client_extras=<xpra.win32.gui.ClientExtras object at 0x02B637D0>
2013-05-07 20:00:23,696 ClientExtras.cleanup() tray=None
2013-05-07 20:00:23,698 ClientExtrasBase.cleanup() session_info_window=None, gtk.main_level()=1
2013-05-07 20:00:23,700 ClientExtrasBase.cleanup() done
2013-05-07 20:00:23,701 ClientExtras.cleanup() ended
2013-05-07 20:00:23,703 XpraClientBase.cleanup() protocol=None
2013-05-07 20:00:23,704 XpraClient.clean_mmap() mmap_file=None
2013-05-07 20:00:23,706 XpraClient.cleanup() done
2013-05-07 20:00:23,707 XpraClient.quit(1) main loop at level 1, calling gtk quit via timeout
2013-05-07 20:00:24,190 XpraClient.run() main loop ended, returning exit_code=2
2013-05-07 20:00:24,191 XpraClient.cleanup() client_extras=<xpra.win32.gui.ClientExtras object at 0x02B637D0>
2013-05-07 20:00:24,193 ClientExtras.cleanup() tray=None
2013-05-07 20:00:24,193 ClientExtrasBase.cleanup() session_info_window=None, gtk.main_level()=0
2013-05-07 20:00:24,194 ClientExtrasBase.cleanup() done
2013-05-07 20:00:24,194 ClientExtras.cleanup() ended
2013-05-07 20:00:24,194 XpraClientBase.cleanup() protocol=None
2013-05-07 20:00:24,196 XpraClient.clean_mmap() mmap_file=None
2013-05-07 20:00:24,196 XpraClient.cleanup() done
C:\Program Files (x86)\Xpra>


Wed, 08 May 2013 03:34:44 GMT - Antoine Martin: status, description changed; milestone set

Which codec was this? What sort of internet connection is this? What was the bandwidth usage at the time?

average server latency=1752.4

A latency of over a second is going to be problematic! (maybe sound data is overloading the connection?)

PS: please use shell redirection to capture log files so I don't have to edit the log to remove the idiotic carriage returns added by the obnoxious windows dos window.


Wed, 08 May 2013 19:31:52 GMT - alas: attachment set

xpra client logs for internet-cloud session crash


Wed, 08 May 2013 19:34:41 GMT - alas:

Ok, I attached a full log (client-side) of a couple minute session before a crash.

It's a comcast cable modem (essentially residential) connection. I didn't get any measurements of bandwidth usage at the time though.

(Looks like it is alternating sound sink over and under runs... and then a section of missing data causes a disconnection call?)


Thu, 09 May 2013 03:21:04 GMT - alas: attachment set

client disconnect with eof message


Thu, 09 May 2013 03:24:17 GMT - alas:

I just added yet more log files, each with a seemingly different cause for disconnection.

Note: in each case it was just the client connection that was closed, I was able to reattach to each with no loss of state.


Thu, 09 May 2013 07:57:43 GMT - Antoine Martin: owner, status changed

OK, so this is using mp3:

starting mp3 sound sink

To know exactly where the new connection came from, please refer to your server log - which will include the IP and port of the client.

So this looks like a broken/reset TCP connection.

See above for figuring out where the connection came from.


Now, on to the sound side of things.

I see a lot of:

sound sink: adding 104 bytes, {}

which would seem to indicate that there was no sound being played then, right? (IIRC, that's the mp3 packet with just a header and no actual sound data - which we could actually shorten a bit more, but zlib probably does a decent enough job on it already)


When running on a line with high latency and/or low bandwidth, you will need to use a leaky queue to allow the sound pipeline to recover. Otherwise, queuing the new sound packets has to wait for the pipeline to drain, which will cause problems.

Please try:

etc


Fri, 10 May 2013 07:16:43 GMT - alas:

Adjusting XPRA_SOUND_QUEUE_LEAK to either =1 or =2 seems to cause so much sound choppiness as to be worse than no change.

XPRA_SOUND-QUEUE_TIME set to =70, however, seems to be of some use. I'm not sure if it is solving the issue entirely, but it does seem to be helping with the sound synch with MP3, if nothing else.

Will continue with testing, let you know. Bandwidth doesn't seem to vary too much. Latencies up to 4K ms, while causing some spinners, didn't seem to crash the connection. Latencies that reached over 6k ms, however, caused a definite crash... but only after a very noticeable several seconds of spinners. Likewise, CPU and memory don't appear to be issues, if xterm top is any indication.


Fri, 10 May 2013 08:06:47 GMT - Antoine Martin:

Adjusting XPRA_SOUND_QUEUE_LEAK to either =1 or =2 seems to cause so much sound choppiness as to be worse than no change.

The point of this change is to figure out if it solves the disconnection problem or not. If it does, then we can figure out a way of solving the choppiness. Have you figured out why your session had been disconnected by a new connection? After all, of the 3 log samples you provided only one was caused by a connection drop-out (tcp connection reset by the looks of things). Best to target the real cause of the disconnections if that is what you are trying to solve here.

As for setting the queue time to 70 instead of the default 80, I very much doubt this will make any difference at all. So something else was different in your tests, best to figure out what that is first.


Fri, 10 May 2013 23:44:19 GMT - alas: attachment set

Narrowing to Remote Host forced disconnect


Fri, 10 May 2013 23:53:02 GMT - alas:

Well, continuing seems to have narrowed the problem to a repeated forced connection closing by remote host. I attached another file of logs, with clips of the closure something like six times in five minutes. I don't see any indication of a reason (I assume the remote host is the client)... so I wonder if it would be worth the trouble to add some logging of reasons that a client would force disconnection?

The bandwidth during this time was floating between 500kb/s and 1mb/s, the latency was generally between 100 and 200 ms, with peaks no higher than 500 or 600 ms. I do wonder if the time, which is 3:45 ish PDT and 6:45 ish EDT, might be a factor, a sort of internet rush hour impact for some reason? (The connection was fine for several hours previously, then began disconnecting at a mad pace.)


Sat, 11 May 2013 02:40:18 GMT - Antoine Martin:

If the log is from the client (it is), then the remote host is the server. Please post your server logs.

It may have more info since:

will call connection lost: read connection reset: [Errno 10054] An existing connection was forcibly closed by the remote host

And please confirm if using the leaky queue fixes the problem or not.

Was the bandwidth you posted in b/s or B/s? (confusing since the 'k' and 'm' are in lowercase). 1MB/s would be fairly high, whereas 1Mb/s would be low.


Tue, 14 May 2013 08:58:57 GMT - Scott Bennett:

Ran server with debugging. Disconnection happened several times and is consistently reproducible on our San Jose Server that was previously very stable running 0.8.

Log file shows nothing significant, in fact the server is not aware of the disconnection and cleans up the old connection when the client reconnects.

No other servers are exhibiting this problem and we are looking into it further. Any and all suggestions welcome.


Tue, 14 May 2013 14:36:07 GMT - Antoine Martin:

And please confirm if using the leaky queue fixes the problem or not.

This will help us narrow things down.


Log file shows nothing significant, in fact the server is not aware of the disconnection and cleans up the old connection when the client reconnects.

That's odd, it should notice sooner or later - and by later I mean no more than one minute.

No other servers are exhibiting this problem and we are looking into it further. Any and all suggestions welcome.

If one version works and another does not, you can always bisect changesets between those two versions... a little bit tedious, but this generally works for isolating the problematic changeset.


Tue, 14 May 2013 22:51:13 GMT - alas:

Using a connection that sidesteps the server firewall seems to reduce the disconnections practically to nil. Have you had any problems with firewalls intercepting xpra connections?

Also, trying the leaky queue options (=1 & =2 both, as client options) produced such choppy sound that it wasn't a viable option. Couldn't endure it long enough to see if it fixed connection issues or not. (Let me know though if there's some other question it might answer, aside from being a fix in its own right.)


Wed, 15 May 2013 03:03:31 GMT - Antoine Martin:

server firewall seems to reduce the disconnections practically to nil

A firewall should not be causing these sorts of problems. Xpra uses a single plain tcp (or ssh?) connection. Maybe the connection uses too much bandwidth (back to my original question) and the firewall starts dropping packets?

produced such choppy sound that it wasn't a viable option

Errrrr. Like I've said already, we're trying to figure out the cause, this is just for testing so just turn the volume down:

The point of this change is to figure out if it solves the disconnection problem or not. If it does, then we can figure out a way of solving the choppiness.


Wed, 15 May 2013 19:47:09 GMT - alas: attachment set

Firewall flow trace for internet xpra remote session host disconnect


Wed, 15 May 2013 20:48:55 GMT - alas:

Trying with XPRA_SOUND_QUEUE_LEAK=1 & =2 doesn't change the disconnection problem (it just makes the connection time less enjoyable).

I also managed to get a flow-trace through the firewall, the logs of which I have attached.

The most interesting part is the following:

May 15 19:10:52 19:10:49.1907994:CID-0:RT:<InternalClient/49823->ExternalServer/1200;6> matched filter cloud1:
May 15 19:10:52 19:10:49.1907994:CID-0:RT:packet [520] ipid = 22858, @42303f1c
May 15 19:10:52 19:10:49.1907994:CID-0:RT:---- flow_process_pkt: (thd 2): flow_ctxt type 13, common flag 0x0, mbuf 0x42303d00, rtbl_idx = 0
May 15 19:10:52 19:10:49.1907994:CID-0:RT: flow process pak fast ifl 70 in_ifp vlan.1
May 15 19:10:52 19:10:49.1907994:CID-0:RT:  vlan.1:InternalClient/49823->ExternalServer/1200, tcp, flag 18
May 15 19:10:52 19:10:49.1907994:CID-0:RT: find flow: table 0x4bbcb2f8, hash 38608(0xffff), sa InternalClient, da ExternalServer, sp 49823, dp 1200, proto 6, tok 6
May 15 19:10:52 19:10:49.1907994:CID-0:RT:  no session found, start first path. in_tunnel - 0, from_cp_flag - 0
May 15 19:10:52 19:10:49.1907994:CID-0:RT:flow_send_icmp_tcp_rst: Sending tcp-rst
May 15 19:10:52 19:10:49.1907994:CID-0:RT:skip pre-frag: is_tunnel_if- 0, is_if_mtu_configured- 0
May 15 19:10:52 19:10:49.1907994:CID-0:RT:  encap vector
May 15 19:10:52 19:10:49.1907994:CID-0:RT:  no more encapping needed
May 15 19:10:52 19:10:49.1907994:CID-0:RT:  **** pak processing end.
May 15 19:10:52 19:10:49.1907994:CID-0:RT:  packet dropped, first pak not sync
May 15 19:10:52 19:10:49.1907994:CID-0:RT:  flow find session returns error.
May 15 19:10:52 19:10:49.1907994:CID-0:RT: ----- flow_process_pkt rc 0x7 (fp rc -1)
May 15 19:10:52 19:10:49.1907994:CID-0:RT:<ExternalServer/1200->InternalClient/49823;6> matched filter cloud1:
May 15 19:10:52 19:10:49.1908336:CID-0:RT:packet [40] ipid = 32809, @42df9880
May 15 19:10:52 19:10:49.1908367:CID-0:RT:---- flow_process_pkt: (thd 3): flow_ctxt type 8, common flag 0x0, mbuf 0x42df2a80, rtbl_idx = 0

It does indeed look like packet loss is the culprit... but the xpra session info (as displayed from the client) never (well, very rarely) indicates bandwidth much above 1000 KB/s, usually closer to 300-500 KB/s... so it doesn't seem to be the connection using too much bandwidth.


Thu, 16 May 2013 00:28:43 GMT - Antoine Martin:

no session found
Sending tcp-rst

Explains why we saw the tcp connection getting reset.

Why the firewall suddenly decides that it doesn't know about this session is the real problem. We are not doing anything low level with the connection, and from what you are saying the connection is loaded (so very unlikely to timeout by itself). Looks like a firewall issue to me. Try swapping it for another one.


Thu, 16 May 2013 00:48:17 GMT - alas:

Well the server doesn't even seem to realize there's been a disconnection, or at least doesn't seem to care, since it continues to run and I can reconnect to it over and over again.

Unfortunately, the firewall is apparently one of the top three most popular and likely to be used by customers, so we can't just dictate that it not be used.

Is there any way to make the xpra client more tolerant of the occasional packet droppings?


Thu, 16 May 2013 01:10:42 GMT - Antoine Martin:

The server should eventually realize. If not, then this is a bug. Reconnecting is never going to be a problem since it will create a new tcp connection and firewall session tracking starts anew.

I am not suggesting that this firewall should not be used, just that you /try/ with a different one. As we're still trying to identify the real cause.

Tcp does not deal with reset connections, that's tcp and absolutely nothing to do with xpra. And nothing we ever want to be dealing with. Btw, we're not talking about packet drops here ( tcp deals with that gracefully and more or less transparently )


Wed, 22 May 2013 01:42:51 GMT - alas:

(Running a 0.10 r3276 windows client and 0.9.3 r3276 fedora server.)

Connecting over the internet with ssh (on a port lower than 1024) seems to resist the drops seen with a tcp connection, even with occasional latency spikes as high as 1500 to 4000 ms.

Using ssh over a port higher than 1024 also seems to resist drops, though it does hit occasional spinners. After a while (about 45 mintues) a period of recurring spinners led xpra to become non-responsive, with the server printing the same error message over and over:

2013-05-21 18:30:06,275 delayed_region_timeout: sending now - something is wrong!
2013-05-21 18:30:06,369 get_backlog found some damage acks that have been pending for too long, expiring them: [18753, 18754]

Any insight into what this error is coming from? Is it just a latency issue? Is it something that might be causing dropped tcp connections when there's no ssh involved? (It seems unlikely, but I thought I'd run it by you.)


Wed, 22 May 2013 02:14:25 GMT - Antoine Martin:

Connecting over the internet with ssh (on a port lower than 1024) seems to resist the drops...

How many attempts did you make? A firewall that does better connection tracking for lower ports would be a big problem for its manufacturer. This goes against TCP specs.

Using ssh over a port higher than 1024 also seems to resist drops, though it does hit occasional spinners

As per above, *if* the firewall does something that crazy, it's broken.

a period of recurring spinners led xpra to become non-responsive.. (..) delayed_region_timeout: sending now - something is wrong! (..) Any insight into what this error is coming from?

Means that the client is not receiving some picture packets, or is not acknowledging them. This usually means that the connection is very bad, and probably already broken.
I would have expected xpra (client or server) to disconnect shortly after.


Thu, 04 Jul 2013 19:00:44 GMT - alas: status changed


Fri, 12 Jul 2013 01:49:35 GMT - alas:

Ok, we now have positive confirmation that it was in fact the firewall causing all of this mischief. They are analyzing. I'll leave this ticket open for now pending their determination of a fix, in case it somehow requires something to be done by server/client in cases of going through a firewall.

(Firewall receives packet from server, and responds with a RST packet back to server, which then sends a RST packet to client. Everyone happily disconnects, unexpectedly.)


Sat, 10 Aug 2013 05:35:13 GMT - Antoine Martin: status changed; resolution set

Closing. I very much doubt there is anything we can do at our end, the packets are compressed so we cannot change their contents. The socket is just a plain TCP sockets, and there aren't options we can set to tell hops in the middle to not mess with it: this is the default behaviour of TCP networks.
If they are resetting the connection because of excessive bandwidth usage, then either bump the limit or add a feature request to xpra for bandwidth limiting.


Sat, 23 Jan 2021 04:51:51 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/329