Running a (slightly customized) 0.14.12 windows client against a (also slightly customized) 0.14.12 fedora 20 client the rendering of the video elements and the lossless elements (text) seems very good until the server starts giving latency/delay warnings.
The video elements continue to display well as these warnings start outputting, but the lossless (text) elements begin to be blurry and increasingly fail to render when scrolling, especially, ceases.
I'm not certain whether this is a cause or maybe just a contributing factor, but here's a bit of server-side output:
Dec 02 10:43:55 node-1000 systemd[1]: Starting Cleanup of Temporary Directories... Dec 02 10:43:55 node-1000 systemd[1]: Started Cleanup of Temporary Directories. ... Dec 02 10:47:30 node-1000 dbus[348]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' Dec 02 10:47:30 node-1000 NetworkManager[408]: bound to 10.171.171.100 -- renewal in 979 seconds. Dec 02 10:47:30 node-1000 dbus-daemon[348]: dbus[348]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' Dec 02 10:47:30 node-1000 systemd[1]: Starting Network Manager Script Dispatcher Service... Dec 02 10:47:30 node-1000 dbus-daemon[348]: dbus[348]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Dec 02 10:47:30 node-1000 dbus[348]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Dec 02 10:47:30 node-1000 systemd[1]: Started Network Manager Script Dispatcher Service. Dec 02 10:49:55 node-1000 server[653]: 2014-12-02 10:49:55,034 starting sound capture using pulseaudio device: Monitor of Null Output Dec 02 10:49:55 node-1000 server[653]: 2014-12-02 10:49:55,315 codec: MPEG-1 Layer 3 (MP3) Dec 02 10:50:18 node-1000 server[653]: 2014-12-02 10:50:18,856 starting sound capture using pulseaudio device: Monitor of Null Output Dec 02 10:50:19 node-1000 server[653]: 2014-12-02 10:50:19,122 codec: MPEG-1 Layer 3 (MP3) Dec 02 10:50:53 node-1000 server[653]: 2014-12-02 10:50:53,757 send_delayed for wid 2, elapsed time 640.0 is above limit of 500.0 - sending now Dec 02 10:50:54 node-1000 server[653]: 2014-12-02 10:50:54,302 send_delayed for wid 2, elapsed time 516.0 is above limit of 500.0 - sending now Dec 02 10:50:54 node-1000 server[653]: 2014-12-02 10:50:54,914 send_delayed for wid 2, elapsed time 593.0 is above limit of 500.0 - sending now Dec 02 10:50:55 node-1000 server[653]: 2014-12-02 10:50:55,673 send_delayed for wid 2, elapsed time 744.0 is above limit of 500.0 - sending now Dec 02 10:51:02 node-1000 server[653]: 2014-12-02 10:51:02,856 send_delayed for wid 2, elapsed time 753.0 is above limit of 500.0 - sending now Dec 02 10:51:03 node-1000 server[653]: 2014-12-02 10:51:03,488 send_delayed for wid 2, elapsed time 617.0 is above limit of 500.0 - sending now Dec 02 10:51:04 node-1000 server[653]: 2014-12-02 10:51:04,082 send_delayed for wid 2, elapsed time 576.0 is above limit of 500.0 - sending now Dec 02 10:51:09 node-1000 server[653]: 2014-12-02 10:51:09,913 send_delayed for wid 2, elapsed time 679.0 is above limit of 500.0 - sending now Dec 02 10:51:10 node-1000 server[653]: 2014-12-02 10:51:10,624 send_delayed for wid 2, elapsed time 696.0 is above limit of 500.0 - sending now Dec 02 10:51:11 node-1000 server[653]: 2014-12-02 10:51:11,268 send_delayed for wid 2, elapsed time 626.0 is above limit of 500.0 - sending now Dec 02 10:51:11 node-1000 server[653]: 2014-12-02 10:51:11,966 send_delayed for wid 2, elapsed time 688.0 is above limit of 500.0 - sending now Dec 02 10:51:17 node-1000 server[653]: 2014-12-02 10:51:17,636 send_delayed for wid 2, elapsed time 502.0 is above limit of 500.0 - sending now Dec 02 10:51:21 node-1000 server[653]: 2014-12-02 10:51:21,819 send_delayed for wid 2, elapsed time 538.0 is above limit of 500.0 - sending now Dec 02 10:51:22 node-1000 server[653]: 2014-12-02 10:51:22,341 send_delayed for wid 2, elapsed time 509.0 is above limit of 500.0 - sending now Dec 02 10:51:22 node-1000 server[653]: 2014-12-02 10:51:22,948 send_delayed for wid 2, elapsed time 597.0 is above limit of 500.0 - sending now Dec 02 10:51:23 node-1000 server[653]: 2014-12-02 10:51:23,514 send_delayed for wid 2, elapsed time 553.0 is above limit of 500.0 - sending now Dec 02 10:51:24 node-1000 server[653]: 2014-12-02 10:51:24,231 send_delayed for wid 2, elapsed time 697.0 is above limit of 500.0 - sending now Dec 02 10:51:25 node-1000 server[653]: 2014-12-02 10:51:25,001 send_delayed for wid 2, elapsed time 757.0 is above limit of 500.0 - sending now Dec 02 10:51:25 node-1000 server[653]: 2014-12-02 10:51:25,533 send_delayed for wid 2, elapsed time 524.0 is above limit of 500.0 - sending now
I was running the client with -d paint
but saw no output except the usual about the MP3 speaker re-starts.
Are there any other, better targeted, debugging flags that might be of more use (client or server side)?
I'm seeing something similar. Here's what I reported to the mailing list:
Lately when using xpra, I've been seeing lots of spinners over top of my windows. (And I mean "lots" in the sense that xpra is pretty much unusable at times.) In the past, xpra was working without these delays.
When this happens, I see messages like this in the client log:
2014-12-02 13:20:53,784 server is not responding, drawing spinners over the windows 2014-12-02 13:21:05,074 server is OK again
In the server log, I see messages like this:
2014-12-02 13:20:23,261 delayed_region_timeout: region is 15001ms old, bad connection? 2014-12-02 13:20:28,963 send_delayed for wid 23, elapsed time 5701.0 is above limit of 500.0 - sending now
Occasionally it is so bad that xpra disconnects. At that point the server log will contain:
2014-12-02 13:24:08,767 Disconnecting client Protocol(SocketConnection(/home/doole/.xpra/reorx-1)): client ping timeout (waited 60 seconds without a response) 2014-12-02 13:24:08,769 xpra client disconnected.
Both my xpra server and client are on the same network (but not the same subnet). If I use "ssh -X" directly to the box, I do not see any sort of delays.
My windows are mostly konsole and gvim.
I'd love to be able to say "this problem started with version X", but I haven't been able to pin it down. Unfortunately, my environment went though multiple changes at the same time (xpra 0.13.x to 0.14.x, Ubuntu 12.04 to 14.04), so I may not be able to get back to a good state for comparison purposes.
I am running without speaker/microphone forwarding but everything else enabled. I have tried various encoding options, but it doesn't seem to make a difference.
My client and server are both running Ubuntu 14.04. I'm currently running xpra 0.14.12 at both the client and server, but I've been seeing this on all the xpra 0.14.x builds that I've tried.
Any suggestions? Anything I can do to help narrow this down further?
I have since tried running with -debug on the client, but it added no new information. (I'm waiting to try -debug on the server until I can afford to recycle my windows.)
r8177 allows us to easily tune the batch settings server side (the same env options could already be used client side):
XPRA_BATCH_MAX_DELAY=120 xpra start
This allows me to more easily get the send_delayed for wid ... - sending now
to fire, but I am not seeing any problems.
How can I reproduce this bug?
I don't do anything special - a couple konsole windows open, one is frequently doing big compiles (so lots of scrolling text). Usually 2-5 gvim windows up, editing text, paging up and down, etc.
r8188 adds the ability to turn off the video regions code, which may help:
XPRA_VIDEO_SUBREGION=0 xpra start
Since this is refresh related, please post the refresh debug log:
xpra start -d refresh,regionrefresh
(the latter debug category is not needed with XPRA_VIDEO_SUBREGION=0
)
@ddoole: are you connecting via ssh or tcp? Is there a firewall between you and the server, a forticlient perhaps? (ie: #742) What does the session info bandwidth graph show?
@ddoole:
-d debug
won't do anything, see xpra -d help
for the list of debug categories - in this particular case, you may also want encoding
or refresh
on the server, and maybe paint
on the client
xpra info
collected when things are going wrong
I think I've just figured out what the problem is: your apps are repainting a lot (see #670 for gvim - #756 may help), and so the >500ms delay is needed. By cutting it short, we're making thing worse, eventually leading to the connection drop.
To answer Antoine's questions:
For totaam:
xpra info output during hang (over VDSL)
Performance graph during hang (over VDSL)
I see the same refresh problem working from home as I do in the office. This is over VDSL instead of local network though.
I attached the xpra info and the session info graphs for a hang.
renamed file for inlining
@ddoole:
-d net
on the server and client
--enable-pings
and see how that influences how quickly the spinners come up
But the more I look at this, the more I think that you may just be having connectivity issues somehow. At first I thought that there was something wrong in the code I just reviewed and modified, but the fact that you are getting the spinners probably invalidates this hypothesis: pings just aren't coming back, or the network / main loops are getting stuck on it.
Not much more I can do on this without more information, preferably in the form of steps so that I can reproduce the problem..
I did a quick experiment with attaching with -d net --enable-pings. I didn't get any additional diagnostics at the client. I saw the spinners just as frequently, but they seemed to go away quicker.
See comment3 for my repro instructions. Running the ClearCase? xcleardiff tool (graphical file compare) is another good way to get the spinner coming up lots.
Well, I've been working on reproducing this, but the send_delayed ...
messages don't seem to be correlating to my lossless element refresh issues.
Also filed #757, about an encoding pipeline BUG... which might be more relevant.
In any case, looks like my supposition should be deemed invalid
, hopefully there will be a solution for ddoole's issues though.
Something new today. I'm in the office and was about to restart my server to add in some of the new options I've been asked to try. I figured I'd reproduce the spinners first before restarting everything.
I'm still getting the spinners reliably, but this time I'm not getting the send_delayed messages in the server log. (My server is still running with no -d option.)
I've changed my server to start with "--no-clipboard --no-mdns -d net" and the problem seems to have gone away. Things feel as responsive as they used to.
In the client logs, I do see occasional "server not responding" messages and the server log has very few "send_delayed" messages. (No additional information in the server log with "-d net".)
I'm guessing these messages are because of the known problems with gvim. Despite what the client log says, I haven't actually seen a spinner.
I'm going to continue experimenting and hopefully pin this down to one of mdns or clipboard. (I'm really hoping it's mdns - I can live without it. I need my clipboard though.)
You may want to add -d clipboard
to the client and server command lines.
Server log with -d clipboard
From my quick tests, it looks like --no-mdns does the trick for me.
To the best of my knowledge, I am not running any clipboard managers. (I go out of my way to kill klipper on any new system I set up.)
I did try running with "--no-mdns -d clipboard" in case anything interesting popped up. No new messages at the client. Nothing on the server said "here's a problem" to me, but I have attached the log if you'd like to take a look.
I'm going to run for a day or two with --no-mdns just to make sure it's not one of those quirks where the problem teases you and disappears for a while. Assuming --no-mdns is the key, anything else I can gather to help you pursue this further?
Thanks! I'm not sure there's much more we can do to debug things, it looks like the python avahi code is swamping the program's main loop, probably waiting for IO somewhere it should not.
I have disabled mdns by default for trunk in r8211 and for v0.14.x in r8212, it will be in the next stable update (probably today).
Feel free to close this ticket if / when that works for you.
I'm still seeing some delays, even with --no-mdns.
Looking at the server log, there are a number of new do_owner_changed() calls logged. Is this significant?
do_owner_changed()
means that something is changing the clipboard contents or ownership.
OK. That's fine then - I was putting stuff I the clipboard.
I'm starting to believe that I was too quick saying --no-mdns was enough. I've started getting lots of spinners again :-(
So I'm back to --no-ndns --no-clipboad and I'll see what happens as I use it more.
It looks like my initial success with --no-mdns --no-clipboard was just fate playing silly buggers with me. I'm back to having lots of spinners even though both were specified on my xpra start.
I'll start playing with the other options that were suggested and see if anything else makes a difference.
Tried connecting with tcp instead of ssh and that hasn't made a difference.
So, at this point I've tried:
And I am still seeing the delays. Any suggestions on other things to try or diagnostics to capture? (I really like xpra and hate the thought of having to go back to something like vnc.)
The xpra info you posted shows that most window updates are sent out of the network layer very very fast: always under 50ms (look for damage.out_latency
).
You must have some very good server hardware or just very very small screen updates.
Which only leaves the network layer (switches, firewalls, OS) and the client.
The only new testing suggestions that I can make at this point are:
etc..
I'm marking this as "invalid" (based on afarr also saying this is invalid in comment #13).
I just reproduced non-responsive windows using ssh -X and konsole with no xpra involved. So that definitely gets xpra off the hook. (I'd tried this before and hadn't seen any problems. A few minutes ago, I realized that my original non-xpra test wasn't 100% comparable to my xpra test. When I tested properly, I hit the hangs.)
So now I need to go find where to open a konsole bug... (In the mean time, it looks like xrvt will be my new friend.)
Thanks so much for your help in trying to narrow this down. Sorry for wasting your time.
I've reverted the mdns changes in r8224.
this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/750