Xpra: Ticket #750: lossless element re-painting seems to be overlooked when latency gets high

Running a (slightly customized) 0.14.12 windows client against a (also slightly customized) 0.14.12 fedora 20 client the rendering of the video elements and the lossless elements (text) seems very good until the server starts giving latency/delay warnings.

The video elements continue to display well as these warnings start outputting, but the lossless (text) elements begin to be blurry and increasingly fail to render when scrolling, especially, ceases.

I'm not certain whether this is a cause or maybe just a contributing factor, but here's a bit of server-side output:

Dec 02 10:43:55 node-1000 systemd[1]: Starting Cleanup of Temporary Directories...
Dec 02 10:43:55 node-1000 systemd[1]: Started Cleanup of Temporary Directories.
...
Dec 02 10:47:30 node-1000 dbus[348]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Dec 02 10:47:30 node-1000 NetworkManager[408]: bound to 10.171.171.100 -- renewal in 979 seconds.
Dec 02 10:47:30 node-1000 dbus-daemon[348]: dbus[348]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Dec 02 10:47:30 node-1000 systemd[1]: Starting Network Manager Script Dispatcher Service...
Dec 02 10:47:30 node-1000 dbus-daemon[348]: dbus[348]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Dec 02 10:47:30 node-1000 dbus[348]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Dec 02 10:47:30 node-1000 systemd[1]: Started Network Manager Script Dispatcher Service.
Dec 02 10:49:55 node-1000 server[653]: 2014-12-02 10:49:55,034 starting sound capture using pulseaudio device: Monitor of Null Output
Dec 02 10:49:55 node-1000 server[653]: 2014-12-02 10:49:55,315 codec: MPEG-1 Layer 3 (MP3)
Dec 02 10:50:18 node-1000 server[653]: 2014-12-02 10:50:18,856 starting sound capture using pulseaudio device: Monitor of Null Output
Dec 02 10:50:19 node-1000 server[653]: 2014-12-02 10:50:19,122 codec: MPEG-1 Layer 3 (MP3)
Dec 02 10:50:53 node-1000 server[653]: 2014-12-02 10:50:53,757 send_delayed for wid 2, elapsed time 640.0 is above limit of 500.0 - sending now
Dec 02 10:50:54 node-1000 server[653]: 2014-12-02 10:50:54,302 send_delayed for wid 2, elapsed time 516.0 is above limit of 500.0 - sending now
Dec 02 10:50:54 node-1000 server[653]: 2014-12-02 10:50:54,914 send_delayed for wid 2, elapsed time 593.0 is above limit of 500.0 - sending now
Dec 02 10:50:55 node-1000 server[653]: 2014-12-02 10:50:55,673 send_delayed for wid 2, elapsed time 744.0 is above limit of 500.0 - sending now
Dec 02 10:51:02 node-1000 server[653]: 2014-12-02 10:51:02,856 send_delayed for wid 2, elapsed time 753.0 is above limit of 500.0 - sending now
Dec 02 10:51:03 node-1000 server[653]: 2014-12-02 10:51:03,488 send_delayed for wid 2, elapsed time 617.0 is above limit of 500.0 - sending now
Dec 02 10:51:04 node-1000 server[653]: 2014-12-02 10:51:04,082 send_delayed for wid 2, elapsed time 576.0 is above limit of 500.0 - sending now
Dec 02 10:51:09 node-1000 server[653]: 2014-12-02 10:51:09,913 send_delayed for wid 2, elapsed time 679.0 is above limit of 500.0 - sending now
Dec 02 10:51:10 node-1000 server[653]: 2014-12-02 10:51:10,624 send_delayed for wid 2, elapsed time 696.0 is above limit of 500.0 - sending now
Dec 02 10:51:11 node-1000 server[653]: 2014-12-02 10:51:11,268 send_delayed for wid 2, elapsed time 626.0 is above limit of 500.0 - sending now
Dec 02 10:51:11 node-1000 server[653]: 2014-12-02 10:51:11,966 send_delayed for wid 2, elapsed time 688.0 is above limit of 500.0 - sending now
Dec 02 10:51:17 node-1000 server[653]: 2014-12-02 10:51:17,636 send_delayed for wid 2, elapsed time 502.0 is above limit of 500.0 - sending now
Dec 02 10:51:21 node-1000 server[653]: 2014-12-02 10:51:21,819 send_delayed for wid 2, elapsed time 538.0 is above limit of 500.0 - sending now
Dec 02 10:51:22 node-1000 server[653]: 2014-12-02 10:51:22,341 send_delayed for wid 2, elapsed time 509.0 is above limit of 500.0 - sending now
Dec 02 10:51:22 node-1000 server[653]: 2014-12-02 10:51:22,948 send_delayed for wid 2, elapsed time 597.0 is above limit of 500.0 - sending now
Dec 02 10:51:23 node-1000 server[653]: 2014-12-02 10:51:23,514 send_delayed for wid 2, elapsed time 553.0 is above limit of 500.0 - sending now
Dec 02 10:51:24 node-1000 server[653]: 2014-12-02 10:51:24,231 send_delayed for wid 2, elapsed time 697.0 is above limit of 500.0 - sending now
Dec 02 10:51:25 node-1000 server[653]: 2014-12-02 10:51:25,001 send_delayed for wid 2, elapsed time 757.0 is above limit of 500.0 - sending now
Dec 02 10:51:25 node-1000 server[653]: 2014-12-02 10:51:25,533 send_delayed for wid 2, elapsed time 524.0 is above limit of 500.0 - sending now

I was running the client with -d paint but saw no output except the usual about the MP3 speaker re-starts.

Are there any other, better targeted, debugging flags that might be of more use (client or server side)?



Tue, 02 Dec 2014 21:07:06 GMT - Doug Doole: cc set

I'm seeing something similar. Here's what I reported to the mailing list:


Lately when using xpra, I've been seeing lots of spinners over top of my windows. (And I mean "lots" in the sense that xpra is pretty much unusable at times.) In the past, xpra was working without these delays.

When this happens, I see messages like this in the client log:

2014-12-02 13:20:53,784 server is not responding, drawing spinners over the windows
2014-12-02 13:21:05,074 server is OK again

In the server log, I see messages like this:

2014-12-02 13:20:23,261 delayed_region_timeout: region is 15001ms old, bad connection?
2014-12-02 13:20:28,963 send_delayed for wid 23, elapsed time 5701.0 is above limit of 500.0 - sending now

Occasionally it is so bad that xpra disconnects. At that point the server log will contain:

2014-12-02 13:24:08,767 Disconnecting client Protocol(SocketConnection(/home/doole/.xpra/reorx-1)): client ping timeout (waited 60 seconds without a response)
2014-12-02 13:24:08,769 xpra client disconnected.

Both my xpra server and client are on the same network (but not the same subnet). If I use "ssh -X" directly to the box, I do not see any sort of delays.

My windows are mostly konsole and gvim.

I'd love to be able to say "this problem started with version X", but I haven't been able to pin it down. Unfortunately, my environment went though multiple changes at the same time (xpra 0.13.x to 0.14.x, Ubuntu 12.04 to 14.04), so I may not be able to get back to a good state for comparison purposes.

I am running without speaker/microphone forwarding but everything else enabled. I have tried various encoding options, but it doesn't seem to make a difference.

My client and server are both running Ubuntu 14.04. I'm currently running xpra 0.14.12 at both the client and server, but I've been seeing this on all the xpra 0.14.x builds that I've tried.

Any suggestions? Anything I can do to help narrow this down further?


I have since tried running with -debug on the client, but it added no new information. (I'm waiting to try -debug on the server until I can afford to recycle my windows.)


Wed, 03 Dec 2014 20:47:27 GMT - Antoine Martin: status changed

r8177 allows us to easily tune the batch settings server side (the same env options could already be used client side):

XPRA_BATCH_MAX_DELAY=120 xpra start

This allows me to more easily get the send_delayed for wid ... - sending now to fire, but I am not seeing any problems.

How can I reproduce this bug?


Wed, 03 Dec 2014 21:13:46 GMT - Doug Doole:

I don't do anything special - a couple konsole windows open, one is frequently doing big compiles (so lots of scrolling text). Usually 2-5 gvim windows up, editing text, paging up and down, etc.


Thu, 04 Dec 2014 01:10:28 GMT - Antoine Martin:

r8188 adds the ability to turn off the video regions code, which may help:

XPRA_VIDEO_SUBREGION=0 xpra start

Thu, 04 Dec 2014 01:22:18 GMT - Antoine Martin:

Since this is refresh related, please post the refresh debug log:

xpra start -d refresh,regionrefresh

(the latter debug category is not needed with XPRA_VIDEO_SUBREGION=0)

@ddoole: are you connecting via ssh or tcp? Is there a firewall between you and the server, a forticlient perhaps? (ie: #742) What does the session info bandwidth graph show?


Thu, 04 Dec 2014 05:51:52 GMT - Antoine Martin: owner, status changed

@ddoole:


Thu, 04 Dec 2014 06:34:30 GMT - Antoine Martin: owner, status changed

I think I've just figured out what the problem is: your apps are repainting a lot (see #670 for gvim - #756 may help), and so the >500ms delay is needed. By cutting it short, we're making thing worse, eventually leading to the connection drop.


Thu, 04 Dec 2014 16:53:03 GMT - Doug Doole:

To answer Antoine's questions:

For totaam:


Thu, 04 Dec 2014 18:21:04 GMT - Doug Doole: attachment set

xpra info output during hang (over VDSL)


Thu, 04 Dec 2014 18:21:38 GMT - Doug Doole: attachment set

Performance graph during hang (over VDSL)


Thu, 04 Dec 2014 18:23:19 GMT - Doug Doole:

I see the same refresh problem working from home as I do in the office. This is over VDSL instead of local network though.

I attached the xpra info and the session info graphs for a hang.

renamed file for inlining


Thu, 04 Dec 2014 18:47:26 GMT - Antoine Martin: attachment set

renamed file for inlining


Thu, 04 Dec 2014 19:46:54 GMT - Antoine Martin: owner, status changed

@ddoole:

But the more I look at this, the more I think that you may just be having connectivity issues somehow. At first I thought that there was something wrong in the code I just reviewed and modified, but the fact that you are getting the spinners probably invalidates this hypothesis: pings just aren't coming back, or the network / main loops are getting stuck on it.

Not much more I can do on this without more information, preferably in the form of steps so that I can reproduce the problem..


Thu, 04 Dec 2014 19:59:51 GMT - Doug Doole:


Thu, 04 Dec 2014 20:58:05 GMT - Doug Doole:

I did a quick experiment with attaching with -d net --enable-pings. I didn't get any additional diagnostics at the client. I saw the spinners just as frequently, but they seemed to go away quicker.

See comment3 for my repro instructions. Running the ClearCase? xcleardiff tool (graphical file compare) is another good way to get the spinner coming up lots.


Fri, 05 Dec 2014 02:56:40 GMT - alas:

Well, I've been working on reproducing this, but the send_delayed ... messages don't seem to be correlating to my lossless element refresh issues.

Also filed #757, about an encoding pipeline BUG... which might be more relevant.

In any case, looks like my supposition should be deemed invalid, hopefully there will be a solution for ddoole's issues though.


Fri, 05 Dec 2014 14:38:23 GMT - Doug Doole:

Something new today. I'm in the office and was about to restart my server to add in some of the new options I've been asked to try. I figured I'd reproduce the spinners first before restarting everything.

I'm still getting the spinners reliably, but this time I'm not getting the send_delayed messages in the server log. (My server is still running with no -d option.)


Fri, 05 Dec 2014 15:38:38 GMT - Doug Doole:

I've changed my server to start with "--no-clipboard --no-mdns -d net" and the problem seems to have gone away. Things feel as responsive as they used to.

In the client logs, I do see occasional "server not responding" messages and the server log has very few "send_delayed" messages. (No additional information in the server log with "-d net".)

I'm guessing these messages are because of the known problems with gvim. Despite what the client log says, I haven't actually seen a spinner.

I'm going to continue experimenting and hopefully pin this down to one of mdns or clipboard. (I'm really hoping it's mdns - I can live without it. I need my clipboard though.)


Fri, 05 Dec 2014 17:33:04 GMT - Antoine Martin:

You may want to add -d clipboard to the client and server command lines.


Fri, 05 Dec 2014 18:15:36 GMT - Doug Doole: attachment set

Server log with -d clipboard


Fri, 05 Dec 2014 18:19:48 GMT - Doug Doole:

From my quick tests, it looks like --no-mdns does the trick for me.

To the best of my knowledge, I am not running any clipboard managers. (I go out of my way to kill klipper on any new system I set up.)

I did try running with "--no-mdns -d clipboard" in case anything interesting popped up. No new messages at the client. Nothing on the server said "here's a problem" to me, but I have attached the log if you'd like to take a look.

I'm going to run for a day or two with --no-mdns just to make sure it's not one of those quirks where the problem teases you and disappears for a while. Assuming --no-mdns is the key, anything else I can gather to help you pursue this further?


Fri, 05 Dec 2014 18:40:47 GMT - Antoine Martin:

Thanks! I'm not sure there's much more we can do to debug things, it looks like the python avahi code is swamping the program's main loop, probably waiting for IO somewhere it should not.

I have disabled mdns by default for trunk in r8211 and for v0.14.x in r8212, it will be in the next stable update (probably today).

Feel free to close this ticket if / when that works for you.


Fri, 05 Dec 2014 18:48:56 GMT - Doug Doole:

I'm still seeing some delays, even with --no-mdns.

Looking at the server log, there are a number of new do_owner_changed() calls logged. Is this significant?


Fri, 05 Dec 2014 18:50:23 GMT - Antoine Martin:

do_owner_changed()

means that something is changing the clipboard contents or ownership.


Fri, 05 Dec 2014 19:21:24 GMT - Doug Doole:

OK. That's fine then - I was putting stuff I the clipboard.

I'm starting to believe that I was too quick saying --no-mdns was enough. I've started getting lots of spinners again :-(

So I'm back to --no-ndns --no-clipboad and I'll see what happens as I use it more.


Fri, 05 Dec 2014 21:12:19 GMT - Doug Doole:

It looks like my initial success with --no-mdns --no-clipboard was just fate playing silly buggers with me. I'm back to having lots of spinners even though both were specified on my xpra start.

I'll start playing with the other options that were suggested and see if anything else makes a difference.


Mon, 08 Dec 2014 18:31:03 GMT - Doug Doole:

Tried connecting with tcp instead of ssh and that hasn't made a difference.


Tue, 09 Dec 2014 17:56:23 GMT - Doug Doole:

So, at this point I've tried:

And I am still seeing the delays. Any suggestions on other things to try or diagnostics to capture? (I really like xpra and hate the thought of having to go back to something like vnc.)


Tue, 09 Dec 2014 19:03:52 GMT - Antoine Martin:

The xpra info you posted shows that most window updates are sent out of the network layer very very fast: always under 50ms (look for damage.out_latency). You must have some very good server hardware or just very very small screen updates. Which only leaves the network layer (switches, firewalls, OS) and the client.

The only new testing suggestions that I can make at this point are:

etc..


Tue, 09 Dec 2014 20:24:44 GMT - Doug Doole: status changed; resolution set

I'm marking this as "invalid" (based on afarr also saying this is invalid in comment #13).

I just reproduced non-responsive windows using ssh -X and konsole with no xpra involved. So that definitely gets xpra off the hook. (I'd tried this before and hadn't seen any problems. A few minutes ago, I realized that my original non-xpra test wasn't 100% comparable to my xpra test. When I tested properly, I hit the hangs.)

So now I need to go find where to open a konsole bug... (In the mean time, it looks like xrvt will be my new friend.)

Thanks so much for your help in trying to narrow this down. Sorry for wasting your time.


Tue, 09 Dec 2014 23:24:10 GMT - Antoine Martin:

I've reverted the mdns changes in r8224.


Sat, 23 Jan 2021 05:04:52 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/750