Xpra: Ticket #750: lossless element re-painting seems to be overlooked when latency gets high

Running a (slightly customized) 0.14.12 windows client against a (also slightly customized) 0.14.12 fedora 20 client the rendering of the video elements and the lossless elements (text) seems very good until the server starts giving latency/delay warnings.

The video elements continue to display well as these warnings start outputting, but the lossless (text) elements begin to be blurry and increasingly fail to render when scrolling, especially, ceases.

I'm not certain whether this is a cause or maybe just a contributing factor, but here's a bit of server-side output:

Dec 02 10:43:55 node-1000 systemd[1]: Starting Cleanup of Temporary Directories...
Dec 02 10:43:55 node-1000 systemd[1]: Started Cleanup of Temporary Directories.
...
Dec 02 10:47:30 node-1000 dbus[348]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Dec 02 10:47:30 node-1000 NetworkManager[408]: bound to 10.171.171.100 -- renewal in 979 seconds.
Dec 02 10:47:30 node-1000 dbus-daemon[348]: dbus[348]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Dec 02 10:47:30 node-1000 systemd[1]: Starting Network Manager Script Dispatcher Service...
Dec 02 10:47:30 node-1000 dbus-daemon[348]: dbus[348]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Dec 02 10:47:30 node-1000 dbus[348]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Dec 02 10:47:30 node-1000 systemd[1]: Started Network Manager Script Dispatcher Service.
Dec 02 10:49:55 node-1000 server[653]: 2014-12-02 10:49:55,034 starting sound capture using pulseaudio device: Monitor of Null Output
Dec 02 10:49:55 node-1000 server[653]: 2014-12-02 10:49:55,315 codec: MPEG-1 Layer 3 (MP3)
Dec 02 10:50:18 node-1000 server[653]: 2014-12-02 10:50:18,856 starting sound capture using pulseaudio device: Monitor of Null Output
Dec 02 10:50:19 node-1000 server[653]: 2014-12-02 10:50:19,122 codec: MPEG-1 Layer 3 (MP3)
Dec 02 10:50:53 node-1000 server[653]: 2014-12-02 10:50:53,757 send_delayed for wid 2, elapsed time 640.0 is above limit of 500.0 - sending now
Dec 02 10:50:54 node-1000 server[653]: 2014-12-02 10:50:54,302 send_delayed for wid 2, elapsed time 516.0 is above limit of 500.0 - sending now
Dec 02 10:50:54 node-1000 server[653]: 2014-12-02 10:50:54,914 send_delayed for wid 2, elapsed time 593.0 is above limit of 500.0 - sending now
Dec 02 10:50:55 node-1000 server[653]: 2014-12-02 10:50:55,673 send_delayed for wid 2, elapsed time 744.0 is above limit of 500.0 - sending now
Dec 02 10:51:02 node-1000 server[653]: 2014-12-02 10:51:02,856 send_delayed for wid 2, elapsed time 753.0 is above limit of 500.0 - sending now
Dec 02 10:51:03 node-1000 server[653]: 2014-12-02 10:51:03,488 send_delayed for wid 2, elapsed time 617.0 is above limit of 500.0 - sending now
Dec 02 10:51:04 node-1000 server[653]: 2014-12-02 10:51:04,082 send_delayed for wid 2, elapsed time 576.0 is above limit of 500.0 - sending now
Dec 02 10:51:09 node-1000 server[653]: 2014-12-02 10:51:09,913 send_delayed for wid 2, elapsed time 679.0 is above limit of 500.0 - sending now
Dec 02 10:51:10 node-1000 server[653]: 2014-12-02 10:51:10,624 send_delayed for wid 2, elapsed time 696.0 is above limit of 500.0 - sending now
Dec 02 10:51:11 node-1000 server[653]: 2014-12-02 10:51:11,268 send_delayed for wid 2, elapsed time 626.0 is above limit of 500.0 - sending now
Dec 02 10:51:11 node-1000 server[653]: 2014-12-02 10:51:11,966 send_delayed for wid 2, elapsed time 688.0 is above limit of 500.0 - sending now
Dec 02 10:51:17 node-1000 server[653]: 2014-12-02 10:51:17,636 send_delayed for wid 2, elapsed time 502.0 is above limit of 500.0 - sending now
Dec 02 10:51:21 node-1000 server[653]: 2014-12-02 10:51:21,819 send_delayed for wid 2, elapsed time 538.0 is above limit of 500.0 - sending now
Dec 02 10:51:22 node-1000 server[653]: 2014-12-02 10:51:22,341 send_delayed for wid 2, elapsed time 509.0 is above limit of 500.0 - sending now
Dec 02 10:51:22 node-1000 server[653]: 2014-12-02 10:51:22,948 send_delayed for wid 2, elapsed time 597.0 is above limit of 500.0 - sending now
Dec 02 10:51:23 node-1000 server[653]: 2014-12-02 10:51:23,514 send_delayed for wid 2, elapsed time 553.0 is above limit of 500.0 - sending now
Dec 02 10:51:24 node-1000 server[653]: 2014-12-02 10:51:24,231 send_delayed for wid 2, elapsed time 697.0 is above limit of 500.0 - sending now
Dec 02 10:51:25 node-1000 server[653]: 2014-12-02 10:51:25,001 send_delayed for wid 2, elapsed time 757.0 is above limit of 500.0 - sending now
Dec 02 10:51:25 node-1000 server[653]: 2014-12-02 10:51:25,533 send_delayed for wid 2, elapsed time 524.0 is above limit of 500.0 - sending now

I was running the client with -d paint but saw no output except the usual about the MP3 speaker re-starts.

Are there any other, better targeted, debugging flags that might be of more use (client or server side)?

Tue, 02 Dec 2014 21:07:06 GMT - Doug Doole: cc set

cc dougdoole@… added

I'm seeing something similar. Here's what I reported to the mailing list:

Lately when using xpra, I've been seeing lots of spinners over top of my windows. (And I mean "lots" in the sense that xpra is pretty much unusable at times.) In the past, xpra was working without these delays.

When this happens, I see messages like this in the client log:

2014-12-02 13:20:53,784 server is not responding, drawing spinners over the windows
2014-12-02 13:21:05,074 server is OK again

In the server log, I see messages like this:

2014-12-02 13:20:23,261 delayed_region_timeout: region is 15001ms old, bad connection?
2014-12-02 13:20:28,963 send_delayed for wid 23, elapsed time 5701.0 is above limit of 500.0 - sending now

Occasionally it is so bad that xpra disconnects. At that point the server log will contain:

2014-12-02 13:24:08,767 Disconnecting client Protocol(SocketConnection(/home/doole/.xpra/reorx-1)): client ping timeout (waited 60 seconds without a response)
2014-12-02 13:24:08,769 xpra client disconnected.

Both my xpra server and client are on the same network (but not the same subnet). If I use "ssh -X" directly to the box, I do not see any sort of delays.

My windows are mostly konsole and gvim.

I'd love to be able to say "this problem started with version X", but I haven't been able to pin it down. Unfortunately, my environment went though multiple changes at the same time (xpra 0.13.x to 0.14.x, Ubuntu 12.04 to 14.04), so I may not be able to get back to a good state for comparison purposes.

I am running without speaker/microphone forwarding but everything else enabled. I have tried various encoding options, but it doesn't seem to make a difference.

My client and server are both running Ubuntu 14.04. I'm currently running xpra 0.14.12 at both the client and server, but I've been seeing this on all the xpra 0.14.x builds that I've tried.

Any suggestions? Anything I can do to help narrow this down further?

I have since tried running with -debug on the client, but it added no new information. (I'm waiting to try -debug on the server until I can afford to recycle my windows.)

Wed, 03 Dec 2014 20:47:27 GMT - Antoine Martin: status changed

status changed from new to assigned

r8177 allows us to easily tune the batch settings server side (the same env options could already be used client side):

XPRA_BATCH_MAX_DELAY=120 xpra start

This allows me to more easily get the send_delayed for wid ... - sending now to fire, but I am not seeing any problems.

How can I reproduce this bug?

Wed, 03 Dec 2014 21:13:46 GMT - Doug Doole:

I don't do anything special - a couple konsole windows open, one is frequently doing big compiles (so lots of scrolling text). Usually 2-5 gvim windows up, editing text, paging up and down, etc.

Thu, 04 Dec 2014 01:10:28 GMT - Antoine Martin:

r8188 adds the ability to turn off the video regions code, which may help:

XPRA_VIDEO_SUBREGION=0 xpra start

Thu, 04 Dec 2014 01:22:18 GMT - Antoine Martin:

Since this is refresh related, please post the refresh debug log:

xpra start -d refresh,regionrefresh

(the latter debug category is not needed with XPRA_VIDEO_SUBREGION=0)

@ddoole: are you connecting via ssh or tcp? Is there a firewall between you and the server, a forticlient perhaps? (ie: #742) What does the session info bandwidth graph show?

Thu, 04 Dec 2014 05:51:52 GMT - Antoine Martin: owner, status changed

owner changed from Antoine Martin to Doug Doole
status changed from assigned to new

@ddoole:

gvim is known to cause unnecessary repaints: #670
auto refresh issues: #694
btw: -d debug won't do anything, see xpra -d help for the list of debug categories - in this particular case, you may also want encoding or refresh on the server, and maybe paint on the client
you're seeing some extraordinary delays there - the kind that you would only see when connecting over a cellular network like 2G or 3G...
maybe have a look at the session info network traffic
post xpra info collected when things are going wrong

Thu, 04 Dec 2014 06:34:30 GMT - Antoine Martin: owner, status changed

owner changed from Doug Doole to Antoine Martin
status changed from new to assigned

I think I've just figured out what the problem is: your apps are repainting a lot (see #670 for gvim - #756 may help), and so the >500ms delay is needed. By cutting it short, we're making thing worse, eventually leading to the connection drop.

Thu, 04 Dec 2014 16:53:03 GMT - Doug Doole:

To answer Antoine's questions:

I am connecting using ssh, not tcp.
There is a firewall running on my laptop (iptables). Any rules I should be looking for or adding?

For totaam:

I'm seeing these delays when I'm on the local network at work. Gigabit copper from my laptop to the server. The machines are on separate subnets though.
I'm not at the office today, so it'll be a day or two before I can get information about the local network.

Thu, 04 Dec 2014 18:21:04 GMT - Doug Doole: attachment set

attachment set to ddoole.xpra.info.vdsl

xpra info output during hang (over VDSL)

Thu, 04 Dec 2014 18:21:38 GMT - Doug Doole: attachment set

attachment set to ddoole.xpra.png.vdsl

Performance graph during hang (over VDSL)

Thu, 04 Dec 2014 18:23:19 GMT - Doug Doole:

I see the same refresh problem working from home as I do in the office. This is over VDSL instead of local network though.

I attached the xpra info and the session info graphs for a hang.

Thu, 04 Dec 2014 18:47:26 GMT - Antoine Martin: attachment set

attachment set to ddoole.xpra.vdsl.png

renamed file for inlining

Thu, 04 Dec 2014 19:46:54 GMT - Antoine Martin: owner, status changed

owner changed from Antoine Martin to Doug Doole
status changed from assigned to new

@ddoole:

png encoding isn't the default, because it isn't the most efficient - could be making things worse here, though xpra info seems to disagree with me and indicates that you are not encoding bound
your latency isn't even that high either
don't worry about iptables, it cannot interfere out of the box (it would take some considerable effort) - only some broken layer-7 proprietary firewalls get this wrong
could be worth trying tcp connection just to rule out the transport layer
trying turning off more things: mdns maybe? clipboard?
I have made some minor improvements in r8193 and r8195 (the last one may be worth backporting)
try network layer debugging with -d net on the server and client
try --enable-pings and see how that influences how quickly the spinners come up

But the more I look at this, the more I think that you may just be having connectivity issues somehow. At first I thought that there was something wrong in the code I just reviewed and modified, but the fact that you are getting the spinners probably invalidates this hypothesis: pings just aren't coming back, or the network / main loops are getting stuck on it.

Not much more I can do on this without more information, preferably in the form of steps so that I can reproduce the problem..

Thu, 04 Dec 2014 19:59:51 GMT - Doug Doole:

I know png isn't the most efficient, but the artifacting I get around text when I use H.264 or VP8 drives me nuts. I have tried other encodings though, and it did not make any observable difference.
I tried disabling the clipboard at the client, and that hasn't helped. I'll try disabling at the server and see if it makes a difference.
I'll try the other things you've suggested as well.

Thu, 04 Dec 2014 20:58:05 GMT - Doug Doole:

I did a quick experiment with attaching with -d net --enable-pings. I didn't get any additional diagnostics at the client. I saw the spinners just as frequently, but they seemed to go away quicker.

See comment3 for my repro instructions. Running the ClearCase? xcleardiff tool (graphical file compare) is another good way to get the spinner coming up lots.

Fri, 05 Dec 2014 02:56:40 GMT - alas:

Well, I've been working on reproducing this, but the send_delayed ... messages don't seem to be correlating to my lossless element refresh issues.

Also filed #757, about an encoding pipeline BUG... which might be more relevant.

In any case, looks like my supposition should be deemed invalid, hopefully there will be a solution for ddoole's issues though.

Fri, 05 Dec 2014 14:38:23 GMT - Doug Doole:

Something new today. I'm in the office and was about to restart my server to add in some of the new options I've been asked to try. I figured I'd reproduce the spinners first before restarting everything.

I'm still getting the spinners reliably, but this time I'm not getting the send_delayed messages in the server log. (My server is still running with no -d option.)

Fri, 05 Dec 2014 15:38:38 GMT - Doug Doole:

I've changed my server to start with "--no-clipboard --no-mdns -d net" and the problem seems to have gone away. Things feel as responsive as they used to.

In the client logs, I do see occasional "server not responding" messages and the server log has very few "send_delayed" messages. (No additional information in the server log with "-d net".)

I'm guessing these messages are because of the known problems with gvim. Despite what the client log says, I haven't actually seen a spinner.

I'm going to continue experimenting and hopefully pin this down to one of mdns or clipboard. (I'm really hoping it's mdns - I can live without it. I need my clipboard though.)

Fri, 05 Dec 2014 17:33:04 GMT - Antoine Martin:

mdns: could be causing problems
clipboard: much more likely, covered in the FAQ and bug reporting guidelines - check that nothing else is trying to sync the clipboard at the same time as xpra

You may want to add -d clipboard to the client and server command lines.

Fri, 05 Dec 2014 18:15:36 GMT - Doug Doole: attachment set

attachment set to clipboard.log

Server log with -d clipboard

Fri, 05 Dec 2014 18:19:48 GMT - Doug Doole:

From my quick tests, it looks like --no-mdns does the trick for me.

To the best of my knowledge, I am not running any clipboard managers. (I go out of my way to kill klipper on any new system I set up.)

I did try running with "--no-mdns -d clipboard" in case anything interesting popped up. No new messages at the client. Nothing on the server said "here's a problem" to me, but I have attached the log if you'd like to take a look.

I'm going to run for a day or two with --no-mdns just to make sure it's not one of those quirks where the problem teases you and disappears for a while. Assuming --no-mdns is the key, anything else I can gather to help you pursue this further?

Fri, 05 Dec 2014 18:40:47 GMT - Antoine Martin:

Thanks! I'm not sure there's much more we can do to debug things, it looks like the python avahi code is swamping the program's main loop, probably waiting for IO somewhere it should not.

I have disabled mdns by default for trunk in r8211 and for v0.14.x in r8212, it will be in the next stable update (probably today).

Feel free to close this ticket if / when that works for you.

Fri, 05 Dec 2014 18:48:56 GMT - Doug Doole:

I'm still seeing some delays, even with --no-mdns.

Looking at the server log, there are a number of new do_owner_changed() calls logged. Is this significant?

Fri, 05 Dec 2014 18:50:23 GMT - Antoine Martin:

do_owner_changed()

means that something is changing the clipboard contents or ownership.

Fri, 05 Dec 2014 19:21:24 GMT - Doug Doole:

OK. That's fine then - I was putting stuff I the clipboard.

I'm starting to believe that I was too quick saying --no-mdns was enough. I've started getting lots of spinners again :-(

So I'm back to --no-ndns --no-clipboad and I'll see what happens as I use it more.

Fri, 05 Dec 2014 21:12:19 GMT - Doug Doole:

It looks like my initial success with --no-mdns --no-clipboard was just fate playing silly buggers with me. I'm back to having lots of spinners even though both were specified on my xpra start.

I'll start playing with the other options that were suggested and see if anything else makes a difference.

Mon, 08 Dec 2014 18:31:03 GMT - Doug Doole:

Tried connecting with tcp instead of ssh and that hasn't made a difference.

Tue, 09 Dec 2014 17:56:23 GMT - Doug Doole:

So, at this point I've tried:

different encodings
ssh and tcp communication
turning off every option that has a --no-... switch
built in xvfb as well as xdummy (I normally run with xdummy)

And I am still seeing the delays. Any suggestions on other things to try or diagnostics to capture? (I really like xpra and hate the thought of having to go back to something like vnc.)

Tue, 09 Dec 2014 19:03:52 GMT - Antoine Martin:

The xpra info you posted shows that most window updates are sent out of the network layer very very fast: always under 50ms (look for damage.out_latency). You must have some very good server hardware or just very very small screen updates. Which only leaves the network layer (switches, firewalls, OS) and the client.

The only new testing suggestions that I can make at this point are:

trying different OS / versions
trying older versions of xpra to see when the regression occurred (assuming it is an xpra regression) - client side first (since the server "looks" ok)
try from the same subnet with the same OS

etc..

Tue, 09 Dec 2014 20:24:44 GMT - Doug Doole: status changed; resolution set

status changed from new to closed
resolution set to invalid

I'm marking this as "invalid" (based on afarr also saying this is invalid in comment #13).

I just reproduced non-responsive windows using ssh -X and konsole with no xpra involved. So that definitely gets xpra off the hook. (I'd tried this before and hadn't seen any problems. A few minutes ago, I realized that my original non-xpra test wasn't 100% comparable to my xpra test. When I tested properly, I hit the hangs.)

So now I need to go find where to open a konsole bug... (In the mean time, it looks like xrvt will be my new friend.)

Thanks so much for your help in trying to narrow this down. Sorry for wasting your time.

Tue, 09 Dec 2014 23:24:10 GMT - Antoine Martin:

I've reverted the mdns changes in r8224.

Sat, 23 Jan 2021 05:04:52 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/750