#750 closed defect (invalid)
lossless element re-painting seems to be overlooked when latency gets high
Reported by: | alas | Owned by: | Doug Doole |
---|---|---|---|
Priority: | major | Milestone: | 0.15 |
Component: | server | Version: | 0.14.x |
Keywords: | Cc: | dougdoole@… |
Description
Running a (slightly customized) 0.14.12 windows client against a (also slightly customized) 0.14.12 fedora 20 client the rendering of the video elements and the lossless elements (text) seems very good until the server starts giving latency/delay warnings.
The video elements continue to display well as these warnings start outputting, but the lossless (text) elements begin to be blurry and increasingly fail to render when scrolling, especially, ceases.
I'm not certain whether this is a cause or maybe just a contributing factor, but here's a bit of server-side output:
Dec 02 10:43:55 node-1000 systemd[1]: Starting Cleanup of Temporary Directories... Dec 02 10:43:55 node-1000 systemd[1]: Started Cleanup of Temporary Directories. ... Dec 02 10:47:30 node-1000 dbus[348]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' Dec 02 10:47:30 node-1000 NetworkManager[408]: bound to 10.171.171.100 -- renewal in 979 seconds. Dec 02 10:47:30 node-1000 dbus-daemon[348]: dbus[348]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' Dec 02 10:47:30 node-1000 systemd[1]: Starting Network Manager Script Dispatcher Service... Dec 02 10:47:30 node-1000 dbus-daemon[348]: dbus[348]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Dec 02 10:47:30 node-1000 dbus[348]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Dec 02 10:47:30 node-1000 systemd[1]: Started Network Manager Script Dispatcher Service. Dec 02 10:49:55 node-1000 server[653]: 2014-12-02 10:49:55,034 starting sound capture using pulseaudio device: Monitor of Null Output Dec 02 10:49:55 node-1000 server[653]: 2014-12-02 10:49:55,315 codec: MPEG-1 Layer 3 (MP3) Dec 02 10:50:18 node-1000 server[653]: 2014-12-02 10:50:18,856 starting sound capture using pulseaudio device: Monitor of Null Output Dec 02 10:50:19 node-1000 server[653]: 2014-12-02 10:50:19,122 codec: MPEG-1 Layer 3 (MP3) Dec 02 10:50:53 node-1000 server[653]: 2014-12-02 10:50:53,757 send_delayed for wid 2, elapsed time 640.0 is above limit of 500.0 - sending now Dec 02 10:50:54 node-1000 server[653]: 2014-12-02 10:50:54,302 send_delayed for wid 2, elapsed time 516.0 is above limit of 500.0 - sending now Dec 02 10:50:54 node-1000 server[653]: 2014-12-02 10:50:54,914 send_delayed for wid 2, elapsed time 593.0 is above limit of 500.0 - sending now Dec 02 10:50:55 node-1000 server[653]: 2014-12-02 10:50:55,673 send_delayed for wid 2, elapsed time 744.0 is above limit of 500.0 - sending now Dec 02 10:51:02 node-1000 server[653]: 2014-12-02 10:51:02,856 send_delayed for wid 2, elapsed time 753.0 is above limit of 500.0 - sending now Dec 02 10:51:03 node-1000 server[653]: 2014-12-02 10:51:03,488 send_delayed for wid 2, elapsed time 617.0 is above limit of 500.0 - sending now Dec 02 10:51:04 node-1000 server[653]: 2014-12-02 10:51:04,082 send_delayed for wid 2, elapsed time 576.0 is above limit of 500.0 - sending now Dec 02 10:51:09 node-1000 server[653]: 2014-12-02 10:51:09,913 send_delayed for wid 2, elapsed time 679.0 is above limit of 500.0 - sending now Dec 02 10:51:10 node-1000 server[653]: 2014-12-02 10:51:10,624 send_delayed for wid 2, elapsed time 696.0 is above limit of 500.0 - sending now Dec 02 10:51:11 node-1000 server[653]: 2014-12-02 10:51:11,268 send_delayed for wid 2, elapsed time 626.0 is above limit of 500.0 - sending now Dec 02 10:51:11 node-1000 server[653]: 2014-12-02 10:51:11,966 send_delayed for wid 2, elapsed time 688.0 is above limit of 500.0 - sending now Dec 02 10:51:17 node-1000 server[653]: 2014-12-02 10:51:17,636 send_delayed for wid 2, elapsed time 502.0 is above limit of 500.0 - sending now Dec 02 10:51:21 node-1000 server[653]: 2014-12-02 10:51:21,819 send_delayed for wid 2, elapsed time 538.0 is above limit of 500.0 - sending now Dec 02 10:51:22 node-1000 server[653]: 2014-12-02 10:51:22,341 send_delayed for wid 2, elapsed time 509.0 is above limit of 500.0 - sending now Dec 02 10:51:22 node-1000 server[653]: 2014-12-02 10:51:22,948 send_delayed for wid 2, elapsed time 597.0 is above limit of 500.0 - sending now Dec 02 10:51:23 node-1000 server[653]: 2014-12-02 10:51:23,514 send_delayed for wid 2, elapsed time 553.0 is above limit of 500.0 - sending now Dec 02 10:51:24 node-1000 server[653]: 2014-12-02 10:51:24,231 send_delayed for wid 2, elapsed time 697.0 is above limit of 500.0 - sending now Dec 02 10:51:25 node-1000 server[653]: 2014-12-02 10:51:25,001 send_delayed for wid 2, elapsed time 757.0 is above limit of 500.0 - sending now Dec 02 10:51:25 node-1000 server[653]: 2014-12-02 10:51:25,533 send_delayed for wid 2, elapsed time 524.0 is above limit of 500.0 - sending now
I was running the client with -d paint
but saw no output except the usual about the MP3 speaker re-starts.
Are there any other, better targeted, debugging flags that might be of more use (client or server side)?
Attachments (4)
Change History (32)
comment:2 Changed 7 years ago by
Status: | new → assigned |
---|
r8177 allows us to easily tune the batch settings server side (the same env options could already be used client side):
XPRA_BATCH_MAX_DELAY=120 xpra start
This allows me to more easily get the send_delayed for wid ... - sending now
to fire, but I am not seeing any problems.
How can I reproduce this bug?
comment:3 Changed 7 years ago by
I don't do anything special - a couple konsole windows open, one is frequently doing big compiles (so lots of scrolling text). Usually 2-5 gvim windows up, editing text, paging up and down, etc.
comment:4 Changed 7 years ago by
r8188 adds the ability to turn off the video regions code, which may help:
XPRA_VIDEO_SUBREGION=0 xpra start
comment:5 Changed 7 years ago by
Since this is refresh related, please post the refresh debug log:
xpra start -d refresh,regionrefresh
(the latter debug category is not needed with XPRA_VIDEO_SUBREGION=0
)
@ddoole: are you connecting via ssh or tcp? Is there a firewall between you and the server, a forticlient perhaps? (ie: #742)
What does the session info bandwidth graph show?
comment:6 Changed 7 years ago by
Owner: | changed from Antoine Martin to Doug Doole |
---|---|
Status: | assigned → new |
@ddoole:
- gvim is known to cause unnecessary repaints: #670
- auto refresh issues: #694
- btw:
-d debug
won't do anything, seexpra -d help
for the list of debug categories - in this particular case, you may also wantencoding
orrefresh
on the server, and maybepaint
on the client - you're seeing some extraordinary delays there - the kind that you would only see when connecting over a cellular network like 2G or 3G...
- maybe have a look at the session info network traffic
- post
xpra info
collected when things are going wrong
comment:7 Changed 7 years ago by
Owner: | changed from Doug Doole to Antoine Martin |
---|---|
Status: | new → assigned |
comment:8 Changed 7 years ago by
To answer Antoine's questions:
- I am connecting using ssh, not tcp.
- There is a firewall running on my laptop (iptables). Any rules I should be looking for or adding?
For totaam:
- I'm seeing these delays when I'm on the local network at work. Gigabit copper from my laptop to the server. The machines are on separate subnets though.
- I'm not at the office today, so it'll be a day or two before I can get information about the local network.
Changed 7 years ago by
Attachment: | ddoole.xpra.info.vdsl added |
---|
xpra info output during hang (over VDSL)
Changed 7 years ago by
Attachment: | ddoole.xpra.png.vdsl added |
---|
Performance graph during hang (over VDSL)
comment:9 Changed 7 years ago by
I see the same refresh problem working from home as I do in the office. This is over VDSL instead of local network though.
I attached the xpra info and the session info graphs for a hang.
comment:10 Changed 7 years ago by
Owner: | changed from Antoine Martin to Doug Doole |
---|---|
Status: | assigned → new |
@ddoole:
- png encoding isn't the default, because it isn't the most efficient - could be making things worse here, though xpra info seems to disagree with me and indicates that you are not encoding bound
- your latency isn't even that high either
- don't worry about iptables, it cannot interfere out of the box (it would take some considerable effort) - only some broken layer-7 proprietary firewalls get this wrong
- could be worth trying tcp connection just to rule out the transport layer
- trying turning off more things: mdns maybe? clipboard?
- I have made some minor improvements in r8193 and r8195 (the last one may be worth backporting)
- try network layer debugging with
-d net
on the server and client - try
--enable-pings
and see how that influences how quickly the spinners come up
But the more I look at this, the more I think that you may just be having connectivity issues somehow.
At first I thought that there was something wrong in the code I just reviewed and modified, but the fact that you are getting the spinners probably invalidates this hypothesis: pings just aren't coming back, or the network / main loops are getting stuck on it.
Not much more I can do on this without more information, preferably in the form of steps so that I can reproduce the problem..
comment:11 Changed 7 years ago by
- I know png isn't the most efficient, but the artifacting I get around text when I use H.264 or VP8 drives me nuts. I have tried other encodings though, and it did not make any observable difference.
- I tried disabling the clipboard at the client, and that hasn't helped. I'll try disabling at the server and see if it makes a difference.
- I'll try the other things you've suggested as well.
comment:12 Changed 7 years ago by
I did a quick experiment with attaching with -d net --enable-pings. I didn't get any additional diagnostics at the client. I saw the spinners just as frequently, but they seemed to go away quicker.
See comment3 for my repro instructions. Running the ClearCase? xcleardiff tool (graphical file compare) is another good way to get the spinner coming up lots.
comment:13 Changed 7 years ago by
Well, I've been working on reproducing this, but the send_delayed ...
messages don't seem to be correlating to my lossless element refresh issues.
Also filed #757, about an encoding pipeline BUG... which might be more relevant.
In any case, looks like my supposition should be deemed invalid
, hopefully there will be a solution for ddoole's issues though.
comment:14 Changed 7 years ago by
Something new today. I'm in the office and was about to restart my server to add in some of the new options I've been asked to try. I figured I'd reproduce the spinners first before restarting everything.
I'm still getting the spinners reliably, but this time I'm not getting the send_delayed messages in the server log. (My server is still running with no -d option.)
comment:15 Changed 7 years ago by
I've changed my server to start with "--no-clipboard --no-mdns -d net" and the problem seems to have gone away. Things feel as responsive as they used to.
In the client logs, I do see occasional "server not responding" messages and the server log has very few "send_delayed" messages. (No additional information in the server log with "-d net".)
I'm guessing these messages are because of the known problems with gvim. Despite what the client log says, I haven't actually seen a spinner.
I'm going to continue experimenting and hopefully pin this down to one of mdns or clipboard. (I'm really hoping it's mdns - I can live without it. I need my clipboard though.)
comment:16 Changed 7 years ago by
- mdns: could be causing problems
- clipboard: much more likely, covered in the FAQ and bug reporting guidelines - check that nothing else is trying to sync the clipboard at the same time as xpra
You may want to add -d clipboard
to the client and server command lines.
comment:17 Changed 7 years ago by
From my quick tests, it looks like --no-mdns does the trick for me.
To the best of my knowledge, I am not running any clipboard managers. (I go out of my way to kill klipper on any new system I set up.)
I did try running with "--no-mdns -d clipboard" in case anything interesting popped up. No new messages at the client. Nothing on the server said "here's a problem" to me, but I have attached the log if you'd like to take a look.
I'm going to run for a day or two with --no-mdns just to make sure it's not one of those quirks where the problem teases you and disappears for a while. Assuming --no-mdns is the key, anything else I can gather to help you pursue this further?
comment:18 Changed 7 years ago by
Thanks!
I'm not sure there's much more we can do to debug things, it looks like the python avahi code is swamping the program's main loop, probably waiting for IO somewhere it should not.
I have disabled mdns by default for trunk in r8211 and for v0.14.x in r8212, it will be in the next stable update (probably today).
Feel free to close this ticket if / when that works for you.
comment:19 Changed 7 years ago by
I'm still seeing some delays, even with --no-mdns.
Looking at the server log, there are a number of new do_owner_changed() calls logged. Is this significant?
comment:20 Changed 7 years ago by
do_owner_changed()
means that something is changing the clipboard contents or ownership.
comment:21 Changed 7 years ago by
OK. That's fine then - I was putting stuff I the clipboard.
I'm starting to believe that I was too quick saying --no-mdns was enough. I've started getting lots of spinners again :-(
So I'm back to --no-ndns --no-clipboad and I'll see what happens as I use it more.
comment:22 Changed 7 years ago by
It looks like my initial success with --no-mdns --no-clipboard was just fate playing silly buggers with me. I'm back to having lots of spinners even though both were specified on my xpra start.
I'll start playing with the other options that were suggested and see if anything else makes a difference.
comment:23 Changed 7 years ago by
Tried connecting with tcp instead of ssh and that hasn't made a difference.
comment:24 Changed 7 years ago by
So, at this point I've tried:
- different encodings
- ssh and tcp communication
- turning off every option that has a --no-... switch
- built in xvfb as well as xdummy (I normally run with xdummy)
And I am still seeing the delays. Any suggestions on other things to try or diagnostics to capture? (I really like xpra and hate the thought of having to go back to something like vnc.)
comment:25 Changed 7 years ago by
The xpra info you posted shows that most window updates are sent out of the network layer very very fast: always under 50ms (look for damage.out_latency
).
You must have some very good server hardware or just very very small screen updates.
Which only leaves the network layer (switches, firewalls, OS) and the client.
The only new testing suggestions that I can make at this point are:
- trying different OS / versions
- trying older versions of xpra to see when the regression occurred (assuming it is an xpra regression) - client side first (since the server "looks" ok)
- try from the same subnet with the same OS
etc..
comment:26 Changed 7 years ago by
Resolution: | → invalid |
---|---|
Status: | new → closed |
I'm marking this as "invalid" (based on afarr also saying this is invalid in comment #13).
I just reproduced non-responsive windows using ssh -X and konsole with no xpra involved. So that definitely gets xpra off the hook. (I'd tried this before and hadn't seen any problems. A few minutes ago, I realized that my original non-xpra test wasn't 100% comparable to my xpra test. When I tested properly, I hit the hangs.)
So now I need to go find where to open a konsole bug... (In the mean time, it looks like xrvt will be my new friend.)
Thanks so much for your help in trying to narrow this down. Sorry for wasting your time.
comment:28 Changed 16 months ago by
this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/750
I'm seeing something similar. Here's what I reported to the mailing list:
Lately when using xpra, I've been seeing lots of spinners over top of my windows. (And I mean "lots" in the sense that xpra is pretty much unusable at times.) In the past, xpra was working without these delays.
When this happens, I see messages like this in the client log:
In the server log, I see messages like this:
Occasionally it is so bad that xpra disconnects. At that point the server log will contain:
Both my xpra server and client are on the same network (but not the same subnet). If I use "ssh -X" directly to the box, I do not see any sort of delays.
My windows are mostly konsole and gvim.
I'd love to be able to say "this problem started with version X", but I haven't been able to pin it down. Unfortunately, my environment went though multiple changes at the same time (xpra 0.13.x to 0.14.x, Ubuntu 12.04 to 14.04), so I may not be able to get back to a good state for comparison purposes.
I am running without speaker/microphone forwarding but everything else enabled. I have tried various encoding options, but it doesn't seem to make a difference.
My client and server are both running Ubuntu 14.04. I'm currently running xpra 0.14.12 at both the client and server, but I've been seeing this on all the xpra 0.14.x builds that I've tried.
Any suggestions? Anything I can do to help narrow this down further?
I have since tried running with -debug on the client, but it added no new information. (I'm waiting to try -debug on the server until I can afford to recycle my windows.)