xpra icon
Bug tracker and wiki

Opened 5 months ago

Closed 2 months ago

#2090 closed defect (needinfo)

XPRA does not recover from network congestion (even when network does)

Reported by: Nathan Hallquist Owned by: Nathan Hallquist
Priority: critical Milestone: 2.5
Component: network Version: 2.4.x
Keywords: Cc: steved424@…

Description

I have been running performance tests on XPRA (between work and home) and I was noticing that if my internet connection got congested with H264 content from my OpenGL window my xterm windows would become very slow and they *never* recover unless I restart the client.

I have now tested things on my gigabit work LAN. Using a piece of software that rate-limits the MS Windows XPRA client. I found that if I impose a 5kb/s upload limit things get very slow, and that when I turn off this limit the recovery is poor. I believe, therefore, that I have found way to reliably simulate the problem that I was seeing from home. When temporary congestion seems to make things slow XPRA seems to not recover. I'm baffled.

Here my procedure:

  1. rate limit
  2. rotate 3D model until the network congestion spinner appears
  3. turn off rate limit

What happens:

  1. 3D window remains slow even after several seconds of idle with rate limit turned off. This window recovers but seemingly after only a long time of *continuous* use without congestion. Desired behavior: the window should recover right away.
  1. Even though the action was in the OpenGL window the xterm window *never* recovers until I restart the client. Specifically, the time between keypress and echo becomes very long and stays that way. Desired behavior: the xterm should recover instantly.

What I'm using:

  1. head revision server.
  2. Bandwidth detection settings seem to make no difference.
  3. I've tried with several clients.
  4. I've so far only used paramiko ssh (one hop) on the controlled lan tests.
  5. From home I've observed the problem is worse with the 2hop than with ssh -L from the shell and then a one-hop (weather the one-hop is tcp:// or ssh:// seems to make small to no difference). I think, therefore, I may need to multithread the 2hop ssh. I'll send pataches soon (I plan on starting with paramiko). Can I put them on this ticket?

Attachments (1)

log1.txt (351.9 KB) - added by Nathan Hallquist 5 months ago.

Download all attachments as: .zip

Change History (19)

comment:1 Changed 5 months ago by Nathan Hallquist

XPRA_FORCE_BATCH=1 seems to solve the problem on the LAN (later I will look at what happens over broadband). I think that somehow what is going on in my OpenGL window is driving the batch delay up. (When I opened the ticket I didn't understand what batch delay did).

If what is happening is what I'm guessing I think that batch delay for "text" windows should plummet rapidly after a spike because sluggish shells are hard to use (compared to GUIs).

comment:2 Changed 5 months ago by Antoine Martin

Owner: changed from Antoine Martin to Nathan Hallquist

Sounds similar to #1911.

Forcing XPRA_FORCE_BATCH=1 will ensure that we let regions accumulate, giving more opportunity for rectangles to get merged and for packet aggregation (#619) - which means better use of the more limited bandwidth.
Maybe we should always batch by default, or at least always batch when we see any congestion.

How can I reproduce this easily on my laptop?

comment:3 in reply to:  2 Changed 5 months ago by Nathan Hallquist

My procedure is

  1. Using VirtualGL run lsprepost with a nontrivial 3D model. I can provide you with lsprepost and a nontrivial data set. I'm guessing that anything that really congests the network will do the trick.
  1. I've only tested on windows but I used a program called "NetLimiter 4". I set it to limit download to 50KB/s and upload to 5KB/s. Then things get very slow (as expected). Then I go ahead and uncheck the limits, and the batch delay is huge and things, especially xterms, become unusable.

I have found that with:

Environment=XPRA_FORCE_BATCH=1
Environment=XPRA_BATCH_MAX_DELAY=50

I get pretty good results over broadband and LAN. I'm *not* saying that is the right thing to do, frankly, I have very little clue about this. I'm going to test performance on an LTE hotspot shortly...

With these settings + dbus video-box hinting XPRA is working better than anything else I've ever tried. I'll soon be deploying it to some more of my users.

Last edited 5 months ago by Antoine Martin (previous) (diff)

comment:4 Changed 5 months ago by Antoine Martin

Owner: changed from Nathan Hallquist to Antoine Martin
Status: newassigned

Using VirtualGL run lsprepost with a nontrivial 3D model.
I can provide you with lsprepost and a nontrivial data set.
I'm guessing that anything that really congests the network will do the trick.

Can you reproduce with something more widely available? glxgears or glxspheres perhaps?

I set it to limit download to 50KB/s and upload to 5KB/s.
Then things get very slow (as expected).

It's a miracle it works at all. 5KB/s is really much lower than anything it was ever designed for!

Then I go ahead and uncheck the limits, and the batch delay is huge and things, especially xterms, become unusable.

Ah.

Environment=XPRA_FORCE_BATCH=1

It should be fine to change this default.
In almost all cases, batching is the right thing to do. The minimum, which is 5ms is imperceptible anyway.
VNC servers have it always enabled.

Environment=XPRA_BATCH_MAX_DELAY=50

I am less sure about this one.
With the mostly dynamic batch delay code, the base value is almost meaningless. Unless when it is high and causes problems...
The only issue with capping the batch delay is that we have other heuristics that use the batch delay as input.
I'll see what I can do.

comment:5 Changed 5 months ago by Nathan Hallquist

I have not disappeared.

The performance of VirtualGL on LTE hotspot was really great. This is a testament to XPRA, not my hotspot, which isn't that great.

I will get back with you on this once I have finished hooking dbus to ls-prepost. I have gotten our developers to give me a callback, and I have figured out how to extract relative coordinates from the glCanvas. Now, I have to write the dbus code...

comment:6 Changed 5 months ago by Antoine Martin

r21267 enables batching by default.
As for the max delay change, my current ideas are:

  • use the existing "soft-expired" mechanism: we could let the batch delay increase, but speculatively allow packets to go out when there is no bottleneck. (the batch delay would end up increasing a lot less since the "actual batch delay" will remain lower)
  • if we have already waited longer than the current batch delay value (ie: not many screen updates in an xterm), then we can lower the batch delay for the next screen update that comes - (ideally ignoring small screen updates...)
Last edited 5 months ago by Antoine Martin (previous) (diff)

comment:7 Changed 5 months ago by Antoine Martin

Owner: changed from Antoine Martin to Nathan Hallquist
Status: assignednew

More improvements in:

  • r21273: batching is enabled by default, move some checks to save some cpu cycles
  • r21275: skip waiting unnecessarily when the window was idle for longer than the batch delay + move more code out of hot path

This won't fix your problems, but it will help.
Can you please capture the -d stats debug output of when the batch delay stays high when it shouldn't?

Last edited 5 months ago by Antoine Martin (previous) (diff)

comment:8 Changed 5 months ago by Nathan Hallquist

I've attached log1.txt.

At about 9:50:50 I do a rotation and it's good.
At about 9:51:00 I choke the bandwidth
At about 9:51:22 I restore the bandwidth. The 3D window remains slow, which is not good. Oddly, my xterm comes right back, which is good.
At about 9:52:30 The lag seems to be reduced mostly.

Changed 5 months ago by Nathan Hallquist

Attachment: log1.txt added

comment:9 Changed 5 months ago by Antoine Martin

Owner: changed from Nathan Hallquist to Antoine Martin
Priority: majorcritical
Status: newassigned

Thanks, I can reproduce the problem locally using tc.
This is caused by the latency heuristics: an increase in latency causes the batch delay to go up quickly, but a decrease in latency does not bring it back down quickly enough.

comment:10 Changed 4 months ago by Antoine Martin

Owner: changed from Antoine Martin to Nathan Hallquist
Status: assignednew

More fixes:

  • r21290 + r21294 cosmetic
  • r21299: expire regions much more quickly (but don't send until backlog is cleared)

@nathan_lstc: is this now usable? (ignoring the pycuda issue from ticket:2022#comment:72 for now)

comment:11 Changed 4 months ago by Nathan Hallquist

I have 4 computers runnings XPRA servers right now. Only one of them is running vanilla yum-repo. I have just upgraded that one to r21314. Out of the box it seems to work well. Here is my systemd file:

ExecStart=/bin/sh -c "cd ~;PATH=/opt/xpra/bin:$PATH xpra --no-daemon start --no-printing  --start-via-proxy=no --systemd-run=no --start=\"xrdb -merge $HOME/.Xresources\" --start-child=xterm --exit-with-children --mdns=no --xsettings=no :`id -u`"
Environment=PYTHONPATH=/opt/xpra/lib64/python2.7/site-packages
Environment=LD_LIBRARY_PATH=/opt/libjpeg-turbo/lib64/:/usr/local/cuda/lib64:/usr/lib64/xpra
Environment=CUDA_VISIBLE_DEVICES=0

Everything is working right for me (I'll check with the guy having the pycuda issue on Monday). Looking through the patches, I can see they do what I was setting through env varriables. Thanks!

One more question: "xpra info" has a lot of numbers called "delay". How do I interpret them all?

client.batch.delay.50p=112
client.batch.delay.80p=932
client.batch.delay.90p=936
client.batch.delay.avg=412
client.batch.delay.cur=935
client.batch.delay.max=941
client.batch.delay.min=60
client.batch.locked=False
client.batch.max-delay=500
client.batch.min-delay=16
client.batch.timeout-delay=15000
client.window.1.batch.actual_delays.90p=63
client.window.1.batch.actual_delays.avg=57
client.window.1.batch.actual_delays.cur=57
client.window.1.batch.actual_delays.max=106
client.window.1.batch.actual_delays.min=46

Is window.1.batch.actual_delays.cur the batch delay that is currently happening? If so, what about "client.batch.delay.cur"? Right now, one number looks okay the other not so good:

[nathan@bobross lsprepost4.7_centos7]$ xpra info :250 |grep -i delay |grep cur
client.batch.delay.cur=934
client.window.1.batch.actual_delays.cur=80
client.window.1.batch.delay.cur=80
client.window.8.batch.actual_delays.cur=230
client.window.8.batch.delay.cur=230
[nathan@bobross lsprepost4.7_centos7]$


comment:12 Changed 4 months ago by Antoine Martin

One more question: "xpra info" has a lot of numbers called "delay". How do I interpret them all?

That's coming from get_weighted_list_stats in browser/xpra/trunk/src/xpra/simple_stats.py.

  • "min", "max" and "avg" should be self explanatory
  • "cur" is current
  • 90p is 90 percentile

Is window.1.batch.actual_delays.cur the batch delay that is currently happening?

For this window, yes.

If so, what about "client.batch.delay.cur"?

That's the global value, which is only used when creating new windows.

Right now, one number looks okay the other not so good:
(..)
client.window.8.batch.actual_delays.cur=230

Yes, that's too high.
Can you get the -d stats log?

comment:13 Changed 3 months ago by Antoine Martin

Bump.

comment:14 Changed 3 months ago by Antoine Martin

Fixes to the batch delay changes in r21621. (found thanks to ticket:2140#comment:1)
This may explain the high batch delay values. The new dynamic delay code was hiding that somewhat - doing its job, but masking unreasonable values.

comment:15 Changed 3 months ago by Nathan Hallquist

I am far from certain, but I just tried a combination of operations over a 1gb LAN that felt slow in the previous revisions, but in this revision it seemed absolutely smooth. I'm going to make a much more extensive test and get back with you.

[nathan@curry lsprepost4.7_centos7]$ xpra info :250 | grep delay | grep cur
client.batch.delay.cur=8
client.window.1.batch.actual_delays.cur=60
client.window.1.batch.delay.cur=3
client.window.15.batch.actual_delays.cur=150
client.window.15.batch.delay.cur=11
client.window.16.batch.actual_delays.cur=8
client.window.16.batch.delay.cur=8
[nathan@curry lsprepost4.7_centos7]$

comment:16 Changed 2 months ago by Antoine Martin

Bump.

comment:17 Changed 2 months ago by tc424

Cc: steved424@… added

comment:18 Changed 2 months ago by Antoine Martin

Resolution: needinfo
Status: newclosed
Note: See TracTickets for help on using tickets.