Xpra: Ticket #2090: XPRA does not recover from network congestion (even when network does)

I have been running performance tests on XPRA (between work and home) and I was noticing that if my internet connection got congested with H264 content from my OpenGL window my xterm windows would become very slow and they *never* recover unless I restart the client.

I have now tested things on my gigabit work LAN. Using a piece of software that rate-limits the MS Windows XPRA client. I found that if I impose a 5kb/s upload limit things get very slow, and that when I turn off this limit the recovery is poor. I believe, therefore, that I have found way to reliably simulate the problem that I was seeing from home. When temporary congestion seems to make things slow XPRA seems to not recover. I'm baffled.

Here my procedure:

rate limit
rotate 3D model until the network congestion spinner appears
turn off rate limit

What happens:

3D window remains slow even after several seconds of idle with rate limit turned off. This window recovers but seemingly after only a long time of *continuous* use without congestion. Desired behavior: the window should recover right away.

Even though the action was in the OpenGL window the xterm window *never* recovers until I restart the client. Specifically, the time between keypress and echo becomes very long and stays that way. Desired behavior: the xterm should recover instantly.

What I'm using:

head revision server.
Bandwidth detection settings seem to make no difference.
I've tried with several clients.
I've so far only used paramiko ssh (one hop) on the controlled lan tests.
From home I've observed the problem is worse with the 2hop than with ssh -L from the shell and then a one-hop (weather the one-hop is tcp:// or ssh:// seems to make small to no difference). I think, therefore, I may need to multithread the 2hop ssh. I'll send pataches soon (I plan on starting with paramiko). Can I put them on this ticket?

Mon, 24 Dec 2018 22:05:26 GMT - Nathan Hallquist:

XPRA_FORCE_BATCH=1 seems to solve the problem on the LAN (later I will look at what happens over broadband). I think that somehow what is going on in my OpenGL window is driving the batch delay up. (When I opened the ticket I didn't understand what batch delay did).

If what is happening is what I'm guessing I think that batch delay for "text" windows should plummet rapidly after a spike because sluggish shells are hard to use (compared to GUIs).

Tue, 25 Dec 2018 19:17:30 GMT - Antoine Martin: owner changed

owner changed from Antoine Martin to Nathan Hallquist

Sounds similar to #1911.

Forcing XPRA_FORCE_BATCH=1 will ensure that we let regions accumulate, giving more opportunity for rectangles to get merged and for packet aggregation (#619) - which means better use of the more limited bandwidth. Maybe we should always batch by default, or at least always batch when we see any congestion.

How can I reproduce this easily on my laptop?

Wed, 26 Dec 2018 23:03:03 GMT - Nathan Hallquist:

My procedure is

Using VirtualGL run lsprepost with a nontrivial 3D model. I can provide you with lsprepost and a nontrivial data set. I'm guessing that anything that really congests the network will do the trick.

I've only tested on windows but I used a program called "NetLimiter 4". I set it to limit download to 50KB/s and upload to 5KB/s. Then things get very slow (as expected). Then I go ahead and uncheck the limits, and the batch delay is huge and things, especially xterms, become unusable.

I have found that with:

Environment=XPRA_FORCE_BATCH=1 Environment=XPRA_BATCH_MAX_DELAY=50

I get pretty good results over broadband and LAN. I'm *not* saying that is the right thing to do, frankly, I have very little clue about this. I'm going to test performance on an LTE hotspot shortly...

With these settings + dbus video-box hinting XPRA is working better than anything else I've ever tried. I'll soon be deploying it to some more of my users.

Thu, 27 Dec 2018 19:55:53 GMT - Antoine Martin: owner, status changed

owner changed from Nathan Hallquist to Antoine Martin
status changed from new to assigned

Using VirtualGL run lsprepost with a nontrivial 3D model. I can provide you with lsprepost and a nontrivial data set. I'm guessing that anything that really congests the network will do the trick.

Can you reproduce with something more widely available? glxgears or glxspheres perhaps?

I set it to limit download to 50KB/s and upload to 5KB/s. Then things get very slow (as expected).

It's a miracle it works at all. 5KB/s is really much lower than anything it was ever designed for!

Then I go ahead and uncheck the limits, and the batch delay is huge and things, especially xterms, become unusable.

Ah.

Environment=XPRA_FORCE_BATCH=1

It should be fine to change this default. In almost all cases, batching is the right thing to do. The minimum, which is 5ms is imperceptible anyway. VNC servers have it always enabled.

Environment=XPRA_BATCH_MAX_DELAY=50

I am less sure about this one. With the mostly dynamic batch delay code, the base value is almost meaningless. Unless when it is high and causes problems... The only issue with capping the batch delay is that we have other heuristics that use the batch delay as input. I'll see what I can do.

Fri, 28 Dec 2018 23:03:29 GMT - Nathan Hallquist:

I have not disappeared.

The performance of VirtualGL on LTE hotspot was really great. This is a testament to XPRA, not my hotspot, which isn't that great.

I will get back with you on this once I have finished hooking dbus to ls-prepost. I have gotten our developers to give me a callback, and I have figured out how to extract relative coordinates from the glCanvas. Now, I have to write the dbus code...

Wed, 02 Jan 2019 09:10:34 GMT - Antoine Martin:

r21267 enables batching by default. As for the max delay change, my current ideas are:

use the existing "soft-expired" mechanism: we could let the batch delay increase, but speculatively allow packets to go out when there is no bottleneck. (the batch delay would end up increasing a lot less since the "actual batch delay" will remain lower)
if we have already waited longer than the current batch delay value (ie: not many screen updates in an xterm), then we can lower the batch delay for the next screen update that comes - (ideally ignoring small screen updates...)

Wed, 02 Jan 2019 20:47:07 GMT - Antoine Martin: owner, status changed

owner changed from Antoine Martin to Nathan Hallquist
status changed from assigned to new

More improvements in:

r21273: batching is enabled by default, move some checks to save some cpu cycles
r21275: skip waiting unnecessarily when the window was idle for longer than the batch delay + move more code out of hot path

This won't fix your problems, but it will help. Can you please capture the -d stats debug output of when the batch delay stays high when it shouldn't?

Fri, 04 Jan 2019 19:11:59 GMT - Nathan Hallquist:

I've attached log1.txt.

At about 9:50:50 I do a rotation and it's good. At about 9:51:00 I choke the bandwidth At about 9:51:22 I restore the bandwidth. The 3D window remains slow, which is not good. Oddly, my xterm comes right back, which is good. At about 9:52:30 The lag seems to be reduced mostly.

Fri, 04 Jan 2019 19:12:31 GMT - Nathan Hallquist: attachment set

attachment set to log1.txt

Fri, 04 Jan 2019 19:30:58 GMT - Antoine Martin: owner, priority, status changed

owner changed from Nathan Hallquist to Antoine Martin
priority changed from major to critical
status changed from new to assigned

Thanks, I can reproduce the problem locally using tc. This is caused by the latency heuristics: an increase in latency causes the batch delay to go up quickly, but a decrease in latency does not bring it back down quickly enough.

Fri, 11 Jan 2019 13:19:18 GMT - Antoine Martin: owner, status changed

owner changed from Antoine Martin to Nathan Hallquist
status changed from assigned to new

More fixes:

r21290 + r21294 cosmetic
r21299: expire regions much more quickly (but don't send until backlog is cleared)

@nathan_lstc: is this now usable? (ignoring the pycuda issue from ticket:2022#comment:72 for now)

Sun, 13 Jan 2019 13:55:20 GMT - Nathan Hallquist:

I have 4 computers runnings XPRA servers right now. Only one of them is running vanilla yum-repo. I have just upgraded that one to r21314. Out of the box it seems to work well. Here is my systemd file:

ExecStart=/bin/sh -c "cd ~;PATH=/opt/xpra/bin:$PATH xpra --no-daemon start --no-printing  --start-via-proxy=no --systemd-run=no --start=\"xrdb -merge $HOME/.Xresources\" --start-child=xterm --exit-with-children --mdns=no --xsettings=no :`id -u`"
Environment=PYTHONPATH=/opt/xpra/lib64/python2.7/site-packages
Environment=LD_LIBRARY_PATH=/opt/libjpeg-turbo/lib64/:/usr/local/cuda/lib64:/usr/lib64/xpra
Environment=CUDA_VISIBLE_DEVICES=0

Everything is working right for me (I'll check with the guy having the pycuda issue on Monday). Looking through the patches, I can see they do what I was setting through env varriables. Thanks!

One more question: "xpra info" has a lot of numbers called "delay". How do I interpret them all?

client.batch.delay.50p=112
client.batch.delay.80p=932
client.batch.delay.90p=936
client.batch.delay.avg=412
client.batch.delay.cur=935
client.batch.delay.max=941
client.batch.delay.min=60
client.batch.locked=False
client.batch.max-delay=500
client.batch.min-delay=16
client.batch.timeout-delay=15000

client.window.1.batch.actual_delays.90p=63
client.window.1.batch.actual_delays.avg=57
client.window.1.batch.actual_delays.cur=57
client.window.1.batch.actual_delays.max=106
client.window.1.batch.actual_delays.min=46

Is window.1.batch.actual_delays.cur the batch delay that is currently happening? If so, what about "client.batch.delay.cur"? Right now, one number looks okay the other not so good:

[nathan@bobross lsprepost4.7_centos7]$ xpra info :250 |grep -i delay |grep cur
client.batch.delay.cur=934
client.window.1.batch.actual_delays.cur=80
client.window.1.batch.delay.cur=80
client.window.8.batch.actual_delays.cur=230
client.window.8.batch.delay.cur=230
[nathan@bobross lsprepost4.7_centos7]$

Sun, 13 Jan 2019 14:14:11 GMT - Antoine Martin:

One more question: "xpra info" has a lot of numbers called "delay". How do I interpret them all?

That's coming from get_weighted_list_stats in browser/xpra/trunk/src/xpra/simple_stats.py.

"min", "max" and "avg" should be self explanatory
"cur" is current
90p is 90 percentile

Is window.1.batch.actual_delays.cur the batch delay that is currently happening?

For this window, yes.

If so, what about "client.batch.delay.cur"?

That's the global value, which is only used when creating new windows.

Right now, one number looks okay the other not so good: (..) client.window.8.batch.actual_delays.cur=230

Yes, that's too high. Can you get the -d stats log?

Thu, 07 Feb 2019 15:14:38 GMT - Antoine Martin:

Bump.

Mon, 11 Feb 2019 03:20:14 GMT - Antoine Martin:

Fixes to the batch delay changes in r21621. (found thanks to ticket:2140#comment:1) This may explain the high batch delay values. The new dynamic delay code was hiding that somewhat - doing its job, but masking unreasonable values.

Wed, 13 Feb 2019 21:59:28 GMT - Nathan Hallquist:

I am far from certain, but I just tried a combination of operations over a 1gb LAN that felt slow in the previous revisions, but in this revision it seemed absolutely smooth. I'm going to make a much more extensive test and get back with you.

[nathan@curry lsprepost4.7_centos7]$ xpra info :250 | grep delay | grep cur
client.batch.delay.cur=8
client.window.1.batch.actual_delays.cur=60
client.window.1.batch.delay.cur=3
client.window.15.batch.actual_delays.cur=150
client.window.15.batch.delay.cur=11
client.window.16.batch.actual_delays.cur=8
client.window.16.batch.delay.cur=8
[nathan@curry lsprepost4.7_centos7]$

Thu, 07 Mar 2019 12:38:00 GMT - Antoine Martin:

Bump.

Fri, 15 Mar 2019 10:14:19 GMT - tc424: cc set

cc steved424@… added

Mon, 18 Mar 2019 02:46:33 GMT - Antoine Martin: status changed; resolution set

status changed from new to closed
resolution set to needinfo

Mon, 10 Feb 2020 08:56:38 GMT - Antoine Martin:

Sat, 23 Jan 2021 05:41:57 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/2090