xpra icon
Bug tracker and wiki

Opened 2 years ago

Last modified 3 months ago

#999 assigned defect

catch limited bandwidth issues sooner

Reported by: Antoine Martin Owned by: Antoine Martin
Priority: major Milestone: 2.2
Component: encodings Version: 0.14.x
Keywords: Cc:

Description

Got some logs which show:

window[2].damage.out_latency.90p : 34
window[2].damage.out_latency.avg : 44
window[2].damage.out_latency.cur : 2
window[2].damage.out_latency.max : 1391
window[2].damage.out_latency.min : 0

Which means that it takes around 44ms to compress and send the packet out to the network layer, often less.
Except that in some cases it can take 1391ms!!

There is another one, which isn't quite as bad:

window[2].damage.out_latency.90p : 324
window[2].damage.out_latency.avg : 119
window[2].damage.out_latency.cur : 25
window[2].damage.out_latency.max : 408
window[2].damage.out_latency.min : 1

At that point the UI became sluggish, about 0.5s behind the actual actions.

Not entirely sure what we should be doing here: by the time the OS is pushing back to us, it is too late already and things will be slow because there isn't enough bandwidth to service us.

Maybe we can watch the "damage out latency" more carefully and immediately increase the batching delay to prevent further degradation?

Change History (14)

comment:1 Changed 2 years ago by Antoine Martin

Status: newassigned

comment:2 Changed 2 years ago by Antoine Martin

Re: linux traffic control via tc.
Just as I remembered it, the documentation is absolutely awful.

First, your need to install kernel-modules-extra: Is Traffic Control (tc) broken in Fedora 17? (Bug 823316 - unable to simulate drops with tc / netem),

The documentation found at the linux foundation is incomplete: http://www.linuxfoundation.org/collaborate/workgroups/networking/netem: when trying to add the latency, you may get this totally unhelpful message: RTNETLINK answers: No such file or directory (which file would that be? no files involved here at all!)
You need to:

modprobe sch_netem

Added difficulty: for testing, it is much easier to run everything on the same system.
Unfortunately, even when using the system's public network IP, the network subsystem will take a shortcut and route through the loopback device.
So you have to apply the rules there, ie:

tc qdisc add dev lo root netem delay 100ms 50ms 25%

And remember to remove them when you're done as this will interfere with lots of things..

tc qdisc del dev lo root netem

For the record, some alternatives:

On OSX and FreeBSD, there's also (obviously) ipfw:

And in OSX Lion onwards with xcode: Network Link Conditioner

Last edited 2 years ago by Antoine Martin (previous) (diff)

comment:3 Changed 23 months ago by Antoine Martin

Milestone: 0.160.17

I have not found a simple solution to this problem - not one that can be merged this late in the release cycle. Re-scheduling. (hopefully some of the changes can be backported).

But I did find a huge bug in the process: r11376. (backported in r11380).

comment:4 Changed 19 months ago by Antoine Martin

  • r12200 (r12202 for v0.16.x branch) should prevent the line jitter from causing drastic changes in the base batch delay
  • r12158 (r12196 for v0.16.x branch) makes it possible to tune the number of soft-expired sends - as this may make things worse on bandwidth constrained links

comment:5 Changed 19 months ago by Antoine Martin

See also #401, #540 and #1135.

Last edited 19 months ago by Antoine Martin (previous) (diff)

comment:6 Changed 18 months ago by alas

Just as a note (as much so I will remember &/or be able to find more easily as for anyone else's benefit), I've managed to get some other tc functions to work as well: loss, reorder, delete all, and list active (examples above for an eth0 device).

  • To list rules: tc -s qdisc ls dev eth0.
  • To add loss: tc qdisc add dev eth0 root netem loss 1%, or tc qdisc add dev eth0 root netem loss 2% 40% to make it more jittery.
  • To add reorder: tc qdisc add dev eth0 root netem reorder 2%, or tc qdisc add dev eth0 root netem reorder 2% 50% to make it more jittery.
  • To delete all the tc rules: tc qdisc del dev eth0 root.

comment:7 Changed 15 months ago by Antoine Martin

The low-level network code is a bit messy, in large part because of win32 and the way it (doesn't) handle blocking sockets...

  • r13270 ensures we don't penalise win32 clients (workaround is now only applied to win32 shadow servers), backported in r13271
  • r13272: code refactoring / cleanup
  • r13273: more detailed packet accounting

At the moment, we detect the network bottleneck because the network write call takes longer to return, handling WSAEWOULDBLOCK and socket.timeout would be more explicit.
Maybe we shouldn't be using blocking sockets? Or maybe reads can be blocking but it would be useful if writes were not so that we could detect when the network layer cannot handle any more data. (assuming that we can distinguish)
Or maybe we need to use different code altogether for win32 and posix?

Related reading:

Last edited 15 months ago by Antoine Martin (previous) (diff)

comment:8 Changed 15 months ago by Antoine Martin

Milestone: 0.171.0

comment:9 Changed 13 months ago by Antoine Martin

Milestone: 1.03.0

Far too late to make intrusive changes to the network layer.
Some recent fixes and breakage: #1211, #1298, #1134

Good read: https://github.com/TigerVNC/tigervnc/wiki/Latency

Last edited 13 months ago by Antoine Martin (previous) (diff)

comment:10 Changed 8 months ago by Antoine Martin

Milestone: 3.02.1

comment:11 Changed 8 months ago by Antoine Martin

See also #619, #401 and #153

Last edited 6 months ago by Antoine Martin (previous) (diff)

comment:12 Changed 6 months ago by Antoine Martin

The big hurdle for fixing this is that we have a number of queues and threads sitting in between the window damage events and the network sockets.
When things go wrong (network bottleneck, dropped packets, whatever), we need to delay the window pixel capture instead of queuing things up downstream (pixel encoding queue, packet queue, etc).
These buffers were introduced to ensure that we keep the pixel pipeline filled at all times to make the best use of the available bandwidth: highest fps / quality possible.
When tweaking those settings, we want to make sure we don't break the optimal use case.
So maybe we should define a baseline before making any changes, one for the optimal use case (gigabit or local connection without mmap) and one for the "slow" network connection (fixed settings we can reproduce reliably with tc). That's on top of the automated perf tests, which will give us another angle on this.

Things to figure out:

  • XPRA_BATCH_ALWAYS=1 - I don't think this would make much of a difference, we always end up batching anyway - but worth checking
  • XPRA_MAX_SOFT_EXPIRED=0 - how much does this help? It will prevent us from optimistically expiring damage regions before we have received ACK packets. Not sure yet how we would tune this automatically
  • TARGET_LATENCY_TOLERANCE=0 (added in r15616) - how much difference does this make? (probably very little on its own, likely to require XPRA_MAX_SOFT_EXPIRED=0)
  • av-sync: does this even matter?
  • r15617 exposes the "damage.target-latency" for each window, how far is this value from the network latency? This needs to account for the client processing the packet, decoding the picture data and presenting it, sending a new packet.. so an extra ~50ms is to be expected.

Things to do:

  • figure out how to decisively identify network pushback - without breaking blocking / non-blocking socket code... shadow servers, etc - some network implementations will chunk things, or buffer things in the network layer, and others won't.
  • drive the batch delay / soft expired delay more dynamically
Last edited 6 months ago by Antoine Martin (previous) (diff)

comment:13 Changed 6 months ago by Antoine Martin

Testing with r15691 Fedora 26 server and a win7 64-bit client using the default encoding on a 4k screen, connecting over 100Mbps LAN and using glxspheres to generate constant high fps.

Here are the settings we change during the tests (showing the default value here):

  • batch-always: XPRA_BATCH_ALWAYS=0
  • max-expired: XPRA_MAX_SOFT_EXPIRED=5
  • latency-tolerance: TARGET_LATENCY_TOLERANCE=20

Settings "SET1" is: "batch-always=1, max-expired=0, latency-tolerance=0"

Example of tc changes we can make:

  • tc latency: 100 / 50: tc qdisc add dev eth0 root netem delay 100ms 50ms 25%
  • tc loss: 2 / 40: tc qdisc add dev eth0 root netem loss 2% 40%

For collecting statistics:

xpra info | egrep "out_latency.avg|encoder.fps|client-ping-latency.avg"
Sample Settings FPS Client Ping Latency Damage Out Latency
1Defaults 30-35 4 4
2SET1 25-35 4 5
3Defaults with tc latency 100 / 0 25-35 ~105 4
4SET1 with tc latency 100 / 0 25-35 ~105 4
5Defaults with tc latency 100 / 50 15-30 ~200 ~4 to 500! (but usually very low)
6SET1 with tc latency 100 / 50 15-30 ~200 - spiked up to 900! ~4 to 500! (but usually very low)

Initial notes:

  • tried with rgb encoding, auto is faster! (100Mbps LAN)
  • heisenbug for statistics: "xpra info" slows down the server a little bit, so don't capture data too often..
  • some variation between runs, sometimes as much as 20% - not sure why yet
  • av-sync throws things off..
  • fps goes up very slowly in some cases
  • damage out latency goes very high for the xterm window used to start glxspheres with av-sync enabled: this isn't video, it shouldn't be delayed!
  • seen the client ping latency go as high as 900ms - why?
  • damage out latency can start very high with SET1 + tc latency: 2s!
  • fps varies a lot when we add latency, even more so with jitter
  • the tunables from "SET1" affect even the best case scenario (on a LAN with 4ms latency) - shows that they have a purpose
  • latency affects fps: we should be able to achieve the same fps, no matter what the latency is (within reason) - not far off with latency alone (Sample 1 vs 3), worse with jitter (Sample 1 vs 5)

comment:14 Changed 3 months ago by Antoine Martin

Milestone: 2.12.2

re-scheduling

Note: See TracTickets for help on using tickets.