Xpra: Ticket #153: xpra over high latency links batches too much

As per this mailing list post, it seems that the damage sequence (which we use to see how far behind the client is) is causing the damage batch delay to go up when it is not necessary. This causes the picture latency to go up.

The only alternative explanation is that somehow the "damage-sequence" packet is having an adverse effect, which is very unlikely. To prove this, simply apply this patch to trunk:

--- src/xpra/server_source.py	(revision 975)
+++ src/xpra/server_source.py	(working copy)
@@ -248,7 +248,7 @@
             if self._damage_data_queue.qsize()>3:
                 #contains pixmaps before they get converted to a packet that goes to the damage_packet_queue
                 update_batch_delay("damage data queue overflow: %s" % self._damage_data_queue.qsize(), logp10(self._damage_data_queue.qsize()-2))
-        if not last_delta:
+        if True:
             return
         #figure out how many pixels behind we are, rather than just the number of packets
         all_unprocessed = list(self._damage_packet_sizes)[-delta:]

If the problem goes away, then it is clearly our batching decision making (the part that is based on client feedback) which is wrong, as I suspect it is.


The problem is that although the behaviour is suboptimal over high latency links, I do not know which application to test or what bandwidth limitations to apply to simulate the link in question (2Mbps down 512Bpbs up?).

Using trickle, here is what I tested with:

We try to converge on the best batch delay possible given the limited bandwidth and it oscillates between 20ms and 400ms, clearly a far too wide range. I think that is because of buffering and "smoothing" over time, which allows lots of packets to go through (and we reduce the batch delay) then when we hit the limit it starts delaying them (and we increase it). Problem is that each of these decisions is delayed, so they it struggles to cope with the changing conditions and keeps oscillating. Slowing down the rate of change would definitely help in this particular case, but not in others... It might be worth taking into account the connection latency to reduce this rate of change.

Thoughts welcome.



Tue, 26 Jun 2012 17:36:51 GMT - Antoine Martin:

Another area where we ought to make some improvements is the scheduling of calculate_batch_delay: at the moment, it fires no more than 40 times per second and it will not fire unless there are screen updates.

The problem with this is that if there is a period of inactivity, it won't run at all... It might be better to schedule it with a value derived from the current batch delay, with an increasing delay if no new frames come in (so as to not wake up the server too often unnecessarily).

The other thing is that when we modify this batch delay, we ought to take into account how many times the current value was used, and when it was used. Some sort of moving average would be relevant too: the further we move away from it, the slower the change.

Finally, let's try to calculate the link latency: record the send time in damage_packet_sent and compare with the receive time _process_damage_sequence (minus the decode_time which we have)


Tue, 26 Jun 2012 19:33:05 GMT - Antoine Martin: status changed

Some really hackish code to try to avoid this oscillation of the batch delay. First we keep historical records of the size of the damage_data_queue, and try to stop increasing the delay so quickly as soon as things are starting to get better (and not just when it is fully back to normal). Also, when lowering the batch delay, we take into account a weighted moving average to try to avoid lowering it too quickly and too far.

This does seem to allow it to narrow down on the correct value better - without oscillation, a the cost of taking longer to lower it... Not ready for merging as it is, but something like it should get applied eventually.

Still TODO:

Please give feedback!

--- src/xpra/server_source.py	(revision 975)
+++ src/xpra/server_source.py	(working copy)
@@ -123,6 +123,7 @@
         self._mmap_bytes_sent = 0
         protocol.source = self
         self._damage_data_queue = Queue()
+        self._damage_data_queue_sizes = maxdeque(100)
         self._damage_packet_queue = Queue()
         self._closed = False
@@ -232,22 +233,55 @@
             self.batch_configs[wid] = batch
         return batch
-    def calculate_batch_delay(self, wid, batch):
+    def calculate_batch_delay(self, wid, window, batch):
         def update_batch_delay(reason, factor=1, delta=0):
-            batch.delay = max(batch.min_delay, min(batch.max_delay, int(100.0*batch.delay*factor)/100.0)-delta)
+            new_delay = max(batch.min_delay, min(batch.max_delay, int(100.0*batch.delay*factor)/100.0)-delta)
+            if len(batch.last_delays)>0:
+                #get the weighted average:
+                tv, tw = 0.0, 0.0
+                now = time.time()
+                for when, delay in batch.last_delays:
+                    #newer matter more, and even more so when factor is high
+                    w = 1.0/(0.1+(now-when)*factor)
+                    tv += delay*w
+                    tw += w
+                avg = tv / tw
+                log.info("update_batch_delay: average=%s, weighted average=%s, new_delay=%s", sum([x for _,x in batch.last_delays])/len(batch.last_delays), avg, new_delay)
+                if new_delay<avg:
+                    new_delay = (3*new_delay+avg)/4
+            batch.delay = new_delay
             batch.last_updated = time.time()
-            log("update_batch_delay: %s, wid=%s, factor=%s, delta=%s, new batch delay=%s", reason, wid, factor, delta, batch.delay)
+            log.info("update_batch_delay: %s, wid=%s, factor=%s, delta=%s, new batch delay=%s", reason, wid, factor, delta, batch.delay)
         last_delta = self.last_client_delta
         delta = self._damage_packet_sequence-self.last_client_packet_sequence
         self.last_client_delta = delta
         if True:
-            if self._damage_packet_queue.qsize()>3:
+            if self._damage_packet_queue.qsize()>2:
                 #packets ready for sending by network layer
                 update_batch_delay("damage packet queue overflow: %s" % self._damage_packet_queue.qsize(), logp2(self._damage_packet_queue.qsize()-2))
-            if self._damage_data_queue.qsize()>3:
+                return
+            if self._damage_data_queue.qsize()>2:
                 #contains pixmaps before they get converted to a packet that goes to the damage_packet_queue
-                update_batch_delay("damage data queue overflow: %s" % self._damage_data_queue.qsize(), logp10(self._damage_data_queue.qsize()-2))
+                qsize = self._damage_data_queue.qsize()
+                factor = logp10(qsize)
+                if len(self._damage_data_queue_sizes)>0:
+                    last_10 = list(self._damage_data_queue_sizes)[-10:]
+                    if max(last_10)==qsize:
+                        log.info("damage data queue overflow: %s worst of %s" % (qsize, last_10))
+                        #it's the worst it's been recently, ouch increase faster
+                        factor += 0.2
+                    elif min(last_10)==qsize:
+                        log.info("damage data queue overflow: %s best of %s" % (qsize, last_10))
+                        #it's the best it's been recently, leave it alone
+                        factor = 1.0
+                    elif len(last_10)>4 and last_10[-1]>qsize and last_10[-2]>qsize:
+                        log.info("damage data queue overflow: %s improving: %s" % (qsize, last_10))
+                        factor = (factor+2)/3.0
+                    else:
+                        log.info("historical=%s, last_10=%s, max=%s, min=%s, qsize=%s" % (self._damage_data_queue_sizes, last_10, max(last_10), min(last_10), qsize))
+                update_batch_delay("damage data queue overflow: %s" % self._damage_data_queue.qsize(), factor)
+                return
         if not last_delta:
             return
         #figure out how many pixels behind we are, rather than just the number of packets
@@ -262,19 +296,21 @@
         last_pixels_behind = sum(last_unprocessed)
         log("calculate_batch_delay: wid=%s, unprocessed=%s, last_unprocessed=%s, pixels_behind=%s, last_pixels_behind=%s", wid, unprocessed, last_unprocessed, pixels_behind, last_pixels_behind)
         if packets_due<=2 and last_packets_due<=2:
-            return update_batch_delay("client is up to date: %s packets in queue" % packets_due, 0.9, 1)
+            return update_batch_delay("client is up to date: %s packets in queue" % packets_due, (7+packets_due)/10.0, 1)
         if packets_due<last_packets_due and pixels_behind<last_pixels_behind:
-            return update_batch_delay("client is catching up: down from %s to %s packets in queue, %s to %s pixels due" % (last_packets_due, packets_due, last_pixels_behind, pixels_behind), 0.9, 1)
-        low_limit, high_limit = 8*1024, 32*1024
+            return update_batch_delay("client is catching up: down from %s to %s packets in queue, %s to %s pixels due" % (last_packets_due, packets_due, last_pixels_behind, pixels_behind), (9-min(2,last_packets_due-packets_due))/10.0, 1)
+        ww, wh = self.get_window_dimensions(window)
+        low_limit, high_limit = ww*wh, ww*wh*4
         if self._mmap and self._mmap_size>0:
-            low_limit, high_limit = 256*1024, 1*1024*1024
+            low_limit *= 4
+            high_limit *= 4
         #things are getting worse or unchanged:
         if pixels_behind<=low_limit:
             if packets_due<=5:
                 return update_batch_delay("client is only %s pixels and only %s packets behind" % (pixels_behind, packets_due), 0.8, 1)
             if packets_due<10 and packets_due<last_packets_due:
                 return update_batch_delay("client is only %s pixels and %s packets behind" % (pixels_behind, packets_due), 0.9, 0)
-        if pixels_behind>=high_limit:
+        if pixels_behind>high_limit:
             return update_batch_delay("client is %s pixels behind!" % pixels_behind, min(1.5, logp2(1.0*pixels_behind/high_limit)))
         if pixels_behind<last_pixels_behind:
             #things are getting better:
@@ -341,7 +377,7 @@
             pixmap = self.get_window_pixmap(wid, window, self._sequence)
             if pixmap:
                 self._process_damage_region(now, pixmap, wid, x, y, w, h, coding, self._sequence, options)
-                batch.last_delays.append(0)
+                batch.last_delays.append((now, 0))
         #record this damage event in the damage_last_events queue:
         now = time.time()
         last_events = self._damage_last_events.setdefault(wid, maxdeque(100))
@@ -360,7 +396,7 @@
             return
         if batch.last_updated+0.025<now:
-            self.calculate_batch_delay(wid, batch)
+            self.calculate_batch_delay(wid, window, batch)
         event_min_time = now-batch.time_unit
         all_pixels = [pixels for event_time,pixels in last_events if event_time>event_min_time]
         beyond_limit = len(all_pixels)>batch.max_events or sum(all_pixels)>batch.max_pixels
@@ -385,7 +421,7 @@
                 log("window %s already removed from delayed list?", wid)
             return False
         log("damage(%s, %s, %s, %s, %s) scheduling batching expiry for sequence %s in %sms", wid, x, y, w, h, self._sequence, batch.delay)
-        batch.last_delays.append(batch.delay)
+        batch.last_delays.append((now, batch.delay))
         gobject.timeout_add(int(batch.delay), send_delayed)
     def send_delayed_regions(self, damage_time, wid, window, damage, coding, sequence, options):
@@ -446,6 +482,7 @@
         if data:
             log("process_damage_regions: adding pixel data %s to queue, elapsed time=%s, queue size=%s", data[:6], int(1000*(time.time()-damage_time)), self._damage_data_queue.qsize())
             self._damage_data_queue.put(data)
+            self._damage_data_queue_sizes.append(self._damage_data_queue.qsize())
     def data_to_packet(self):
         while not self._closed:
Index: src/xpra/server.py
===================================================================
--- src/xpra/server.py	(revision 975)
+++ src/xpra/server.py	(working copy)
@@ -1003,7 +1003,7 @@
         for wid in self._id_to_window.keys():
             batch = source.get_batch_config(wid)
             if batch:
-                for d in batch.last_delays:
+                for _,d in batch.last_delays:
                     batch_delays.append(d)
         if len(batch_delays)>0:
             info["min_batch_delay"] = int(max(batch_delays))

Wed, 27 Jun 2012 02:22:21 GMT - Martin Ebourne:

Link capacity, just tested with scp on a large file. Upload bandwidth: mean 152.7KB/s Download bandwidth: mean 590.9KB/s Idle RTT: min 301ms Uplink saturated RTT: max 785ms Downlink saturated RTT: max 454ms

I don't think glxgears is at all a useful application to test with for this behaviour. I'm not interested in frame rate or throughput here but in interactivity, which is mostly defined by the latency from an action (keypress/mouse) to the reaction (relevant screen update). The apps I'm using are as per the original post: emacs, xterm, evolution, firefox, rdesktop

A good demonstration of the interactivity problems is running an emacs side by side over old and new xpra. On the old one every cursor movement is reflected close to immediately, if I hold down the cursor key I see the cursor move through every character as you would locally and when I let go it stops about where expected. On vanilla 0.4.0 it batches and so the cursor jumps many characters at a time in a very jerky movement and when I let go I'm not sure when it has stopped moving or is just pausing before the next update. Editing code in such conditions is challenging.

Another demonstration is clicking with a mouse to open a menu. It should be close to immediate but can take seconds on vanilla 0.4.0 which kills usability.

On to your changes: The one line change didn't seem to help much. I didn't test it against vanilla 0.4.0 today but it was painful to use, cursor was very jerky and applications took up to 5 seconds to respond.

On the other hand the larger patch is excellent. It's really made a huge difference, cursor movement is smooth and the GUI applications seem quite responsive. It's early days but I'll continue to use this and see how I get on.

Redraw is noticeably different from the old version, scrolling in emacs used to paint the screen from the top to the bottom, now the screen updates as a block but after a delay. This makes it look more like emacs running locally but reduces the immediacy of the feedback and doesn't seem to affect overall redraw time (that is difficult to measure but I tried using a stopwatch and paging down 10 times pressing each key as soon as the screen was updated).

Thanks for your help and super quick response!


Sat, 30 Jun 2012 08:28:18 GMT - Antoine Martin:

r982 is a more proper fix for this (details in commit message), please let me know if this fixes it for you so I can close this ticket. I will follow up with some more improvements in #155.

There were many problems with the old code:

etc..

FYI: glxgears is a great way of testing latency/bandwidth limited links as it allows you to vary the bandwidth usage simply by resizing the window. As you ask xpra to push more pixels through a limited pipe, eventually reaching capacity, it needs to start adjusting the batch delay upwards. This was the easiest way for me to reproduce your latency issues.

The all important xterm "while true; do dmesg; done" test, which is part of the test suite, now runs fine too!


Thu, 05 Jul 2012 18:39:44 GMT - Antoine Martin:

Much better/cleaner code in r1046 - see commit message for lots of details.


Mon, 09 Jul 2012 06:58:47 GMT - Martin Ebourne:

Ran with r982 all last week, it worked well. Running with r1060 this week.


Mon, 09 Jul 2012 18:40:58 GMT - Antoine Martin: cc set

r1063 Removes all the nasty hardcoded constants and instead relies on simple arithmetic.

Overall, this works much better than r1046 in almost all cases (see spreadsheet of test results), and although there are some corner cases regressions compared to 0.3.x, these will be dealt with separately if possible. The important metric here is that we seem to manage to prevent latency from going too high (except with png encoding and pathological tests..)

Note: it may not be possible to satisfy all scenarios with heuristics alone.


Mon, 09 Jul 2012 18:42:19 GMT - Antoine Martin: attachment set

test results comparing 0.3 branch, 0.4.0 (circa r1061) and 0.4.0-rc7 (aka r1063)


Tue, 10 Jul 2012 03:33:34 GMT - Martin Ebourne:

Ok thanks; running that now, with debug_delay enabled. I am using png encoding as jpg did not give sufficient text quality unless on a very high setting after which it seemed to give no performance improvement. I use xterm/emacs a lot with a quite small bitmap font on a high res display so need good text reproduction. Hopefully my test is not too pathological.


Tue, 10 Jul 2012 06:50:48 GMT - Antoine Martin:

Just be aware that whenever the debug fires (every 30s or when there are more than 1000 records to print, whichever comes first), it will cause some lag...


Thu, 12 Jul 2012 07:22:40 GMT - Martin Ebourne:

r1063 didn't turn out so well. I ran it with the logging for a day, it was quite sluggish to use. Unfortunately when I stopped the logging it didn't really improve. I had to revert back to r1062 on the server and it was fine then.

I have a full day's debug_delay log if it helps you at all, though it's 380MB (65MB bzipped) so if you want me to preprocess it to get the interesting stuff let me know by email.


Fri, 13 Jul 2012 20:41:40 GMT - Antoine Martin:

r1098 should improve things for you - whether that is now as good as r1062... I don't know (my automated tests do not capture your typical usage apparently - I wished they did, ideas welcome!)

The one problematic factor is "damage data queue size", it weighs more because items in this queue will have to be processed by all the other queues and then sent out, unfortunately it only counts items and not how many pixels are in them... Something needs to be done to improve this.

ok, here are some of the most interesting log extracts, chosen at tipping points: before or after a jump/drop in batch delay.

I found the location of spikes by looking for the new batch delay value in the log extracts with:

cat xpra-r1064-ab  | awk -F "new delay=" '{print $2}' | awk '{print $1}' | more

*

update_batch_delay: wid=9, change factor=44.2%, delay min=49.0, avg=64.3, max=99.7, cur=82.7, w. average=72.7, tot wgt=32.1, hist_w=15.0, new delay=119.4 --
[  ('damage processing latency: avg=17.4, recent=17.0', 1.009869654093447, 0.9993499058538527),
   ('damage send latency: avg=19.9, recent=19.1', 0.9988767378654252, 1.0011502526907246),
   ('network send speed: avg=189347, recent=177379 (KBytes/s)', 1.04786777295379, 1.04786777295379),
   ('client decode speed: avg=11.6, recent=11.6 (MPixels/s)', 0.5533322396555136, 1.0),
   ['delay not updated for 135.0 ms (skipped 3 times)', 0, 0.4617549787366009],
   ('client latency: avg=331.2, recent=334.0', 1.397990316854844, 1.0024348667059213),
   ['damage packet queue size: avg=1.77680196624, recent=1.72500583591, target=1, div=1', 1.3122801851043002, 0.4852180911832026],
   ['damage packet queue pixels: avg=34526.3476559, recent=36544.7159551, target=1, div=1089066.0', 0.03191879574328046, 1.0415651555598613],
   ['damage data queue: avg=3.70385296869, recent=5.5146282146, target=0, div=1', 3.7036828475176695, 4.985079509269735],
   ['client has some ACKs pending: 1', 1.0, 0],
   ['packets/pixels backlog from 0/0 to 0/0', 0.0, 0.5]  ]

That's a 44% increase in the delay, mostly due to the backlog in the damage data queue.

*

update_batch_delay: wid=3, change factor=65.8%, delay min=49.0, avg=117.9, max=320.1, cur=537.0, w. average=178.6, tot wgt=33.0, hist_w=14.6, new delay=890.4 --
[  ('damage processing latency: avg=33.6, recent=45.0', 1.04877707321461, 1.0156410027957719),
   ('damage send latency: avg=46.8, recent=64.0', 1.059864475384888, 1.0233804662676518),
   ('network send speed: avg=95790, recent=54358 (KBytes/s)', 1.4658212520431309, 1.4658212520431309),
   ('client decode speed: avg=22.6, recent=21.6 (MPixels/s)', 0.8991863663126225, 1.288416026946962),
   ['delay not updated for 56.9 ms (skipped 1 times)', 0, 0.0],
   ('client latency: avg=323.2, recent=324.8', 1.3881096435205884, 1.0014330630396993),
   ['damage packet queue size: avg=5.91231821803, recent=8.22406320798, target=1, div=1', 2.768934254729569, 0.6396246761650424],
   ['damage packet queue pixels: avg=87764.8947987, recent=128303.007021, target=1, div=800000.0', 0.1465504231143261, 1.2997704145088427],
   ['damage data queue: avg=8.4125154456, recent=12.3757953354, target=0, div=1', 4.741552772744084, 6.031202546394014],
   ['client has some ACKs pending: 34', 1.0, 0],
   ['packets/pixels backlog from 1/11 to 0/0', 0.0, 0.0] ]

That's a 65% increase, again mostly due to the backlog in the data queue, but also in the packet queue..

*

update_batch_delay: wid=5, change factor=-4.5%, delay min=719.3, avg=1511.4, max=5267.7, cur=778.4, w. average=901.7, tot wgt=13.7, hist_w=5.7, new delay=742.9 --
[  ('damage processing latency: avg=10.1, recent=10.9', 1.0012974014642182, 1.001085368146258),
   ('damage send latency: avg=12.5, recent=12.9', 0.9901946553188158, 0.9994942119306733),
   ('network send speed: avg=142411, recent=108396 (KBytes/s)', 1.2102694480959981, 1.2102694480959981),
   ('client decode speed: avg=6.2, recent=6.2 (MPixels/s)', 0.3239169543398469, 0.999999083426363),
   ['delay not updated for 1085.5 ms (skipped 27 times)', 0, 4.671985090122343],
   ('client latency: avg=720.9, recent=918.6', 1.919049293419958, 1.1132563179902775),
   ['damage packet queue size: avg=1.85173179032, recent=1.59141169452, target=1, div=1', 1.2595885423108755, 0.4269648367628348],
   ['damage packet queue pixels: avg=17460.8428057, recent=21152.9546567, target=1, div=124098.0', 0.15528091512926995, 1.1449971957849034],
   ['damage data queue: avg=0.658772944417, recent=0.764245320806, target=0, div=1', 1.8190511836295682, 2.883258558416875],
   ['client has some ACKs pending: 5', 1.0, 0],
   ['packets/pixels backlog from 0/0 to 1/11', 0.5, 0.7924812503605781]]

The delay should decrease faster here..


Sun, 15 Jul 2012 11:04:44 GMT - Antoine Martin: attachment set

test results comparing 0.3 branch and 0.4.0-rc9 (aka r1100)


Wed, 18 Jul 2012 19:07:11 GMT - Antoine Martin:

more improvements in r1130


Sat, 21 Jul 2012 18:25:27 GMT - Antoine Martin:

Many more fixes in r1144 (see changelog)


Mon, 23 Jul 2012 17:10:23 GMT - Antoine Martin: status changed; resolution set

I believe this is complete.

Feel free to re-open if necessary.


Fri, 27 Jul 2012 06:01:22 GMT - Antoine Martin:

Found some problems still...


Fri, 27 Jul 2012 14:21:28 GMT - Antoine Martin:

I will probably also backport part of r1195 which caused delay to stay high when the average used in calculations is close to zero..


Thu, 02 Aug 2012 19:05:12 GMT - aradtech: attachment set

aradtech xpra logfile with latency debugging


Thu, 02 Aug 2012 19:28:02 GMT - aradtech: attachment set

second try of logging issue


Thu, 02 Aug 2012 20:21:32 GMT - Antoine Martin:

The problem seems to be caused by a very large "damage processing and send latency", as seen by records like this one:

2012-08-02 12:15:35,378 update_batch_delay: wid=1, fps=100, last updated 41.99 ms ago, decay=1.0, change factor=1.4%, delay min=6.2, avg=16.4, max=41.2, cur=37.4, w. average=28.5, tot wgt=25.5, hist_w=12.7, new delay=37.9
 ('damage processing latency: avg=1129.3, recent=1661.2 avg=1.12933581412, recent=1.66122890682, target=0.05228515625, div=1.0', 2.99, 1.58)
 ('damage send latency: avg=1155.8, recent=1674.1 avg=1.15583599438, recent=1.67416776013, target=0.0757421875, div=1.0', 2.59, 1.37)
 ('network send speed: avg=177174, recent=155615 (KBytes/s), div=10.2488096084', 1.09, 0.1)
 ('client decode speed: avg=61.5, recent=61.2 (MPixels/s)', 1.0, 0.0)
 ('delay not updated for 41.2 ms (skipped 1 times - highest latency is 1674.1)', 0.0, 0.0)
 ('client latency: lowest=4.4, avg=56.5, recent=31.3 avg=0.056502563324, recent=0.0313947186921, target=0.01, div=1.0', 0.86, 0.17)
 ('damage packet queue size: avg=0.139678134796, recent=0.0549618248307, target=1.0 avg=0.139678134796, recent=0.0549618248307, target=1.0, div=1.0', 0.04, 0.96)
 ('damage packet queue pixels: avg=0.0, recent=0.0, target=1.0 avg=0.0, recent=0.0, target=1.0, div=886784', 0.0, 0.99)
 ('client packets backlog avg=0, recent=0, target=0, div=1.0', 0.0, 0.99)
 ('client pixels backlog avg=0, recent=0, target=0, div=886784', 0.0, 0.99)

2 things:

Will try to reproduce with something I can automate.


Sat, 23 Jan 2021 04:46:59 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/153