Xpra: Ticket #592: Pages sometimes show blurry content

Sometimes Xpra will enter a state where contents on a webpage become unreadable. It's most reproducible with large window sizes, usually after manually resizing the window. I'll attach a couple of screenshots of various buggy states, along with a .txt containing an xpra info.

Easily reproducible steps:

Open Chrome/Firefox?
Navigate to a webpage (I use wikipedia)
Resize window to a large window size manually using the bottom right corner

If this does not induce a buggy state, try resizing the window rapidly from small to big. If it's still not buggy, try going from big to small to big again rapidly.

This state is only ever entered when using the auto quality option. Manually forcing the best quality option will not induce this blurry state. Also of note, disconnecting and reconnecting (without closing chrome) will re-enter the blurry state.

Mon, 09 Jun 2014 18:48:28 GMT - J. Max Mena: attachment set

attachment set to Blurry.png

Mon, 09 Jun 2014 18:49:02 GMT - J. Max Mena: attachment set

attachment set to 592 Slightly Blurry.png

Xpra in a slightly blurry state. It's not as bad as I've seen it, but this is the best I could get for the xpra info.

Mon, 09 Jun 2014 18:51:19 GMT - J. Max Mena: attachment set

attachment set to 592 More Blurry.png

(this is the same file as blurry.png) A more blurry state that I screenshotted last week after it sat for several minutes.

Tue, 10 Jun 2014 01:53:52 GMT - Antoine Martin: owner changed

owner changed from Antoine Martin to J. Max Mena

Please include xpra info captured when the window quality is degraded. Or even better, a few times as it is degrading.

From the screenshots, it looks like the picture is downscaled to 2:3 or even 1:2. You can verify this by running without downscaling:

XPRA_SCALING=0 xpra start ..

Tue, 10 Jun 2014 02:12:52 GMT - J. Max Mena: attachment set

attachment set to ticket592xprainfo.txt

Xpra Info as requested...a little bit late.

Tue, 10 Jun 2014 05:24:04 GMT - Antoine Martin:

If you're using this screen layout regularly, we should probably add a modeline for it:

client.screen[0].monitor[0].geometry=(1920, 296, 1600, 900)
client.screen[0].monitor[1].geometry=(0, 0, 1920, 1080)
client.screen[0].size=(3520, 1196)

Though the automatic mode matching found something pretty close:

server.root_window_size=(3600, 1200)

As I mentioned before, I can't reproduce this, but I don't have a 1080p mswindows client to test with (at least not easily, and not with opengl enabled). Does this also happen with Linux + opengl clients? OSX? Does it never recover? I really don't understand why re-connecting does not reset things as all the statistics do get reset in that case. Does r6682 (for trunk) or r6682 (for v0.13.x) help?

the global batch delay is low (<=20ms), which is good:
```
batch.delay.cur=12
batch.actual_delay.cur=20
```

the client latency seems to suffer from some heavy jitter:

client.latency.50p=60
client.latency.80p=159
client.latency.90p=407
client.latency.absmin=1
client.latency.avg=114
client.latency.cur=5
client.latency.max=617
client.latency.min=1

That's a pretty wide range there, from 1ms to 617ms, and the 90 percentile is quite high too (407). What makes this very odd is that the ping latency is actually very good and stable (the server is probably not far at all):

client.ping_latency.50p=2
client.ping_latency.80p=2
client.ping_latency.90p=3
client.ping_latency.avg=3
client.ping_latency.cur=2
client.ping_latency.max=56
client.ping_latency.min=1

The difference between the two is that the "client ping latency" is measured from dedicated ping packets whereas the "client latency" is measured by substracting the decoding time from the time it takes to process a paint request. The two should not be so out of whack. This could be because of the resizing (which is similar to scrolling, only worse), it could end up queuing too many paint requests on the client end, maybe we need to slow down the refresh during resizing.

Please post the server output with -d stats (or you can also enable it at runtime with xpra control :DISPLAY debug enable stats), we should see some record_latency: .. information in the server log telling us where all that time is being spent.

Another thing that is puzzling is why the auto-refresh didn't kick in. Since the picture data is downscaled, the auto-refresh should get scheduled soon after each window paint. -d refresh would tell us more about that.

This is OK now, but maybe a little high at times (max=84) - preventing this from getting too high would help perceived latency and may impact the scaling heuristics (maybe this is how we slow down paint on resize more generically):

damage.data_queue.size.50p=14
damage.data_queue.size.80p=38
damage.data_queue.size.90p=51
damage.data_queue.size.avg=21
damage.data_queue.size.cur=0
damage.data_queue.size.current=0
damage.data_queue.size.max=84
damage.data_queue.size.min=0

Now, the problematic window (excerpts):

window[3].class-instance=('Google-chrome (/home/jimador/.config/google-chrome-unstable)', 'Google-chrome')
window[3].title=Wikipedia – Die freie Enzyklopädie - Google Chrome
window[3].size=(1880, 1012)
window[3].batch.delay.cur=20
window[3].csc=swscale
window[3].csc.dst_format=YUV420P
window[3].csc.dst_height=674
window[3].csc.dst_width=1252
window[3].scaling=(2, 3)
window[3].encoder=x264
window[3].encoding.quality.cur=66
window[3].encoding.speed.cur=97

So we are downscaling by 2:3 because of the low automatic quality. The quality is low (66) because of:

window[3].encoding.quality.backlog_factor=100
window[3].encoding.quality.batch_factor=33
window[3].encoding.quality.latency_factor=3273

This is the documentation around that code:

    # quality:
    #    0    for lowest quality (low bandwidth usage)
    #    100  for best quality (high bandwidth usage)
    # here we try minimize client-latency, packet-backlog and batch.delay

So, the problem comes from the client latency getting too high. If we figure out why that is, we can probably fix this.

Note: this does not explain why re-connecting does not fix the issue, it should.

Tue, 10 Jun 2014 11:30:57 GMT - Antoine Martin:

Please see comment:2, as we really need to figure out what is going on, and not just which changeset fixes the symptoms. In particular, why the auto-refresh didn't kick in, why re-connecting doesn't fix things, etc..

r6703 is an important fix (a bug which has gone unnoticed for far too long - backported), which may reduce performance..
r6704 is more of a band aid: it batches more during resizing, a bit of an ugly workaround
r6707 tries to prevent encoding queue backlogs: the problem here is that the queue is shared for all the windows, and until now we only kept track of the backlog of screen updates after they were queued for sending to the client, not before encoding.

Tue, 10 Jun 2014 18:03:00 GMT - J. Max Mena:

I'm attaching a 592stats.txt file to the ticket. I booted it up, and connected. A few seconds after connecting it became unread-ably blurry. I tried a few tabs, and refreshing and it never went out of blurry, so everything printed is with it being blurry. I re-ran the same test with -d refresh, and it was the exact same; starts out fine, then becomes blurry. I'll upload the output with a 592refresh.txt

This was all running the r6713 Windows client. I'll test the Linux and OSX clients later today and report back.

Also of note, chrome opened up running full-screen both times(or at least really close to it, which is where it was when I last closed it).

Tue, 10 Jun 2014 18:13:16 GMT - J. Max Mena: attachment set

attachment set to 592refresh.txt

Re-ran the test after realizing I never ended the Xpra server just to be sure I got all the logs. This time, I connected and it became blurry, then after opening a wiki entry it snapped back into the normal readable mode. I rapidly resized it and managed to get it back into the buggered state running almost full-screen. Upon getting the buggered state to stick, I ended the session.

Tue, 10 Jun 2014 18:26:04 GMT - J. Max Mena: attachment set

attachment set to 592stats.txt

re-ran the test to get some more in-depth log files. Upon connecting it started out blurry, but after browsing the wikipedia a bit, it became readable again. Once in the readable state I tried to get back to the blurry state. After leaving it on a webpage and resizing the window to a small(roughly 600x600) window, and dragging it back to fullscreen 1080p caused it to enter the blurry state again. Once in the blurry state it would snap back to readable, but then back to the blurry state. Once in the consistently burry state, I haven't found a reliable way out of it. It seems resizing the window is the most reliable way to get it into the blurry state, but I have seen it become this way without any resizing at all before.

Tue, 10 Jun 2014 20:00:09 GMT - J. Max Mena:

Retested with r6713 Fedora 20 and OSX client. I was able to reproduce with the Fedora 20 client, but unable to reproduce with the OSX client.

Wed, 11 Jun 2014 00:52:05 GMT - Antoine Martin:

The tests above were performed with an out of date server:

2014-06-10 11:17:32,859 xpra server version 0.13.1 (r6587)

and

2014-06-10 11:07:41,609 xpra server version 0.13.1 (r6587)

So we don't know anything about the fixes from yesterday. Also, please state which version and revision was used, I had to look at the download area to figure out that r6713 is probably a stable 0.13.4 build I made yesterday, and not the trunk version with the fixes..

Wed, 11 Jun 2014 03:25:18 GMT - Antoine Martin:

Thanks for the logs.

I've added your screen configuration (modeline) in r6737
found a bug (fixed in r6714 - will backport): on rare occasions, we were using uninitialized values, causing some invalid values to get recorded in our internal statistics, this may have caused some problems with the heuristics:
```
record_latency: took   12.5 ms round trip, 1402424289809.3 for echo,    5.0 for decoding of     1902 pixels, \
    -4493765 bytes sent over the network in    7.5 ms, 1402424289804.3 ms for echo
```

(negative number of bytes! timed as 44 years instead of mere milliseconds!) But this should have been very rare, and would probably have resolved itself after about 500 screen updates. (which comes quicker than you would think)

another bug (fixed in r6717 - will backport): this would have caused the auto-refresh to fail, and explains why the picture stays bad even when it doesn't get updated (but this doesn't explain why the picture goes bad in the first place)

I believe this is the big one: r6721 (can also cause the picture to stay bad) - unfortunately far too intrusive to backport... so all I can suggest is that you disable scaling until the next release.

this one makes a massive difference when watching videos in a non-fullscreen part of a window: r6728. Unfortunately it sort of breaks auto-refresh, so this can't be backported and I'll need to do more work on this.

r6729 + r6730 + r6732 also help in preventing the auto-refresh from kicking in too often (as it kills performance when we also have video playing)

r6733 smoothes things a bit

there are spikes in screen update processing latency in your log samples. Some that take well over a second to roundtrip:

record_latency: took 1152.2 ms round trip, 1142.8 for echo,  203.0 for decoding of   691456 pixels, \
    583865 bytes sent over the network in  949.2 ms,  939.8 ms for echo
record_latency: took  991.2 ms round trip,  989.7 for echo,  290.0 for decoding of  1478912 pixels, \
    66217 bytes sent over the network in  701.2 ms,  699.7 ms for echo

This should be less of a problem with the changes in trunk from comment:3, but doesn't explain why it is happening.

It looks like the problem is bandwidth related, which could explain why I can't reproduce it on my fast LAN. It is a bit odd how this seems to happen with nothing preceding it:

2014-06-10 11:18:35,577 record_latency: took    4.8 ms round trip,    4.8 for echo,    3.0 for decoding of     7290 pixels, \
    320 bytes sent over the network in    1.8 ms,    1.8 ms for echo
2014-06-10 11:18:38,521 record_latency: took 2185.6 ms round trip, 2185.5 for echo,  309.0 for decoding of   521696 pixels, \
    47284 bytes sent over the network in 1876.6 ms, 1876.5 ms for echo

Why would 47KB take almost 2 seconds to round trip? Especially when the previous update, sent 3 seconds before took no time at all (1.8ms). Could there be some random drop outs on your lan?

Wed, 11 Jun 2014 18:38:56 GMT - J. Max Mena:

Retested with r6737 Windows client against a r6737 Fedora 20 Server. I re-ran similar tests, such as perusing Wikipedia at a large resolution and looking up a few German words in wiktionary. Other than some slight blurring when scrolling or resizing the window, I could not get the blurry state to stick for a significant amount of time. At one point I thought I had it stuck but it snapped back into a readable state. It is still reproducible, but almost impossible to get the blurry state to stick. Now it will go between being blurry and readable. Sometimes it stays readable, sometimes it'll stop on blurry for a period before going back to 'hunting' for the readable state.

I'll attach some more stats and refresh log files shortly.

Wed, 11 Jun 2014 18:40:19 GMT - J. Max Mena: attachment set

attachment set to 592r6737stats.txt

-d stats piped into a .txt while perusing wikipedia. Never got the blurry state to stick around. A couple of times it jumped back into being blurry, but it recovered quickly (less than a second).

Wed, 11 Jun 2014 18:41:47 GMT - J. Max Mena: attachment set

attachment set to 592r6737refresh.txt

Similar test as the stats file. Again, could not get the blurry state to stick. Sometimes when navigating back to a tab it would be blurry for a short period of time, but it recovered.

Thu, 12 Jun 2014 03:24:35 GMT - Antoine Martin:

Please check with current 0.13.x, is that good enough?

I am moving some of the remaining issues to #596

Tue, 17 Jun 2014 20:02:11 GMT - J. Max Mena:

Retested with 13.6 r6783 Win7 64-bit client and Fedora 20 server:

Blurriness is still reproducible with large window sizes and scrolling, however blurriness seems to disappear relatively quickly.
However, if a video is embedded in a website(like a YouTube? embedded video), this causes the blurriness to stick noticeably harder while playing. Pausing the video causes the page to return to "normal" behavior (Somewhat expected, based off what I read from #596)
Also of note, partial webpage refreshes, like hiding comment sections, are causing the whole webpage to become blurry for a short period.(again, somewhat expected)

I can provide some more logs if requested, and screen-shots as well. Blurriness has definitely improved in this latest test.

Wed, 18 Jun 2014 10:49:04 GMT - Antoine Martin: status changed; resolution set

status changed from new to closed
resolution set to fixed

This will have to do for 0.13, closing, please see #596.

Sat, 23 Jan 2021 05:00:18 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/592