xpra icon
Bug tracker and wiki

Opened 7 months ago

Closed 6 months ago

#1781 closed defect (fixed)

Refresh bug in certain applications

Reported by: J. Max Mena Owned by: J. Max Mena
Priority: major Milestone: 2.3
Component: encodings Version: trunk
Keywords: Cc:

Description

I finally managed to reliably reproduce this - it's easiest in Firefox and Chrome on https://sfbay.craigslist.org/search/cta (a site for classifieds - this sub-site happens to be for cars, I've gotten it to trigger on the Motorcycles one as well).

My client and server are both Fedora 26 machines running trunk r18661 built from source.

I started my server with: XPRA_BANDWIDTH_DETECTION=0 xpra start :13 --bind-tcp=0.0.0.0: --start-new-commands=yes --start-child=xterm --start-child=google-chrome --start-via-proxy=no --systemd-run=no --no-daemon (running without the bandwidth limits makes it easier to detect) and attached with xpra attach tcp:ip.

Once attached I nagivated to the link I provided above. Just mousing over a few of the ads (and cycling through the photos) is enough to induce a state where the client is constantly repainting. Continued use of this site makes the paints more and more aggressive to the point where it becomes almost impossible to type into the Xterm.

I ran two different sessions - one to get the xpra info of both a fresh and a bugged state. And then a second one with -d refresh, compress enabled. On the second run I got into the bugged state almost instantly - I let it sit for at least 10-20 seconds before trying to stop the session with the Xterm. Hopefully that's enough.

Attachments (9)

1781freshinfo.txt (149.2 KB) - added by J. Max Mena 7 months ago.
Fresh xpra info
1781buggedinfo.txt (150.8 KB) - added by J. Max Mena 7 months ago.
After stuck in a constant refresh state
1781refreshcompresstrimmed.log (325.2 KB) - added by J. Max Mena 7 months ago.
uncompressed and trimmed log - the first few seconds, and the last few seconds (at the point where it was sitting and not being interacted with)
1781log.tar (20.0 KB) - added by J. Max Mena 7 months ago.
tarball of the entire log - tarred with -caf
1781log.tar.xz (192.7 KB) - added by J. Max Mena 7 months ago.
(properly) compressed full log
1781damagerefreshregionsubregion.log (3.1 MB) - added by J. Max Mena 6 months ago.
1781loground2.log (1.0 MB) - added by J. Max Mena 6 months ago.
1781loginfo.txt (170.6 KB) - added by J. Max Mena 6 months ago.
1781 paint bug.png (701.9 KB) - added by J. Max Mena 6 months ago.
Stuck paint state - note the boxes along the edge

Change History (22)

Changed 7 months ago by J. Max Mena

Attachment: 1781freshinfo.txt added

Fresh xpra info

Changed 7 months ago by J. Max Mena

Attachment: 1781buggedinfo.txt added

After stuck in a constant refresh state

Changed 7 months ago by J. Max Mena

uncompressed and trimmed log - the first few seconds, and the last few seconds (at the point where it was sitting and not being interacted with)

Changed 7 months ago by J. Max Mena

Attachment: 1781log.tar added

tarball of the entire log - tarred with -caf

comment:1 Changed 7 months ago by Antoine Martin

Owner: changed from Antoine Martin to J. Max Mena

Your log file is empty:

$ tar -xvf 1781log.tar 
1781refreshcompress.log
$ ls -la 1781*
-rw-rw-r--. 1 antoine antoine 20480 Mar  3 12:40 1781log.tar
-rw-rw-r--. 1 antoine antoine 10240 Mar  3 00:38 1781refreshcompress.log
$ wc -l 1781refreshcompress.log 
0 1781refreshcompress.log

In any case, never put a single file in a tar file. That never does anything useful except increasing the file size, tar is not a compressed format, it is archival format.
Use xz instead.

As for the log sample attached, it is trimmed just around the time where things got interesting so I don't really know how things got into this state.

Changed 7 months ago by J. Max Mena

Attachment: 1781log.tar.xz added

(properly) compressed full log

comment:2 Changed 7 months ago by J. Max Mena

Owner: changed from J. Max Mena to Antoine Martin

Well I've learned something today:

  • The a flag on tar will auto use the correct compression based on the name, rather than magically auto-selecting the proper one based on what I assumed it was before

I've properly tarred (and verified that it's correct) the full log from start to finish.

Last edited 7 months ago by J. Max Mena (previous) (diff)

comment:3 Changed 7 months ago by Antoine Martin

As per comment:1:

never put a single file in a tar file. That never does anything useful except increasing the file size, tar is not a compressed format, it is archival format. Use xz instead.

200	1781log.tar.xz
196	1781refreshcompress.log.xz

comment:4 Changed 7 months ago by Antoine Martin

Status: newassigned

I believe this is caused by r17540, triggering a refresh loop.
This bug is present in the 2.2.x branch, but I believe that the bandwidth detection changes in 2.3 (#999) make it easier to trigger.

FYI: you can generally avoid refresh related problems by turning off auto-refresh.

r18683 disables lossy refresh in the 2.2.x branch.
I will find a better solution for 2.3.

Last edited 7 months ago by Antoine Martin (previous) (diff)

comment:5 Changed 7 months ago by Antoine Martin

Owner: changed from Antoine Martin to J. Max Mena
Status: assignednew

My guess in comment:4 was based on log events like this one:

compress:   0.1ms for    2x1    pixels at 1126,965  for wid=2     using      webp with ratio 525.0%  (    0KB to     0KB), sequence  3608, client_options={'flush': 3, 'quality': 100, 'rgb_format': 'BGRX'}
compress:   0.3ms for    3x16   pixels at 1126,22   for wid=2     using      webp with ratio  21.9%  (    0KB to     0KB), sequence  1165, client_options={'flush': 3, 'quality': 100, 'rgb_format': 'BGRX'}

We shouldn't be using "webp" for such tiny screen updates.

But it turns out that this may not be fixed by r18683, r18685 is likely to be a more correct solution. (not sure this should be backported though).

It's not clear to me which part of the screen was refreshing unnecessarily. Can you specify the coordinates or at least take a screenshot with paint boxes turned on.
And assuming that you can reproduce this reliably still, please try turning off features to see if they are involved in this, ie: video region detection, refresh delay, etc..

comment:6 Changed 7 months ago by J. Max Mena

Owner: changed from J. Max Mena to Antoine Martin
  • Upped client and server to r18711

I no longer see the excessive flickering that made it obvious what was going on. However, my session still bogs down to the point of being unusable in the same manner, and the logs still show a flood of Webp refreshes -d refresh (see below). Turning auto-refresh off xpra control :13 auto-refresh 0 makes the issue go away; alternatively detaching and re-attaching help, but the issue will crop up again eventually.

2018-03-13 13:17:53,688 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:53,688 auto refresh: rgb24 screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(474, 78, 637, 28), refresh regions=[])
2018-03-13 13:17:53,735 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:53,901 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,067 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,290 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,368 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,492 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,585 auto refresh:  h264 screen update (actual quality=  0, lossy= True), added pixels to refresh regions, scheduling refresh in 150ms (pct=0, batch=28) (region=rectangle(0, 73, 1128, 804), refresh regions=[])
2018-03-13 13:17:54,586 auto refresh: rgb24 screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 73), refresh regions=[])

I also managed to get a traceback once or twice(something about video subregion), but can't seem to get it again for this comment.

comment:7 Changed 7 months ago by Antoine Martin

Owner: changed from Antoine Martin to J. Max Mena

Hmmm, looking at this again, maybe r18712 will fix this.
This code is far from being new though, so I would have expected it to be reported sooner (every version going back to pre 1.0 releases would be affected, unless somehow a changeset like r17480 made it easier to trigger).

If you still hit the issue, please provide "-d damage,refresh,regionrefresh,subregion" with more than a few lines and maybe also with "-d encoding" added. And if it does fire similar large-ish webp refreshes, does the region being refreshed match xpra info | grep video_subregion?

The traceback could be critical, please always save and report ALL tracebacks.

comment:8 Changed 6 months ago by J. Max Mena

Owner: changed from J. Max Mena to Antoine Martin

Upped client and server to r18739:

I can still trigger this exact same state fairly reliably - by opening Chrome and navigating to [sfbay.craigslist.org] and navigating a few of the ads (cars and motorcycles in particular). Oddly enough it was far more difficult to trigger while the logs were running, at least in my case. It took a few tries yesterday afternoon, but I managed to get it fairly quickly this morning before the log file got way too large. Running with the OpenGL paint boxes makes it much easier to tell when you've gotten into the correct state - it's most apparent when you see a ton of small paint boxes along the edges of the windows. I'll attach the log you asked for.

NOTE: I switched from a browser and terminal start children to two start terminal children, and I used one for starting the browser and the other to stop the session quickly.

As you mentioned as well, I double checked to see if they matched from xpra info | grep video_subregion and....they kind of do.

From my Xpra info (different session than the logs I'll be attaching):

window.3.video_subregion.rectangle=(1, 1, 1127, 965)
window.3.video_subregion.score=99
window.3.video_subregion.scores.(1, 1, 1127, 965)=99
window.3.video_subregion.set-at=7354
window.3.video_subregion.supported=True
window.3.video_subregion.time=65861
window.3.video_subregion.video-mode=False
window.3.video_subregion.width=1127
window.3.video_subregion.x=1
window.3.video_subregion.y=1

and from the logs (same session as the one I got the info from):

2018-03-16 09:45:43,432 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(1, 106, 4, 856), refresh regions=[])
2018-03-16 09:45:43,470 remove_refresh_region(rectangle(161, 301, 6, 13)) updated refresh regions=[]
2018-03-16 09:45:43,470 auto refresh: rgb24 screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(161, 301, 6, 13), refresh regions=[])
2018-03-16 09:45:43,479 refresh_subregion([R(0, 3, 1, 960)])
2018-03-16 09:45:43,503 remove_refresh_region(rectangle(1, 1, 1126, 105)) updated refresh regions=[]
2018-03-16 09:45:43,504 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(1, 1, 1126, 105), refresh regions=[])
2018-03-16 09:45:43,505 remove_refresh_region(rectangle(1126, 1, 2, 383)) updated refresh regions=[]
2018-03-16 09:45:43,505 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(1126, 1, 2, 383), refresh regions=[])
2018-03-16 09:45:43,600 refresh_subregion([R(0, 0, 1128, 3), R(1124, 3, 4, 959), R(0, 962, 1128, 4)])
2018-03-16 09:45:43,602 refresh_subregion([R(3, 0, 1122, 1)])
2018-03-16 09:45:43,602 remove_refresh_region(rectangle(0, 0, 1129, 967)) updated refresh regions=[]
2018-03-16 09:45:43,602 refresh_subregion([R(1, 1, 1126, 105), R(1, 106, 4, 856), R(1124, 106, 3, 856), R(1, 962, 1126, 3)])

Note the (1, 1, 1126, 105) from the logs and the (1, 1, 1127, 965) from the info. The start coordinates are the same, and the ending X-coordinates almost match, but the subregion from the info is much taller than what's getting auto-refreshed. That feels like it's too big of a coincidence, but I have a feeling you know a lot more about what's going on here.

Changed 6 months ago by J. Max Mena

comment:9 Changed 6 months ago by Antoine Martin

Owner: changed from Antoine Martin to J. Max Mena

Note the (1, 1, 1126, 105) from the logs and the (1, 1, 1127, 965) from the info.

That's very likely to be because the browser paints the video area using horizontal bands. This data from the log file uses a different geometry but the idea is the same (here with 58 pixel high bands):

damage(5, 48, 1119, 58, {})    wid=3, scheduling batching expiry for sequence 1809 in 34.0 ms
damage(5, 106, 1119, 58, {})   wid=3, using existing delayed {} regions created 0.0ms ago
damage(5, 164, 1119, 58, {})   wid=3, using existing delayed {} regions created 0.0ms ago
...
damage(5, 860, 1119, 58, {})   wid=3, using existing delayed {} regions created 0.0ms ago
damage(5, 918, 1119, 44, {})   wid=3, using existing delayed {} regions created 0.0ms ago

We then run some clever code to identify the "real" video region from that. (usually by aggregating the horizontal bands together)
We also round down the dimensions (1127 becomes 1126) because video encoders often use colourspace subsampling, which requires even dimensions to operate.

This log file looks like it would be useful, but I'm not sure where to look (~25000 lines in there): did the problem appear at the start? If not, when?
Do you have the corresponding "paint boxes" screenshot so I can try to reconcile? Or at least the "xpra info"?

comment:10 Changed 6 months ago by J. Max Mena

I managed to get it into the stuck state in the last few seconds of the log - when the flood of webp paints start is usually when it's in the stuck state.

I'll try again and post a new log with screenshots and an xpra info.

comment:11 Changed 6 months ago by J. Max Mena

Owner: changed from J. Max Mena to Antoine Martin

Okay got myself stuck into the state really quick with the logs running, so I'll attach them and a screenshot and an Xpra info.

Changed 6 months ago by J. Max Mena

Attachment: 1781loground2.log added

Changed 6 months ago by J. Max Mena

Attachment: 1781loginfo.txt added

Changed 6 months ago by J. Max Mena

Attachment: 1781 paint bug.png added

Stuck paint state - note the boxes along the edge

comment:12 Changed 6 months ago by Antoine Martin

Owner: changed from Antoine Martin to J. Max Mena

From the log, we see refreshes following the last add_video_refresh on line 7802:

refresh_subregion([R(1124, 3, 4, 184), R(1124, 207, 4, 25), R(1124, 251, 4, 694), R(1124, 958, 4, 3), R(1124, 961, 4, 1), R(0, 962, 1128, 1)])

Which would match the right and bottom edges of the window where the black paint boxes are on the screenshot: window.3.size=(1129, 967).
Just noticed that webp was missing from the default paint colours definition, which is why it showed up as black rather than pink, this is fixed in r18745.

Then each one of those regions gets refreshed, ie:

auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 962, 1128, 1), refresh regions=[])

Yet, after about 130ms, we're back to refreshing the same list again. Without seeing any damage(..) events in between.

That's because the video area shrank and we then request the areas that are no longer managed as video to get refreshed... and the method fires repeatedly instead of just once. The minimal fix for this is r18747.
This bug affected all versions that have video region support. (some recent optimizations may have made it easier to hit)

r18748 is a more complete code update in this area and r18749 is another related minor improvement.

comment:13 Changed 6 months ago by J. Max Mena

Resolution: fixed
Status: newclosed

Meant to close this one but somehow I never got around to it. I'll just blame that on being retasked.

Anyways, post r18749, I no longer see the flickering or excessive webp refreshes. I do see some fairly consistent small updates after a short time in some cases, but I can't rule out odd Firefox or Chrome painting behavior as being a culprit, and it doesn't seem to have any notable impact on my session like I was seeing before.

Closing

Note: See TracTickets for help on using tickets.