Xpra: Ticket #249: OSX's UI thread is blocked / decoding without using the UI thread to serialize access

As of r2662, we serialize the decoding of frames by doing the initial processing of draw packets from the UI thread - this prevents a race with the "new-window" packets. (see #242 for details)

It would be nice to avoid this, as we always try to minimize the time spent in the UI thread to maximize interactivity/latency/throughput.

Also, OSX has a nasty habit of blocking the UI thread, see: https://mail.gnome.org/archives/gtk-osx-users-list/2013-February/msg00005.html

Unfortunately, it's not as simple as reverting the change for OSX... All callbacks run from the UI thread at present, that's because:

at least one of the callbacks does the actual UI paint
the code that updates the backing pixmap is GDK code, which therefore requires the UI lock..)

And so the ACK won't be sent until the UI thread resumes. Which means it can take many seconds before that happens, and the server will slow down the refresh rate drastically, and may even print some warnings to the log... OSX PITA.

Wed, 06 Feb 2013 07:34:35 GMT - Antoine Martin: attachment set

attachment set to decode-nonUIthread-for-osx.patch

changes OSX to not use the UI thread to schedule decode, which is not enough to fix the problem

Fri, 08 Feb 2013 04:28:42 GMT - Antoine Martin: status, milestone changed; owner set

owner set to Antoine Martin
status changed from new to accepted
milestone changed from 0.8 to 0.9

Just thought of a better solution: using the patch above, when we see from the paint thread that requests aren't being processed by the UI thread (for longer than N ms), we can:

tell the server not to bother sending any updates (new "pause" packet?)
cancel all pending window repaints (a simple flag)
schedule a call to reset everything: when the UI thread runs again, we clear the repaint flag and tell the server to start again (new "resume" packet?)

We may also be able to bring back the osx and non-osx codepath back together by always doing these checks in the paint thread and scheduling the _do_draw call via idle_add (which would solve the race for osx).

Wed, 20 Mar 2013 14:19:23 GMT - Antoine Martin: milestone changed

milestone changed from 0.9 to 0.10

Sun, 07 Jul 2013 06:45:52 GMT - Antoine Martin: status changed

status changed from accepted to new

r3786 implements a fix with the approach from comment:1, see changeset for full details, including testing setup.

Note: when testing via env vars, we may encounter slower than normal xpra client exit and even bugs on ctrl-c or exit - nothing to worry about unless this is reproducible without the env vars.

What is not solved yet by this changeset:

a number of paint events may have accumulated on the client before we tell the server to stop sending them, and those will be full decompressed frames so potentially quite big... this memory will be locked until we resume.
still TODO: re-send all the metadata attributes as they may have changed (or we could just send them anyway since they are generally quite small?)

Sun, 07 Jul 2013 06:46:12 GMT - Antoine Martin: owner changed

owner changed from Antoine Martin to alas

please test (see comment:3 - preferably using a real mac as client)

Note: I am seeing sound overruns on my mac vm (#362 again) and the mac menu is not populated yet in trunk (needs to be hooked again)

Thu, 11 Jul 2013 00:33:38 GMT - alas: attachment set

attachment set to osxXpraTests.txt

Initial runs of osx client (beta) tests

Thu, 11 Jul 2013 00:41:05 GMT - alas:

Using an osx 10.6.8 laptop (real mac)... Starting with either the gui or command-line, all of the PNG and JPEG encodings produced start-child xterms that were hollow and non-responsive (server-side logs indicated that it was convinced that there was focus, client-side indicated some serious, or at least verbose, errors were going on).

With the rgb encoding, the client behaved indistinguishably from a windows client (r3778 like, complete with all the lazarus browser focus issues in the same places)... except for the inability to copy & paste (the clipboard once again reaching out from hell).

I've attached an account of steps and server & client side logs.

Thu, 11 Jul 2013 01:35:45 GMT - Antoine Martin:

Note: this ticket is about UI thread lock issues on osx when accessing the global menu.

There seems to be an OpenGL issue which I am adding to #226 The focus issues belong in #214 and #366 (and which encoding is used is probably not relevant as this is race and is probably fairly random) The clipboard issue is already recorded as #318

The PNG and JPEG errors look like an issue with pillow/PIL:

ImportError: The _imaging C module is not installed

This one is client side and explains why PNG/JPEG do not work with the mac client - should be fixed in r3817

Then in the server logs there is another one: "Suspension not allowed here" which is this this PIL/Pillow bug and we only trigger it in 0.10 servers because we now optimize the compression in some cases (depending on the speed parameter) - nothing to do with osx, this should occur with all clients, moved to #375

The warnings:

WARNING **: Trying to register gtype 'GMountMountFlags' as enum when in fact it is of type 'GFlags'

Can be fixed by applying a patch to pygobject at build time, see https://bugzilla.redhat.com/show_bug.cgi?id=790053

More comments hidden in the log file:

Quit from tray again closes but server doesn't recognize the disconnection.

That's because the tray and global menu aren't hooked up, so we exit without cleaning up. The server should notice sooner or later (a minute maximum)

Fri, 12 Jul 2013 02:53:50 GMT - alas:

Scrolling over the global menu doesn't create any interaction at all. Clicking on the global menu freezes the xpra session while it is out of focus. Once focus returns to the xpra session it un-freezes (though after about 10 minutes a sound backlog causes a sound stutter for a few seconds - before recovering rather gracefully).

...
2013-07-11 19:51:01,897 get_backlog found some damage acks that have been pending for too long, expiring them: [9076]
2013-07-11 19:51:16,884 delayed_region_timeout: sending now - something is wrong!
2013-07-11 19:51:16,945 get_backlog found some damage acks that have been pending for too long, expiring them: [9077]

Fri, 12 Jul 2013 04:21:49 GMT - Antoine Martin:

Hah, damn, we shouldn't be getting those messages:

get_backlog found some damage acks that have been pending for too long, expiring them: (...)
delayed_region_timeout: sending now - something is wrong!

When we suspend the screen updates (and also when we resume), we clear the statistics containing the backlog list.

The only places where we will check this backlog list are (checking the list is what fires those error messages):

in damage(..) when processing new damage requests, but this is guarded by a check of the suspended flag.
in may_send_delayed but this is meant to shortcut out since there should not be any pending damage_delayed areas: this variable is only assigned in damage(..) so the same guard applies.

And there can be no threading issues here because all these functions run from the UI thread.

afarr:

are you sure your server is fully up to date? (full release no please)
those error messages showed up whilst the mouse was in the global menu? or after it left? how long after it entered? (roughly)

Fri, 12 Jul 2013 21:24:23 GMT - alas:

The server is fedora 18 and was updated a few days ago. (Re-updated today, then ran the following.)

[JimBeam@jimbeam PIL]$ uname -a
Linux jimbeam.spikes.eng 3.9.6-200.fc18.x86_64 #1 SMP Thu Jun 13 18:56:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

The messages showed up after I used the mouse to change focus from the global menu back to the xpra browser session.

Today I just got the following after 5 min (with x264 encoding), on the server side. 2013-07-12 13:27:22,094 not starting sound as we are suspended ... the video didn't continue where it had stopped... and the sound did not resume.

I'll attach a file with client side logs and further elaboration.

Fri, 12 Jul 2013 21:49:53 GMT - alas: attachment set

attachment set to osxX264GlobalMenuTests.txt

osx r3830 tests of global menu pausing, client logs

Fri, 12 Jul 2013 23:33:30 GMT - alas:

Testing showed the same results, as far as global menu is concerned, with vpx encoding.

Sat, 13 Jul 2013 04:07:16 GMT - Antoine Martin:

Please don't mix log samples and comments - makes it hard to know where to look for things, to reply, and to distinguish one sample from another. Especially if some of the log samples are one line long, they should be inlined in the ticket directly using code blocks.
Also, if there are sound issues, focus issues, etc. They do not belong in this ticket, we already have tickets for those.
The log mentions just 2 attempts, were there more? what was the difference? How repeatable were the problems?

Did you see these messages again? (server-side):

get_backlog found some damage acks that have been pending for too long, expiring them

If so, please post the server log in debug mode (-d all).

When focus returns to the xpra browser session the session un-pauses, but the video doesn't restart from the point where it was paused. (..)

FYI: the session should resume when the global menu loses focus, not necessarily when the session itself re-gains focus.
There is no way we can safely resume the video from the point where it was stopped, we have to drop frames at some point (the ones that occurred whilst the UI was stalled) so we do that as soon as we resume: we jump straight to the current state instead of playing all the stalled frames first. (there may still be a few stalled frames in the buffer - far fewer than before, a couple of seconds maximum) In fact, that is the whole point of this ticket: we don't accumulate frames unnecessarily on the client, as this would eventually cause an out-of-memory condition and a client crash (or worse).

Previously, the client's memory usage would grow by many MB/s when the UI was stalled. With this latest code, it will only grow for the first couple of seconds before the server stops sending any more frames. (we could even drop those stalled buffers on the client, but I don't think it is worth the extra work, the real solution worth spending time on is to find a way not to stall the UI thread at all)

Sat, 13 Jul 2013 16:04:35 GMT - alas:

There were three or four tests run with each x264 and vpx. The case I began the x264 logs with was the fourth try, which seemed the best behaved (because there didn't seem to be erratic sound and video problems confusing issues). I included some information from the first and second tests, which included some mention of those other problems, just in case the interactions might potentially shed some light on one or the other of the issues behaviors.

The vpx tests, as far as the global menu behavior (and as far as the sound and video) produced the same results, as far as bugs- thus I didn't post any logs about the vpx (here, instead posting the sound issue I found there in #379).

No, I didn't see any more of the get_backlog errors, which is why they aren't in the log I posted.

The server-side message that it wasn't going to resume sound after the server pause showed up after the global menu lost focus every time. 2013-07-12 14:00:46,830 not starting sound as we are suspended This message is output whenever the global menu related suspension ends.

The client outputs a few messages that it is going to restart and it does indeed restart. I mentioned the detail of it jumping to the end of the video rather than having paused it because I wasn't sure what the expected behavior was- if it was to pause then I wanted you to know it was not doing so. Since that isn't, then it was behaving as expected.

It looks like the client is handling the pause relatively elegantly now. It just remains to convince the server to restart the sound, despite having been suspended.

Sun, 14 Jul 2013 03:18:29 GMT - Antoine Martin:

No, I didn't see any more of the get_backlog errors, which is why they aren't in the log I posted.

Then I suspect, either your client or server were not up to date when you did see those messages. Took me a long time to investigate..

not starting sound as we are suspended

I believe this message occurs because the overrun events are delivered via the UI thread, and so they get delayed until the UI thread resumes. By the time they get sent, we have suspended things already, including the sound. I may turn it into a debug (hidden by default) message since it looks hard to avoid.

But what I am really interested in is seeing if we do try to resume the sound or not when the client's UI thread resumes, and if we are, where it gets stuck. Please post debug logs of both client and server, trimmed to show just the time around the UI thread suspend/resume cycle. (you can use tail -f -n 0 then ctrl-C to grab from an existing log file)

(btw, not sure how, but you seem to have managed to add an 's' to 'sound' in your cut&paste - I've edited it out)

Sun, 14 Jul 2013 05:06:11 GMT - Antoine Martin: status changed; resolution set

status changed from new to closed
resolution set to fixed

And since the "sound not restarting" info belongs in #379, I am closing this ticket.

The proper fix for all the UI thread stall is to use a newer toolkit, either gtk3 (#90) or qt, or even going native using objc/pyobjc.

Thu, 16 Jun 2016 12:27:40 GMT - Antoine Martin:

Found an explanation for what is happening with the UI thread in OSX : runLoop options: Add a timer to the main run loop in default mode that writes out something every second and enter the menu with your mouse. You will see the difference …

So it sounds like this should be fixed in GTK or glib.

Sat, 23 Jan 2021 04:49:35 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/249