Xpra: Ticket #249: OSX's UI thread is blocked / decoding without using the UI thread to serialize access

As of r2662, we serialize the decoding of frames by doing the initial processing of draw packets from the UI thread - this prevents a race with the "new-window" packets. (see #242 for details)

It would be nice to avoid this, as we always try to minimize the time spent in the UI thread to maximize interactivity/latency/throughput.

Also, OSX has a nasty habit of blocking the UI thread, see: https://mail.gnome.org/archives/gtk-osx-users-list/2013-February/msg00005.html

Unfortunately, it's not as simple as reverting the change for OSX... All callbacks run from the UI thread at present, that's because:

And so the ACK won't be sent until the UI thread resumes. Which means it can take many seconds before that happens, and the server will slow down the refresh rate drastically, and may even print some warnings to the log... OSX PITA.



Wed, 06 Feb 2013 07:34:35 GMT - Antoine Martin: attachment set

changes OSX to not use the UI thread to schedule decode, which is not enough to fix the problem


Fri, 08 Feb 2013 04:28:42 GMT - Antoine Martin: status, milestone changed; owner set

Just thought of a better solution: using the patch above, when we see from the paint thread that requests aren't being processed by the UI thread (for longer than N ms), we can:

We may also be able to bring back the osx and non-osx codepath back together by always doing these checks in the paint thread and scheduling the _do_draw call via idle_add (which would solve the race for osx).


Wed, 20 Mar 2013 14:19:23 GMT - Antoine Martin: milestone changed


Sun, 07 Jul 2013 06:45:52 GMT - Antoine Martin: status changed

r3786 implements a fix with the approach from comment:1, see changeset for full details, including testing setup.

Note: when testing via env vars, we may encounter slower than normal xpra client exit and even bugs on ctrl-c or exit - nothing to worry about unless this is reproducible without the env vars.


What is not solved yet by this changeset:


Sun, 07 Jul 2013 06:46:12 GMT - Antoine Martin: owner changed

please test (see comment:3 - preferably using a real mac as client)

Note: I am seeing sound overruns on my mac vm (#362 again) and the mac menu is not populated yet in trunk (needs to be hooked again)


Thu, 11 Jul 2013 00:33:38 GMT - alas: attachment set

Initial runs of osx client (beta) tests


Thu, 11 Jul 2013 00:41:05 GMT - alas:

Using an osx 10.6.8 laptop (real mac)... Starting with either the gui or command-line, all of the PNG and JPEG encodings produced start-child xterms that were hollow and non-responsive (server-side logs indicated that it was convinced that there was focus, client-side indicated some serious, or at least verbose, errors were going on).

With the rgb encoding, the client behaved indistinguishably from a windows client (r3778 like, complete with all the lazarus browser focus issues in the same places)... except for the inability to copy & paste (the clipboard once again reaching out from hell).

I've attached an account of steps and server & client side logs.


Thu, 11 Jul 2013 01:35:45 GMT - Antoine Martin:

Note: this ticket is about UI thread lock issues on osx when accessing the global menu.

There seems to be an OpenGL issue which I am adding to #226 The focus issues belong in #214 and #366 (and which encoding is used is probably not relevant as this is race and is probably fairly random) The clipboard issue is already recorded as #318

The PNG and JPEG errors look like an issue with pillow/PIL:

ImportError: The _imaging C module is not installed

This one is client side and explains why PNG/JPEG do not work with the mac client - should be fixed in r3817

Then in the server logs there is another one: "Suspension not allowed here" which is this this PIL/Pillow bug and we only trigger it in 0.10 servers because we now optimize the compression in some cases (depending on the speed parameter) - nothing to do with osx, this should occur with all clients, moved to #375

The warnings:

WARNING **: Trying to register gtype 'GMountMountFlags' as enum when in fact it is of type 'GFlags'

Can be fixed by applying a patch to pygobject at build time, see https://bugzilla.redhat.com/show_bug.cgi?id=790053

More comments hidden in the log file:

Quit from tray again closes but server doesn't recognize the disconnection.

That's because the tray and global menu aren't hooked up, so we exit without cleaning up. The server should notice sooner or later (a minute maximum)


Fri, 12 Jul 2013 02:53:50 GMT - alas:

Scrolling over the global menu doesn't create any interaction at all. Clicking on the global menu freezes the xpra session while it is out of focus. Once focus returns to the xpra session it un-freezes (though after about 10 minutes a sound backlog causes a sound stutter for a few seconds - before recovering rather gracefully).

...
2013-07-11 19:51:01,897 get_backlog found some damage acks that have been pending for too long, expiring them: [9076]
2013-07-11 19:51:16,884 delayed_region_timeout: sending now - something is wrong!
2013-07-11 19:51:16,945 get_backlog found some damage acks that have been pending for too long, expiring them: [9077]

Fri, 12 Jul 2013 04:21:49 GMT - Antoine Martin:

Hah, damn, we shouldn't be getting those messages:

get_backlog found some damage acks that have been pending for too long, expiring them: (...)
delayed_region_timeout: sending now - something is wrong!


When we suspend the screen updates (and also when we resume), we clear the statistics containing the backlog list.

The only places where we will check this backlog list are (checking the list is what fires those error messages):


And there can be no threading issues here because all these functions run from the UI thread.


afarr:


Fri, 12 Jul 2013 21:24:23 GMT - alas:

The server is fedora 18 and was updated a few days ago. (Re-updated today, then ran the following.)

[JimBeam@jimbeam PIL]$ uname -a
Linux jimbeam.spikes.eng 3.9.6-200.fc18.x86_64 #1 SMP Thu Jun 13 18:56:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

The messages showed up after I used the mouse to change focus from the global menu back to the xpra browser session.

Today I just got the following after 5 min (with x264 encoding), on the server side. 2013-07-12 13:27:22,094 not starting sound as we are suspended ... the video didn't continue where it had stopped... and the sound did not resume.

I'll attach a file with client side logs and further elaboration.


Fri, 12 Jul 2013 21:49:53 GMT - alas: attachment set

osx r3830 tests of global menu pausing, client logs


Fri, 12 Jul 2013 23:33:30 GMT - alas:

Testing showed the same results, as far as global menu is concerned, with vpx encoding.


Sat, 13 Jul 2013 04:07:16 GMT - Antoine Martin:

Please don't mix log samples and comments - makes it hard to know where to look for things, to reply, and to distinguish one sample from another. Especially if some of the log samples are one line long, they should be inlined in the ticket directly using code blocks.
Also, if there are sound issues, focus issues, etc. They do not belong in this ticket, we already have tickets for those.
The log mentions just 2 attempts, were there more? what was the difference? How repeatable were the problems?


Did you see these messages again? (server-side):

get_backlog found some damage acks that have been pending for too long, expiring them

If so, please post the server log in debug mode (-d all).


When focus returns to the xpra browser session the session un-pauses, but the video doesn't restart from the point where it was paused. (..)


FYI: the session should resume when the global menu loses focus, not necessarily when the session itself re-gains focus.
There is no way we can safely resume the video from the point where it was stopped, we have to drop frames at some point (the ones that occurred whilst the UI was stalled) so we do that as soon as we resume: we jump straight to the current state instead of playing all the stalled frames first. (there may still be a few stalled frames in the buffer - far fewer than before, a couple of seconds maximum) In fact, that is the whole point of this ticket: we don't accumulate frames unnecessarily on the client, as this would eventually cause an out-of-memory condition and a client crash (or worse).


Previously, the client's memory usage would grow by many MB/s when the UI was stalled. With this latest code, it will only grow for the first couple of seconds before the server stops sending any more frames. (we could even drop those stalled buffers on the client, but I don't think it is worth the extra work, the real solution worth spending time on is to find a way not to stall the UI thread at all)


Sat, 13 Jul 2013 16:04:35 GMT - alas:

There were three or four tests run with each x264 and vpx. The case I began the x264 logs with was the fourth try, which seemed the best behaved (because there didn't seem to be erratic sound and video problems confusing issues). I included some information from the first and second tests, which included some mention of those other problems, just in case the interactions might potentially shed some light on one or the other of the issues behaviors.

The vpx tests, as far as the global menu behavior (and as far as the sound and video) produced the same results, as far as bugs- thus I didn't post any logs about the vpx (here, instead posting the sound issue I found there in #379).

No, I didn't see any more of the get_backlog errors, which is why they aren't in the log I posted.

The server-side message that it wasn't going to resume sound after the server pause showed up after the global menu lost focus every time. 2013-07-12 14:00:46,830 not starting sound as we are suspended This message is output whenever the global menu related suspension ends.

The client outputs a few messages that it is going to restart and it does indeed restart. I mentioned the detail of it jumping to the end of the video rather than having paused it because I wasn't sure what the expected behavior was- if it was to pause then I wanted you to know it was not doing so. Since that isn't, then it was behaving as expected.

It looks like the client is handling the pause relatively elegantly now. It just remains to convince the server to restart the sound, despite having been suspended.


Sun, 14 Jul 2013 03:18:29 GMT - Antoine Martin:

No, I didn't see any more of the get_backlog errors, which is why they aren't in the log I posted.


Then I suspect, either your client or server were not up to date when you did see those messages. Took me a long time to investigate..


not starting sound as we are suspended

I believe this message occurs because the overrun events are delivered via the UI thread, and so they get delayed until the UI thread resumes. By the time they get sent, we have suspended things already, including the sound. I may turn it into a debug (hidden by default) message since it looks hard to avoid.


But what I am really interested in is seeing if we do try to resume the sound or not when the client's UI thread resumes, and if we are, where it gets stuck. Please post debug logs of both client and server, trimmed to show just the time around the UI thread suspend/resume cycle. (you can use tail -f -n 0 then ctrl-C to grab from an existing log file)


(btw, not sure how, but you seem to have managed to add an 's' to 'sound' in your cut&paste - I've edited it out)


Sun, 14 Jul 2013 05:06:11 GMT - Antoine Martin: status changed; resolution set

And since the "sound not restarting" info belongs in #379, I am closing this ticket.


The proper fix for all the UI thread stall is to use a newer toolkit, either gtk3 (#90) or qt, or even going native using objc/pyobjc.


Thu, 16 Jun 2016 12:27:40 GMT - Antoine Martin:

Found an explanation for what is happening with the UI thread in OSX : runLoop options: Add a timer to the main run loop in default mode that writes out something every second and enter the menu with your mouse. You will see the difference …

So it sounds like this should be fixed in GTK or glib.


Sat, 23 Jan 2021 04:49:35 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/249