Xpra: Ticket #569: crash running glxgears and gtkperf in a loop

Fedora 20 x86_64 server, Win XP client.

started the following client applications from the xterm:

It eventually crashed the xpra server, without any error message...

And now that I have gdb attached to the server process so that I can get a backtrace, it is refusing to crash!

Can you reproduce?

You can attach gdb to the xpra server process with:

gdb /usr/bin/python
> continue

Then when it crashes, get the backtrace with:

> backtrace

You can find the current value of $XPRA_SERVER_PID with: pidof -x xpra. You can just leave it all running until you get the crash. (if?)

It would also be useful to keep an eye on CPU and memory usage. Even though this is a pathological test, the worst possible application to run in xpra, we should handle it as gracefully as we can.

Fri, 09 May 2014 14:42:01 GMT - Antoine Martin: description changed

Fri, 09 May 2014 15:31:36 GMT - Antoine Martin: description changed

Sat, 10 May 2014 01:04:58 GMT - alas:

I tried with windows 0.12.5 r6398 against fedora 19 server also running r6398... for about 75 minutes.

No Hate. (It wouldn't crash)

Client side cpu seems to be oscillating between about 30-70% and seems to hover around 80% memory (my windows machine doesn't have a lot of spare memory, just FYI, in case that might affect that stat).

Server-side cpu seems to be oscillating between about 90-111%, though memory seems negligible (Xorg is consistently using about 37%, while glxgears is closer to 0.3%) ... though I wasn't able to check those server-side numbers while running the second round of tests for #567 against an xpra session hosted by a different user on the same test fedora 19 VM... on the other hand, whatever extra stress that might've caused also wasn't enough to induce a crash).

Does this test need to be run against a Fedora 20 server? (I haven't taken the time to install glx gears, or much else, on the Fedora 20 server that I supposedly have available.) Come Monday I should be able to run this for hours and use other machines to get work done, so let me know if it needs to be fedora 20, and I'll abuse whatever systems needed.

Sat, 10 May 2014 02:59:59 GMT - Antoine Martin:

OK, Fedora 19 should do just as well. But worth running again on Monday just to be sure.

If neither of us can reproduce it, we'll just assume it was fluke and close this ticket.

Keep an eye on the xpra server RAM usage, as gtkperf is likely to expose leaks there if there are any.

Mon, 12 May 2014 22:31:58 GMT - alas:

I ran the test again, same client same server. Technically I ran it a couple of times, but the first time through I tried to move the glxgears window out from under the winking gtkperf window, and in the process caused everything to crash. (I might try to reproduce that soon.)

Running through without trying to move windows, the test ran like a champ for just shy of 4 hours (I had to re-start glxgears a couple of times though, not sure if it just times out or if that was something I should follow up with).

The client cpu and memory were about the same as they had been all along (between 40-85% cpu, about 86% memory). The xpra process wasn't using up any particular cpu or memory, but the Xorg memory had been climbing all along - from about 30% at one hour, to about 60% after two, up to about 90% by three... though it seemed to flatten at that point and level back at 88%... until the end, when it climbed up to about 93.9%... and flattened while the session went into spinners.

I had typed the continue into the gdb, and whatever was going on, that just froze, mostly. It had been tracing things all along, until it gave me this:

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f8a6c4f1700 (LWP 651)]
0x0000003ef200e5bb in send () from /lib64/libpthread.so.0

I tried to run a backtrace, but there was no response, just mention of more threads exiting:

[Thread 0x7f8a6bcf0700 (LWP 652) exited]
[Thread 0x7f8a6c4f1700 (LWP 651) exited]

When the client finally lost connection, it output the following:

2014-05-12 15:02:23,806 server is OK again
2014-05-12 15:02:30,667 server is not responding, drawing spinners over the windows
2014-05-12 15:03:29,236 server ping timeout - waited 60 seconds without a response
2014-05-12 15:03:29,641 Connection lost
C:\Program Files (x86)\Xpra\library.zip\xpra\client\gtk_base\gtk_client_window_base.py:103: GtkWarning: gtk_widget_map: assertion `gtk_widget_get_visible (widget)' failed

Meanwhile, server output this:

2014-05-12 15:01:23,465 delayed_region_timeout: something is wrong, is the connection dead?
2014-05-12 15:01:45,062 XShmWrapper.shmget(PRIVATE, 656180 bytes, 1023) failed, bytes_per_line=2180, width=545, height=300
2014-05-12 15:02:06,516 disabling XShm support following irrecoverable error
2014-05-12 15:03:59,211 read connection reset or aborted for SocketConnection(('', 1201) - ('', 57536))

I never did manage to get any backtrace. Hope the rest of it is of some use.

Tue, 13 May 2014 04:01:54 GMT - Antoine Martin:

I tried to move the glxgears window out from under the winking gtkperf window, and in the process caused everything to crash. (I might try to reproduce that soon

Yes please.

Xorg memory had been climbing all along - from about 30% at one hour, to about 60% after two, up to about 90%..

That would explain the spinners and the disconnection. If the memory is full, the system is going to swap and collapse soon after. This particular bug is already tracked in #535, so we can ignore memory related problems for this ticket.

Thu, 15 May 2014 01:15:28 GMT - alas:

I tried the test again, same server version same client version.

Moving the glxgears window out from under the winking gtkperf window, while a challenge because of focus racing issues, caused no problems. To be sure, I periodically moved the glxgears window around the display, to a second display, back under the gtkperf window, back out from under... etc. - no issues.

I tried disconnecting the client, then reconnecting, then repeated moving the glxgears window around the displays - still no issues.

I let the test run for two hours, periodically moving the glxgears window around and around - still no issues.

Oddly, this time through, after 2 hours, the Xorg process was still only up to around 17% memory (after 2 hours on the test before, it was up to around 60%). Also, the glxgears process never stopped, thereby never requiring that I go back to the xterm to re-launch.

I doubt it's surprising or notable, but the glxgears generally stopped spinning while the gtkperf window was active.

Maybe it just wasn't as hot today and the test VM server just behaved better?

Anything else you'd like me to try for this case?

Thu, 15 May 2014 05:22:14 GMT - Antoine Martin:

afarr: were the versions the same as last time? It would be worth trying different encodings. Like you said, maybe the server was behaving better (or worse) than last time and using a different encoding.

Thu, 15 May 2014 22:12:16 GMT - alas:

Versions were the same, 0.12.5 r6398 windows client, 0.13.0 r6398 fedora 19 server.

Repeated test for 2 hours with --encoding=jpeg, which seemed to be a closer performance to what I saw with the first test - the glxgears had to be restarted a number of times (and by the 1:45 mark, or so, the glxgears FPS dropped to 13.701, then all the way to .002). Moving the glxgears window around didn't cause any particular problems though, and there were no crashes.

Repeated for another half hour with --encoding=rgb (reconnecting to still running server session). Performance was about the same (glxgears required no restarting).

Repeated yet again for another half hour with --encoding=png, again reconnecting to the still running server session. Performance was, again, about the same, though glxgears did require one restart.

No sign of any crashes though.

Sat, 17 May 2014 11:19:42 GMT - Antoine Martin: status changed; resolution set

You were having to restart the glxgears because xpra is so swamped with requests to create and destroy windows from gtkperf that it struggles to cope with any amount of pixels for glxgears.

If it still hasn't crashed, I think we can assume this is working as well as it should. Closing.

Sat, 23 Jan 2021 04:59:38 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/569