Xpra: Ticket #118: xpra server dies overnight

I've been running xpra 0.2.0 on Ubuntu (11.10 for the server, 11.04 for the client) for the last two days. My client disconnects overnight and on both days, when I attempt to connect the next morning, the client does not attach.

doole@andammo:~$ xpra attach ssh:reorx:1
Connection failed: [Errno 2] No such file or directory
connection lost: empty marker in read queue
Connection lost

I took a look at .xpra/reorx-1.log, but I don't see anything obvious. I have attached both reorx-1.log and reorx-1.log.old.

On the server, I can see the xpra process running:

doole@reorx:~$ ps -ef | grep xpra
doole      915     1 52 Apr26 ?        11:05:18 /usr/bin/python /opt/xpra/bin/xpra start :1

but I am unable to stop it:

doole@reorx:~$ xpra stop :1
Connection failed: [Errno 2] No such file or directory
doole@reorx:~$ xpra stop
Usage:
        xpra start DISPLAY
        xpra attach [DISPLAY]
        xpra detach [DISPLAY]
        xpra screenshot filename [DISPLAY]
        xpra version [DISPLAY]
        xpra stop [DISPLAY]
        xpra list
        xpra upgrade DISPLAY
xpra: error: cannot find a live server to connect to

So I manually kill the xpra session (kill 915 - process is gone) and restart xpra. When I reconnect from the client, all my windows are still present (so killing xpra didn't wipe out the window manager for some reason) although the windows have all moved to +0+0 on the screen.

In case it matters, my line to start xpra is:

/usr/bin/xpra start :1

I was previously using 0.0.7.34 and did not see this behaviour.

I suspect that you'll need more diagnostics - just let me know what to gather.

Fri, 27 Apr 2012 12:46:18 GMT - Doug Doole: attachment set

attachment set to reorx-1.log

Fri, 27 Apr 2012 12:46:29 GMT - Doug Doole: attachment set

attachment set to reorx-1.log.old

Fri, 27 Apr 2012 15:43:56 GMT - Antoine Martin: status, description changed

status changed from new to accepted
description modified (diff)

The

Connection failed: [Errno 2] No such file or directory

Is a little odd, it means the socket is gone - why is not entirely clear. Can you check if the xpra process still has it open?

After that, xpra stop will fail, and connect also..

Maybe you could try to strace the process next time this happens to see what it is doing, or just run the server in debug mode (with '-d all') so we can get some diagnostics.

Fri, 27 Apr 2012 15:57:21 GMT - Antoine Martin:

Hah, I think I got it, from the log:

wimpiggy.selection.AlreadyOwned
removing socket /home/doole/.xpra/reorx-1

When you tried to start the server again, it failed because one was already running, during the cleanup code it deleted what it thought was its socket but was in fact the socket of the active server.

Until I can fix this bug, just make sure you don't start the server again if one is already running, or use the "--use-display" flag, which will force the old server to exit and start the new one.

Fri, 27 Apr 2012 16:07:52 GMT - Antoine Martin:

Hmmm, looking at the code, that cannot be the case:

create_unix_domain_socket

is called before we register the cleanup_socket handler, so this would have failed if the socket still existed... So something must be deleting that socket...

Both logs say that there is already a window manager running for the :1 session, can you try with a clean/brand new display number, clean logs and "-d all"?

Fri, 27 Apr 2012 16:13:50 GMT - Antoine Martin:

And maybe even run a script every few minutes to check the state of the socket in .xpra.

while true; do
  date;
  ls -la ~/.xpra
  xpra list;
  sleep 60;
done

And pipe that to a logfile to check in the morning?

Maybe a cron job is running at a specific time that breaks something? Or maybe the DHCP lease is renewed (shouldn't matter - but maybe the hostname or domainname changes?), or the network drops for a few seconds, or...

Mon, 30 Apr 2012 18:32:44 GMT - Doug Doole:

I needed to do a reboot of my server machine, so I was able to run the script on a completely fresh machine. As luck would have it, it failed fairly quickly.

The other thing I realized is that I had a cron task to restart xpra if it failed. I stopped that job.

I have attached two files:

xpra.track is the output of your script. (I added in a "host reorx" call as well, just in case the IP address of the machine changed. It didn't.)
reorx-1.log.20120430 is the log from start to failure. The interesting observation here is that a whole bunch of "New connection received" messages came in, but I wasn't creating any new connections. (I just have the one active connection from the client.)

Also, although "xpra list" is saying there are no sessions, xpra is still running at the server:

doole@reorx:~$ ps -ef | grep xpra
doole     2035     1 31 13:59 ?        00:08:53 /usr/bin/python /opt/xpra/bin/xpra start :1
doole     2505  2504  0 14:01 ?        00:00:00 /usr/bin/python /opt/xpra/bin/xpra _proxy :1

The client session that I had established before the failure is still usable (I can even spawn new windows from within the session.)

Mon, 30 Apr 2012 18:33:10 GMT - Doug Doole: attachment set

attachment set to xpra.track

Mon, 30 Apr 2012 18:33:25 GMT - Doug Doole: attachment set

attachment set to reorx-1.log.20120430

Mon, 30 Apr 2012 18:37:25 GMT - Doug Doole:

Shoot, I forgot "-d all" when I started xpra on that last run. Trying again...

Mon, 30 Apr 2012 19:18:35 GMT - Doug Doole: attachment set

attachment set to reorx-1.log.21020430-2

Mon, 30 Apr 2012 19:21:42 GMT - Doug Doole:

Just added reorx-1.log.20120430-2. This is the server log with "-d all". (It looks like whatever the problem is happens fairly quickly.)

At the end of the log there's will be a bunch of events as I touched the window managed by xpra. This occurred after the server lost track of the session.

Mon, 30 Apr 2012 20:17:39 GMT - Doug Doole:

I just tried another run. This time I didn't establish any client connection and the server still failed. (I have the server log with "-d all" if you think it will be useful.)

In all the runs I did today, the failure seems to happen about 20 minutes after starting the server. (That may have been the case as well last week, but since the failure doesn't kill an established connection, I wouldn't have noticed.)

Tue, 01 May 2012 03:21:02 GMT - Antoine Martin: priority changed

priority changed from major to critical

OK, that was easy with the reorx-1.log.20120430 logfile (the earlier reorx-1.log files were no good because they were overwritten when starting/trying-to-start a new server):

(...)
New connection received
too many connections (20), ignoring new one

Something keeps connecting (every minute from what you are saying - this will be the script above), but fails to disconnect (or we just miss the event?), eventually causing the server to refuse new connections, the "xpra list" client then comes in, fails to connect and decides that the socket is dead and deletes it!

What I need to do:

ensure connections either become the live one or are dropped after a few seconds (that should already be the case - bug)
try to distinguish a rejected connection from a dead socket (and only mark it as dead if it is the latter)

Tue, 01 May 2012 03:59:04 GMT - Antoine Martin: status changed; resolution set

status changed from accepted to closed
resolution set to fixed

Fixed in r778: there was a race in the network protocol threading code, making the server not unregister connections... and leading to a DoS.

This may warrant a 0.2.1 release..

To cause this bug with 0.2.0, simply run:

while true; do xpra list; sleep 0.1;done

Tue, 01 May 2012 10:26:31 GMT - Antoine Martin:

oops, you will also need r779

Wed, 02 May 2012 13:43:05 GMT - Doug Doole:

I have confirmed that r779 fixes the problem. Thanks.

One thing is still puzzling me though: When I first saw the problem,what was causing all the connection attempts that would mess up the server? The only external connection attempt should have been my client establishing a session, and that wouldn't happen more than a few times in a 24 hour period.

So where did the rest of the connection attempts come from?

Thu, 03 May 2012 07:24:29 GMT - Antoine Martin:

If you are running winswitch, it will run the equivalent of "xpra list" regularly (in particular whenever there is umtp traffic, login/logout, etc)

Thu, 03 May 2012 15:02:14 GMT - Doug Doole:

I'm not using winswitch - just xpra by itself.

Looking at the log, I don't see a lot of connection attempts. So while you've fixed the problem, I'm just curious how I hit it. (Oh well, not a big deal.)

Sat, 23 Jan 2021 04:46:02 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/118